Bacterial genes and genome dynamics in the environment by MASSACHUSTS INST OF TECHNOLOGY Sonia C.Timberlake tE MAY 2 9 2013 Bachelor of Science, Biology California Institute of Technology, 2003 SUBMITTED TO THE DEPARTMENT OF BIOLOGICAL ENGINEERING~ IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN BIOLOGICAL ENGINEERING AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY FEBRUARY 2013 © 2013 Massachusetts Institute of Technology. All Rights Reserved. Signature of Author: Sonia C. Timberake January 11, 2013 Certified by: Eric J Aim Associate Professor of Civil and Environmental Engineering and Biological Engineering Thesis Supervisor 14i Accepted by: Forest A. White Associate Professor of Biological Engineering Course 20 Graduate Committee Chairperson --.- i This doctoral thesis has been examined by a committee of the Biological Engineering Department as follows: Ernest Fraenkel Chairperson, Graduate Thesis Committee: Associate Professor of Biological Engineering and Biology Eric J. Alm Thesis Advisor, Committee Member Associate Professor of Civil and Environmental Engineering and Biological Engineering Martin P. Polz Committee Member Professor of Civil and Environmental Engineering Janelle R. Thompson Committee Member Assistant Professor of Civil and Environmental Engineering 3 4 Genes and genome dynamics in the environment by Sonia C Timberlake Submitted to the Department of Biological Engineering on January 11, 2013, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Biological Engineering Abstract One of the most marvelous features of microbial life is its ability to thrive in such diverse and dynamic environments. My scientific interest lies in the variety of modes by which microbial life accomplishes this feat. In the first half of this thesis I present tools to leverage high throughput sequencing for the study of environmental genomes. In the second half of this thesis, I describe modes of environmental adaptation by bacteria via gene content or gene expression evolution. Associating genes' usage and evolution to adaptation in various environments is a cornerstone of microbiology. New technologies and approaches have revolutionized this pursuit, and I begin by describing the computational challenges I resolved in order to bring these technologies to bear on microbial genomics. In Chapter 1, I describe SHE-RA, an algorithm that increases the useable read length of ultra-high throughput sequencing technologies, thus extending their range of applications to include environmental sequencing. In Chapter 2, I design a new hybrid assembly approach for short reads and assemble 82 Vibrio genomes. Using the ecologically defined groups of this bacterial family, I investigate the genomic and metabolic correlates of habitat and differentiation, and evaluate a neutral model of gene content. In Chapter 3, I report the extent to which orthologous genes in bacteria exhibit the same transcriptional response to the same change in environment, and describe the features and functions of bacterial transcriptional networks that are conserved. I conclude this thesis with a summary of my tools and results, their use in other studies, and their relevance to future work. In particular, I discuss the future experiments and analytical strategies that I am eager to see applied to compelling open questions in microbial ecology and evolution. Thesis Supervisor: Eric JAlm Associate Professor of Civil and Environmental Engineering and Biological Engineering 5 6 Acknowledgements I want to begin by thanking the Biological Engineering department, for welcoming me to MIT to let me pursue my love of science. Doug Lauffenburger, Al Grodzinsky, Forest White and other members of the BE community have worked hard to build scientific family that is full of exciting new ideas and friendly, persistent encouragement to pursue them! In particular I want to thank Dalia Fares for always signing and turning in who knows what all paperwork I forgot every single term to keep me enrolled at MIT. I want thank Sheila Frankel for all of her excellent advice and encouragement and indefatigable efforts, aside Jim Long, to keep Parsons from falling down around us. I want to thank my committee members, Ernest Fraenkel, Martin Polz, and Janelle Thompson, for all of their feedback and advice. In particular I want to thank my advisor, Eric Alm, for his absolutely unstinting support and enthusiasm. Thank you, Eric, for being even more excited and optimistic about my projects than I was. To Lawrence David, Sean Clarke, Jesse Shapiro and the rest of my lab mates in the Alm Lab, it was so lucky for me that you are both exceptional scientists and human beings. In general, the people I've met in BE, Parsons, and the MIT community at large, are not only ferociously intelligent and interesting, but also so generous, accepting, and brave. I feel so lucky have been a part of these communities at MIT and at Caltech. I want to specially thank all my family for always making a welcoming home for me wherever you are. I want to thank my parents for raising me with the knowledge that nothing I could ever ask of them would be too much. Megan Bartley, you were a daily source of support and encouragement. Nina, thank you for being so sane. Finally, a big thank you to my wonderful husband, Kenzie MacIsaac, who makes my world a better, funnier, happier place. 7 8 Table of Contents A BST RACT ................................................................................................................................................................... 5 ACKNOWLEDGEMENTS ............................................................................................................................................. 7 INTRODUCTION ................................................................................................... 14 0 .1 O VE RV IEW ..................................................................................................................................................... 14 0.2 THE PIVOTAL IMPORTANCE OF DEEP COVERAGE IN ENVIRONMENTAL GENOMICS.....................15 0.3 A MODEL SYSTEM FOR BACTERIAL ECOLOGY AND EVOLUTION....................................................... 0.4 CONSERVATION OF COEXPRESSION WITHOUT CONSERVATION OF RESPONSE IN BACTERIAL GENE CHAPTER 0. 17 REGULATORY NETWORKS.......................................................................................................................................23 CHAPTER 1. UNLOCKING SHORT READS FOR METAGENOMICS..................................25 1.1 INTRODUCTION.............................................................................................................................................26 1 .2 R ESU LTS ........................................................................................................................................................ 1 .3 D ISCU SSIO N............................................................................................................................................---.-.2 9 1.4 MATERIALS AND METHODS ....................................................................................................................... 29 1.5 REFERENCES.................................................................................................................................................34 1.6 CONTRIBUTIONS TO THIS MANUSCRIPT .............................................................................................. 1 .7 A PPEND IX................................................................................................................................ 1.7.1 26 35 ................... 35 Supplemental Note: alignment confidence metric........................................................ CHAPTER 2. 35 ASSEMBLY AND ANALYSIS OF 82 VIBRIO GENOMES ............................... 39 2 .1 A BSTRA CT ...................................................................................................................................... ...... --.3 9 2.2 INTRODUCTION............................................................................................................................... ......... 2.3 TECHNICAL CHALLENGES AND NEW ASSEMBLY APPROACHES ....................................................... 40 42 2.3.1 Hybridgenome assembly approachfor short reads ..................................................... 42 2.3.2 Pipelinefor 16S rRNA gene assembly from short reads ............................................... 45 2.3.3 Using whole genome phylogeny to support ecologicalpopulationpredictions......46 2.4 RESULTS AND DISCUSSION ................................................................................................................. 46.....46 2.4.1 M etabolic profiles' relationshipwith phylogeny and ecology ..................................... 46 2.4.2 Gene content relationshipswith phylogeny and ecology ............................................. 49 2.4.3 High rates of HGT among strainsco-existing in the water column.......................... 51 2.4.4 Core and flexible gene turnover............................................................................................. 52 9 2.4.5 2 .5 Genes fixing within a population..................5 .................................................................................................. 53 2 .5 C ONCLU SIO NS ....................................................................... 55 2 .6 M ETHODS ...................................................................................................................................................... 58 2.6.1 Whole genome shotgun librariesand sequencing........................................................... 58 2.6.2 Hybrid assembly pipeline................................................................................................................. 58 2.6.3 16S "contig-mapping" assembly pipeline ............................................................................ 60 2.6.4 165 phy log eny....................................................................................................................................... 60 2.6.5 16S distance matrix,clustering, and pairwisedistances............................................... 60 2.6.6 Identifying recentHGT between populations.................................................................... 61 2.6.7 Counting recentHGT within 80 vibrio strains.................................................................... 61 2.6.8 Callinggenes and annotatinggenomes ............................................................................... 61 2.6.9 Building orthologousgroups ................................................................................................... 61 2.6.10 Building concatenatedgene species phylogenies........................................................ 62 2.6.11 Genome similarityby protein content............................................................................... 62 2.6.12 Populationsimilarity by protein content.......................................................................... 62 2.6.13 Substrateutilization assays................................................................................................... 63 2.6.14 Substrate utilization profiles and similarity ................................................................... 63 2.6.15 Populationsimilarityby substrate utilization profile................................................ 63 2.7 AUTHOR CONTRIBUTIONS..........................................................................................................................63 2 .8 R EFEREN CES ................................................................................................................................................. 64 See main thesis document reference section, Error!Reference source not found............. 64 2.9 2.10 FIGU RES AN D L EGEN DS .............................................................................................................................. SUPPLEMENTAL NOTES AND DISCUSSION ..................................................................................... 65 81 2.10.1 Hybrid assembly pipeline: discussion of accuracy........................................................ 81 2.10.2 16S rRNA phylogeny....................................................................................................................... 81 2.10.3 Defining gene families in 82 genomes: sensitivity to methodology, sampling....... 81 2.10.4 Using gene content divergence to confirm population predictions...................... 82 2.10.5 Habitat-specificgene gain and loss in V. cyclitrophicus............................................ 82 2.10.6 Gene deletions in general:bias, neutrality,orselection?.......................................... 83 CHAPTER 3. CONSERVATION OF COEXPRESSION WITHOUT CONSERVATION OF RESPONSE IN BACTERIAL GENE REGULATORY NETW ORKS........................................... 85 3.1 AUTHORSHIP & PUBLICATION NOTE.................................................................................................. 85 3 .2 A BST RA CT ..................................................................................................................................................... 86 10 3.3 BACKGROUND ............................................................................................................................................... 87 3.4 RESULTS AND DISCUSSION ......................................................................................................................... 89 3.4.1 Coexpression is significantlyconserved acrossbacteria................................................ 89 3.4.2 Expression phenotypes are not conserved ......................................................................... 92 3.5 CONCLUSIONS AND FUTURE DIRECTIONS ........................................................................................... 95 3.6 M ATERIALS AND M ETHODS. ...................................................................................................................... 97 3.6.1 Biom ass production and stress application...................................................................... 97 3.6.2 M icroarrayextraction and hybridization..............................................................................100 3.6.3 Com putationalm ethods................................................................................................................. 3.7 AUTHOR CONTRIBUTIONS.......................................................................................................................106 3.8 FIGURES AND LEGENDS ........................................................................................................................... 3.8.1 102 106 Supplemental Note: transcriptomicexperiments obtainedfrom other laboratories and their com parison ..................................................................................................................................... CHAPTER 4. CONCLUSIONS AND FUTURE DIRECTIONS................................................... 117 125 4.1 SHORT READS FOR METAGENOMICS......................................................................................................125 4.2 ASSEM BLY AND ANALYSIS OF 82 ENVIRONM ENTAL BACTERIA ........................................................ 4.3 CONSERVATION OF COEXPRESSION WITHOUT CONSERVATION OF RESPONSE IN BACTERIAL GENE 127 REGULATORY NETW ORKS............................................................................................................................----....134 CH A PTER 5. GLO SSA RY ............................................................................................................... 140 CH A PTER 6. REFERENCES .......................................................................................................... 143 11 Index of Figures and Tables Chapter 1 Figure 1. Size-dependent isolation of DNA fragments from sheared genomic DNA via dSPRI ............................................................................................................................................................................... Figure 2. Reproducibility of double-SPRI................................................................................................ 27 28 Table 1. Oligonucleotides sequences..............................................................................................................28 Figure 3. Quality of Illumina reads out to 143 bp............................................................................... 29 Figure 4. High-confidence alignment yield by insert lengths......................................................... 30 Figure 5. Quality of composite fragments after overlapping......................................................... 30 Figure 6. Composite Illumina reads constitute a legitimate alternative to pyrosequencing for m etagen om ics stu d ies ................................................................................................................................ Table 2. Cost estimates of sequencing strategies .............................................................................. 31 31 Figure 7. Short insertions and deletions in low-confidence composite Illumina reads ........... 33 Supplemental Figure 1. SHE-RA alignment confidence metric separates true alignments from m is-align m ents...................................................................................................................................3 6 Supplemental Figure 2. Taxa abundances in metagenomics sample are similar when using overlapped Illumina or 454 sequencing ........................................................................................ 37 Chapter 2 Figure 1. Assembly NSO improvement using hybrid assembly approach ................................ Figure 2. Species phylogeny of Vibrio genomes.................................................................................. 65 66 Figure 3. H abitat and gene content..................................................................................................................67 Figure 4. Habitat and substrate utilization profiles............................................................................ 68 Figure 5. Counting recent HGT within 80 vibrio isolates ................................................................. 69 Figure 6. Network of recent HGT between vibrio populations ..................................................... 70 Figure 7. Core and pan genomes for 4 species definitions .............................................................. 71 Figure 8. V. cyclitrophicusstrains clustered by flexible gene content and neutral markers..72 Supplemental Figure 1. Distribution of polymorphisms in 16S genes assembled with short rea d p ip elin e ................................................................................................................................................... 73 Supplemental Figure 2. 16S phylogeny of our populations and other vibrio type strains......74 12 75 Supplemental Figure 3. Gene content divergence with phylogenetic distance....................... Supplemental Figure 4. Overlap in gene content between large-particle-associated 76 p o p u latio n s ...................................................................................................................................................... Supplemental Figure 5. Shared rare genes by same-filter strains ................................................ 77 Supplemental Figure 6. Substrate utilization in the Vibrio family.............................................. 78 Supplemental Figure 7. Substrate utilization profile divergence with phylogenetic distance 79 ............................................................................................................................................................................... Supplemental Figure 8. Distribution of gene family penetrance in V. cyclitrophicus............80 Chapter 3 Figure 1. Conservation of coexpression across the bacterial domain .......................................... 108 Figure 2. Transcriptomic response to pH is highly variable, even within the same species. ............................................................................................................................................................................ 1 09 Figure 3. Comparison of phenotype in our stress compendium of S. oneidensis and D. v u lg aris ........................................................................................................................................................... 11 1 Figure 4. Comparison of phenotypes in our stress compendium of D. vulgaris and G. m etalliredu cens. ......................................................................................................................................... 11 3 Table 1. Genes of our heat shock module...................................................................................................114 Figure 5. Conserved response of E. coli heat shock module across five species despite global d iv ergen ce of resp on se. ......................................................................................................................... 1 15 Supplemental Figure 1. Comparison of phenotypes in our stress compendium of S. oneidensis and D. vulgaris using all putative orthologs.............................................................116 Supplemental Table 1. Significantly-changing genes common to multiple species' conditionsp ecific resp o n ses......................................................................................................................................1 13 17 Chapter 0. Introduction 0.1 Overview This is a very exciting time to be investigating the wonderfully diverse and complex microbial world. One of the most marvelous features of microbial life is its ability to thrive in such diverse and dynamic environments. My scientific interest lies in the variety of modes by which microbial life accomplishes this feat. In the first half of this thesis I present tools to leverage high throughput sequencing for the study of environmental genomes. In the second half of this thesis, I describe modes of environmental adaptation by bacteria via gene content or gene expression evolution. Associating genes' usage and evolution to adaptation in various environments is a cornerstone of microbiology. New technologies and approaches have revolutionized this pursuit, and I begin by describing the computational challenges I resolved in order to bring these technologies to bear on microbial genomics. In Chapter 1, I describe SHE-RA, an algorithm that increases the useable read length of ultra-high throughput sequencing technologies, thus extending their range of applications to include environmental sequencing. In Chapter 2, I design a new hybrid assembly approach for short reads and assemble 82 Vibrio genomes. Using the ecologically defined groups of this bacterial family, I investigate the genomic and metabolic correlates of habitat and differentiation, and evaluate a neutral model of gene content. In Chapter 3, I report the extent to which orthologous genes of co-existing bacteria exhibit the same transcriptional response to the same change in environment, and describe the features and functions of bacterial transcriptional networks that are conserved. I conclude this thesis with a summary of my tools and results, their use in other studies, and their relevance to future work. In particular, I discuss the future experiments and analytical strategies that I am eager to see applied to compelling open questions in microbial ecology and evolution. 14 0.2 The pivotal importance of deep coverage in environmental genomics While the deep sequencing enabled by cheap next generation technologies has revolutionized our understanding of the diversity and dynamics of the microbial world (Tautz, Ellegren, and Weigel 2010), short read length has prevented environmental surveys from making use of the most prolific platforms. Of these ultra high-throughput sequencing (UHTS) platforms, Illumina offered slightly longer reads than Solid (Shokralla et al. 2012), but at -100bp, still fell below the length threshold where most proteins can be reliably identified (Hoff 2009; Brady and Salzberg 2009) and where most bacteria can be identified by standard community profiling methods. Indeed, the majority of Illumina 16S ribosomal reads from human gut microbiota could not be classified down to the genus level, even if the optimal variable region was targeted for sequencing (Soergel et al. 2012; Claesson et al. 2010). And despite the vast coverage improvement compared to 454, diversity estimates from Illumina were unreliable (Claesson et al. 2010). Efforts to obtain longer sequences from short-read instruments have been made (Hiatt et al. 2010; Sorber et al. 2008), but these multistep library preparations impose a complexity bottleneck, limiting their utility for applications like environmental genomics (Hiatt et al. 2010). Thus, work in these areas chiefly relied on longer read technologies (Logares et al. 2012; Bruijn 2011), and suffered the -48-160x drop in the number of reads per dollar (Rodrigue et al. 2010; Bruijn 2011). The next generation sequencing revolution has helped us to appreciate the extremely diverse and dynamic nature of microbial communities, and that we have sampled only a fraction of their diversity (Quince, Curtis, and Sloan 2008). Study after study reports that their community profiles suffer from tremendous undersampling in many and varied environments (Fierer et al. 2007; Fuhrman et al. 2008; J. A. Gilbert et al. 2009; Lopez et al. 2012), such that many estimates of diversity involve substantial extrapolations of species abundance curves (Lopez et al. 2012). In some cases, increasing sequencing depth does not just improve our comprehension of a system, but make it possible. To learn to triage potential pediatric IBD patients non-invasively, stool community profiles were fed to a machine learning tool, and the taxa it relied on as discriminating features to build the classifier were mainly low-abundance. The vast majority of these taxa would have escaped observation without the coverage afforded by UHTS technologies, making this non-invasive 15 diagnosis method not just less effective but most likely impossible (Papa et al. 2012). Community function is orders of magnitude more diverse, and so the undersampling is even more severe. In the metatranscriptomic sample of a coral holobiont community, sequencing to a depth of 0.5-1e6 reads resulted in the majority of identified microbial transcripts sampled by just a single read (Timberlake, unpublished data), partially because a large fraction of the sample was dominated by ribosomal reads, which is typical even after substantial efforts to chemically deplete them (Giannoukos et al. 2012; Ciulla et al. 2010). For a 1 million read marine microbial community transcriptome, a tracer RNA was used to estimate sample coverage at less than 10e-7 (Gifford et al. 2010). The microbial consortia that naturally live on and interact with humans and all living things are increasingly recognized as contributing to health and disease in complex ways; it is thus important not only for our understanding of the microbial world but for a variety of fields to extend the deep coverage afforded by UHTS platforms to environmental microbiology studies. In Chapter 1, I describe SHE-RA, an algorithm that helps remove both the read length and error rate barriers to the use of short reads. Short fragments of DNA are be sequenced from both ends, such that the low-quality terminii of each read pair overlap. By aligning these two fragments and combining the information of two largely independent measurements of the base identity at these overlapping positions, SHE-RA corrects errors and extends the useable read length. If shotgun DNA is of interest, generating a narrow library size distribution appropriate for overlapping read pairs requires some precision in the DNA fragment size-selection step, and so SHE-RA is presented in a conjunction with an optimized chemical protocol for dSPRI magnetic beads use in precise DNA fragment size selection (Lennon et al. 2010) as part of a scalable high-throughput protocol (Rodrigue et al. 2010). I compare SHE-RA output (overlapped Illumina read pairs) to the standard operating procedure for environmental sequencing at the time, 454-FLX, and show that it is a legitimate alternative to 454 for identifying protein functions and taxa in a shotgun marine metagenome. While the SHE-RA algorithm was tested on Illumina reads, it is designed to be largely platform-independent, and will, in principle, work well with any sequencing technology that produces paired ends reads, thus making metagenomics, transcriptomics, and community profiling tractable for the shorter reads produced by UHTS technologies. 16 0.3 A model system for bacterial ecology and evolution Comparative genomics is central to our understanding of function and evolution, but in bacteria its scope has been limited in several important ways. A large fraction of sequenced and studied genomes are either very distant relatives or closely related pathogens (Markowitz et al. 2009). Extremely long divergence times concentrate our understanding of evolutionary dynamics and adaptation on the very slowest processes, while the proclivity for pathogen sequencing skews our understanding of faster evolutionary dynamics towards the evolutionary modes of that particular lifestyle. Finally, the lack of a commonly accepted bacterial species definition makes the adaptive value of variation difficult to interpret (Fraser et al. 2009; Shapiro et al. 2009). In Chapter 2 1 describe the assembly and analysis of 82 genomes of the Vibrionaceae, an important model system for microbial evolution and ecology. This family of heterotrophic bacteria is abundant in aquatic systems worldwide and includes many human and animal pathogens, and is thus ecologically, medically and economically important. Unlike the vast majority of environmental bacteria, the 'vibrios' are easily cultivatable in the lab, and are thus an ideal system for comparative genomics study, particularly because of the breadth of ecological information collected for this group and the natural phylogenetic units that this ecology defined. In the first half of this chapter, I describe the methodological advances required to assemble 82 genomes from short reads. In the second half of this chapter, I survey the gene content of these bacteria, including its lateral acquisition, fast turnover, fixation, and relation to ecology. I do so in the context of one of the premier features of this model system: taxonomic groups distinguished by ecology. Using the environmental data collected with more than 1900 marine isolates, ecologically cohesive populations were delineated in the Vibrionaeae family. Bacterial taxonomic units are typically demarcated by drawing arbitrary phylogenetic cutoffs, but this clustering often corresponds poorly to organisms' physiology or ecology (Francisco, Cohan, and Krizanc 2012). Instead, phylogenetically as well as ecologically cohesive vibrio populations were identified with the machine learning algorithm AdaptML (Hunt et al. 2008). Using season (Fall/Spring) and particle size, ecological differentiation was found to occur at genetic distances that were surprisingly fine. However, subsequent studies 17 consistently discovered these same ecological populations: they were largely cohesive in a variety of gene phylogenies (Preheim, Timberlake, and Polz 2011), reproducibly recovered across time (Szabo et al. 2012), and robust to the particular ecological trait used to define them (Preheim, Timberlake, and Polz 2011; Preheim et al. 2011). The ecological distinctness between populations demonstrates that despite co-existing in the water column and sometimes very close genetic relations, these groups do not have same niche, instead partitioning resources spatially and temporally. Within populations, individuals interact socially (Cordero, Wildschutte, et al. 2012) and sexually (Shapiro et al. 2012), experiencing rampant homologous recombination along population lines that could result in genetic differentiation in a manner akin to eukaryotic speciation. All in all, these ecological populations have so many of the hallmarks of a fundamental taxonomic unit, and it is likely that within each, individual genotypes share a niche and face similar selective pressures, making this the ideal model system to study variation, adaptation to environment, and differentiation. In addition to the breadth of ecological information associated with this group, part of the strength of the model system stems from our genome sequences' sampling the hierarchy of diversity and divergence in the Vibrionaeae family: from finely sampled 'sister strains' separated by just 60 core genome SNPs, fewer than can be seen to accumulate from a few weeks' selection in the lab (Timberlake and Clarke, unpublished results) to the most divergent genomes, which include much of the family's -600 million years radiation (Thompson 2007). Unlike previous comparative genomics studies that focus either on several strains or divergent clades, here we can quantify genomic and phenotypic variance within a population, the unit on which selection acts, as well as the divergence between populations, illuminating the evolutionary processes involved in adaptation to new habitats. Importantly, while previous studies described evolutionary processes in the context of the operational taxonomic definitions ("OTU", typically 3% 16S divergence) or named species, such studies are likely combining multiple differentiated natural populations with divergent ecologies (Hunt et al. 2008; Francisco, Cohan, and Krizanc 2012), thus convolving intrapopulation and inter-population variation, and convolving the effects of the likely distinct forces of selection experienced in the multiple niches in that "species". Thus this genomic reference for the vibrio model system affords us a unique opportunity to investigate the evolutionary dynamics governing variation in a single niche, adaptation to new ecologies, as well as the interactions between co-existing populations. 18 In the first part of this chapter, I describe the methodological advances required to assemble 82 genomes for this model system. Ultra high-throughput sequencing platforms (UHTS) has made such genome assembly projects tractable, but there are theoretical limits on de novo assembly of short reads that the deep coverage cannot overcome. This is particularly true of the early platforms' short single-end reads that constituted most of our dataset (Chaisson, Brinza, and Pevzner 2009). I report a computational approach designed to assemble full-length16S genes, so that important phylogenetic marker could be utilized, and a more general approach for more contiguous genome assembly from short reads. Expensive complementary approaches such as long-insert mate-pair libraries are typically required for genome assembly contiguity (Chaisson, Brinza, and Pevzner 2009; Butler et al.; Wetzel, Kingsford, and Pop 2011). By designing a hybrid reference and de novo assembly approach, I was able to leverage just 1 near-finished genome per population to achieve markedly better assembly contiguity (2-3x N50's) compared with de novo assembly alone. In the second part of this chapter, I use these 82 new genomes to survey the relationships of phenotype and gene content with ecology in this model system. Vibrios are known for their great genetic and metabolic diversity, but the driving forces and dynamics of metabolic evolution remain largely unknown (Keymer et al. 2007). For each of these natural populations, an inferred habitat, based on the isolation data of their phylogenetic group, serves as a representation of their ecology. In this study, we use season (Fall/Spring) and particle size, two environmental measures available for the vast majority of the strains full genome sequenced. These ecological traits have been shown to be important axes for ecological differentiation in this group (Thompson et al. 2004; Hunt et al. 2008) as well as stable descriptions of populations' ecology over time (Szabo et al. 2012) but it is not yet known to what extent they engender genotypic or phenotype similarity. To explore the effect of similar microscale ecology on these features, I determine the extent to which cohabitants share substrate utilization capabilities and overall gene content, as well as the frequency and structure of recent horizontal gene transfer and its structure in this clade. What is the series of events that coincides with the differentiation of a population? Recent studies report that within traditional species boundaries exist multiple subpopulations that can be robustly separated along various non-nucleotide-divergence axes such as ecology (Rocap et al. 2003), geography (Keymer et al. 2007), host organism 19 (Preheim et al. 2011; Sim et al. 2008). However the genomic and phenotypic correlates for divergence along those axes, and their temporal sequence, are largely unknown. Proteomic expression has been suggested to be diverging faster than gene content (Konstantinidis et al. 2009), but the relative timescales for the divergence of other attributes is unclear. Our superfine phylogenetic sampling included many genomes that are distinguished by only 100-300 SNPs genome-wide, a number comparable to what accumulates in just a few weeks of selection and bottlenecks of these same strains in the laboratory (Clarke and Timberlake, unpublished results). Vibrios are known to partition resources at surprisingly fine levels of phylogenetic divergence, and though a recent study illuminated some of the crucial events that occur early in this differentiation process (Shapiro et al. 2012), but the role of gene content and the extent to which this partitioning on fine timescales involves new DNA, new alleles, or new regulation is unknown. In Chapter 2, I ask to what degree flexible DNA, gene content, core genome, or metabolic profiles are distinct between closely related individuals and populations and start to compare the relative timescales for the divergence of these different measures as populations differentiate. Closely related bacteria are increasingly appreciated to vary greatly in their gene repertoire (Tettelin et al. 2005), and the vibrios in particular have remarkably fast gene turnover of many genes (Thompson 2005), but the selective implication of this large and labile flexible genome is poorly understood (Polz et al. 2006). Indeed, same-cluster isolates (1% 16S divergence) can exhibit 20% differences in genome size (Thompson 2005). It is astounding but not implausible to suggest that this large fraction of the genome's content may not be adaptive (Berg and Kurland 2002; Polz et al. 2006), indeed, in the absence of other evidence, neutral variation seems a more appropriate null hypothesis. However, the current literature usually makes one of two opposing assumptions about the adaptive value of these genes with intermediate penetrance in a bacterial group: either strain-specific gene content is held responsible for pathogenic or other adaptive phenotype, and lowpenetrance or strain-specific genes are dismissed as being inessential or not adaptive. A neutral model of flexible gene fixation was recently introduced (Haegeman and Weitz 2012), however, the model has been tested only in its ability to fit observed distributions of flexible genes penetrance in species groups. The ecological data available in this Vibrio model system will enable us to directly test the selective value of flexible DNA and its neutral or selective fixation. In this study, I quantify the extent to which useful flexible DNA, 20 core genes and metabolism diverge at the time scales where resource partitioning is known to occur, and the extent to which flexible gene fixation is concordant with a neutral model. While the neutrality of gene content variation remains obscure, the model that each genome partitions into two kinds of genes, "core" and "flexible", has become a very popular. The model was originally proposed as a way to reconcile the huge intraspecific variation in gene content with any species concept (Ruiting Lan and Reeves 2000; R. Lan and Reeves 2001). Despite the rampant illegitimate recombination introducing new DNA and diversifying the flexible genome, a species' core genome could be resistant to diversifying forces. This core genome, stable and common to all, would be homogenized by intraspecies homologous recombination, thus generating the phylogenetic clusters we empirically observe and the cohesion we associate with the idea of species. Indeed, a few studies have shown the enrichment of recombination in core genes within phylogenetic clusters (Lef6bure et al. 2010; Bennett et al. 2010; Shapiro et al. 2012), and even highlighted a possible mechanism (Budroni et al. 2011). While this rationale (and indeed the bacterial species concept itself (Doolittle 2012; Cohan 2011)) remains the subject of active debate, the core/flexible dichotomy appears to be biologically and ecologically meaningful, as the two gene types have been show have differ in genomic location (Hacker and Carniel 2001; Dobrindt et al. 2004; Coleman et al. 2006) functions (Makarova et al. 2006; Kettler et al. 2007), evolutionary modes (Bennett et al. 2010), expression (Stewart et al. 2011), purifying selection strength (V. Tai et al. 2011), and geographical distribution (Boucher et al. 2011). The success of this core/flexible model is remarkable given how widely their definitions vary (due, in part, to the lack of a consistent definition for an evolutionary unit in bacteria). Various studies defining "core" genes as those shared at clonal complex, named species, genus or even family level, a span of more than 2-12% 16S divergence (Grote et al. 2012). This begs the question whether the core versus flexible findings are the result of analyzing two extremes from a continuum of gene types. Indeed, as comparative studies have included increasing numbers of genomes, these two gene types are being further subdivided in the literature into "core", "extended core", "unique core", "character", "accessory" and "flexible" (Lilburn, Cai, and Gu 2009; Zwick et al. 2012; Lapierre and Gogarten 2009), based on the fraction of genomes in which they are observed. This further partitioning leads one to question if the core/flexible dichotomy concept is valid, and whether a better and consistent definition of core and flexible, one that is consistent with bacterial ecology and differentiation, could make this model of gene-type partitions more 21 meaningful, and could even reinforce the delineation of species in bacteria (Ruiting Lan and Reeves 2000). The 82 Vibrio genomes I have assembled, partitioned into natural evolutionary units, present a unique opportunity to properly define core and flexible gene content in a biologically meaningful way in accordance with the original Core Genome Hypothesis. In this preliminary analysis of those core and flexible genomes, I evaluate a neutral model of flexible gene content penetrance, and the extent to which flexible genes fix and can replace core genes on the timescales of ecological differentiation and with ecological correlates. 22 0.4 Conservation of coexpression without conservation of response in bacterial gene regulatory networks. Microbes sense their changing environment and alter their phenotype via gene regulation in response. These gene regulatory responses are thought to be critical for fitness and adaptation (Britten and Davidson 1969; Cases, de Lorenzo, and Ouzounis 2003), but despite extensive documentation of differentially expressed genes (DeRisi, Iyer, and Brown 1997; Faith et al. 2008), even a basic understanding of their fitness contributions, conservation, or adaptive change is lacking (Birrell et al. 2002; Giaever et al. 2002; S. L. Tai et al. 2005; Klosinska et al. 2011). A fundamental unanswered question is whether orthologous bacterial genes respond the same way to the same environmental cues. Comparative transcriptomic studies in bacteria typically focus on species-specific differences, which can then be implicated in pathogenic or other adaptive phenotypes (Yoder-Himes et al. 2009; Guell et al. 2011; Zhang, Dudley, and Wade 2011). Impressive studies have characterized the divergence of transcriptomic responses in 5 closely related yeast species (Tirosh et al. 2006), and of proteomic abundance profiles in 10 co-generic proteobacteria (Konstantinidis et al. 2009), but to our knowledge, conservation of bacterial transcriptomic responses had been never been quantified. While the evolution of response has not been explored in prokaryotes, evolution of their transcriptional network has been explored by looking at features of network architecture, including modules (McAdams, Srinivasan, and Arkin 2004; Babu et al. 2004; Gelfand 2006). Studies have convincingly demonstrated that genomes are partitioned into functional modules that are expressed together (Stuart et al. 2003; Bergmann, Ihmels, and Barkai 2004), clustered together (Price et al. 2005), physically interacting (Zinman, Zhong, and Bar-Joseph 2011) and laterally transferred together (Morgan N. Price, Arkin, and Alm 2006). As the transcriptional network in evolves, these units can remain cohesive, with some conserved transcriptional modules documented in species as divergent as E coli and metazoans (Stuart et al. 2003; Bergmann, Ihmels, and Barkai 2004), or even out to B. subtilis (Waltman et al. 2010; Zarrineh et al. 2010). Overall, network evolution is complex, including flexibility and extensive rewiring of functional modules (Lozada-Chavez, Janga, and Collado- 23 Vides 2006; Roguev et al. 2008; Singh et al. 2008; Fokkens and Snel 2009), however, the parts of the transcriptional network that are conserved over 4 billion years of bacterial evolution may be fundamental to microbial life and its most common environmental/adaptive challenges. In Chapter 3 of this thesis, I assess conservation in the bacterial transcriptional network in two ways. First, I quantify the conservation and function of the simplest possible units of the transcriptional networks, co-expressed gene pairs, in four very divergent bacteria, starting with an extensive compendium of transcriptomic data we generated in two environmental isolates (Venkateswaran et al. 1999; Heidelberg, Seshadri, Haveman, Hemme, Paulsen, Kolonay, Eisen, Ward, Methe, Brinkac, et al. 2004; Heidelberg, Seshadri, Haveman, Hemme, Paulsen, Kolonay, Eisen, Ward, Methe, Brinkac, et al. 2004) supplemented by mining public databases for expression profiles in other species (Barrett et al. 2007; Faith et al. 2008). Second, I conduct a comparative survey of transcriptional phenotype to determine the features of transcriptional response that are conserved, focusing on three model metal-reducing bacteria and the environmental perturbations which are thought to be relevant for their ecology (Lovley et al. 1993; Finneran, Housewright, and Lovley 2002). This comparative survey of transcriptional phenotype revealed the surprisingly fast-diverging and mercurial nature of bacterial gene regulatory responses, motivating an re-examination of how functional conclusions are drawn from transcriptomic experiments and highlighting the clear need for a phylogenetic model of gene expression evolution. 24 Chapter 1. Unlocking short reads for metagenomics Note: this chapter is presented as it originally appeared in: PLoS ONE 5 (7) (July 28): e11840. doi:10.1371/journal.pone.0011840, plus the following additions: 1.6 Contributions to this manuscript 1.7 Appendix 25 0'PLoS OPEN a ACCESS Freely available online one Unlocking Short Read Sequencing for Metagenomics S6bastien Rodrigue*, Arne C. Maternal, Sonia C. Timberlake, Matthew C. Blackburn, Rex R. Malmstrom, Eric J. Alm*, Sallie W. Chisholm* Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachussetts, United States of America Abstract Background: Different high-throughput nucleic acid sequencing platforms are currently available but a trade-off currently exists between the cost and number of reads that can be generated versus the read length that can be achieved. Methodology/Principa/ Findings:We describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read. Conclus/ons/Igniflcance:This strategy is broadly applicable to sequencing applications that benefit from low-cost highthroughput sequencing, but require longer read lengths. We demonstrate that our approach enables metagenomic analyses using the lilumina Genome Analyzer, with low error rates, and at a fraction of the cost of pyrosequencing. Citation: Rodrigue 5,Materna AC, Timberlake SC,Blackburn MC, Malmstrom RR,et al. (2010) Unlocking Short Read Sequencing for Metagenomics. PLoS ONE 5(7): e11840. doi:10.1371/journal.pone.0011840 Editor. Jack Anthony Gilbert, Plymouth Marine Laboratory, United Kingdom Received May 26, 2010; Accepted July 6, 2010; Published July 28, 2010 Copyright: © 2010 Rodrigue et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by funds from the Gordon and Betty Moore Foundation Marine Microbiology Initiative (www.moore.org), and the Center for Microbial Oceanography: Research and Education (C-MORE) (http://cmore.soest.hawaii.edu) awarded to S.W.C., and by funds from the Department of Energy Genome-to-Life (DOE-GTL) (http://genomicscience.energy.gov) program awarded to S.W.C. and EJA. S.R.is the recipient of postdoctoral fellowships from the (www.nserc-crsng.gc.ca) and from the Fonds Quebecois de Recherche sur Ia Nature et Natural Science and Engineering Research Council of Canada (NSERC) Technologies (FQRNT) (www.fqmt.gouv.qc.ca). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. E-mail: ejalm@mit.edu (EJA); chisholm@mit.edu (SWC) @These authors contributed equally to this work. although the procedures we describe could be applied to other sequencing platforms that can generate paired end data. Introduction New DNA sequencing technologies are dramatically changing the research landscape in biology by enabling experiments that were previously too expensive or time-consuming. Three platforms, the Roche454 Genome Sequencer, the Illumina Genome Analyzer II and the Applied Bio-Systems SOLiD system, are widely used [1]. The latter two instruments produce a very large number of short reads, typically 25-75 nucleotides long, at a significantly lower cost per basepair (bp) compared to the Roche454 pyrosequencing system. However, short reads create difficulties for some applications such as metagenomics and metatranscriptomics, and therefore pyrosequencing remains the platform of choice. Efforts have been made to obtain longer sequences from short-read instruments, and further expand their range of applications [2,3]. A clever strategy was recently developed by Hiatt and colleagues, but their approach requires many steps and imposes a library complexity bottleneck that can limit its utility for applications like metagenomics [2]. We sought to establish a more straightforward and broadly applicable method. Our approach builds on preparing libraries with tunable size distributions such that sequencing from both ends generates overlapping reads. The lower-quality ends of mated reads can then be overlapped to produce high-quality consensus sequence. We demonstrate our strategy using the Illumina GAII technology .PLoS ONE | www.plosone-org Results We developed a simple and automatable protocol to rapidly, efficiently and reproducibly isolate DNA fragments of a specific size distribution without the need for gel electrophoresisis. The method is similar to a recently published protocol [4] but allows adjusting the size distribution of the isolated DNA fragments as appropriate for the sequencing platform used. After ultrasonic DNA shearing, size-selection is carried out via double solid phase reversible immobilization (dSPRI) using carboxyl coated magnetic beads (Figure 1). DNA associates with these particles in a strictly size-dependent manner according to the concentration of polyethylene glycol and salts [4-7], which can be adjusted by changing the volume ratio of SPRI bead suspension to DNA solution (Figure 1). The size distribution of selectively bound DNA is concentration independent (Figure 2), which supports the preparation of sequencing libraries from dilute samples. In the first step of this procedure, DNA fragments larger than the desired library insert size are bound to magnetic beads, which are then discarded. The supernatant is saved, and mixed with an appropriate volume of fresh SPRI beads. This second incubation preferentially immobilizes DNA fragments of the targeted size 26 July 2010 1 Volume 5 | Issue 7 | e11840 DNA Sequencing Pipeline A RaO hesdADNA B S,,,.n12 S.,,.,1 Msme 8ncubndonUMe RW*o beadaftp1 Inuba fne A ag A 1u11.3 65e50 20 Mn 15min "46p 25.5 5 6550l: 20 min 60115[pI :0.52 15min labp 23.3 C 46ASOpI09 20miW 5SSMLprI0.s5 7min larbp 2M.e 455 MO ; 0.9 20 mi 40M5[pA;0.42 7 min " 248 E 4&50; 20min 350mnq;0.32 7min assp F 4 20min 20/5 p;0.21 7mm Sbp 22,1 G 45050LA;]005 20min Iwssjp;0.11 7min flhp Is 40501s1Q;0s8 5mn0 100/85[ius sI.18 5mkn srp H .3 ;0.S 9 0 Kpg:0 1550 100150 200 300 0115Ipit 0.87 400 500 700 1500 C CV[% bp 232 fragment length [bp] 32.4 15 50 100150 200 300 4 15 50 100 150200 300 400 500 700 1500 15 50 100 150200 300 400 500 700 1500 100 40 200I 100 I. 100 H 40 20 F 100 so so 0 40 40 30 20 15 50 100 150200 300 400 500 700 1500 150501010200 300 400 500 700 1500 fragment Wngth 1bp] Figure 1. Size-dependent isolation of DNA fragments from sheared genomic DNA via dSPRI. AMPure XP SPRI beads bind DNA fragments in a size dependent manner according to the concentration of salts and polyethylene glycol (PEG) in the reaction [4-7], which can easily be changed by using different volume ratios of DNA to SPRI bead solutions. Atwo-step procedure isemployed to isolate targeted DNA size fractions. Panels Ato H present Bioanalyzer DNA-1000 assays showing the sheared genomic DNA used as starting material (black), the larger size DNA fragments discarded in separation 1 (red), and the size fraction purified and recovered after separation 2 (blue). Panel I is a table summarizing the conditions and results displayed in panels Ato H.All Bioanalyzer DNA-1000 traces after separation 1 (panel J), and after separation 2 (panel K), are respectively displayed on a graph for the conditions presented in panels A to H. The conditions displayed in panel H were used to obtaine the Illumina composite reads discussed in the text. The wider DNA fragment size distribution from panel H allowed to better analyze the effects of shorter versus longer overlapping regions on consensus reads. doi:10.1 371/journal.pone.001 1840.gOO1 range. Shorter DNA molecules, salts and other reagents are removed by wash steps, and the DNA fraction that remains bound to the magnetic SPRI beads can subsequently be recovered with virtually no loss [4]. The resulting DNA population reproducibly exhibits a relatively narrow size distribution (Figure 2), and can be used to prepare high-throughput sequencing libraries by ligating appropriate oligonucleotide adapters (Table 1). DNA size selection allowed us to choose an appropriate library insert length with precision, such that mate-pair sequences of each insert overlap in the middle, and can be aligned to produce a longer, more accurate composite read. We used our streamlined dSPRI protocol to produce a paired-end library with an insert size range of 100-300 bp (Figure 1H) for the Illumina GAIIx platform, and generated millions of paired-end reads 143 nucleotides (ntds) in length. We observed that the sequencing .Q.PLoS ONE I www.plosone.org error rate remained <1% for 110 ntds in the forward read, and 87 ntds in the reverse read (Figure 3A). To reconstruct each insert's sequence, we developed the SHERA (SHortread ErrorReducing Aligner) software tool, which aligns each mate-pair to produce a composite read with Phred-like quality scores [8]. With each alignment, we report a confidence metric, allowing misaligned reads to be reliably filtered out. We chose a cut-off such that 87% of reads were retained and less than 1% of paired reads were incorrectly aligned. The average composite read length reached 180±40 bp (Figure 4), as expected from the library insert size distribution. Importantly, the composite reads' length can easily be adjusted by modifying dSPRI selection conditions (Figure 1), and reach >200 bp. For nucleotides in the aligned region, SHERA recalculates the error probability of each bp, given the base call data from both reads (Figure 3B). 27 July 2010 1 Volume 5 1 Issue 7 1 e11840 DNA Sequencing Pipeline A 1550100 200 300 400 500 B C 5. 5. I I 700 1500 1550100 fragment length (bp] 200 300 400 500 700 1500 15 50 100 fragment length (bp] 200 300 400 500 700 1500 fragment length [bp] Figure 2. Reproducibility of double-SPRI. The panel shows DNA fragment size distributions as obtained by Bioanalyzer DNA-1000 assays. The curves represent the size fractions removed during the first separation step A) or recovered after the second separation B). The two size fractions were independently reproduced in 4 or 8 separation experiments. The curves in B) represent the libraries sequenced after dSPRI based size selection, adapter ligation and PCR enrichment. While concentrations (arbitrary fluorescence units) vary between reproduced libraries the range of removed or enriched DNA fragment sizes was highly reproducible. Panel c) shows the DNA fragment size distribution recovered after the second separation when using decreasing amounts sheared genomic DNA. dSPRI allows reliable size selection in a DNA concentration independent manner. doi:10.1 371/joumal.pone.001 1840.g002 Composite read quality exceeds that of individual short-reads (Figure 3). The shortest composite reads have the best quality scores, still, the average error rate remained <1% across 95% of 250 bp reads (Figure 5). New technologies and sequencing chemistry improvements are likely to further increase ultra high-throughput read lengths, thus extending the achievable length of high-quality composite sequences. Short-read technologies have not been widely used for metagenomic analyses because of the difficulty in confidently assigning phylogeny or putative gene function to short sequences, although a strategy using Illumina mate paired libraries to assign taxonomy has recently been evaluated with simulated data [9]. We sought to test if our composite reads could potentially transcend this limitation. To address this question, we sequenced a metagenomic DNA sample from the Hawaii Ocean Time Series [10] using our overlapping paired-end approach and compared the results with those obtained with the Roche 454FLX technology, the most widely-used platform for these studies. Over 4.8 million high-confidence composite sequences of 180±40 bp were generated from a single sequencing lane of the Illumina GAI~x while 673,673 454-FLX reads of 207±71 bp were already available [10]. Despite the slightly shorter average read length of the composite Illumina reads relative to 454-FLX, the fraction of reads that could be assigned to a taxon was similar in both cases, even when considering longer 454-FLX reads (Figure 6A). The relative abundance of the sequences that could be assigned to each taxon was also consistent between the 454FLX and composite read datasets (Figure 6B). We have focused our comparison on Illumina and 454-FLX, because they generate similar read lengths, and read length can make a difference in performance for some applications [11]. As the technology landscape evolves our general approach for experimental and computational overlapping of short paired reads will be readily adapted to new platforms and upgrades, and therefore is likely to continue to play a role in providing additional cost-effective sequencing options. Taken together, these results demonstrate that Illumina composite reads produced by SHERA constitute a practical and cost-effective alternative to 454-FLX for metagenome sequencing and analysis while providing significantly deeper sampling at a lower cost. Table 1. Oligonucleotides sequences. Name IGA-A-down Purification PLC Sequence (5' to 3') B'B'B AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT AC/3AmM/ IGA-A-up HPLC /5AmMC6/ACA CTC TTT CCCTAC ACG ACG CTCTTC CGA TCT BBBB IGA-PE-80-down HPLC /5AmMC6/CTC GGC ATT CCT GCT GAA CCG CTC TTC CGA TCT IGA-PE-80-up HPLC AGA TCGGAA GAG CGG TTC AGC AGG AAT GCC GAG /3AmM/ IGA-PCR-PE-F PAGE AAT GAT ACG GCG ACC ACC GAG ATC TAC ACT CTT TCC CTA CAC GAC GCT CTT CCG ATC T IGA-PCR-PE-R PAGE CAA GCAGAA GAC GGC ATA CGA GAT CGG TCT CGG CAT TCC TGC TGA ACC GCT CTT CCG ATC T QPCA.Iibrary quantification-F Standard desakilng MAT GAT ACG GCG ACC ACC GA QPCR-library quantiflcation-R Standard desalting CAA GCA GAA GAC GGC ATA CGA All oligonucleotides were purchased from IDT (www.idtdna.com). 5AmMC6: 5' Amino Modifier C6. 3AmM: 5' Amino Modifier. B: Barcode base. The sequencing reads start with these bases. The barcodes used in this study were AAAC, ACCC, AGGC and ATTC. B': Complementary barcode base when the two oligonucleotide from the same adapter are annealed. doi:10.1371/journal.pone.001 1840.t001 .PLoS ONE | www.plosone.org 28 July 2010 1 Volume 5 1 Issue 7 | e11840 DNA Sequencing Pipeline A CM. Q75 0. CM 0504 a. mean Q25 6 6o position on forward read 2o 8b 4o 10 1k0 140 140 120 80 100 60 40 20 0 position on reverse read - B 075 (0 C>- 050 mean 0 CD 025 a 0 - CD, ' I 0 20 I I I 40 60 80 I I I I I 100 120 140 160 180 position on composite fragment [bp] Figure 3. Quality of llumina reads out to 143 bp. A) The mean Phred quality of single reads is shown in the solid red lines, with error bars displaying quartiles. Read quality is highly variable toward the reads' ends. Read quality as a function of base pair isworse on the second mate of the pair. B) The mean and quartiles of Phred quality by base for the average-length composite read. doi:10.1 371/journal.pone.001 1840.g003 applications including amplicon sequencing, transcriptomics, and de novo assembly, will take advantage of the millions of longer high quality reads produced by this simple and scalable method. Discussion We have demonstrated that our overlapping mate-pair strategy can substantially increase read length while maintaining high sequence quality, in turn making economical short-read platforms suitable for a range of applications. For example, the performance of phylogenetic classification and annotation of DNA sequences improves with read length, the most significant increase being observed by extending read length from 100 to 200 bp [12-13]. By allowing to reach and exceed this latter threshold, our method enables the use of the ubiquitous Illumina platform for metagenomics. Importantly, the dSPRI DNA fragment size selection and the SHERA package for computational read joining can be used with any existing or emerging short-read sequencing platforms capable of producing matepairs. Compared to pyrosequencing, the dominant sequencing technology in metagenomics, our composite read strategy already provides a >30-fold increase in sequencing capacity per dollar at similar or higher accuracy (Table 2). We expect that the advances described here will not only unlock short read sequencing for metagenomics, but that a number of other .. PLoS ONE I www.plosone.org Materials and Methods Double-SPRI (dSPRI) DNA fragment size-selection High-molecular weight genomic DNA was obtained from a marine environmental sample (Hawaii Ocean Time Series, cruise 186, 75 m depth, [10]) and from laboratory cultures using standard procedures [14], or by amplifying genomes from individually flow-sorted cells as described [15]. DNA samples (50 pL, 2 pag) were sheared using 18-24 cycles of alternating 30 seconds ultasonic bursts and 30 seconds pauses in a 4*C water bath (Bioruptor UCD-200, Diagenode). The resulting DNA molecules were end-repaired and phosphorylated according to manufacturer's recommendations (End-repair kit, Enzymatics or New England Biolabs). The size distribution of the resulting fragments ranged on average from -100 to -700 bp as judged by Agilent Bioanalyzer DNA-1000 assays. 29 July 2010 1 Volume 5 | Issue 7 | e11840 DNA Sequencing Pipeline Illumina paired end reads 40000 - 30000 - 20000 - Overlapped composite fragment A R sequence length 143bp E 0) 0 Co, 0) E Q - CL cc 10000- l- M~ 0- 100 150 200 250 300 350 0- fragment length [bp] 0 Figure 4. High-confidence alignment yield by Insert lengths. Distribution of insert lengths by aligning the original reads to the reference sequence (gray), and lengths of composite reads retained (red). doi:10.1371/journal.pone.001 1840.g004 100 150 200 250 .0 a. sequence length 180bp 8 a We optimized a two-step dSPRI protocol [4] for DNA fragment size selection that quantitatively removes fragments above or below the desired size range. In a first separation step, DNA fragments above an upper size threshold are bound by Agencourt AMPure XP SPRI magnetic beads and discarded. In the following second separation step fragments in the desired size range are selectively captured on the beads while all shorter fragments are discarded with the supernatant. The conditions during first and second separation are tunable to enable selective enrichment for other fragment size distributions (Figure 1). In addition, we showed that the size specificity of the dSPRI procedure is stable over a wide range of DNA concentrations (Figure 2). The sequencing libraries prepared for this work were obtained by combining 50 pL of sheared DNA with 40 piL of Agencourt AMPure XP SPRI magnetic beads and incubated at room temperature for 5-7 minutes. The tubes were next placed on a magnet holder (Invitrogen-Dynal) for 2 minutes before transferring the supernatant to a new tube and discarding AMPure XP particles from this first separation step. For the second separation step, 100 pL of fresh AMPure XP beads were added to 85 pL of the supematant from the previous separation, mixed well, and incubated for 5 minutes before transferring the tube to the magnet holder rack. The supernatant from this second step was discarded and the SPRI particles were washed twice with 0.5 mL of 70% ethanol. The wash steps ensure complete removal of polyethylene glycol (PEG) and salts, but also of any remaining unbound fragments smaller than the desired fragment size distribution. After removing the 70% ethanol from the wash steps, the beads were air dried at room temperature (or at 37'C) for 10-15 minutes. In order to recover the bound DNA, 30 pL of water was added to the beads, and incubated at room temperature for at least I minute. The DNA solution was separated from the Agencourt AMPure XP SPRI magnetic beads for 2 min on the magnet holder, and transferred to a new tube. Applying aforementioned conditions for dSPRI led to the specific enrichment of a DNA size-fraction with an average size of 207 bp. Enrichment for other size distribution can easily be achieved by modifying the dSPRI conditions as described in Figure 1. .PLoS ONE | www.plosone.org 50 B 00 0 (N 0 Ca 0 50 100 150 200 250 a-. C I.. | sequence length 250bp - 0 I A CD 0 . 0 Cu -0 0 0. ~0 o0 0 50 100 150 200 250 position on composite sequence Figure S. Quality of composite fragments after overlapping. The mean Phred quality (and corresponding error rate) at each position of the read. We show the original Illumina reads and the composite fragments generated by overlapping. A) 143 bp, the length of the original Illumina read; B) 180 bp, the mean length of overlapped fragment in this library and C) 250 bp, roughly the mean length generated by 454-FLX technology. doi:10.1 371/journal.pone.001 1840.g005 30 July 2010 | Volume 5 | Issue 7 1 el 1840 DNA Sequencing Pipeline A B assignable fraction of reads 0- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 a 0 top 25 taxa represented in sample 0 0 .* LL 0 *t3 .b 0- 101 102 103 104 o10 composite Illumina reads Figure 6. Composite Illumina reads constitute a legitimate alternative to pyrosequencing for metagenomics studies. DNA from a marine metagenomics sample was sequenced with our overlapping mate-pair approach. The composite reads were directly compared to 454-FLX sequences from the exact same sample. A) Fraction of reads that could be assigned to a taxon using the composite reads (mean read length 180 bp), the entire 454FLX dataset (mean read length 207 bp), or longer 454-FLX reads (mean read length 254 bp). B)Comparison of taxon assignments using composite reads and 454-FLX pyrosequencing reads. The top 25 represented taxa, with colored symbols next to the bracket, are listed in Table S1. doi:1 0.1371/journal.pone.001 1840.g006 Illumina paired-end library construction reaction consisting of 20 mM Tris-HCI pH 8.8, 10 mM (NH 4) 2SO 4 , 10 mM KCI, 2 mM MgSO 4 , and 0.1% Triton X100. The resulting DNA fragments were diluted 2 fold with ddH 2O, and PCR amplified using Phusion Hot Start HighFidelity DNA polymerase (New England Biolabs), and simultaneously monitored on a Bio-Rad Opticon real-time PCR instrument. Each reaction consisted of I sL of diluted nicktranslated DNA, 0.5 units of Phusion Hot Start High-Fidelity DNA polymerase, IX Phusion HF buffer, 0.4 pM of each IGAPCR-PE-F and IGA-PCR-PE-R primers [described in Table 1], 200 pM dNTPs as well as 0.25X SYBR green I. The initial denaturation step was at 98'C for 1 minute, followed by successive cycles of 98'C for 15 seconds, 65'C for 10 seconds and 72'C for 15 seconds. The reactions were stopped in the late logarithmic amplification phase, and the DNA from the corresponding samples were pooled. The resulting libraries were composed of fragments with distinct adapters at each extremities due to PCR suppression effect [17-19]. The amplified libraries were subjected to an additional round of AMPure XP SPRI beads purification, with a DNA/beads ratio of 1, to remove residual primers and adapter dimers. In order to achieve optimal cluster densities for sequencing on the Illumina GAII platform, samples were quantified using two methods. The samples were first analyzed using the High-Sensitivity DNA Kit for the Bioanalyzer (Agilent Technologies) to detect any primer-dimers and to determine the average molecular weight of each library. The samples were next quantified by real-time PCR on a Light~ycler II 480 (Roche), using the LightCycler 480 SYBR Green I Master mix (Roche), appropriate primers (Table 1), and serially diluted PhiX 335 bp control library (Illumina) as a standard (quantified with the Agilent Bioanalyzer; we observed that the Illumina PhiX 335 bp Our strategy to create Illumina compatible sequencing libraries uses a blunt-end ligation step of two distinct adapters rather than the ligation of a single Y-shaped adapter [16]. dSPRI-selected DNA fragments were combined with a 10-fold molecular excess for each oligonucleotide adapters (see Table I for oligonucleotide sequences). Blunt-end DNA fragments were ligated using a rapidligation kit (Enzymatics or New England Biolabs) for 5-10 minutes at room temperature. Adapter excess was subsequently removed using Agencourt AMPure XP SPRI beads according to the manufacturer's recommendations (DNA/bead ratio of 1). A nicktranslation step was performed with 2 units of BstI exo~ DNA polymerase (New England Biolabs) at 65'C for 25 minutes in a Table 2. Cost estimates of sequencing strategies. Sequencing technology Composite illumina Cost of a run ()4,200.00 Number of reads obtained Average Read length (bp) 454-FLX 8,000.00 10,000,000 400,000 180 25 Reads per $ 2381 50 bp per $ 428,571 12,500 Cost ratio per read 48 1 Cost ratio per bp 34 1 'The lilumina sequencing pricing is from the MIT BioMicro center (http:// openwetware.org/wiki/BioMicroCenter:Pricing) and the 454-FLX is from http:// www.biosystems.usu.edu/core_labs/genomics/454/. doi:10.1371/journal.pone.001 1840.t002 .PLoS ONE | www.plosone.org 31 July 2010 | Volume 5 | Issue 7 | e11840 DNA Sequencing Pipeline probability that the consensus base is wrong, given the base calls at that position in the alignment and their respective quality scores. Calculating this probability required making a few assumptions and approximations: concentration vary from lot to lot, and typical concentrations were found to range from 11-15 nM). The phiX standard curve was required to have an r 2 value of >0.98 or the run was repeated. This step both confirmed the concentration of the libraries as well as confirmed the presence of proper primers for Illumina sequencing. 1. Given a basecall was in error, there is a uniform probability of reporting one of the other three bases. Illumina paired-end sequencing 2. The probability of correct alignment was approximated to be 1. (A decent approximation given true-positive rates of >99% in our quality filtered alignments, see Counting False Alignments below). Illumina libraries were clustered for sequencing immediately following sample quantification according to manufacturers recomendations (Illumina). Following clustering, the samples were loaded on to an Illumina GAJIx sequencer. Sequencing reagents for each read were pooled prior to loading. Data from the Illumina GAIIx was analyzed using the Illumina pipeline 1.4.0 to generate fastq files. 3. The basecalls of the forward and reverse read are independent observations, given the true base in the insert. 4. The prior probability of a base being present in an insert in this flowcell can be estimated from the low-error portion of the reads or approximated as uniform at the user's discretion (In our case, we used empirical observations to estimate %GC in single-genome samples, and a uniform distribution for the metagenomic sample.) Mate-Read Overlapper Algorithm In order to align overlapping mate reads and produce accurate composite fragments, we developed the SHERA software pacakge, a collection of computer code written in the PERL programming language. The algorithm for aligning paired reads and computing quality scores was designed to be broadly applicable, independent of the sequencing platform. The source code is freely available at http://alnlab.mit.edu/ALM/Software/Software.html. In brief, SHERA takes overlapping short reads as input, finds the best alignment, produces a composite read, calculates new quality scores for the overlapping bases, and scores the alignment confidence. A detailed description of each of these steps is provided below. We also describe our procedure for evaluating false alignment rates on both real and simulated reads. We first confirmed that the posterior error probability for the overlapped region was correct by counting the true error rate in composite reads constructed from simulated reads. Next, using real mate-reads as input, we confirmed that our Phred quality scores up to 32 correspond to the correct error rate as calculated by comparing to the reference. Above 32, the error rate does not improve (for neither the overlapped bases nor the original Illumina bases). This is due to errors in the reference and/or a very small percentage of Illumina reads with insertions or deletions. Finding the best alignment Counting False Alignments For each mate-pair, the algorithm scores all possible ungapped alignments (this includes alignments shorter than the read length, which are produced when a fragment is fully sequenced resulting in and additional synthesis reactions proceed past its 5-prime end, base-pairing with the Illumina adapters.) Each alignment is scored by summing over all matches and subtracting a penalty for each mismatch. For the alignment with the best score, the algorithm constructs the consensus sequence in this overlapping region by reporting the nucleotide with the higher quality value, or the informative base call if one nucleotide is 'N'. A Phred quality score is calculated for each consensus base (see below). The algorithm output includes sequence and quality of the composite read generated from each mate pair, as well as an alignment confidence score that enables filtering out probable mis-alignments. Minimizing the number of false alignments is an important step for some applications such as de novo assembly. Our goal was to find a threshold for alignment confidence that would allow no more than 1% misalignments to remain in our subset of composite reads (based on previous experience with de novo assembly of libraries containing hybrid reads). We used several approaches to assess our alignment accuracy and found similar results. First, we used the read-simulator from the MAQ package [20] to produce simulated reads that mimicked the error profiles observed in the quality values of our Illumina data as well as the distribution of overlap lengths. In the case of simulated reads, the true distance between mate pairs is known exactly and the reference is perfect. After filtering (confidence metric 0.5) we retain 89.7% of the composite reads, 0.5% of which were constructed from mis-alignments. This strategy is useful when no reference is available, or as a test case on new platforms. Second, we used mate-reads from the PhiX control lane sequenced during our Illumina sequencing run. This DNA library is prepared from the PhiX174 bacteriophage genome for purposes of quality control, and we found that it had and insert length 200±21 bp (estimated by mapping to the reference with MAQ. These mapped mate-pairs defined the "true" insert length of each sequenced fragment. After constructing the composite sequences and filtering, we plotted the difference in length between a composite fragment and its insert length as predicted by MAQ (Figure 7). This histogram is a sum of two distributions, the overlapper software's misalignments (a broad gaussian) and a sharp peak of small (1-2 bp) indels. We examined gapped alignments to confirm this was the case and also found that many of the insertion or deletions occur in homopolymer runs. We used a simple linear model to infer the number of misalignments; this yielded a false positive rate of 1.0% for PhiX174. Alignment confidence metric Alignment confidence was quantified in order to identify composite reads constructed from mis-alignments. Due to the relatively short alignment lengths with non-uniform base composition and quality, we did not model the statistical significance of an alignment analytically, but instead chose an empirical metric. Because mis-alignments of similar length have similar alignment scores (within noise), we use seven alignments of similar length of the same mate-pair to calculate a baseline, and divide by the range of the scores to estimate the alignment's significance. Alignments of all confidence levels are reported; we leave selection of this threshold to the user to achieve the sensitivity/specificity required of their downstream application. Calculation of a Phred Quality Score for the overlapped region SHERA reports a Phred-scaled error probability for each base in the composite read. For overlapped bases, this is the posterior . 0 .1ldn. - PLoS ONE I www.plosone.org 32 July 2010 1 Volume 5 | Issue 7 | el 1840 DNA Sequencing Pipeline 0 0.0 q 8 o 0 0 CD 0 q, 0) 0 CDO 0 I1T -200 0 000111100f-am00 -150 -100 I -50 i 0 1 50 0 100 difference between pairs mapped length and composite fragment length Figure 7. Short insertions and deletions in low-confidence composite Illumina reads. Mate-reads from the control lane (PhiX174 bacteriophage genome) were used to assess false alignments introduced by the SHERA pipeline. After constructing the composite sequences and filtering, we plotted the difference in length (if any) between a composite fragment and its insert length as predicted by MAQ by mapping the original mate-pairs to the reference.This histogram isa sum of two distributions, the overlapper software's misalignments (a broad gaussian) and a sharp peak of small (1-2 bp) indels. We used a simple and conservative linear model to remove the indels and infer the number of misalignments. doi:10.1 371/journal.pone.001 1840.g007 used the popular metagenomics software MEGAN [22] for downstream analysis. We compared the fraction of unassignable reads (due to lack of significant blast hit in the database) across four subsets of sequence data from the same sample, which were chosen to compare sequencing platforms and the effect of read length: highconfidence composite Illumina reads, low-confidence composite Illunina reads, a representative sample of our 454 reads (mean length 207 bp), a longer subset of our 454 reads (mean length 254 bp). For comparison purposes we used random samples of 50,000 reads each. We used the MEGAN parameters recommended by the authors [22] as suitable for comparing reads of different topperminscorebylength = 0.35, lengths (minscore = 35.0, cent= 10, minsupport = 1; the low minsupport value was chosen because some of the data subsets tested were necessarily small. We also tested the alternative parameter set with similar results: minscore = 40.0, toppercent = 10, minsupport = 5). In Figure 6A we show the fraction of DNA fragments for which MEGAN was able to assign to a known genome or taxon via significant protein homology. We show that composite Illumina reads from highconfidence alignments perform at least as well as reads obtained from 454-FLX. We also find that composite reads produced by alignments we designated as "low-confidence" have a very high fraction of unassignable reads, further proof of the ability of our algorithm to identify misalignments. Finally, we show results for 454-FLX reads of mean length 254 bp (range: 230-270 bp), more typical of datasets produced by the 454-FLX platform. The assignable read fraction is comparable. Next we asked if the taxonomic classification was consistent across platforms. We used MEGAN [22] to classify all of the 454FLX reads and an equal number of composite Illumina reads. Figure 6B shows the taxonomic classification is relatively invariant with respect to the sequencing technology chosen. We performed this analysis with two reasonable parameters sets (minscore = 35.0, minscorebylength=0.35, toppercent= 10, mninsupport=5 OR minscore =40.0, toppercent =10,minsupport = 5) and observed equally good agreement in taxonomic classification across platforms. Finally, for Figure 3C, we assessed alignment yield and accuracy on mate-pair reads sequenced from a dSPRI library. This library was generated from a single Prochlorococcus cell (as described [15]), for which a draft copy of the genome (N50 = ~7 Kb) was constructed independently by de novo assembly of 454-Titanium reads from the same DNA sample. We mapped the Illumina mate-paired reads to these 454Titanium contigs to define the "true" insert length of each sequenced insert. A small fraction of mate-pairs (<5%) were unmapped or mapped to different contigs; these were excluded from further analysis. As in the case of PhiX 174, there were a number of small indels evident in the comparison between reads and reference which were excluded from our statistic. We estimate that 0.5% of our high confidence set of composite reads were mis-alignments introduced by SHERA. Comparison of 454-FLX and overlapped Illumina reads for metagenomic applications To evaluate the suitability of overlapped Illumina for metagenomic analysis, we made a direct comparison to the Roche 454FLX platform, currently the most commonly used sequencing platform for this application. We used sequencing data from the exact same marine metagenomic sample [10]. We asked whether reads from the two platforms were equally able to identify significant protein homology with known proteins for the purposes of taxonomic classification. We obtained 6,339,825 mate-paired Illumina reads of length 143 ntds each with insert lengths shown in Figure 3. We joined the Illumina mate-pair reads with our algorithm and filtered the composite reads by alignment confidence, resulting in 4,830,552 high-confidence composite reads of length 180±40 bp. We analyzed both high-confidence and low-confidence sets of composite reads as described below. We also had access to 673,673 454FLX reads of length 207 - 71 ntds available for comparison [10]. We blasted all 454-FLX reads and composite Illumina reads against the NCBI protein database nr [21] using translated nucleotide queries (blastx parameters -b 100 -v 100 -e 30 -Q 11 -F "in S"), and .PLoS ONE I www.plosone.org 33 July 2010 1 Volume 5 | Issue 7 | e11840 DNA Sequencing Pipeline sample with 454-FLX. We would also like to thank the MIT BioMicro Center for performing the Illumina sequencing reported in this study. Supporting Information Table S1 Top 25 taxa identified in the HOT186 75 meters depth metagenomics sample. Found at: doi:10.1371/journal.pone.0011840.s001 Author Contributions (0.80 MB Conceived and designed the experiments: SR ACM EJA SWC. Performed the experiments: SR ACM MCB RM. Analyzed the data: SR ACM SCT. Contributed reagents/materials/analysis tools: SR ACM SCT. Wrote the paper: SR ACM SCT EJA SWC. PDF) Acknowledgments We are grateful to the members of the DeLong and Chisholm laboratories (MIT) that collected, processed, and sequenced the metagenomic DNA References I. Schuster SC (2008) Next-generation sequencing transforms today's biology. Nat 12. Hofl KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P (2008) Methods 5: 16- 18. 2. HiattJB, Patwardhan RP, Turner EH, Lee C. ShendureJ (2010) Parallel, tagdirected assembly of locally derived short sequence reads. Nat Methods 7: Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics 9: 217. 13. Brady A, Salzberg SL (2009) Phymm and PhynmBL: metagenomic phyloge- 119-122. 3. Sorber K, Chiu C, Webster D, Dimon M, Ruby JG, et al. (2008) The long 14. march: a sample preparation technique that enhances contig length and coverage by high-throughput short-read sequencing. PLoS One 3: e3495. 4. Lennon NJ. Lintner RE, Anderson S, Alvarez P, Barry A, et al. (2010) A scalable, fully automated process for construction of sequence-ready barcoded libraries for 454. Genome Biol 11: R15. 15. 5. DeAngelis MM, Wang DG, Hawkins TL (1995) Solid-phase reversible 6. 7. 8. 9. 10. 11. 16. immobilization for the isolation of PCR products. Nucleic Acids Res 23: 4742-4743. Hawkins TL, O'Connor-Morin T, Roy A, Santillan C (1994) DNA purification and isolation using a solid-phase. Nucleic Acids Res 22: 4543-4544. Krizova J, Spanova A, Rittich B, Horak D (2005) Magnetic hydrophilic methacrylate-based polymer microspheres for genomic DNA isolation. J Chromatogr A 1064: 247-253. Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. 11. Error probabilities. Genome Res 8: 186- 194. Mitra S, Schubach M, Huson DH (2010) Short clones or long clones? A simulation study on the use of paired reads in metagenomics. BMC Bioinformatics I I Suppl 1: S12. Martinez A, Tyson GW, Delong EF (2010) Widespread known and novel phosphonate utilization pathways in marine bacteria revealed by functional screening and metagenomic analyses. Environ Microbiol 12: 222-238. Wommack KE, Bhavsar J, Ravel J (2008) Metagenomics: read length matters. Appi Environ Microbiol 74(5): 1453 63. .1PLoS ONE | www.plosone.org 17. netic classification with interpolated Markov models. Nature Methods 6: 673-676. Kettler GC. Martiny AC, Huang K, Zucker J, Coleman ML, et al. (2007) Patterns and implications of gene gain and loss in the evolution of Prochlorococcus. PLoS Genet 3: e231. Rodrigue S, Malmstrom RR, Berlin AM, Birren BW, Henn MR, et al. (2009) Whole genome amplification and de novo assembly of single bacterial cells. PLoS One 4: e6864. Bentley DR. Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53-59. Matz M, Shagin D, Bogdanova E, Britanova 0, Lukyanov S, et al. (1999) Amplification of cDNA ends based on template-switching effect and step-out PCR. Nucleic Acids Res 27: 1558-1560. 18. Shagin DA, Lukyanov KA, Vagner LL, Matz MV (1999) Regulation of average length of complex PCR product. Nucleic Acids Res 27: e23. 19. Siebert PD, Chenchik A, Kellogg DE, Lukyanov KA, Lukyanov SA (1995) An improved PCR method for walking in uncloned genomic DNA. Nucleic Acids Res 23: 1087-1088. 20. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18: 1851-1858. 21. Altschul SF, Madden TL, Schaffer AA ZhangJ, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402. 22. Huson DH, Auch AF, Qi J, Schuster SC (2007) MEGAN analysis of metagenomic data. Genome Res 17: 377 -386. 34 July 2010 | Volume 5 1 Issue 7 | el 1840 1.6 Contributions to this manuscript As the sole author with bioinformatic/computational expertise, I was 100% responsible for designing, writing, and testing the SHE-RA algorithm, analyzing all of the sequence data, and making all of the figures with the exception of the Bioanalyzer machine plots. I was responsible for writing all aspects of the paper that deal with the algorithm and sequence analysis. 1.7 Appendix 1.7.1 Supplemental Note: alignment confidence metric To assess alignment confidence, one could calculate the probability of an alignment being due to chance analytically. The multivariate hypergeometric distribution would be appropriate for this calculation if certain simplifying assumptions are true. Namely, the probability of picking a particular base must be fixed. The short alignment lengths and high local variance of base and GC content make this assumption untenable. Instead, an empirical metric was developed. The chosen alignment's score is compared to the empirical distribution of scores of (mis-)alignments of similar length from the same read and region. Alignment confidence metric is calculated as maximal score's excess over this distribution, normalized by distribution width. 35 S0 -o- correct alignments -o- mis-alignments 4) Co C 0 4- 00 a 0 000 0L 0 000 0 o-o==0-o-o-8-o-o-o-o P I 0.0 I 0.2 I 0.4 I 0.6 I 0.8 1.0 SHE-RA alignment confidence metric Supplemental Figure 1. SHE-RA alignment confidence metric separates true alignments from mis-alignments. SHE-RA reports the best alignment for all reads, along with a confidence score that allows the user to tune the stringency of overlaps according to the fidelity required for their downstream application. Simulated reads of the same length and quality distribution as the Illumina IIGA reads were used to definitively assess true and mis-alignments. The figure shows that the alignment confidence score produces a good separation between true alignments and mis-alignments. The resulting separation was confirmed consistent in real data, where overlap accuracy was estimated by mapping to a draft reference genome assembled with separate data. 36 IO 0- 0 top 25 taxa represented In sample a Prochlorococcus marinus 0 Bacteria x cellular organisms root Candidatus Pelagibacter 0 + 3 Proteobacteria 0 0- Cyanobacteria 0 Tl) 0 -o M. U C- U- 4 U.O Gammaproteobacteria Bacteroidetes U SAR11 cluster M A Psychroflexus torquis ATCC 700755 Prochiorococcus phage P-SSM2 * 1P 6 Candidatus Pelagibacter sp. HTCC7211 Alphaproteobacteria * N 0 Alteromonas macleodii Candidatus Pelagibacter ubique Not assigned 5e Rhodobacterales Viruses Flavobacteria ., I I 10 I 102 10 3 4 10 I overlapped Illumina reads 0 Prochlorococcus phage P-SSP7 0 Candidatus Pelagibacter ubique HTCC1 062 0 Alteromonas macleodii ATCC 27126 O Prochlorococcus marinus str. AS9601 Prochlorococcus marinus str. MIT 9312 Supplemental Figure 2. Taxa abundances in metagenomics sample are similar when using overlapped Illumina or 454 sequencinj Reads from 454 and SHE-RA-processed Illumina paired end reads were blastx-ed against nr and the software MEGAN was used to assign each read to a known genome or taxon via significant protein homology. (A subset of 500,000 unfiltered random reads from each dataset was used for comparison.) The close correspondence in taxa abundance demonstrates that the overlapped Illumina reads are a legitimate alternative for surveying community composition and that the library construction protocol does not suffer any gross biases or bottlenecks. 37 Prochlorococcus gene content 454-FLX E overlapped Illumina Supplemental Figure 3. Gene functions identified in metagenomics sample HOT186 75m show similar fractional abundances using overlapped Illumina or 454 sequencing. Community shotgun genomic data from the HOT186 75m sample were analyzed for gene content from Prochlorococcus,the most abundant taxon in this sample. Reads from 454 and SHE-RA-processed Illumina paired end reads were blastx-ed against nr and those reads assigned to the Prochlorococcusgenus were binned by their GOslim functional category. (A subset of 500,000 unfiltered random reads from each dataset was used for comparison.) The correspondence in gene function abundance demonstrates that the overlapped Illumina reads are effective for studying community gene function. 38 Chapter 2. Assembly and analysis of 82 Vibrio genomes 2.1 Abstract The Vibrionaceae family of marine bacteria is an important model system for bacterial evolution and ecology. Drawing from our strain collection of thousands of environmental isolates grouped into natural, ecologically distinct populations, we sequenced 82 whole genomes from 22 populations to better understand the genomic underpinnings of ecological differentiation. In the first half of this chapter, I describe methodological advances required for short read assembly, in order to build and analyze the genomic reference for this model system. A hybrid assembly approach was designed, enabling efficient construction of related genome sequences. This approach produces significantly more contiguous genomes than de novo assembly alone can, and is generally applicable to short read assembly problems. In the second half of this chapter, I survey the relationships of phenotype and gene content with ecology in this model system. Core and flexible genes were defined using these ecological populations, and genome comparisons revealed fast turnover and adaptive fixation of flexible genes on the extremely fast phylogenetic timescales where we previously observed ecological partitioning at the microscale. A high rate of recent HGT exceeded that previously found between ecologically similar organisms. However, over the long term, neither variance in gene content nor metabolic profiles were correlated with our measures of ecology and instead was primarily explained by phylogenetic divergence: cohabitant populations did not have more similar gene content, flexible pools or substrate utilization profiles once the effect of distance was removed. 39 2.2 Introduction The Vibrionaceae are heterotrophic bacteria abundant in aquatic environments worldwide and include many human and animal pathogens, and are thus ecologically, medically and economically important. Unlike the vast majority of environmental bacteria, vibrios are easily cultivatable in the lab, making them a good model system for comparative genomics study, particularly because of the breadth of ecological data collected for this group and the ecologically-differentiated populations we were thus able to define. Using the environmental data collected with more than 1900 marine isolates, ecologically cohesive populations have been delineated among the vibrios. Bacterial taxonomic units are typically demarcated by drawing arbitrary phylogenetic cutoffs, but such groupings often correspond poorly to organisms' physiology or ecology (Francisco, Cohan, and Krizanc 2012). Instead, the vibrios were grouped into both phylogenetically and ecologically cohesive populations (Hunt et al. 2008; Preheim et al. 2011; Szabo et al. 2012; Preheim 2010), using a variety of environmental datasets analyzed by the machine learning algorithm AdaptML (Hunt et al. 2008). Ecological differentiation of related populations was observed at levels of phylogenetic divergence that were surprisingly fine, but subsequent studies consistently rediscovered these same ecological populations: they were largely cohesive in a variety of gene phylogenies (Preheim, Timberlake, and Polz 2011), reproducibly recovered across time (Szabo et al. 2012), and robust to the particular ecological trait used to define them (Preheim, Timberlake, and Polz 2011; Preheim et al. 2011). The ecological distinctness between populations demonstrates that despite coexisting in the water column and sometimes very close genetic relations, these groups do not have same niche, instead partitioning resources spatially and temporally. Within populations, individuals interact socially (Cordero, Wildschutte, et al. 2012) and sexually (Shapiro et al. 2012), experiencing rampant homologous recombination along population lines that could result in genetic differentiation in a manner akin to eukaryotic speciation. All in all, these ecological populations have so many of the hallmarks of a fundamental taxonomic unit, and it is likely that within each, individual genotypes share a niche and face similar selective pressures, making this an excellent model system to study diversity, divergence, and adaptation to environment. While ecologically differentiated populations have been defined in this family, it is not yet known to what extent this differentiation engenders genotypic or phenotypic 40 divergence. In general, the extent to which ecology predicts similarity of gene content or other phenotype is not well understood. While bacterial lineages with similar 'lifestyles' have been reported to possess more similar gene functional content (Lauro et al. 2009), both gene type and ecology are defined broadly in such studies. Genomes of similar ecology are also reported to exchange more genes, either from co-habitants or other, geographically distinct, genomes of similar ecology (Hooper, Mavromatis, and Kyrpides 2009; Smillie et al. 2011). However, the overall long-term effect of this gene pool sharing is not known: is it so rampant and persistent that multiple lineages adapting to the same habitat do so through the use of convergent strategies? To explore the effect of similar microscale ecology on gene sharing, we determine the extent to which populations with similar environmental distributions share overall gene content, as well as the frequency and structure of recent horizontal gene transfer and its structure in this clade. For each of our ecological populations defined by AdaptML a projected habitat was inferred based on the isolation data of their phylogenetic group, and serves as one representation of the group's ecology. In this study, we consider habitats learned from season (Fall/Spring) and particle size, two environmental measures available for the vast majority of the strains whose full genome was sequenced. These environmental traits have been shown to be important axes for ecological differentiation in this group (Thompson et al. 2004; Hunt et al. 2008) as well as stable descriptions of population's ecology over time (Szabo et al. 2012) but their utility as a descriptor of true ecology, and genomic and metabolic correlates were unknown. We ask the extent to which multiple lineages with the same projected habitat share genes or common carbon substrates. Vibrios are known for their great metabolic potential and diversity, but the driving forces and dynamics of metabolic evolution are still largely unknown (Keymer et al. 2007). Substrate metabolism could be largely stable on the timescales of bacterial species divergence, driven primarily by lineage history (phylogenetic inertia), or it could be substantially divergent, driven by the current ecological preferences of the lineage. It is also possible that it could be diverse, varying substantially with in a population on the timescales where we observe rapid flexible gene turnover. Using 46 metabolically diverse strains of vibrio, we investigate the timescales for diversity and divergence of metabolic profiles, and the extent to which multiple adaptations to the same inferred habitat involve convergence of phenotypes. 41 Closely related bacteria are increasingly appreciated to vary greatly in their gene repertoire (Tettelin et al. 2005), and vibrios in particular have remarkably fast turnover of much of their genome (Thompson 2005), but the selective implication of this large and labile flexible genome is poorly understood (Polz et al. 2006). Conspecific vibrios can differ by 1 megabase in genome content in their 5 megabase genomes (Polz et al. 2006), and even strains related as closely as -1% 16S divergence can exhibit 20% differences in genome size (Thompson 2005). It is not implausible to suggest that this surprisingly large fraction of the genome may not be adaptive (Berg and Kurland 2002; Polz et al. 2006), indeed, in the absence of other evidence, neutral variation seems the most appropriate null hypothesis. However, the current literature typically makes one of two opposing assumptions about the adaptive value of flexible genes: studies that anecdotally attribute any strain-specific gene content as responsible for pathogenic or other adaptive phenotype, and those that dismiss low-penetrance or strain-specific genes as likely being inessential or not adaptive (Haegeman and Weitz 2012). Thus, opposing interpretations on the adaptive value of this rather extensive 'flexible' gene content are common, and recently introduced neutral model for flexible genes distribution has only been tested in a limited way. Because our dataset includes ecological differentiation on a similarly fine scale as this rapid gene turnover, we can use ecology to investigate the extent to which flexible genes fix neutrally in a population. To construct the genomic reference for this important model system and perform this first survey of the interplay between ecology and genotype, we relied on next generation sequencing (NGS). NGS platforms have made genome sequencing tractable and affordable, however, short reads introduce a number of limitations and technical challenges. In the first part of this study, we discuss two methodological advances developed to overcome the limitations of current short read assembly methods. The first is a generally applicable approach developed for efficient and contiguous assembly of bacterial genomes. The second is a pipeline to construct from short reads the important phylogenetic marker, the16S gene. 2.3 Technical challenges and new assembly approaches 2.3.1 Hybrid genome assembly approach for short reads. A hybrid approach to assembly was designed which significantly improved assembly contiguity for many types of short reads. Assembly of short reads presents a number of 42 theoretical and technical challenges. There are practical limits on the efficiency with which de novo assembler heuristics infer linkage information (Bankevich et al. 2012), especially in light of abundant errors and regions of aberrant coverage due to amplification and sequencing biases (Aird et al. 2011) . There are also theoretical limits on de novo assembly of certain features by short reads (Wetzel, Kingsford, and Pop 2011; Chaisson, Brinza, and Pevzner 2009), particularly the early Illumina platforms' short single-end reads that constituted most of our dataset. On the other hand, reference assembly approaches require an extremely closely related (near-)finished genome and ignore the substantial amount of strain-specific DNA. None of these issues can be overcome simply by the abundant deep coverage afforded by short reads. State-of-the art de novo assembly of single end short read datasets produced genome sequences fragmented into many thousands of contigs (Zerbino and Birney 2008; Salzberg et al. 2008). Because of systematic biases and theoretical constraints, the gaps between contigs are not random failures but largely fall in the same regions, particularly in related genomes (Wetzel, Kingsford, and Pop 2011; Chaisson, Brinza, and Pevzner 2009). Our initial approach to assembly of many related genomes, gap-filling between contigs via alignment to a related strain's contigs, proved ineffective, as the vast majority of gaps occurred at the same trouble spots in both genomes (S. Timberlake, unpublished data). Long-insert libraries (-4kb), required to overcome the theoretical limits on de novo assembly of short reads (Chaisson, Brinza, and Pevzner 2009; Butler et al.; Wetzel, Kingsford, and Pop 2011) provide long range linkage information between regions bordering trouble spots so assemblies no longer break in those regions. However, long DNA fragments are physically incompatible with the Illumina platform (Levine 2012), and the library construction protocols designed to sidestep this limitation are very costly in terms of reagents and time. A hybrid assembly approach was designed to make efficient use of labor-intensive long-insert linkage information, propagating it to related strains in a conservative manner. Each hybrid assembly requires a reference genome, but the required relatedness of this reference is more relaxed than is typical for reference assembly. A minimal set of costly references is chosen by estimating the maximum nucleotide divergence at which reference assembly was useful, given the constraints of the mapping software at the time (Li, Ruan, and Durbin 2008). We used long-insert libraries, prepared in addition to standard libraries, to close (or nearly close) one genome per population, using a standard de novo assembly algorithm (Zerbino et al. 2009). References so generated have complete sequences in these 43 otherwise unassemblable regions, making synteny information available for the rest of the closely related strains in the population. Where long inserts were unavailable, even synteny information from reference strains assembled from short-insert paired reads improved single-end read genome assembly with this hybrid approach. Hybrid assembly is a modular process proceeding in two main steps: reference assembly to propagate conserved synteny information and de novo assembly to incorporate strain-specific DNA and resolve strain-specific linkage. Briefly, for each strain "S" undergoing hybrid assembly, whole-genome shotgun short reads from S were mapped to the most closely related reference strain "R", generating reference contigs that were supplied to the de novo assembly algorithm with all reads from strain S. Any single base of R not covered by mapped reads from S was considered a possible strain-specific synteny breakpoint. This conservative approach was motivated by the observation that closely related vibrios experience frequent synteny-disrupting gene insertion and deletion, and that the genomic repeats which are hotspots for such events (Achaz et al. 2003; Rocha 2003) include the exact same features where synteny cannot be resolved by short reads (Wetzel, Kingsford, and Pop 2011; Chaisson, Brinza, and Pevzner 2009). This hybrid assembly approach substantially increased N50, the standard metric for assembly contiguity, for all common lengths and types of Illumina reads. (The N50 is the length of the smallest contig that needs to be included in the set of contigs comprising half the total bases in the genome.) Hybrid assembly improved N50's by 2-3 fold (Figure 1) over de novo assembly alone, and was effective for both single-end reads (51-76bp) and the paired-end reads (63-1,02bp) produced by more mature NGS, the Illumina IIGA. (Extensive parameter space exploration was conducted for both assembly methods; see Methods). Moreover, this modular approach is generally applicable both in terms of software algorithms, short read platform, and long-range linkage information source. With the modules used here, reference genomes that successfully increased contiguity were as distant as -90% average nucleotide identity in the core genome, substantially more distance than is typically suggested for reference assembly of any kind, including some other hybrid approaches (Salzberg et al. 2008)). However, long-range linkage information need not come from a near-complete reference genome; Pacific Biosciences now offers long reads which will likely be a suitable a suitable reference for this approach (English et al. 2012). 44 2.3.2 Pipeline for 16S rRNA gene assembly from short reads. The small ribosomal RNA subunit (16S rRNA) cannot be assembled from short read data by traditional methods, but it is integral to microbial taxonomy and phylogeny as the canonical unit of evolutionary distance. As part of the operational bacterial taxon definition, it is used in the standard procedure for identifying type strains or named species for an isolate, as well as for some definitions of HGT (Smillie et al. 2011). Given these important roles for the 16S ribosomal sequence and its unavailability from our short read sequences, a custom pipeline was built to assemble it. 16S sequences are particularly recalcitrant to assembly from short reads due to aberrant coverage and repetition in the genome. Vibrio genomes in particular possess an extremely high number of 16S gene copies among known bacteria, with each genome encoding multiple (8-14) polymorphic copies (Z. M. P. Lee, Bussema III, and Schmidt 2009), whose diversity within a single genome (Moreno, Romero, and Espejo 2002) approaches that between OTUs. Like all divergent repeats, it is theoretically impossible to assemble de novo from short reads. However, reference assembly requires a full-length 16S sequence from a closely related conspecific, which is unavailable for many environmental bacteria. A recently published algorithm for 16S gene assembly iterates read mapping to references with reference update steps, but requires paired-end reads (Miller et al. 2011). Our hybrid genome assembly approach described above could assemble full-length ribosomal sequences, but only in populations after long-mate pair reads were utilized. A computational pipeline was designed for assembly of full-length Vibrio 16S sequences from short reads with the respective limitations of de novo and reference assembly in mind. Briefly, centigs from de novo assembly of whole genome shotgun short reads were screened for the presence of 16S fragments. Subsequences matching 16S significantly were identified by sensitive alignment to 16S reference sequences of Vibrionaceae complete genomes, filtered for sufficient coverage, and used to construct fulllength sequences while retaining all observed bases at polymorphic sites. Unlike polymorphisms in reads, which may be due to sequencing error, contigs over a minimum coverage depth are predominantly error-free and polymorphic sites thus represent real intragenome diversity (Supplemental Figure 1). The resulting 16S sequence from each genome is an amalgam of all 16S alleles in the genome, and the number and location of the variable positions was consistent with Sanger-sequenced finished genomes. There are as many as 24 polymorphic sites within the 16S genes of a single isolate, so intragenomic 45 alleles could differ by as much as -1.5%. This is not without precedent: for example, family member Photobacterium profundum SS9 has intragenomic 16S divergence of >1.5% (P. S. Dehal et al. 2010). Evolutionary distances based on these 16S sequences were used to build phylogenies and confirm taxonomy by clustering our strains with named type strains (Supplemental Figure 2, Supplemental Notes), and as a measure of phylogenetic distance. 2.3.3 Using whole genome phylogeny to support ecological population predictions Whole genome sequences revealed that most ecological populations previously predicted using hsp60 were robustly supported by the majority of core genes, while within populations, phylogenetic relationships were inconsistent and/or poorly resolved. Ecological populations were originally delineated using the housekeeping gene hsp60 due to its ubiquity and better phylogenetic resolution than 16S. However, subsequent analysis found that though these groups were largely cohesive in a number of housekeeping genebased phylogenies, some populations were incongruent or could not be confirmed (Preheim et al. 2011). To determine population cohesion and phylogeny more precisely, species phylogenies were constructed using the majority of ubiquitous genes or, alternatively, a subset of 30 ribosomal proteins (Methods). Core gene phylogenies robustly supported most previously predicted ecological populations but undermined or rejected a few (see Supplemental notes). In particular, strains of V. splendidus with no ecological population prediction or ones with poor or conflicting support were combined for some of this analysis, as has been suggested previously (Preheim 2010), or omitted. Intrapopulation nodes typically had low bootstrap values and inconsistencies between phylogenies. This likely reflects rampant intrapopulation recombination in the core genome, as found previously in this group (Shapiro et al. 2012), and descent may be impossible to resolve. 2.4 Results and Discussion 2.4.1 Metabolic profiles' relationship with phylogeny and ecology The patterns and drivers of metabolic diversity and divergence are largely unknown, so we sought to identify them in this metabolically diverse family. Substrate metabolism could be driven by the current ecological preferences of the lineage, or by lineage history. Likewise, substrate metabolism could vary widely in a population (due to flexible DNA), or these capabilities could largely be encoded in a slowly changing core genome. The role of ecology in shaping metabolism, particularly microscale ecology, is also not well understood. Multiple lineages living in the same habitat may have convergently evolved the potential to 46 exploit the same types of resources found in a habitat, or they may partition those resources so as not to be in direct competition. The projected habitat inferred for each population provides a representation of that population's microscale ecology, and we asked whether particle size preferences (and season) entailed similar carbon substrate utilization profiles. To get a metabolic profile of each strain, we quantified its ability to respire 95 different carbon substrates on a Biolog GN2 plate as the sole carbon source. The metabolic profiles of 46 fully sequenced strains from 16 populations spanning the diversity of our family exhibited substantial within-population similarity but within-species variance, no ecological convergence, and evolution dominated by phylogenetic divergence. Metabolic profiles in these 46 vibrio strains were remarkably diverse and malleable (Supplemental Figure 6). There was only a single substrate that could be ubiquitously respired (alpha-D-glucose, though several others were common) and many distant relatives utilized zero common substrates. Moreover, the ability to utilize large numbers of substrates has been gained and lost multiple times. There was a great degree of diversity in metabolic profiles within named species. Strains within an OTU (<3%16S distance) exhibited obvious variation, more so than biological replicates (Welch's t-test on strainpairs Pearson correlations; p<0.012). However, ecologically defined populations had largely cohesive metabolic profiles (indistinguishable from the similarity of biological replicates: p>0.25). If metabolic profiles are more similar between co-habiting lineages, this could indicate convergent strategies for resource exploitation in that habitat. Same-habitat strains (either by particle size preference or seasonal co-occurrence), however, do not have more similar metabolic profiles than other strains with similar evolutionary divergence (Figure 4). In contrast, a recent study within environmental V. cholerae strains found a number of metabolic capabilities correlated with environmental parameters (Keymer et al. 2007). There are a number of possible explanations for these results. One possibility is that within any one inferred microhabitat in our system, competition drives resource partitioning and thus divergent metabolic potential, whereas the V. cholerae clades examined in that previous study were primarily not co-existing geographically even at the macroscale. A second and likely explanation is that our inferred habitats, based on limited environmental measurements, may correspond poorly with true ecology. For example, it is known that particle size bins group together many disparate substrate types, including particles derived from plants versus animal detritus (Preheim 2010), and thus particle size might be 47 an extremely limited measure of ecological similarity, even though it (together with season) successfully delineated very closely related populations (Hunt et al. 2008). To drill deeper for ecological drivers of metabolism, more numerous or more precise environmental preferences (for example, particle type, host or body site preference (Preheim 2010)) might be tested for a signal of metabolic convergence. A third explanation is that phylogeny, not ecology, largely drives metabolic profiles. Consistent with this idea, phylogenetic distance explained a large part of the variation between populations (Supplemental Figure 7). Metabolic divergence between populations (mean Pearson correlation in metabolic profiles) increased roughly linearly with phylogenetic distance, a pattern that is consistent with overall neutral evolution of a character along a phylogeny, and the idea that metabolic profiles are largely a function of evolutionary history. Phylogenetic distance was not controlled for in the previous V. cholera study. It is generally accepted that a bacterial species unit should be defined by a common phenotype in addition to other features like core genome (Keymer et al. 2007), but whether substrate utilization assays like the Biolog plates are appropriate for delineating bacterial groups is not certain. Previous study reported that for three vibrio populations, substrate utilization profiles were more similar (average Pearson correlation) within populations than between populations (Preheim 2010). However, this observation can easily be explained by the large phylogenetic distance between the populations, rather than the population delineation per se. Instead, when we exclude the effect of phylogenetic distance, metabolic differentiation between populations is less clear. We focus on V. cyclitrophicus, which consists of two diverging populations that are generally not phylogenetically separated (for example, in core gene phylogenies) but are ecologically differentiated into large and small particle specialists (SP-specialists). This ecological split is robustly supported by SNPs a handful of genome regions (Shapiro et al. 2012), but within- and between-population genetic divergences are similar in these early stages of differentiation (Shapiro et al. 2012). Here, ecological populations' metabolic profiles were not cohesive: while the LP strains' metabolic profiles were more similar within than between populations (p<0.01), SP populations were not (p>0.75), and hierarchical clustering of V. cyclitrophicus metabolic profiles were not congruent with the ecological split. This may be because the early stages of ecological differentiation do not engender clear metabolic differentiation, or because such metabolic differentiation was not measured by this particular assay. While the robustness or generality of this result remains to be seen, it suggests that even high- 48 throughput measures of metabolic phenotype, such as the commonly employed Biolog plates, may not be sensitive for delineating bacterial species units. 2.4.2 Gene content relationships with phylogeny and ecology Vibrios of this model system are known to partition resources at the microscale (Hunt et al. 2008), however, the extent to which adaptation to these microscale habitats involves the use of the same genes from the flexible gene pool was unknown. As with metabolism, gene content may be driven by the current ecological preferences of the lineage, or lineage history; it is even possible that neutral variance might dominate gene content. Enrichment of shared flexible genes and shared total genes in same habitat populations revealed that this measure of ecological similarity was not predictive of preferential acquisition or retention of genes. Instead, the majority of total gene content divergence could be explained by phylogenetic distance alone. Gene content is largely driven by lineage history, as the overall fraction of shared gene families (mean Jaccard similarity) between populations decreases roughly linearly with phylogenetic distance (Supplemental Figure 3). Same-habitat populations were not more similar than other populations once the effect of distance was accounted (Figure 3). In fact, different-habitat populations apparently shared more gene content for any given 16S divergence, but this difference was not significant at any one distance (Welch's t-test on population-pairs' averages, p>0.07 for every bin). We also considered gene content similarity between just the large particle specialist populations (LP-specialists), as this projected habitat had a particularly distinct signal (Hunt et al. 2008), and it includes larger, denser, less ephemeral bacterial communities (Simon et al. 2002; Hunt et al. 2008). However, gene content actually appeared less similar between LP-specialist populations than other populations at each phylogenetic distance (Supplemental Figure 4), though this was only statistically significant at one distance (3%< 16S distance <4%, p=0.025). Measuring similarity between total genome content could obscure a signal of convergence among more recently acquired flexible genes relevant for the lineage's current habitat, particularly because gene content is largely attributable to phylogenetic history, and might therefore be shaped more by the long ecological history of the lineage than by the current ecology. However, rare genes were no more likely to be shared by separate lineages with the same habitat, indicating that even recent overlap of flexible gene pools is not driven by these habitats. While the age of gene acquisition for rare genes was hard to assess, rare genes are more likely to be novel than genes fixed in many lineages, so shared rare 49 genes between populations may indicate that they have recently been accessing the same flexible gene pool. Same-habitat genome pairs' tendency to share flexible genes was calculated by considering rare genes (present in 2-5 genomes), less rare genes (2-8 genomes) and relatively common but non-ubiquitous genes (2-20 or 2-40 genomes). However, regardless of gene rarity, distance bin, or habitat metric, same-habitat genomes did not consistently share more flexible genes (p>0.25), with one particular and consistent exception (Supplemental Figure 5). Within-cluster (first distance bin) comparisons always share more genes between same-habitat strains (p<0.01). This dichotomy of within population versus between population gene sharing, though not enumerated as part of the original Core Genome Hypothesis (Ruiting Lan and Reeves 2000), is certainly consistent with it's attempts to incorporate HGT in to a cohesive species model for bacteria. While laterally acquired DNA is typically considered a diversifying force, this within-population gene sharing constitutes an additional homogenizing force within populations and a diversifying force between them. Our measures of ecological similarity within this family, particle size preference and season, were not predictive of preferential acquisition or retention of genes by a variety of methods. This may be because resource partitioning and/or phylogenetic history could drive very different adaptive strategies in the same habitat, however, the literature includes quite a bit of lore suggesting that similar ecology predicts similar gene content. It is often suggested that certain lifestyles entail enrichment of certain genes or gene functions (Lauro et al. 2009), though these lifestyle categories may represent very differencet trophic strategies and very divergent bacteria. Studies within closely related bacteria have also suggested that ecology drives gene content similarity (Luo et al. 2011; Sim et al. 2008), however it was not clear that genetic divergence was not driving both. Indeed, we found that phylogenetic distance predicts a large fraction of gene content similarity (Supplemental Figure 3). Lateral gene sharing is found to be largely driven by ecology after accounting for distance, even when using broad labels for ecology (Tuller et al. 2011; Smillie et al. 2011). However, it is possible that similarity of ecology amongst all our vibrios (all heterotrophic marine bacteria from the same coastal waters) dwarfed any additional similarity engendered by our projected habitats. This may be a likely explanation as our projected habitats were based on just two axes of the niche space, and even similar particle size encompasses a variety of substrate types, as discussed above. We were not able to detect 50 convergent adaptations to similar habitats, but whether this is due to competitive exclusion or another ecological explanation remains open. 2.4.3 High rates of HGT among strains co-existing in the water column. Because our metrics of ecology did not drive similar metabolism or enriched gene sharing between our inferred habitats, we therefore took an unbiased approach to discover the structure of DNA exchange within our family. We found that these vibrio genomes have a very high rate of horizontal transfer, and describe a structured network of exchange concentrated in a few partners that share a particle-size based ecology. To analyze gene exchange amongst 80 vibrio strains, we focused on recent HGT, as older transfers may have been driven by historical ecology distinct from the current niche. The rate of recent HGT between two bacterial lineages was quantified as the probability that a pair of genomes drawn from those two lineages (>=3% 16S distance) recently shared at least one laterally acquired piece of DNA (a stretch of DNA >500bp long and >99% identical, and therefore extremely unlikely to be vertically inherited). This empirical method for identifying recent HGT was duplicated exactly from a recently publication (Smillie et al. 2011).for purposes of direct comparison with their estimates of transfer frequency in 2235 complete genomes. In addition to quantifying recent gene exchange between 2% 16S clusters for comparison, we also calculated HGT rates between our ecologically-defined groups. Recent HGT frequency was remarkably high (Figure 5), roughly 4x higher between our 2% 16S clusters of vibrios than between (non human-microbiome) genomes from the same environment, and rivaling that of genomes isolated from humans but not observed at the same body site. This is despite the fact that these vibrios are known to have different ecologies (Hunt et al. 2008; Preheim et al. 2011). However, previous studies quantifying high rates of horizontal gene exchange between microbes with the "same ecology" rely on mostly broad labels like "marine" or "hot spring" to describe ecology (Tuller et al. 2011; Smillie et al. 2011). Even our different-habitat strains meet this permissive definition of same ecology and are further known to co-exist in the water column geographically and, to some degree, spatially (e.g., co-occurrence on specific plants or animals (Preheim 2010)), features that are known to have a positive effect on gene exchange frequency (Smillie et al. 2011; Boucher et al. 2011). Breadth or specificity of ecological label might similarly explain the exceptionally high HGT frequency observed between microbiota of the same human 51 body site: this definition of ecology is likely the most specific, and broader definitions encompass dissimilar ecologies that dilute the signial. Thus it is likely that recent studies highlighting ecology as a predominant force in gene transfer still underestimate its role. While our counts are too low to rigorously assess the effect of microscale habitat on horizontal transfer, the network of populations organized by HGT frequency has some structure, perhaps due to ecology (Figure 6). Some populations are never observed to engage in transfer within our populations, while others partner quite frequently. A nexus of transfer between phylogenetically distant relatives connects the three known LP-specialist populations: V. breoganii,V. crassostreae,V. sp. F10, and V. splendidus population 19 (though some of these are now known to preferentially attach to very different kinds of particles or metazoans (Preheim 2010)). Another subnetwork of HGT partners connects the only human pathogens in our dataset with some of the fish pathogens: V. cholerae (node "VC"), V. alginolyticus (5) (Nishiguchi and Nair 2003), and V. ordalli (3) and V. splendidus (18). This is despite being assigned to different microscale habitats, and in the case of V. cholerae, geography (V. cholerae strains included in this analysis were isolated from ponds and from clinical patients). Though the effect of shared microscale ecology on HGT is statistically uncertain here, the high degree to which shared ecology drives transfer (Smillie et al. 2011) suggests that many newly acquired genes are indeed adaptive in the environment. 2.4.4 Core and flexible gene turnover. These vibrios not only have a high rate of recent transfer with each other, but an astonishingly high rate of introduction of new DNA overall. Our superfine phylogenetic sampling included many pairs of genomes distinguished by only 100-300 SNPs genomewide, a number comparable to what accumulates in a few weeks of bottlenecks in the laboratory (Clarke and Timberlake, unpublished results). Surprisingly, during those short divergence times in the wild, 100-300kb of new DNA can accumulate in each strain. This rapid and persistent turnover of flexible DNA introduces abundant raw material for selection. Rather than a false signal of diversity composed simply of assembly artifacts or other junk DNA, >30% of the DNA gained and lost between these sister strains was organized into open reading frames with identifiable homology. While genes of phage origin and other possibly selfish elements typically accounted for 10% of these genes, the majority of annotated ORFs could have adaptive value: these were primarily transporters and other extracellular genes, as well as carbohydrate-active or DNA-active enzymes. Thus a 52 very rapid influx of functional genes provides abundant raw material for selection even faster than the short timescales where stable resource partitioning has been observed to occur. While the majority of this newly introduced DNA never reaches fixation, what fixes is still sufficient to replace a large fraction of core genes within traditional OTU (3% 16S) divergence. Of 4711 flexible gene families observed in V. cyclitrophicus(likely a huge underestimate of the genes that have been gained and lost already in this clade), 2100 were observed in only a single strain, despite our very fine phylogenetic sampling. To get an overview of the rate at which this functional, protein-coding flexible DNA might accumulate in the genome and fix in populations, we constructed core and pan genome curves. Focusing on our most finely sampled clades, V. cyclitrophicusand V. splendidus (Figure 7), we defined core genomes for each ecological group and the phylogenetic groups that contain them. Despite rapid loss of most new flexible DNA, the rate of replacement of genes in a population's core was still remarkably high. On the timescale of within-OTU divergence, turnover of more than 25% of the genome core to an ecological population has occurred. Of 3839 genes core to this large-particle LP-specialist population, only 2806 remain core in all of the 3% 16S OTU cluster (subsampled to equalize genome count). The practical implications are clear: the choice of phylogenetic grouping has a large qualitative impact on genes' assignment to core or flexible categories. The biological implications of this very fast turnover are still unclear (Thompson 2005; Polz et al. 2006): is this evolution be neutral or adaptive? 2.4.5 Genes fixing within a population Independent association of a genotypic trait with ecology can be good evidence of positive selection, provided shared ancestry cannot explain both. While we found flexible gene content is not more similar between independent lineages with the same inferred habitat, it is significantly more similar between same-habitat strains within a diverging clade, possibly indicating selective fixation. V. cyclitrophicus consists of two diverging populations, which are generally not phylogenetically separated (for example, in core gene phylogenies) but are ecologically differentiated into large and small particle specialists (SP-specialists), a split robustly supported by SNPs a handful of genome regions (Shapiro et al. 2012). Within the SP and LP-specialist habitats, genotypes shared more rare genes (t-test on all species pairs, p=O; Supplemental Figure 5), and had more similar gene content overall: hierarchical clustering of V. cyclitrophicusstrains based on presence/absence of all protein families 53 delineates two groups falling along ecological lines (Figure 8). This abundance of habitatspecific flexible genes evident so early in the differentiation process suggests a preeminent role of new, laterally acquired flexible DNA in the adaptation and ecological differentiation of populations (Supplemental Note). Flexible genes, i.e., genes of intermediate penetrance in a bacterial group, are not well established to be neutral or adaptive (Haegeman and Weitz 2012). Penetrance of flexible genes into V. cyclitrophicusexhibit a frequency distribution curve characteristic of many bacterial populations but poorly understood: the vast majority of genes are very rare or completely fixed, and few have intermediate penetrance (Supplemental Figure 8). The adaptive value of these genes and their reason for intermediate fixation was not clear. Interpretations for low penetrance genes in the literature directly conflict: strain-specific DNA is often held responsible for pathogenic or other adaptive phenotype, while other studies assume that low penetrance implies low selective benefit. Systematic evidence confirming either of these opposing interpretations is lacking, but a recent paper on flexible gene distribution suggested a neutral model that might be appropriate for flexible gene fixation (Haegeman and Weitz 2012). To test the extent to which flexible genes are undergoing neutral gene fixation, we looked at the habitat specificity of flexible gene families within V. cyclitrophicus.We focused on 850 flexible gene families present in more than 3 V. cyclitrophicusgenomes. We focused on this subset of families for two reasons: 1) they have sufficient counts for a meaningful gene-by-gene habitat-enrichment test and 2) gene families present in fewer strains could be argued to have been gained ancestrally and present in same-habitat strains due to clonal expansion. Of these 850 gene families, 175 were distributed significantly unevenly between the two ecologies (2-tailed hypergeometric test with 5% FDR). Thus, more than 30% of the flexible genes apparently rising to fixation (or being deleted) are doing so in a clearly habitat-specific and likely non-neutral manner. The selective forces governing gain versus loss of these genes likely differ (Supplemental Notes). A possible parsimonious explanation for this pattern is gene presence/absence in one ancestral SP- or LP-specialist, followed by clonal expansions and vertical inheritance of flexible genes by their same-ecology descendants. While any signal for clonal ancestry within V. cyclitrophicushas been obscured by rampant recombination (no single phylogeny accounts for more than 0.7% of core genome length), the vast majority of the core genome's 54 length does not support monophyly for SP and LP strains, with trees of 69% of recombination-free blocks rejecting the ecological split (Shapiro et al. 2012 SOM; Shapiro 2010). Moreover, the tree of 30 ribosomal proteins (presumably neutral markers for this ecological adaptation) has good support for phylogenetically well-mixed SP and LP specialists (Figure 8; most nodes' bootstraps>80%). Therefore parallel fixation of the same flexible genes in genomes of the same ecology is a likely explanation and is good evidence that their fixation occurs by selection, not drift. In total, V. cyclitrophicusgenomes harbor 4710 flexible gene families, so the majority of flexible genes that enter a population are deleted. The timescale on which this occurs seems distinct from those gene losses driven by the deletion bias and selective deletion commonly discussed in the literature (Kuo and Ochman 2009) (Supplemental Notes), and thus neither the mechanism nor adaptive value of these rapid deletions is known. 2.5 Conclusions This study presented the assembly of 82 genomes and a first survey of flexible gene turnover, gene sharing, metabolism, and ecology among populations of this Vibrionaceae model system. There exists a persistent gap between abundant short read sequences and tool development for their use. I developed two assembly strategies to fill this gap, including a hybrid assembly approach that significantly increases genome assembly contiguity by efficiently leveraging costly long-insert library information. If long-insert library construction continues to be an expensive and limiting step, extending this approach for use with more distant reference genomes would be extremely useful, and may be possible with the next generation of fast but more permissive read-mappers (Niu et al. 2011). Hybrid assembly has a modular implementation and is broadly applicable: it was effective both for the latest generation of Illumina paired-end reads, and for the older single-end short reads, which may once again be relevant with the newest generation of sequencing platforms (Glenn 2011). It was effective using a fairly wide range of reference genomes, in terms of both relatedness and contiguity. A survey of genotype, phenotype and ecology in 82 environmental bacteria revealed several surprising findings. The adaptive fixation and preeminent role of flexible gene 55 content in population differentiation is clear, occurring with surprising rapidity, with the very fine phylogenetic distances where ecological differentiation occurs. While the inferred habitats in our study were not predictive of genotypic or metabolic convergence, they delineated ecological populations largely congruent by a number of measures, including gene content, core genome phylogeny, and metabolism. I showed that new and potentially beneficial genes are rapidly introduced to each genome, providing abundant raw material for selection with the very rapid timescales where we have previously reported ecological differentiation. While the majority of this very fast flexible DNA never reaches fixation, a large amount does fix, apparently nonneutrally, on timescales compatible with ecological differentiation, and delineates populations in a way congruent with their ecology, before core genome divergence or metabolic phenotype. By OTU-like divergence, it has resulted in large-scale (-25%) turnover in the core of any one population. This highlights the large qualitative discrepancy between core and flexible genomes defined by arbitrary phylogenetic cutoffs and other species-group definitions. In particular, a large fraction of genes apparently at intermediate penetrance in a species were actually fixed in a population, very likely by selection. These new vibrio genomes, despite association with different microscale habitats, shared genes with a frequency much higher than a recent survey of HGT reported for sameecology bacteria or even for bacteria isolated from various sites on the human body, a known nexus of transfer. To what extent this is due to shared geography and isolation method, or perhaps to the vibrios' natural tendency for DNA uptake, is an open question, but the relative breadth of the labels used to describe "ecology" in these studies certainly has an effect. HGT is enriched among strains with similar ecology, pointing to a likely role for these genes in environmental adaptation. However, our measures of ecology did not have predictive power for sharing of gene content over the long term, or of metabolic similarity. Both genotype and phenotype were found to be largely driven by phylogenetic history. In fact, metabolic phenotype could only statistically distinguish ecological populations when they were sufficiently separated by phylogenetic distance. However, we cannot dismiss the possibility that ecology could drive convergent adaptations, as our projected habitats (inferred ecological preferences based on particle size and season) may not sufficiently represent the populations' true ecology. 56 Finally, there is an apparent dichotomy in the role of flexible genes in adapting to a habitat. Within a differentiating clade (V cyclitrophicus), strains with the same habitat acquire large numbers of the same flexible genes, apparently in parallel. However, between populations, flexible genes are no more likely to be shared by lineages adapting in parallel to the same projected habitat. This is true even amongst the closely related populations, within phylogenetic groups that are typically treated as conspecifics (e.g., V. splendidus clade). If ecological populations rather than arbitrarily defined OTUs are considered the fundamental evolutionary unit for selection, then this dichotomy becomes consistent with our concepts of species and evolution in macro-organisms: individuals from the same species can share a niche, resources, genetic information, but competitive exclusion precludes different species found in the same habitat from doing so. The quantitative ecological data collected with this model system, and the fine phylogenetic resolution of genomic sequences and ecological partitioning, make it an ideal system to study evolutionary dynamics and selection on a hierarchy of timescales. Here we have demonstrated that even the shortest timescales are relevant for gene content to be selected and play a role in ecological differentiation, but that phenotype and genotype dynamics on longer timescale are not so easily explained by our measures of ecology. Future work on this system can likely illuminate the longer term ecological benefits and core genome integration, as well and the variety of selective forces governing the fixation and loss, of these dynamic and important flexible genes. 57 2.6 Methods 2.6.1 Whole genome shotgun libraries and sequencing For these 82 genomes, a wide variety of library types, read lengths, insert sizes and sequencing machines was used. Each genome was sequenced using a standard whole genome shotgun short-read library preparation, with some genomes additionally sequenced with mate-pair libraries (detailed below) or Sanger sequencing that had previously been assembled (12B01, 12G01). Genomic DNA libraries' were prepared by random shearing per Illumina's protocol (and ligation of custom paired-end barcodes for multiplexing, where applicable). For most genomes, short (50 or 76bp) single-end reads were sequenced at the Broad Institute Illumina GI. For V. cyclitrophicus (population 15) genomes, paired-end (63 or 102bp) reads were sequenced on the Illumina GAII at MIT's BioMicroCenter. Additionally, to aid in genome closure, long-insert mate-pair libraries were constructed for 11 genomes (1F-53, ZF-14, ZF-29, 9ZC13, 12B09, 5S-149, 5S-101, FF-33, 1F211, ZS-17, FF-500) by two methodologies. For strains 1F-53 and ZF-270, long-insert matepair libraries were constructed with an ECOP15 "jumping construct" protocol (Collins et al. 1987; Young et al. 2010). Briefly, 4-6kb sheared gDNA fragments were ligated to oligomers containing a restriction enzyme recognition site that directs cleavage -2 Sbp downstream, and sequenced in 35bp paired reads on an Illumina GAII at MIT's BioMicroCenter. ECOP15 restriction sites were identified by pattern matching and reads trimmed of this nongenomic DNA. The remaining 10 long-insert mate-pair libraries were constructed using the Illumina mate-pair kit. Briefly, 4-6kb sheared gDNA fragments were blunt-end biotinylated, circularized and (randomly) re-sheared, and 80bp paired reads sequenced on an Illumina GAII at MIT's BioMicroCenter. All reads were quality trimmed with the trimBWAstyle.pl script to eliminate regions of low-quality bases (average less than phred score 20) (Li and Durbin 2009). Strains 12B01 and 12G01 were previously sequenced with Sanger reads (GenBank WGS scaffolds NZCH724170-84.1, NZCH902589-98.1, respectively); short read mapping was used to amend -20 errors or mutations in each of those Sanger-based drafts. 2.6.2 Hybrid assembly pipeline The hybrid assembly approach developed here proceeds in two main steps: short reads from each strain are scaffolded on each reference strain genome to generate "reference 58 assembly contigs", which are then used in combination with short reads in a final de novo assembly. For each strain "S"undergoing hybrid assembly, we chose one or more of the most closely related reference strains "R" to scaffold reads. For 69 strains undergoing hybrid assembly, 13 strains with near-closed genomes (from Sanger or long-insert-mate pair sequencing efforts) were available for use as references. Reference assembly contigs were generated as follows. Untrimmed short reads from strain S were mapped onto each reference R separately using maq-0.7.1 (parameters: maq map -N -n 3 -e 800) allowing a maximum of 3 mismatches in the seed (mapper sensitivity maximum) but up to 20-25 total high quality mismatches, allowing codon 3rd positions to be approximately saturated for a trimmed -75bp read. Where paired reads were available, only proper pairs were kept. (Proper pairs are those with both reads mapping within 2 standard deviations of the insert and in the correct FR orientation). The maq consensus sequences generated from this reference assembly were split on any single base that was uncovered or had ambiguous consensus to generate "reference assembly contigs." These rather stringent criteria for use of synteny information from reference strains were chosen to avoid assuming that synteny is highly conserved in the absence of corroborating evidence from strain S short reads. Reference assembly contigs were used in combination with de novo assembly as follows. To assemble strain 12F01, reads were assembled de novo using the program eulersr (Chaisson, Brinza, and Pevzner 2009). De novo contigs and reference assembly contigs were merged with the AMOS suite program minimus2 (Sommer et al. 2007), using a minimum overlap of 25 and minimum percent id of 98. For the other strains, reads were trimmed of low quality bases as above. All short trimmed short reads as well as reference assembly contigs were supplied to Velvet for de novo assembly (the latter were labeled as belonging to the "Long" read category for velvet (-long-multcutoff=0). Parameter space was searched to optimize N50 with the following constraints: covcutoff > 3, predicted exp-cov * 0.5< exp-cov < predicted.exp-cov * 2.5, where predicted-exp-cov was calculated directly from the reads or estimated automatically by Velvet from the data. 25 < k < 49, except for the "ditag" type long-insert assembies, which require k<=21. Assemblies built with longer kmers were preferred unless N50 difference was > 5%. 59 2.6.3 16S "contig-mapping"assembly pipeline Custom assembly pipeline for 16S sequences consisted of three main steps: de novo assembly of all short reads, scaffolding of contigs, and consensus building. All de novo contigs for each genome with coverage higher than 10-fold were blasted against a database of 16S sequences from finished genomes in the Vibrio family (blast parameters "-W=7 e=0.001"). Aligned contigs were trimmed down to 16S subsequences, and Mothur (Schloss et al. 2009) was used to build a consensus sequence for each strain, retaining all observed bases at intra-genomic polymorphic positions (encoded with IUPAC ambiguity codes). Unlike polymorphisms in reads, which may be due to sequencing error, contigs over a minimum threshold coverage depth should be relatively free of errors, as our library construction protocol was performed in 4-8 parallel subsamples, to avoid propagation of errors in early rounds of PCR. For 16S genes assembled this way, intragenome polymorphism frequency was concentrated in variable regions also seen in finished genomes from the family, consistent with the idea that they are true polymorphisms, not short read sequencing errors or assembly artifacts. (Variable regions for the group were calculated using 129 16S rRNA sequences found in 15 finished genomes of 10 Vibrionales species downloaded from the MicrobesOnline database (Alm et al. 2005)). Where this contig-mapping approach resulted in too many Ns and ambiguities, maq was used to map reads to a within-population 16S sequence (assembled via the contig-mapping) and to call consensus. As an additional note, in organisms where 16S intragenome polymorphism is minimal or unimportant, simpler approaches to the assembly of 16S are likely effective. Here, short reads, single-ends, and high intragenomic polymorphism teamed up in a tragic trifecta that required this approach. 2.6.4 16S phylogeny Ribosomal rRNA sequences of Vibrio family strains were obtained from IMG genomes using that database's ORF coordinates. All IMG vibrios and our 82 Vibrio strains were aligned to the Silva reference alignment with Mothur, and Phyml was used with default parameters to estimate the phylogeny. 2.6.5 16S distance matrix, clustering, and pairwise distances A distance matrix between species was constructed using the minimum Hamming distance between two sequences where each polymorphic site received a distance count of zero if 60 the sets of nucleotides intersected. This is an underestimate of the true distance. To calculate between-cluster distances, one representative per population with the fewest ambiguous (polymorphic) sequence positions was selected. 2.6.6 Identifying recent HGT between populations Recent horizontally transferred sequences were discovered according to an empirical method (Smillie et al. 2011). Briefly, segments of >300bp >99% identical found (by blastn) in genomes diverged more than 3% at 16S rRNA locii were considered HGT (only segments of >500bp were considered for the purposes of direct comparison of counts with a recent study (Smillie et al. 2011), see "Counting recent HGT..." below). Multiple HGT sequences between the same pair of genomes were counted as a single HGT event, and HGT event frequencies were calculated between each population pair (rather than between each 2% 16S cluster as below) as the fraction of their genome pairs with at least 1 HGT sequence. 2.6.7 Counting recent HGT within 80 vibrio strains Recent HGT frequencies were calculated according to an empirical method (Smillie et al. 2011) for the purposes of direct comparison with those counts. The following minimal modifications to their pipeline were made to accommodate the assembled 16S of this dataset. First, the assembled 16S with the fewest ambiguous positions was chosen to represent each population. Representative sequences were then clustered with Mothur at average neighbor <2% distance. Finally, distance between these clusters was calculated as the minimum distance between any two representative sequences in the cluster. Each of these modifications underestimates 16S diversity and divergence, producing a conservative underestimate of the 16S distance on the order of 0.5%. 2.6.8 Calling genes and annotating genomes ORFs were called by Glimmer3 (Delcher et al. 2007) with default parameters as implemented by the RAST pipeline (Aziz et al. 2008). RAST was also used to assign functional annotations. 2.6.9 Building orthologous groups Orthologous groups were built by two methods. OrthoMCL was used with default parameters, except for a 50% amino acid identity cutoff (Fischer et al. 2002) . Orthologous groups are sensitive to parameter choices and to sampling (see Supplemental Note), however OrthoMCL gene families resulted in Vibrio family core genome sizes comparable to 61 estimates for other groups of family members using the well validated approach of tree- based orthology (P. S. Dehal et al. 2010; M. Price, Dehal, and Arkin 2007). OrthoMCL gene families were used in all gene content analysis, except for the shared rare genes analysis (Supplemental Figure 5) and the species phylogenies that include V. cholerae-like strains). For those, gene families were defined by Blastclust (L=0.75,S=1.75) (Dondoshansky and Wolf 2000). 2.6.10 Building concatenated gene species phylogenies In additional to 16S phylogenies, two species phylogenies were constructed using concatenated alignments of many protein-coding genes to evaluate support for, and membership in, ecological populations. A core genome phylogeny was constructed from 66 near-ubiquitous gene families (defined by blastclust, above) present in single copy in at least 75 genomes, more than half the total number of ubiquitous gene families. Nucleotide sequences were aligned with MUSCLE (default settings) and concatenated alignments' phylogeny was estimated with Phyml (default settings). Separately, a concatenated ribosomal protein tree was constructed. The amino acid sequences of 30 ubiquitous OrthoMCL families encoding ribosomal proteins were aligned with MUSCLE and guided alignment of nucleotide sequences with Pal2nal (Suyama, Torrents, and Bork 2006) using default parameters. 2.6.11 Genome similarity by protein content Global similarity between two strains gene content was measured as the fractional overlap in gene content. The Jaccard similarity between genomes A and B is A-, where A is the set of protein families present in genome A and B the set of protein families present in genome B. This jaccard metric was calculated using the Vegan package of the R computing environment (Dixon 2003; Team 2012). Hierarchical clustering of proteomes was performed by UPGMA, implemented in the 'cluster' package in R (Maechler et al. 2012). 2.6.12 Population similarity by protein content Average genome-wide similarity of protein content between two populations was calculated as the mean of the set of jaccard distances between pairs of genomes composed of one genome from each population. 62 2.6.13 Substrate utilization assays All substrate utilization data was previously generated and generously supplied by S. Preheim (Preheim 2010 and unpublished data) and detailed methods are available elsewhere. Briefly, cells grown in suspension to OD600=0.1 were inoculated into each well of a 96-well Biolog GN2 plate (containing 95 different carbon substrates plus water) and incubated at 22C for four days. Optical density readings were taken for each well at 580 nm to quantify metabolism (i.e. respiration) in each well. 2.6.14 Substrate utilization profiles and similarity Each strain's substrate utilization profile was defined as the vector of the ODs resulting from the respiration of 95 different carbon substrates on the Biolog GN2 plate. Similarity of profiles was defined as the Pearson correlation of these vectors. 2.6.15 Population similarity by substrate utilization profile Overall similarity of substrate utilization between two populations was calculated as the mean of the set of Pearson correlations between pairs of strains composed of one strain from each population. 2.7 Author Contributions. The analysis and writing for this chapter were solely the work of S. Timberlake, with thanks to O.X. Cordero for running OrthoMCL, one method for building orthologous gene families. S.P. Preheim conceived of and performed all of the Biolog substrate utilization experiments. Thanks to S.P. Preheim, M.C. Butler, A. Perrotta and others for isolating, culturing, and prepping sequencing libraries of the vibrios used in this work. M.F. Polz and E.J. Alm provided materials and reagents and valuable feedback. 63 2.8 References See main thesis document reference section, Error! Reference source not found.. 64 2.9 Figures and Legends 350000 300000 250000 in z 200000 150000 100000 50000 0 a,> >. 600000 hybrid assembly 500000 I 6 de a no 400000 0 U., z 300000 de novoa alone 200000 I 100000 0 in N Ln in - N r N N N50 N N N Figure 1. Assembly N50 improvement using hybrid assembly approach. The improvement of hybrid assemblies over de novo assemblies is quantified by a comparison of the assembly statistic N50. The N50 represents the length of the smallest contig that needs to be included in a set of contigs containing half the total base pairs of the genome assembly. Red bars, N50 of de novo assembly alone; blue bars, N50 of hybrid assembly. Top panel, single-end reads (51-76bp); bottom panel, paired-end reads (63102bp). Genomes for which no hybrid assembly was performed (due to either assembly by long-insert mate-pair reads or lack of a suitable reference) are omitted. 65 0.1 nudeotide substitutions per site Size Fraction: >63 pm 63-5 tm 5-1 pm <1 Rm Season: U * spring fall Figure 2. Species phylogeny of Vibrio genomes Alternating gray/blue shaded backgrounds delineate ecologically differentiated clades defined in previous studies (Hunt et al. 2008; Preheim 2010). Strains without population prediction or where population prediction is directly contradicted by this species tree topology are left unshaded. Leaf labels contain strain names and, where applicable, population numbers predicted by AdaptML on the Fractionation 2006 data (Hunt et al. 2008). Color strips indicate size fraction and season of collection (not inferred habitat). Core genome phylogeny reveals robust support for F11 and F17, clades previously unsupported due to conflicting signals in housekeeping gene trees (Preheim, Timberlake, and Polz 2011). Groups correspond to the following named taxa: 1, Enterovibriocalviensis-like;2, Enterovibrionorvegicus; 3, V.ordalli;4, V.rumoiensis-like; 5, V. alginolyticus;6, V. 66 aestuarianus;7, Aliivibrio logei; 8, V. Aliivibriofischeri;9, V. breoganii; G10, V. sp F10; 11 V. splendidus cluster 1; 13, V. crassostreae;14, V. kanaloae; 15, V. cyclitrophicus; 16, V. tasmaniensis;17 V. lentus; 18-25 V. splendidus. This phylogeny was constructed from 66 ubiquitous gene families present in single copy in at least 75 genomes. Gene families were defined by Blastclust (L=0.75,S=1.75) (Dondoshansky and Wolf 2000). Nucleotide sequences were aligned with muscle (default settings) and concatenated alignments' phylogeny was estimated with Phyml (default settings, with 100 bootstrap replicates), and visualized with iTOL (Letunic and Bork 2011). Nodes with bootstrap >80 are indicated with dots. Scale bar is in units of nucleotide substitutions per site. inferred habitat: particle size same-habitatedifferent-habitat c - C 8 CL inferred habitat: season . 0 6 C0 2 4 6 8 10 2 %16S distance between populations 4 6 8 10 %16S distance between populations Figure 3. Habitat and gene content The overlap in protein-coding gene families (Jaccard distance) is plotted as a function of 16S distance between populations. Populations with the same inferred habitat (adapted from (Hunt et al. 2008)), either in terms of particle size (left panel) or seasonal preference (right panel), do not have more similarity in their overall gene content. Each point represents the mean gene overlap (Jaccard similarity) of all population pairs in that distance bin (calculated separately for same-season or same-particle). Jaccard similarity between each population pair was calculated as the mean of the set of Jaccard similarities between all pairwise genome comparisons composed of one genome from each population. 67 inferred habitat: season inferred habitat: particle size same-habitat e - - different-habitat oC o .2 .5 0 0- i III 10 0 IIII 0 2 4 6 8 2 4 I 6 II 8 10 phylogenetic distance (%16S) phylogenetic distance (%16S) Figure 4. Habitat and substrate utilization profiles The similarity in substrate utilization profiles is plotted as a function of phylogenetic distance between populations. Populations with the same inferred habitat (adapted from (Hunt et al. 2008)), either in terms of particle size (left panel) or seasonal preference (right panel), do not have more similar substrate utilization profiles. Each point represents the mean of metabolic similarities between all population pairs in that distance bin (calculated separately for same-season or same-particle). Overall metabolic similarity between each population pair was calculated as the mean of the set of Pearson correlations of strain pairs composed of one strain from each population. The dashed gray line represents the mean Pearson correlation in metabolic profile between 3 sets of biological duplicates. Data were generated and generously supplied by (S. Preheim, unpublished data). 68 HGT frequency in 2235 bacteria 50 Human - A C- 40 0 -o Same site Different site Non-Human ea.. E 1s5L1.0 0.5 30 U00 3 o 20 I s 6 1o1sa HGT frequency in 80 vibrio genomes Is 16SDistance()_____________________ - 10- 3 6 9 12 165 Distance (%) 15 18 3 6 9 16S Distance (%) Figure 5. Counting recent HGT within 80 vibrio isolates. Recent HGT was counted within 80 Vibrio isolates (right panel) using a recent method that estimated frequency of recent HGT amongst all 2235 full genomes publicly available for the purposes of direct comparison (left panel, reproduced from previously published data (Smillie et al. 2011)). The y-axis is the probability that a pair of genomes drawn from two different species at a particular phylogenetic divergence (>=3% 16S distance) recently shared at least one piece of DNA (a stretch of DNA >500bp long and >99% identical). Minimal modifications to %16S distance calculation methods were made to accommodate the assembled 16S of this dataset (see methods), resulting in conservative underestimates of 16S distance. 69 <1p ... 1-5p *@0 >63p Figure 6. Network of recent HGT between vibrio populations Each node represents a population colored by particle size preference. Population numbers and habitats from Fractionation 2006 study (Hunt et al. 2008) or merged from (Preheim, Timberlake, and Polz 2011), except Vibrio cholerae-like strains, "VC". V. splendidus strains unassigned to supported populations were grouped together, "SPL". HGT frequencies were used to weight a spring-embedded force diagram made in Cytoscape (Smoot et al. 2011), and unconnected nodes had no recent HGT observed with the other populations in this dataset. Recent horizontally transferred sequences were discovered according to an empirical method (Smillie et al. 2011) with the following modifications: nucleotide stretches of 300-500bp (not just >500bp) were also considered and HGT event frequencies were calculated for each population pair (not 2% average neighbor cluster) as the fraction of their genome-pairs with at least one HGT sequence. 70 Pan genomes for 4 definitions of "species" Vcyc:LP Vcyc:LP+SP I -OTU (<3% 16S) Vibrio family 0 00 0 Core genomes for 4 definitions of "species" 0 0 0N ! l-Il-iI- 0 0 LO/ /r 5 10 15 20 number of genomes Figure 7. Core and pan genomes for 4 species definitions. The core genome and pan genome curves are plotted for four different ways of delineating species-like groups. For pan curves (upper panel) each point (x,y) counts the total number of orthologous gene families y found in any of the set of x genomes genomes (the vertical spread results from different combinations of x genomes from the total N in the group; where CxN > 100, 100 sets of x genomes are chosen randomly); the line traces the average over all combinations. For core curves, (lower panel), each point (x,y) counts the number of 71 orthologous gene families y present in all x genomes. Core and pan genomes were defined using 4 different hierarchical groups: a single ecologically differentiated population of largeparticle specialists (Vcyc:SP), a concatenated gene tree phylogenetic cluster comprised of two ecological populations (Vcyc:SP+LP), -3% 16S rRNA cluster (OTU <3%16S), or by named taxonomic family (Vibrio family). When considering the larger phylogenetic groups, population sample number was equalized by random downsampling. 0.1 Species phylogeny Jaccard distance of gene content ZF-270 ZF-14 ZF-28 1F-175 F-53 F-160 1F-97 F-111 1F-273 E5F-59 FF-160 ZF-99 1F-97 1F-111 1F-273 F-270 1F-289 ZF-14 FF-75 F-207 ZF-264 ZF-207 ZF-30 1F-289 ZF-65 ZF-30 F-75 ZF-205 inferred habitat F59 ZF-65 F-99 LP: >63p ZF-264 SP: 1-5p ZF-170 ZF-205 1F-53 1F-175 ZF-170 F-28 Figure 8. V. cyclitrophicus strains clustered by flexible gene content and neutral markers Strains of the V. cyclitrophicus population are hierarchically clustered according to their fraction of shared protein families. Green and red squares label strains assigned to small particle and large particle specialist populations of this species, respectively (Hunt et al. 2008; Shapiro et al. 2012). Phylogeny of 30 concatenated ribosomal proteins supports nonclonal origins for the SP and LP specialists. Species tree shown is a subtree of the concatenated ribosomal protein phylogeny; dots denote branches with >80% bootstrap support. 72 16S from Vibrio finished genomes (R 0 075' C~L 0 I C)J 6N 0 1500 1000 500 C0~ CD 16S assembled from short reads 0O 0) EE 04I 1111i11III Ell 0 . IIAHIMA IIE 111111 1 1 is 1000 500 111 IIII il I I I I I'A 1500 Position in 16S rRNA alignment Supplemental Figure 1. Distribution of polymorphisms in 16S genes assembled with short read pipeline. Polymorphism frequency by position across the 16S gene assembled from short reads was concentrated in variable regions also seen in finished genomes from the family, consistent with the idea that they are true polymorphisms, not short-read sequencing/assembly artifacts. Top panel, the distribution of intra- and intergenome polymorphism observed for the family using 129 16S rRNA genes found in 15 finished genomes of 10 Vibrionales species. Lower panel, intragenome polymorphic sites of 16S genes assembled by short read pipeline. 73 Fa_FF -2 ~ imi _4us Vibio~ Vibrio'w 6421-8 1..31. .mm^sVT5C66840_12 Vaarlo-pRC58 _04723020 153 Supplemental Figure 2. 16S phylogeny of our populations and other vibrio type strains. 16S rRNA genes assembled from short reads are placed in phylogenetic context with other cultured Vibrio isolates. One representative per population with the fewest ambiguous (polymorphic) positions was selected, aligned to the Silva reference, and Phyml was used to estimate the phylogeny. 74 C) 2 11,spl 0 (1) E % 6 -7,8 vb p1,2 CL* CU * -~ o .5,9 a) 0 CU (D 6 E 0 2 4 8 6 10 % 16S divergence between populations Supplemental Figure 3. Gene content divergence with phylogenetic distance The fractional overlap in protein-coding gene families (Jaccard distance) is plotted against 16S distance between populations. Each point is the mean Jaccard distance between two populations (Jaccard distance between each population pair was calculated as the mean of the set of all Jaccard distances between genome pairs composed of one genome from each population.) Outliers are labeled with population numbers from Fractionation2006 study; spl, V. splendidus putative populations 19,20,23. Some populations have %16S distance=-0 because intragenomic diversity exceeds between-population divergence. Gene content 's linear divergence seems to saturate at high %16S distance, but it is possible this is due to orthologous group definitions. 75 inferred habitat: large particle specialist LP-specialist pairs 0other pairs 0- 0 o C 0 0 C 't2 0 Q 00- - I I I I I 2 4 6 8 10 %16S distance between populations Supplemental Figure 4. Overlap in gene content between large-particle-associated populations The fractional overlap in protein-coding gene families (Jaccard distance) is plotted as a function of 16S distance between populations. Large-particle specialist populations (Hunt et al. 2008), do not have more similarity in their overall gene content. Each point represents the mean Jaccard distance between population pairs in that distance bin (calculated separately for same-season or same-particle). Jaccard distance between each population pair was calculated as the mean Jaccard distance of the set of all pairwise genome comparisons composed of one genome from each population. Gene content overlap in sameand different-habitat comparisons differed only at one distance bin (3% < 16S distance <4%, p=0.025) 76 CD E 0 CL -, 0 a) a) E C, CO S same-habitatFra m ndifferent-habitatg a) -0 E 0) C%4 0. 05 .01. C 00 notaveae o each poutosding prevous lotsthspis somihrdltetheiga of rare genes.) Rare genes were defined as genes found in between 3 and 20 genomes (other definitions were used with similar results). There was no consistent pattern of increased shared genes in genome pairs of same versus different habitats, with one consistent exception for all definitions of rarity: Same-habitat strains within a population (first distance bin) always share more genes between (p<0.01). The outlier at intermediate distance consists of comparisons of Population 1 and 2 strains, which have surprisingly similar gene content given their distance; the reason for this is unknown. 77 IF- 302 1 FF- 2 I I I FF-4514 1 ZF-Ii a I 4 'CIO Z 299 "-W9 ZF-"9 12GOI 107 3 F$-,44 3 F 233 3 Z112910 gZam 10 R01371 2311 6 win A 6=7713 1313 9CS108 13 ZF-01 ;3 IF-63so ZF-Z7015A ZF-204 13A ;F-28915A F 11158 ,F-273 ZF-14150k Z 17015A ZS-139 II IS" OF-79 16 ZS-1717 FF_= Is ZS-03 15A V-SO 19 Fr- IS 04AA0 MEN arm a am man a on ME miummanowne an a MEMMEN a a a 0 MNMMMMM 0 0 a 0 MEN 0 on WE 0 a a 0 We Sam a a P a a a WMI 0 0 0 a of OWN On . a WOMMOMMIN a ON so a a 0 0 a 0 0 MEN a a 0 !al an No 18 IF-157 '2NI 8-12123 a Supplemental Figure 6. Substrate utilization in the Vibrio family A heat map of 95 substrates respiration (OD on Biolog GN2) in each of 46 strains with fully sequenced genomes. Substrates are hierarchically clustered according to the OD vector in 46 strains (eisenCluster implemented in the R software package (Chipman and Tibshirani 2006) ); strains by core genes phylogeny. Leaves are labeled with strain names and original population number (from Fractionation2006 study or adapted where necessary (Preheim, Timberlake, and Polz 2011)). Leaf color strips code for season or filter size of isolation. In large circles, numbers indicate assigned populations for clade (for confirmed populations, see Methods) and colors indicate inferred habitats (Hunt et al. 2008). 78 0 I)I Q 0 0) V 0 0 0 O 0 io [Td, o qo 0 0 0 2 6 4 8 10 %16S distance Supplemental Figure 7. Substrate utilization profile divergence with phylogenetic distance Similarity of substrate utilization for each population pair is plotted against evolutionary distance (%16 distance). The Pearson correlation of each strain pair was calculated using the vector of OD's for growth on 96 Biolog substrates); populations' similarity calculated as the mean correlation between all pairwise strain comparisons composed of one strain from each population and bars represent standard deviation. Where only comparison was available, the median standard deviation of all other comparisons is plotted. The dashed gray line represents the mean Pearson correlation between 3 sets of biological duplicates. Large variation at intermediate distance is due to comparisons of V. splendidus populations with similarly metabolically versatile V. alginolyticusversus comparisons with metabolically diminished V. breoganii. 79 E) 0 E LO z LO 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 penetrance in cyclitrophicus Supplemental Figure 8. Distribution of gene family penetrance in V.cyclitrophicus A gene frequency distribution curves counts the number of gene families ("clusters") found in 1, 2,...,X genomes of the strains of V. cyclitrophicus. 80 2.10 Supplemental notes and discussion. 2.10.1 Hybrid assembly pipeline: discussion of accuracy. Assembly of whole genome shotgun reads is a probabilistic process, involving heuristic steps (Zerbino et al. 2009; Chaisson, Brinza, and Pevzner 2009). Assemblies can incorporate several major classes of errors: base substitution, base insertion, contig mis-joining, and region duplication or deletion. Because this short read sequencing dataset wasn't created with methods development in mind, there is no gold standard to check for accuracy. However, there are two modular steps in the hybrid assembly pipeline, so we can consider the accuracy of each separately. The latter step consists of de-novo assembly (using reference assembly contigs as fake long reads), and the accuracy the de novo assembler Velvet in terms of these errors classes has been documented extensively in the literature. The first step consists of the generation of reference assembly contigs. Future work should future quantify how errors they contain can be propagated through the de novo assembly process. 2.10.2 16S rRNA phylogeny To explore the phylogeny and taxonomy of our sequenced Vibrionaceae strains with other members of the genus, a phylogeny was built including 16S sequences from complete genome sequences of the Vibrionales order (Supplemental Figure 2). A striking topological difference is the placement of the V. cholera-like clade as an outgroup to the rest of the Vibrio genus. It is certainly possible this placement spurious and the 16S-based distance estimates to this group inflated compared to most measures of phylogenetic divergence. However, V. cholera-like strains were not particle-size filtered and were not used throughout the majority of this study. More thorough analysis of Vibrionaceae family's phylogeny (Kirkup et al. 2010) and these populations' taxonomy (Preheim, Timberlake, and Polz 2011) have been undertaken elsewhere. 2.10.3 Defining gene families in 82 genomes: sensitivity to methodology, sampling. Assigning orthology should not be challenging in a group of this level of sequence divergence, but the size of the dataset presented challenges, as sequencing has outpaced analysis tools. Several standard methods of constructing gene families were assessed for our set of genomes, with widely varying results. Grouping of genes varied by program 81 choice, parameter choice, but most strikingly, with genome sampling within clades, resulting in family core genome predictions (provisionally defined as those ubiquitously present in the family's assemblies) that ranged between 70-1000. Within-population downsampling resulted in highly discrepant gene family predictions: Blastclust predicted only 160 ubiquitous gene families in the vibrios when all strains were used for clustering, but more than 1000 when only 1 genome from each population was used (a number of Blastclust parameter sets were tried with similarly discrepant results). Newer methods designed for finely sampled genomes (Angiuoli et al. 2011) start with whole genome alignments, using synteny to improve scaffolds, ORF calls, and ortholog assignments, however, these methods are only appropriate at close phylogenetic distances and could not be applied to our family. OrthoMCL was ultimately chosen for this genome content analysis as it is commonly used and the size of core genomes produced were similar to estimates for other Vibrionaceae family members, using the well-validated approach of tree-based orthology (P. S. Dehal et al. 2010; M. Price, Dehal, and Arkin 2007). 2.10.4 Using gene content divergence to confirm population predictions. The Fractionation2006 study inferred many recently differentiated pops in the adaptive radiation of the V. splendidus/tasmeniensisclade. Whether these are truly distinct populations has been difficult to confirm. They were not well supported by MLSA analysis, or sampling occasions after the 2006 study (Preheim, Timberlake, and Polz 2011). It is not yet well established that proteome content divergence can be used generally to distinguish differentiated clusters, however, both proteome and concatenated gene phylogenies support Population18 but both reject Population19. V. tasmeniensispopulations 16 and 17 should definitely be placed apart even though they were combined into a single population by some ecological characters (Preheim 2010 Particle2007 study); they likely represent more than two populations, as prescribed by habitats based on host and body site (Preheim et al. 2011 Invert2007 study). 2.10.5 Habitat-specific gene gain and loss in V. cyclitrophicus We consider the results of using a broader group of fixing genes than that presented in the text. We considered 1155 flexible gene families present in more than 2 V. cyclitrophicus strains, of which 128 were significantly enriched in one habitat (2-tailed hypergeometric test with 5% FDR). Thus, more than 10% of the flexible genes apparently rising to fixation (or being deleted) are doing so in a clearly habitat-specific manner. 82 Narrowing our focus to more common genes in the main text resulted in a higher estimate of habitat-specific fixation. The selective processes governing flexible gene gain and loss are very likely different. If separate these gene families into two groups, those likely to be present in the last common ancestor and undergoing loss, and those that have been recently gained and rising to fixation, we see separate trends. V. cyclitrophicus flexible gene families also present in two or more closely related (<3% 16S distance) outgroups were defined as ancestral and undergoing loss. These 565 flexible gene families are putatively being deleted from a common ancestor, while the rest (590) are recent gains. For recent gains, nearly 20% of them (114/590) are being fixed in a habitat-specific manner (2-tailed hypergeometric test with 5% FDR). Gene gains are much more likely to have a habitatspecific distribution than genes lost (ks.test on hypergeometric p-values: DA-= 0.2125, p<1e-11). As above, the results are consistent when a narrower focus considers more common genes. There are several major caveats for this approach. The main issue is that the strains are treated as if they are on a star phylogeny, with independent history of gain and loss. While the well-mixed nature of the genome does not directly reject this idea, there is certainly some evidence for same-habitat strains sharing more phylogenetic history. Although the SP and LP specialists are not each clonal expansions, the same-habitat pairs constitute some of the most closely related strains in the group, and gene gains in these lineages likely do not meet the criteria of independence for the hypergeometric test.. To try to minimize this effect we considered genes near fixation, such that their distribution in multiple genomes was unlikely to be a result of clonal expansion. Secondly, this test is treating each ORF's presence/absence independently, and therefore their fixation is independent. This is almost certainly not the case, as genes arrive on stretches of DNA, sometimes very large ones, that are mobilized together (Shapiro et al. 2012). However, all models of gene content we are aware of treat each ORF independently, including the neutral gene content models recently developed (Haegeman and Weitz 2012), and our intention was to assess their applicability in this system. 2.10.6 Gene deletions in general: bias, neutrality, or selection? The majority of flexible genes that enter a population are not fixing by selection, rather, they are deleted rapidly. Is this deletion is neutral? During experimental evolution of M. 83 extorquens, accessory genome deletions were found to be due to selection rather than deletion bias, as they were not observed in mutation accumulation assays (M. C. Lee and Marx 2012). However, others argue for strong and pervasive deletion bias in bacterial genomes (Kuo, Moran, and Ochman 2009; Kuo and Ochman 2009), though the time scales are much broader. Our comparison of strains evolving in the wild versus an equivalent amount of SNP-divergence in the laboratory revealed a stark contrast, with wild strains deleting orders of magnitude more DNA. In 7 laboratory lines of V. splendidus 12B01 evolved under conditions of elevating salt concentration, lines accumulated 50-100 SNPs, but deletions were rarely observed. In stark contrast, wild-type sister strains of similar divergence (1F-111/1F-273, 1F-53/1F-175, ZF-207/ZF-30 from V. cyclitrophicus, 12F01/12B01 and 12B09/FS-144 from V. splendidus and V. ordalifi respectively), each with <100 core genome SNPS (or <300 total genome SNPs, including SNP clusters likely due to recombined regions), had >100-300kb of flexible DNA deleted in each lineage. This figure is likely an underestimate of the total amount of DNA deletions that have occurred along each branch since only the leaves can be sampled. This suggests that perhaps fast loss of flexible DNA may be mediated by turnover (i.e., replacement by new incoming DNA). Whether this fast turnover of flexible DNA is mediated by different mechanisms than the deletion bias typically discussed in the literature has not been explored, as strains this closely related have not, to our knowledge, been analyzed previously. 84 Chapter 3. Conservation of coexpression without conservation of response in bacterial gene regulatory networks 3.1 Authorship & Publication Note This chapter has been submitted as a manuscript by the following authors: Sonia C Timberlake, Romy Chakraborty, Aindrila Mukhopadhyay, Katherine Huang, Dominique C Joyner, Marcin P Joachimiak, Jason K Baumohl, Paramvir S Dehal, Zhili He, Aifen Zhou, Jizhong Zhou, Adam P Arkin , Terry C Hazen, Eric J Alm Detailed author contributions are available in section 3.7. 85 3.2 Abstract Transcriptional response to environmental perturbation is important for fitness, but how these adaptive strategies evolve remains poorly characterized. In this study, we address two fundamental questions about the evolution of bacterial gene regulation. First, we asked to what extent the basic units of the transcriptional network, coexpressed gene pairs, are conserved across the billions of years of evolution spanned by three bacterial clades. Second, we asked whether the global transcriptional phenotype in an environment is conserved. We report an extensive compendium of transcriptomic data in three metal-reducing bacteria we generated to conduct our comparative survey of transcriptional phenotype. We also analyze hundreds of previously described microbial gene expression experiments, to quantify coexpression conserved in our model organisms and across the bacterial domain. Comparing orthologous genes' responses, however, revealed that global transcriptional phenotypes are not conserved over relatively short time spans, although we identify specific exceptions where orthologous genes respond similarly to an environmental cue. We also show that bacterial transcriptional responses are extremely sensitive to variations in experimental treatments. We find a lack of consistency or conservation in bacterial transcriptional responses, which calls into question the standard analysis of transcriptomic data listing significant changers in a single experiment. Comparative transcriptomics may be critical for identifying the elements of any transcriptional response that are functionally (and not just statistically) significant, but consistently produced data from very closely related bacteria will be required. Our comparative survey of gene expression underscores the need for a model of gene expression evolution, one that can relate the robustness and fitness of transcriptional responses to their subsequent selection. 86 3.3 Background Microbes in changing environmental conditions alter their phenotype via gene regulation in response, but the selection on this response is poorly understood. Although instances of differentially regulated genes have been extensively documented (DeRisi, Iyer, and Brown 1997; Faith et al. 2008), even a basic understanding of their fitness contributions (Birrell et al. 2002; Giaever et al. 2002; Klosinska et al. 2011; S. L. Tai et al. 2007), divergence, or drift is lacking. A fundamental unanswered question is whether orthologous bacterial genes respond the same way to the same environmental cues. While recent studies have characterized the divergence of transcriptomic responses in 5 closely related yeast species (Tirosh et al. 2006), and of proteomic abundance profiles in 10 co-generic proteobacteria (Konstantinidis et al. 2009), to our knowledge, conservation of bacterial transcriptomic responses had been never been quantified. Comparative transcriptomic studies in bacteria typically revolve around strain-specific differences, which can then be implicated in pathogenic or other adaptive phenotypes (Yoder-Himes et al. 2009; Guell et al. 2011; Zhang, Dudley, and Wade 2011), but the extent to which these differences are due to selection or drift has not been shown (M. Lynch 2007). Indeed, even upregulated genes' presumed selective advantage is questioned (Birrell et al. 2002; Giaever et al. 2002; Klosinska et al. 2011; S. L. Tai et al. 2007), but their conservation could support it. Whether bacterial genes respond the same way to the same condition is thus fundamental to our understanding of transcriptional response. To characterize the evolution of transcriptional phenotype in a variety of species and environmental stressors, we generated an extensive compendium of transcriptomic data in three proteobacteria, and analyzed data from public repositories. We focused our study of transcriptional phenotype on three dissimilatory metalreducers that have gathered interest as agents of bioremediation. Geobacter metallireducens GS-15 and Desulfovibrio vulgaris Hildenborough are delta-proteobacteria and likely diverged from Shewanella oneidensis MR-1 and the rest of the gamma-proteobacteria three billion years ago at the base of the Proteobacterial clade (Ciccarelli et al. 2006; David and Alm 2010). D. vulgaris is an anaerobic sulfate-reducer and S. oneidensis is a facultative anaerobe, but their common natural environments and dissimilatory metal reducing lifestyles and may pose similar challenges. On a genomic level, they share 832 putatively orthologous genes, roughly a quarter of D. vulgaris' gene content, including half of its 87 transcription factors (Heidelberg, Seshadri, Haveman, Hemme, Paulsen, Kolonay, Eisen, Ward, Methe, Brinkac, et al. 2004; Heidelberg, Seshadri, Haveman, Hemme, Paulsen, Kolonay, Eisen, Ward, Methe, and Brinkac 2004; Mulder et al. 2007), though recent work suggests that transcription factors identified as bi-directional best hits (BBH) may be an overestimate of the vertically-inherited regulators controlling orthologous regulons (M. Price, Dehal, and Arkin 2007). We also assayed phenotype of G.metailireducens,a much closer relative of D. vulgaris, sharing more gene content (Paramvir S. Dehal et al.), and the obligate anaerobic lifestyle (Lovley et al. 1993). To ask how orthologous genes respond to similar environmental stressors, we generated an extensive compendium of transcriptional profiles by subjecting these bacteria to a variety of ancient and ubiquitous environmental changes: heat, cold, acid, alkaline, salt, and nitrate. We also analyzed public transcriptomic datasets of for evidence of transcriptional response conservation in other species and conditions. While the evolution of transcriptional response has not been described in prokaryotes, conservation of their transcriptional network has been explored by looking at conserved network architecture, including modules (McAdams, Srinivasan, and Arkin 2004; Babu et al. 2004; Gelfand 2006). Studies have convincingly demonstrated that genomes are partitioned into functional modules that are expressed together (Stuart et al. 2003; Bergmann, Ihmels, and Barkai 2004), clustered together (Price et al. 2005), physically interacting (Zinman, Zhong, and Bar-Joseph 2011) and laterally transferred together (Morgan N. Price, Arkin, and Alm 2006). As the transcriptional network in evolves, these units can remain cohesive, with some conserved transcriptional modules documented in species as divergent as E. coli and metazoans (Stuart et al. 2003; Bergmann, Ihmels, and Barkai 2004), or even out to B. subtilis (Waltman et al. 2010; Zarrineh et al. 2010), but are part of a complex picture of network evolution, including flexibility and extensive rewiring of functional modules (Fokkens and Snel 2009; Roguev et al. 2008; Singh et al. 2008). For a look at overall conservation of the transcriptional networks in our model organisms and across the bacterial domain, we start by quantifying the conservation and function of the simplest possible units, coexpressed gene pairs, in four very divergent bacteria. Despite abundant transcriptomic data in bacteria, no model exists for its evolution, and the correspondence between orthologous genes responses is not even qualitatively understood. This comparative survey of transcriptional phenotype was undertaken to 88 examine transcriptional network evolution in bacteria in two fundamental ways: conservation of links on the network, and conservation of network output. To do so, we generated microarray data in three bacterial species and drew on a large database of public microarray data to evaluate conservation of coexpression and response to a variety of environmental stressors. 3.4 Results and Discussion 3.4.1 Coexpression is significantly conserved across bacteria Our broad survey of conservation of coexpression painted a complex picture: we identify a very significant number of coexpression couplings conserved over vast evolutionary distances spanned by bacterial phyla. The majority, however, we do not find conserved, likely reflecting re-organization of species' transcriptional programs, in agreement with previous reports (Zarrineh et al. 2010). We used hundreds of microarray experiments to define coexpressed genes and assess their conservation across the bacterial domain, focusing on the simplest unit of the transcriptional network, coexpressed gene pairs. Gene pairs that are up-regulated or down-regulated together across many conditions tend to be functionally associated (Brown and Botstein 1999; Eisen 1998; Fink et al. 1998), and may participate in the same pathway or physically interact. Coexpressed genes have been further clustered into larger modules that act cohesively across conditions and evolutionary time (Cai, 2010; van Noort et al., 2003; Waltman et al., 2010; Zarrineh et al., 2010). Conservation in an ancient gene regulatory network was assessed by comparing coexpressed gene pairs in three very divergent bacterial clades: a cyanobacterium (Synechocystis PCC)(Kaneko et al. 1996) and our two model organisms (deeply branching proteobacteria), using data from the well-studied E. coil MG1655 to bolster inference of coexpressed gene in the gamma-Proteobacterial branch. Between the extensive phenotypic dataset we generated and public databases (Barrett et al. 2007; Paramvir S. Dehal et al.), these four organisms had some of the most extensive and varied assays of their transcriptomic responses to common ecological conditions. This wide variety of experimental conditions sampled many distinct transcriptional states, and we built a comprehensive picture of coexpression for millions of gene pairs, then asked which coexpression links were conserved. Despite a few billion years spent diverging from a common ancestor, these transcriptional networks shared more than 14 times the number of coexpressed pairs 89 expected by chance: of 18196 coexpressed pairs in the gamma-proteobacteria with orthologous genes and data in the other two clades, 2629 of these links were conserved whereas only 182 were expected under a model of random assortment. The model of random assortment was constructed without reference to operon structure, because while conservation of operon structure could be a mechanism underlying many conserved coexpression links, operon structure is driven by co-regulation (M. N. Price et al. 2005), so it is therefore likely that selection on coexpression preserves co-operon structure rather than the reverse. We identified putative coexpressed pairs in a single organism as those in the top 10th percentile of correlated expression over all conditions. (See methods for details and results using alternative cutoffs). To validate our cutoff for defining coexpressed pairs, we verified that this level of coexpression conservation was typical of protein partners by considering the subset of coexpressed pairs also known to be physically interacting in vivo (Butland et al. 2005; Hu et al. 2009). For the subset of 198 ubiquitous proteins where coexpression confidence was increased by physical interaction data, adding this criterion resulted in pairs just over twice as likely to be conserved. Thus, our simple strategy identified coexpressed gene pairs with regulatory conservation comparable to that of coexpressed proteins also known to physically interact in E. coli. What are the cellular functions of these pieces of the transcriptional network conserved across vast distances? Figure la depicts the conserved gene pairs' functions (as defined by the Riley functional categories used in COG (Tatusov et al. 2003), and which are unexpectedly likely to be linked given the distribution of conserved genes. Our conserved set of links was enriched five-fold for pairs participating in protein translation, a process known to be tightly co-regulated (Leary and Huang 2001; Kaczanowska and Ryden-Aulin 2007), though the fact that stressful conditions comprise a large part of microarray databases and our data compendium may bias us towards discovery of this functional categories. Stress conditions tend to slow growth rates and ribosomal biogenesis in general (Brauer et al. 2008; Rzhetsky et al. 2009), and this is particularly true in our experiments, as experimental perturbations were tuned such that each resulted in a 50% reduction in growth rate. While a large fraction of conserved genes are in the replication, recombination, and repair functional category (L), conserved links to this category are few, indicating that these conserved genes have divergent regulation. While previous studies have described predominantly within-functional-category coexpression links (Bergmann, Ihmels, and Barkai 2004; Stuart et al. 2003), Figure la illustrates many conserved links between genes 90 from different COG functional categories. For instance, energy production and conversion genes are coexpressed with translational machinery genes (C-J), at a rate 10-fold higher than expected, with the genes of the ATP synthase complex among the most highly connected, a reflection of the high cost of ribosome biosynthesis (Warner 1999). Interestingly, the steep energy cost of amino acid usage in translation (Akashi and Gojobori 2002) is not reflected by conserved regulatory links with the energetic functional category (E-C), though nucleotide metabolism is unexpectedly coupled with energy. The choice of correlation threshold for defining coexpressed pairs influences quantification of conservation, so we took a broader view by visualizing the divergence in the whole coexpression distribution instead of just the top 10%. The observed joint distribution of any two species counts where in each species' coexpression distribution a gene pairs' correlation is conserved in the orthologous pair. Under a model of random assortment, the correlation of a gene-pair in one organism is independent of its orthologous gene-pairs in others, so the expected joint distribution is the product of the two independent species' distributions. Figure lb quantifies the joint distribution's departure from the random assortment model: at the extremes of coexpression, and the waning similarity between successively more distant pairs of species is suggestive that the divergence in coexpression has not yet plateaued. We were surprised to see that a number of gene pairs remain anticorrelated over billions of years (Figure 1c). While approximately half of these conserved pairs were expected by chance and may be spurious, the links could generate hypotheses about function. For example, ycal is currently uncharacterized (Uniprot Consortium 2011) and has no known physical interactions (Su et al. 2008) nor convincing functional data compiled in online databases (Szklarczyk et al. 2010), apart from homology to DNA internalization proteins. The expression of ycal in opposition to a suite of growth and translation proteins (cogJ, asnS, engA, ileS) (Uniprot Consortium 2011) and to a few proteins of the de novo nucleotide synthesis pathway (guaA, purB), is consistent across billions of years, raising the possibility that despite the diversity of natural competence programs observed in bacteria (Gilbreath et al. 2011), competence induced out of phase with the ribosomal/growth transcriptional program is ancient and conserved. Anticorrelated expression relationships, and conserved ones in particular, can complement the usual expression profile clustering as a means for generating critically needed hypotheses about physiology or function (Dhillon, Marcotte, and Roshan 2003; Shatkay et al. 2000) but are not frequently used. 91 3.4.2 Expression phenotypes are not conserved A fundamental question in the evolution of transcriptional phenotypes is whether orthologous genes respond in a similar way to similar conditions. Because we could not find this directly addressed in the scientific literature, we searched public databases to gather comparable stress experiments performed in two or more species. For a number of common environmental perturbations, including heat stress, oxidative stress, acid, alcohol, and UV, we compared transcriptional responses to each stress in the closest species available (typically same-genus or family), and never observed significant conservation (for all orthologous genes, we compared log fold-change between treatment and control expression, transcriptional response hereafter). The single exception was in the close relativesEscherichiacoli and Salmonella enterica,in which we were able to find one time point in heat shock with a modest signal for conservation (r 2=0.22) in our comparison of six pairs of experimental time courses (Allen et al. 2003; Gutierrez-Rios et al. 2003; Jozefczuk et al. 2010) (see Supplemental Notes for details). In fact, we noticed that when a species' response to a given condition was assayed by more than one laboratory, the inter-lab comparisons overwhelmingly and surprisingly showed poor or zero correlation. Figure 2 highlights an example: E. coli differential expression in pH perturbations show little or sometimes negative correlation across different laboratories (Faith et al. 2008; Glasner et al. 2003; Kannan et al. 2008; Maurer et al. 2005; Hayes, Wilks, Yohannes, et al. 2006; Hayes, Wilks, Sanfilippo, et al. 2006). In the case of acute acid stress, three datasets (Faith et al. 2008; Glasner et al. 2003; Kannan et al. 2008) report a total of 836 significantly up-regulated genes, only two of which were common to all three studies (at any time point). Indeed, the responses to acid compared across laboratories were less similar than acid and alkaline within a laboratory, indicating that the transcriptional response is driven more by the details of experimental protocol and analysis than by the ancient ecological stressor. This observation argues for caution when drawing physiological conclusions from any single transcriptomic experiment. It also suggests that the dissimilarity we observed between species in comparisons of public data could equally be due to differences in experimental details as to divergence between species. To limit the non-evolutionary sources of dissimilarity, we designed a standardized experimental protocol and data analysis pipeline and produced a compendium of stress response data in three species (see Methods). In S. oneidensis and D. vulgaris,we produced 92 time series of transcriptional responses to five common environmental stressors: heat and cold, salt, acid, and alkaline. We also grew S. oneidensisboth aerobically and anaerobically for the salt stress condition to interrogate the possible effects of aerobic growth on stress phenotype. The transcriptional phenotype of these two species responding to environmental perturbations showed little evidence of conservation: only a modest correlation in temperature shock (r 2 <0.11, Figure 3a). We first considered similarity of response of all 830 shared genes (BBH, see Supplementary Figure 1) but found temperature shock responses were more correlated when we restricted our comparison to 509 tree-based orthologs, the subset of orthologs whose gene trees support strict vertical descent (M. Price, Dehal, and Arkin 2007; Paramvir S. Dehal et al.), highlighting the utility of tree-based methods for identifying functional orthologs. We compared same-condition transcriptomic responses (Figure 3a, bold boxes on diagonal) to ask if responses were similar, and different-condition responses to ask if the similarity was condition specific or more general. Within each condition, the most similar responses (annotated b-e) show very weak correlation (0<r 2 <0.11). Other than the limited conservation in temperature shock, several considerations indicate that global similarity is absent: in many comparisons no significantly changing genes are shared, and within acid and salt shock, there are as many anticorrelated as correlated responses. We also do not observe a consistent conserved general stress response (GSR) in this transcriptomic data. This may be due to the fact that GSR regulation diverges rapidly (Caimano et al. 2004; Gunesekere et al. 2006; Mittenhuber 2002), or that the ancient part of the response is largely post-translational (Mittenhuber 2002; Staron and Mascher 2010),but transcriptional conservation of the GSR was not found to be addressed in the literature. This lack of conservation in response is somewhat surprising given that coexpression showed a significant level of conservation: 4083 gene pairs were still coexpressed in both genomes, whereas only 1210 are expected by chance. This highlights an important distinction between the conservation of coexpression and the conservation of phenotype: coexpressed genes may act as a unit in both species, but that unit may not respond in a similar manner to the same condition. To look for conservation of phenotype over shorter evolutionary intervals, we compared transcriptional responses of D. vulgaris to those of Geobactermetallireducens,but 93 still found no evidence for conservation (Figure 4). We grew G. metallireducensunder salt and nitrate stress using protocols minimally modified from D. vulgaris,and compared samecondition transcriptomic responses as well as different-condition responses (Figure 4 and additional data in Supplement). We observe some weak correlation in these species' responses to the same condition (Figure 4b-c; r 2=0.09 for nitrate and 0.03 for salt), but as these were less similar than responses to different conditions (for example, salt versus nitrate or nitrate versus pH) they do not support condition-specific conservation. While a conserved general stress response might exhibit such a signature, upon further investigation, the most similar nitrate and salt responses across species (Figure 4b-c) share only a single significantly changing gene, the DNA repair protein recR. Thus, over these evolutionary intervals, neither general stress responses nor condition-specific ones are similar at the transcriptional level. Despite divergent global responses, there may be functional modules with conserved expression. Under any particular condition, some critical modules may respond directly to environmental cues, while others respond to indirect signals propagated through the cell's global regulatory network. We reasoned that the expression phenotypes of modules responding directly to any particular condition might be more conserved evolutionarily. To test this hypothesis, we focused on the heat shock response, as it showed the best signal of conservation, and the genes important for fitness have been well characterized by independent experiments. We examined heat shock response data culled from the databases for a broader set of species (Chhabra et al. 2006; Frye et al.; Gao et al. 2004; Gutierrez-Rios et al. 2003; Helmann et al. 2001), and apart from the close sister species E. coli and S. enterica,little or no correlation of global transcriptional response was found in comparisons of more evolutionarily distant pairs of species (Figure 5). However, heat shock module genes, known a priorito be important for fitness in this condition, have a highly conserved response between E. coli and S. enterica,and the general response is consistent across all of the species tested, despite the large evolutionary separations between them. In fact, the magnitudes of the responses for each gene in the heat shock module are similar, which is remarkable given that the experiments were done in different culture media, aerobicity, temperatures, platforms, and using overall different lab protocols. The differences in the global transcriptional profiles could be driven by the sensitivity to experimental protocols, or by drift of relatively unimportant parts of the response. However, in the case of the heat shock module's response, selection not only 94 conserved it over vast evolutionary distance, but also made it robust to the many other environmental cues acting in addition to heat shock. This observed conservation and consistency places bounds on the effects of drift and sensitivity on genes expression with known fitness benefit. 3.5 Conclusions and future directions We showed that an environmental perturbation elicits vastly different transcriptional responses in orthologous bacterial genes. Why are orthologous genes not expressed in a similar way in a similar condition? Two elements contribute to this dissimilarity: evolutionary divergence between species, and protocol-specific side effects. The extent to which each of these is due to natural selection versus neutral drift is an important unanswered question. The phenomenon of similar microarray-based experiments producing inconsistent results has been noted previously. For example, different platforms and analysis methods may inconsistently quantify absolute expression level or low abundance transcripts (Draghici et al. 2006; Shippy et al. 2006). However, a large consortium's systematic evaluation found excellent quantitative agreement in global transcriptional profiles between a variety of platforms, sites, and normalization techniques, particularly for foldchange values (Canales et al. 2006; Shippy et al. 2006; Shi et al. 2006), leading us to suppose that most of the protocol-specific variation is present in the biomass, rather than introduced by downstream measurement and processing. Indeed, more than a decade's worth of microarray measurements in well-studied yeast systems provide the perfect comparison to the variability of response we have highlighted in bacteria. In a number of yeast species, a diverse set of stressors all elicit a stereotypical global transcriptional program, termed the environmental stress response (ESR) (Gasch et al. 2000). In contrast to the "general stress response" (GSR) mediated by alternative sigma factors in prokaryotes, this is a global and highly stereotypical pattern, and is consistent across experiments done in different labs, with different methodology, platforms, and strains (Gasch et al. 2000). The global correlation of response is also largely shared between S. cerevisiae and S. pombe (Chen et al. 2003; Gasch 2007) despite 500 million years of evolutionary divergence (Berbee and Taylor 1993) and significant rewiring of even basic cellular transcription programs (Rustici et al. 2004). More recent analysis of the yeast ESR has found the effect largely due to cell-cycle genes (Slavov et al. 2012), but the robust and conserved global response remains in striking 95 contrast to our observation of response variability in the same species or closely related bacteria, and could indicate fundamental differences in the strategies used to tolerate stress across the domains of life. The extreme sensitivity in the bacterial transcriptional response to stress not only calls into question the functional conclusions drawn from a single experiment's list of significantly changing genes, but also the fraction of the global transcriptional response that actually contributes to fitness, in line with previous studies that have found differential expression poorly correlated with positive fitness effects measured by other means (Birrell et al. 2002; Giaever et al. 2002; Klosinska et al. 2011; S. L. Tai et al. 2007). Has selection tuned every part of the global transcriptional state, or is it more likely that of the hundreds of genes that respond (significantly) to an environmental perturbation, only a fraction of these contribute to fitness, and the remaining changes are side-effects of the global network that are mostly neutral for fitness? For example, the E. coli CRP regulon affects hundreds of downstream genes (Salgado et al. 2006; Zheng et al. 2004), but it seems implausible that each of them is responding optimally and contributing to fitness in every condition where CRP activity is modulated. It may be that for any one condition, the expression of a handful are important while the expression of the rest are 'hitchhiking' via their shared regulator. The extent to which bacterial transcriptional responses are evolving under selection or neutral drift remains an open question, but even where neutral evolution may not be the best model for expression (Bedford and Hartl 2009) that doesn't preclude widespread neutrality in a dynamic, condition-specific response (M. Lynch 2007). Questioning global optimality seems even more probable when one considers that a monoculture laboratory dish is a poor duplicate of the natural ecology experienced by a bacterium, and whatever the experimental protocol used, those conditions have likely never been optimized by selection. The most common approach to transcriptome analysis is to list the genes predicted to change with the highest degree of statistical confidence, but here we showed that variation in transcriptional responses with a species can lead to inconsistent lists, if not misleading conclusions. Whether the mercurial nature of bacterial transcriptional responses is due to protocol-specific responses or widespread neutrality is an important fundamental question, but regardless, comparative transcriptomics provides a practical strategy for identifying the functionally important parts of the response: by honing in on 96 those conserved by purifying or stabilizing selection, just as the conserved residues in a multiple sequence alignment can highlight the residues critical for protein function. We showed a module known to be important for fitness had far greater conservation; conversely, conserved condition-specific modules could be learned (Lazzeroni and Owen 2002; Waltman et al. 2010) where they are not known a prioriand thus focus our physiological interpretation of a response with information complementary to differential expression. In our data compendium of three metal-reducing proteobacteria, we found that despite significant conservation of an ancient regulatory network, orthologous proteins very rarely show a conserved response. Taking these findings into account, future comparisons of transcriptional phenotypes from multiple closely related bacterial species, employing rather stringent protocol standardization or in situ techniques, will be a valuable approach for understanding the both the physiology of transcriptional responses in bacteria and their evolution. 3.6 Materials and Methods. 3.6.1 Biomass production and stress application. Our compendium of microarray data was conceived with the aim of comparing transcriptional responses in three model metal-reducing species (S. oneidensis,D. vulgaris, G. metallireducens).To that end, our experimental design included efforts to make biomass production and quality control similar, and stress conditions comparable wherever possible. In some cases experimental protocols could not be made identical due to biological constraints (different media and/or gas composition of headspace were used to accommodate the significantly different physiology of our organisms) or technological constraints (e.g., microarrays were not available for all three species on the same platform). Efforts to make the environmental stress comparable included determining the stress level for each condition, in each organism, such that the cell growth was possible but significantly inhibited. Various concentrations of stressors were tested to determine an appropriate inhibitory concentration. In most cases we were able to find a stressor concentration to reduce growth by 50% of the maximum compared to the control (noted as the minimum inhibitory concentration or MIC). In some cases, transcriptomic responses of 97 a number of inhibitory concentrations were assayed. Biomass was collected at a number of time points post treatment to capture response dynamics, these time points differed by organism in accordance with their stress response and growth curves. Some experiments we generated in our three model organisms have been published previously. In S. oneidensis,heat shock (Gao et al. 2004), cold shock (Gao et al. 2006), and pH (Leaphart et al. 2006) were described previously. In D. vulgaris,heat shock (Chhabra et al. 2006), alkaline (Stolyar et al. 2007), some salt data (Mukhopadhyay et al. 2006) and some nitrate data (He et al. 2010) were described previously. The detailed protocols for unpublished experimental data are described below. Shewanella oneidensis. For salt stress studies on aerobically grown Shewanella oneidensis strain MR1, starter cultures were inoculated 1:100 from 2ml vials of -80C freezer stocks into LS4D medium and grown in a 30C water bath until log phase (OD-0.34). Then they were inoculated into six 2000ml biomass production cultures. At an OD 5 9 o of approximately 0.4, 500mM NaCl stress was applied to three of these, and the other three were treated as non-stressed control incubations. To facilitate comparison of gene expression of Shewanella oneidensis to D. vulgaris, we also assayed S. oneidensisstress response under strict anaerobic conditions. For salt stress studies cultures were grown in LS4D medium (Mukhopadhyay et al. 2006) minimally modified to allow growth of Shewanella oneidensis strain MR1 by substituting 50mM sulfate with 50mM fumarate. Minimum inhibitory concentration under these conditions was determined to be 300mM NaCl; during biomass production for transcriptomic studies 300mM NaCl was added as the stressor in triplicate incubations after cells reached mid-log phase as described for D. vulgaris.Subsamples were harvested at 0, 60, 120 and 240 mins after stressor addition. Desulfovibriovulgaris.For Desulfovibrio vulgaris Hildenborough biomass production, starter cultures were grown in LS4D medium from frozen glycerol stocks. Inoculations were typically a 1:10 dilution. For example, 1ml thawed stock was added to 10ml LS4D for overnight cultures. When these starter cultures reached mid-log phase, they were inoculated into six biomass production cultures using a 1:100 dilution. At an ODso of 98 approximately 0.6 per ml, stressor was applied to three of these, and the other three were mock-treated to serve as non-stressed control incubations. In the case of salt stress, the MIC was determined to be 250mM NaCl (in addition to what is present in LS4D). For the control incubations, a volume of LS4D or deionized degassed water equal to the solution of NaCl was added. Subsamples were collected from the stressed and non-stressed treatments generally at 0, 30, 60, 120, and 240 min following stressor addition. To arrest further cell activity at this point, samples were processed as described previously (Mukhopadhyay et al. 2006). Briefly, the required volume of culture was harvested through a cooling tube, centrifuged at 11,000rcf for 10 minutes to pellet cells. The supernatant was discarded and pelleted cells were resuspended in 30 ml chilled PBS, spun down with the same parameters as before, and the pellet was stored in -80C until further analysis. In the case of nitrate stress, the MIC was determined to be 6500ppm. LS4D media was inoculated with frozen stocks at 1:100 dilution and grown at 30C in an anaerobic chamber (10% C02, 5%H2, remainder N2) overnight to an OD of 0.3/ml; then this culture was used to inoculate samples (1:10) for biomass production. Six bottles were grown until 0.4/ml OD, at which point we took a TO sample from each bottle, and added 40 ml of 4.43 M NaNO3 to three of the six bottles; an equal volume of water was added to the other three for control. Additional samples for 30,60,120, and 240 minute time points were obtained. All samples were obtained by aspiration from bottles in an airtight manner. The four nitrate time courses analyzed in this study included slight experimental variations from this workflow. Time zero "TO" OD varied between 0.28-0.4, and final pH of control between 7.07.3. Experiment 23 excluded the additional PBS-washing step after centrifugation. These additional experiments with slight variations were all included in this meta-analysis to allow for every possible chance of observing similarity in transcriptomic phenotype. In the case of pH stress, 1.8M H2SO4 was added to reach a culture pH of 6.2 or 5.5 (Exp 111 and 38 respectively), and nothing was added to the control cultures. In the case of cold stress, biomass production proceeded as described above for D. vulgaris.After TO harvesting, the three treatment bottles were taken out of the anaerobic chamber and placed in the water bath at 8C, pumped through a chilling device and quickly chilled to 8C. Samples were taken at 60,120, and 240 minutes. 99 D. vulgaris heat shock experiments (Chhabra et al. 2006) were performed like the rest of the D. vulgaris experiments, but with the following key differences: lactate sulfate medium (LS) with yeast extract (not defined) was used as culture media. 37C rather than 30C was used for overnight cell growth as well as control conditions for biomass production. Samples were transferred on the benchtop to a pre-chilled 50ml falcon tube and spun down at 10,000rcf before storage in -80C. Geobacter metallireducens. To facilitate comparison of gene expression of Geobacter metallireducensto D. vulgaris,LS4D medium was minimally modified to accommodate the growth of the iron-reducing Geobacterstrain GS 15. Notably, 10mM Iron-NTA was added as the electron acceptor instead of 50mM sulfate, and 10mM acetate added instead of 60mM lactate as the electron donor. Starter cultures were initiated in this medium from frozen stocks, subcultured once before inoculation into biomass production cultures. Due to the infeasibility of tracking cell growth in cultures containing iron by optical density, Iron(II) concentration and cell counts were used as measures of growth. The MICs for G. metallireducenswere determined as for D. vulgaris. Control and stressed cultures were prepared in triplicate under anaerobic conditions as described above for D. vulgaris and subsamples collected and processed at 0, 60, 120 and 240 mins after addition of stressor at the MIC (100mM NaCl in the case of salt stress and 10mM in the case of nitrate). 3.6.2 Microarray extraction and hybridization. For S. oneidensis salt experiments, microarray extraction and hybridization were performed as described previously (Zhou et al. 2010) to a custom oligo array (Gao et al. 2004). For D. vulgaris, microarray extraction and hybridization were performed as described previously (Stolyar et al. 2007) with the exception of nitrate, which was previously published (He et al. 2010). For Geobactermetallireducens,extraction and labeling of total RNA and genomic DNA were carried out as described for D. vulgaris. Microarray hybridization was performed as described previously (Zhou et al. 2010). Data processing. Log expression levels, including global normalization, were first computed for each microarray. Log expression levels obtained from replicate arrays were averaged. Each gene was represented by two spots on each microarray, and spots flagged by the scanning software were excluded. The net signal of each spot was calculated by 100 subtracting the background signal and adding a pseudosignal of 100 to obtain a positive value. For resulting net signals of 50, a value of 50 was used. For each spot, the level of expression was the ratio of the two channels (ratio of mRNA to genomic DNA). For each replicate, the levels were normalized so that the total expression levels for the spots that were present on all replicates were identical. Finally, mean expression levels and standard deviations of each spot were calculated, which required n > 1. To estimate differential gene expression in control and treatment conditions, normalized log ratios were used. The log ratio was log2 (treatment)/log2 (control). This log ratio was normalized using LOWESS on the difference versus the sum of the log expression level (Dudoit and Fridlyand 2002). Sector-based artifacts were observed; therefore, the log ratio was further normalized by subtracting the median of all spots within each sector. Up to this point, data were processed using spots instead of genes to allow sector-based normalization. Finally, the spots for each gene were averaged to obtain a final normalized log ratio. To assess the significance of the normalized log ratio, a Z score was calculated by using the following equation: Z = log2(treatment/control) / sqrt(0.25 + variance) , where 0.25 is a pseudovariance term. Log ratios and Z values for all microarray data in this study are available at http://www.microbesonline.org/cgi-bin/microarray/viewExp.cgi?expld. The log ratios and Z values were used to generate two types of plots, (i) volcano plots and (ii) operon-based estimates of local accuracy. For volcano plots, we plotted the log2 ratio of all genes versus Z. This provided an estimate of the total numbers of significant changers in a microarray comparison and allowed determination of which time showed the most changers. For operon-based estimates of local accuracy, each point represented a group of 100 predicted significant changers with similar Z scores (the point showed the least significant Z value in the set). The estimated accuracy of each group of changers was derived by inspecting other genes in the same operons as these changers. For random changers, the transcripts for 50% of these genes should have been regulated in the same direction, and for perfect changers 100% of the genes should have been regulated in the same direction. Members of the operons without a consistent signal across replicates (Z < 0.5) were excluded. Even with perfect microarray data, the estimated accuracy was somewhat less than 100% due to errors in operon predictions. For operon plots see http://www.microbesonline.org/cgibin/microarray/viewExp.cgi?expId=XX and select Plots. For public datasets, we averaged replicates and normalized data (median centered and treatment versus control) where applicable, such that each gene was represented by a 101 log-fold-change in expression between treatment and control. The control was always taken from the same data series as the treatment. For time-course data sets, we used the series' zero time point as a control. For continuous culture experiments, the unstressed growth condition with identical metadata was used as a control. Public datasets are cited in the main text; exact experiment reference numbers, control conditions and processing are enumerated in Supplemental Notes. 3.6.3 Computational methods Significant changers. As z-scores were not available for all of the public microarray data, we used 2-fold up- or down-regulation as a threshold for significantly changing genes. If, when investigating correlated responses in two or more species or from two or more separate experiments, no conserved significant changing genes were discovered, we relaxed this threshold to 1.5-fold (as it is less likely that the signal is spurious in both experiments). Gene function and COG functional categories. All functional annotations were downloaded from MicrobesOnline (Paramvir S. Dehal et al.). This includes COG functional categories, many of which were not assigned in the COG database (which contains a limited set of species/strains) but were annotated by the MicrobesOnline pipeline. To assign a COG functional category to each orthologous gene family (tree-orthologs from the MicrobesOnline database), we used all COG functional categories assigned to family members (typically 1 but sometimes more). Coexpressed gene pairs in a single species. We define putative "coexpressed pairs" as the gene pairs with the most correlated response vectors across many conditions in each organism. To supplement our own transcriptomic data (36 experiments for S oneidensis and 79 for D. vulgaris), we gathered datasets from several public microarray databases for two other organisms under diverse conditions: 40 for S. PCC, 40; 212 for E. coli. Some conditions were represented by multiple time points. Prior to calculating coexpression, we removed experiments that were too similar, i.e. the Pearson correlation over all genes, p>0.90, to avoid their dominating the signal for coexpression. For each pair of genes in an organism, we computed the Pearson correlation of those two gene's response vectors over all conditions, i.e., a measure of tendency for the two genes to be up- or down-regulated together in response to a variety of perturbations. Each gene in a species was represented by a vector g where each entry gn is the log-foldchange in gene expression for that gene in experiment n (using all transcriptomic 102 experiments for that organism). The Pearson correlation of these vectors, p(gia, ggi) is the coexpression of gene-pair (i, j) over all experiments n in species 1. (We only considered gene pairs for which at least half the experiments had data.) For each organism, this yielded a distribution of expression correlations p of 6 to 9 million gene pairs (G choose 2 where G is the total number of genes in the genome). We took the top 10th percentile of gene pairs from each organism's distribution as putatively coexpressed pairs. This low-specificity cutoff for putative coexpressed pairs was a compromise to maintain reasonable sensitivity, as it already excluded more than 25% of (predicted or experimentally validated) cooperonic genes in E. coli. Similarly, anticorrelated gene pairs were defined those in the bottom 10th percentile (and Pearson's p<O) in each organism. Conservation of coexpressed pairs. To quantify conservation of coexpression, we took the list of putative coexpressed gene pairs in each species, and asked which orthologous gene-pairs were also coexpressed. To calculate both the observed and expected fractions conserved, we considered only the gene pairs with conserved orthologous gene pairs present in the other species, so that the signal of regulatory rewiring would not be convoluted with the signal of gene presence/absence. For the four-species comparative analyses, we used orthologous gene families built from tree-based orthologs (M. Price, Dehal, and Arkin 2007; Paramvir S. Dehal et al.), 295 gene families present in S. PCC, D. vulgaris,and at least one of our two gammaproteobacteria. Each gene-family-pair (i, and expression correlation p(gi2, gj2) j) has expression correlation p(gui, gai) in species 1 in species 2, and conserved coexpression if both correlations p rank in the top 10th percentile in their respective species. To calculate the expected number of coexpression links conserved, we started with the set of gene pairs coexpressed ( rank( p(gji, gi) ) >0.1) in either gamma-proteobacteria, considering only those with orthologs and data in the other two branches (18196 gene pairs), and multiplied by 0.10 twice (for each of the other two clades). Under a model of random assortment, 10% of pairs would be expected to fall in the top 10th percentile of correlation for each phylum added (D. vulgaris and S. PCC). A stricter definition of coexpression, the 1st percentile of correlation, results in 407 conserved coexpressed pairs when only 0.5 are expected, a >800fold enrichment. Thus it is likely that our estimate of 14-fold enrichment of conservation over expectation is quite conservative. 103 To describe which cellular functions are linked by these ultra conserved coexpression links (Figure 1a), we assigned all 295 conserved gene families one (or more) functional categories, or to "unknown" (there are 18 functional categories defined for COG, following Riley (Riley 1993)). The expected 2-D distribution of links between categories was calculated as the network of random linkages between conserved genes, i.e., by multiplying this marginal distribution for all categories. The observed distribution of links was calculated as following: for each coexpressed ortholog-pair we added one count linking the pair of functional categories to which those ortholog groups belong (splitting counts for membership in multiple categories). For the fold-enrichment for each pair of functional categories, we divided observed counts in each category-pair by the number expected. For the conservation of correlation matrices (Figure 1b) we considered two genomes' conserved coexpressed gene-pairs. For the observed joint (2D) distribution, in each uniform bin we counted how many orthologous gene-pairs share that level of coexpression in the two organisms: each gene-family-pair (i, j) places a count into the 2D bin defined by coordinates (p(gi, gji), p(gi2, g2)) for all Pearson's p. The expected joint distribution for each pair of genomes was calculated empirically under a null model of random assortment (i.e., orthologs are no more likely to conserve their coexpression links than by random chance): we multiplied the marginal (discrete) distributions of each of the two species. All distributions were normalized to sum to 1. Each bin of the observed joint distribution was divided by the counts in the expected joint distribution to yield the factor of excess orthologous coexpressed pairs present at each level of correlation. Protein-protein interactions taken from two genome-wide data sets (Butland et al. 2005; Hu et al. 2009) were used to identify the subset of E. coli coexpressed pairs also physically interacting. Conservation of coexpression in this subset was calculated as for the full set. Similarity of transcriptional response The similarity in global transcriptional response was computed as the Pearson correlation of the log-fold-change in expression of all orthologous genes. Tree-based orthologs from MicrobesOnline were used, except where we noted the bi-directional best hit (BBH) ortholog methodology (best hits were required over 75% of the length). Each transcriptional response (all genes in one time point of an experimental treatment) was represented by a vector v, where each entry vg is the log-foldchange in gene expression for gene g (and the vector includes all genes in that genome for 104 which data was available for that experiment). Similarity of response is the Pearson correlation p between two vectors vi and V2 (for all shared orthologous genes g in species 1 and 2), and is often discussed as p 2 or r 2 in the text, i.e., in the more intuitive terms of fractional variance explained. Conservation of transcriptional phenotype: definition. To decide whether correlation of expression profiles represented significant conservation of condition-specific response, we considered two things. If there was no other data for the two species being compared (as was often the case for public datasets) we drew a cutoff of r 2 =0.01 to consider a response may be correlated. We considered the maximal correlation for all possible comparisons of timepoints, so as not to assume that the dynamics of responses would be identical at the same time points in different species. While r 2<0.01 would have a significant p-value in a test where every orthologs' data point was assumed independent, that is certainly an overestimate of the true number of independently expressed genes in a genome, so this is likely a permissive threshold for significant correlation. Secondly, we asked if the response was more correlated within the same treatment condition than between different treatment conditions. Where responses to different-conditions are more similar than to sameconditions, the correlation is more likely due to chance or to a non-specific stress response than to similar condition-specific adaptations acting in both the species. In all cases we examined the list of conserved up- and down-regulated genes for corroboration and physiological insights. Estimation of divergence times. Approximate divergence times were taken from a recent paper (David and Alm 2010). Briefly, a chronogram of the tree of life was constructed with the PhyloBayes software package; this chronogram was calibrated according to a set of temporal constraints drawn from the paleontology and biogeochemistry literature. 105 3.7 Author Contributions S. C. Timberlake performed all computational analysis (with the exception of processing of raw microarray signal) and wrote the paper with E.J. Alm. D. Joyner, R. Chakraborty, A. Mukhopadhyay and others performed growth assays, phenotype microarrays and experiments and measurements. M.P. joachimiak and others designed the microarray probe analysis pipeline. M.P. Joachimiak, J.K. Baumohl, P.S. Dehal and K.H. Huang maintain(ed) the MicrobesOnline database, which defines ortholgs and also imported expression data. Special thanks to M. Price for E coli expression data compendium. A.P. Arkin, T. Hazen, and E.J. Alm conceived of comparative stress experiments and provided materials and reagents. 3.8 Figures and Legends 106 C- a. (5 (D 3 -1 - - ..... 0o 0 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -- xa 0) C~) 0 'A ce 8 0 o 0) o eli CD f 6 b i0:j .Obmmm -0.6' co .... -0.8 -1 0.8' 0.6 -0.2 -0.4 00.2 0.4 - -0.6 -0.8 -1 0.8 1-1 0.6 'a Cf Cbs C,) 0.4 - 6 0 j :1o 0 o o mA .. . oj 0.2- 0- -0.2 - -0.4 - -0.6- -0.8 - b o 6 :C O O 6 O correlation of gene pair in S. oneidensis I U I to '9 0 3 0 0, 3f 3t - 107 (3 ~ o 0- 0 e .. o0 0 o. e 0 o.0 j2 E!8 3 0 33 0 . 0 *0 R 5- 0D .. . 0 - 0 0 0 0 0 0 6 e 3S ia 00. 0@ S0 0 00 0 . . 00 : 0 o0 . 0 0 0 * e . - .. 0. 0. 0~ 0 .@WinODD -*e 33 0 0 0 . 0 -0 . . S0 - Sr a 2 3 -n 0. * e ~ 0 I Co . . * - . 0 - C. OR 0 3 C ci 3- 0 3.;. ~ 3.-' I I # conserved genes I 0) 0I eS Figure 1. Conservation of coexpression across the bacterial domain a) Cellular functions of ultra conserved genes and coexpressed gene pairs. The grey bars profile the COG functional categories of 295 orthologous genes present (and having expression data) in the three clades. The matrix of dots depicts the distribution of 2629 conserved coexpression partners amongst the functional categories: for each pair of functional categories, dot size indicates the absolute number of coexpressed gene pairs, while red-blue shading highlights function pairs with an unexpected share of coexpressed members (compared to random links between the 295 genes present in all). b) Evolution of the coexpression distribution between S. oneidensis and three increasingly distant species. Each square (a bin of Pearson correlations) depicts the enrichment of orthologous genepairs whose level of coexpression has been conserved (compared to the number expected under a model of random assortment (see methods)). The majority of the distribution exhibits random assortment, but enrichment in the highly correlated gene pairs is a strong signal of conservation between species. Some anticorrelated pairs' conservation is also evident. c) Anticorrelated interactions conserved across divergent bacteria. Edges connect genes with opposite gene expression profiles conserved in our three clades. Nodes are labeled with a gene name where available; node colors encode the COG functional category assigned to the gene. The main hub aat is a leucyltransferase that functions in the N-end rule pathway for protein degradation. The other hub, ycal, is uncharacterized but is putatively a membrane protein involved in DNA internalization and competence for transformation. 108 o = acid o alkaline = = acute = anaerobic 10' @ pH12 0.9 10' @ pH12 0.7 2' @ pH15.5 0.5 0.3 10'@ pHS1.50. 0.1 -0.1 continuous @ pH5.0 -0.3 continuous @ pH8.7 continuous @ pH5.7 -0.5 -0.7 continuous @ pH8.5 -0.9 o 0 ( 0.. Figure 2. Transcriptomic response to pH is highly variable, even within the same species. We compared the global responses from five data sets (separated by thin lines) assaying acute or chronic pH-stress phenotype (timepoints and conditions are indicated) or pH growth phenotype ("continuous" culture) in E. coli. Each square depicts the Pearson correlation r of the response of all genes measured. Between labs, correlation is slight or sometimes negative (maximum r 2=0.12). The responses to acid and alkaline from within a dataset are far better correlated than the response to acid across datasets, despite being normalized to their own controls. (All experiments were normalized to the pH7 condition from the same data set; see methods for details). 109 D. vulgaris Heat shock Cold Acid Alkaline shock shock shock Salt shock 5' Heat shock I 2' isy 10' 5' r- NA 5, I IIA L.B I IV VA V.B Cold shock 1' .IS II.B 20' 4a' 80' 160' 65' Acid shock ExperIments reference InformatIon S. oneldensls D. vulgarlS exp8:42C exp1; 15C expl; 8C exp7; pH4 exp7; pHl10 exp394; 02+ expl715;02- VI VI VIlA VII.B IX XA X.B LC XD III d Alkaline shock exp24;50C exp25; 8C exp38; pH5 exp1l; pH6.2 exp37; pH10 exp2O; [250mM] exp14; [250mM] exp6; [500mM] exp6; [250mM] IV 3120' VA 120' V.B I Salt shock -0.9 -0.7 -0.5 -0.3 -0.1 VII VNIA VIII.B IX XA XB XC 0.3 0.5 0.7 0.9 Pearson correlation 240' VI 0.1 X.D e. Salt shock c. Cold shock b. Heat shock d. Acid shock 00 cm1 o 0 CS N 0- 0000I 00o0 CII .2 -4 - CmI 0 2 0 r. 9 4 I 80 . .0--- -- ---- --------- -2 S o0 Coo 0b2 CIJ r00 ra. 0.073 -4 I - 0 2 log2rato, S onekdenass 4 I I i j I I I I 1 -4 -2 0 2 4 -4 -2 0 2 r =.0.00 I r=. .031 - 4 -4 -2 0 2 4 0 Figure 3. Comparison of phenotype in our stress compendium of S. oneidensis and D. vulgaris 3a) Correlations in transcriptional responses between these two species in five environmental perturbations. Each colored square in the matrix displays the Pearson correlation between the response of all orthologs in S. oneidensis and D. vulgaris,comparing each sample point in the time courses; the five bold boxes highlight the same-condition comparisons (all-versus-all timepoints) for heat, cold, acid, alkaline, and salt. Roman numerals mark experimental conditions while letters delineate multiple protocols or stress levels within a condition. Experiment numbers (expXX) reference MicrobesOnline expression database ids. The most correlated response within each condition is marked and expanded in Figure 3b-e. 3b-e) The most similar transcriptional responses within each condition. Within each treatment condition (indicated in a), the most similar responses are expanded in the scatter plots, illustrating the up- or down-regulation of all orthologous genes. Each point represents the log-fold-change measured in S. oneidensis along the x-axis and D. vulgaris along the yaxis. Alkaline responses were never correlated, and so omitted. 111 ............. ...... ...................................... .- ...-. ....... .... .............. G.metalllreducens Salt shock Salt__shock Nitrate shock __ Experiments reference Information 30' 60' 120' IA D. vulgarls LA exp6; [500mM] .B exp6; [250mM] 240' 15' 60' 120' Jw b 11expl4 III IVA IV.B V VI IB 240' 30' 60' 120' exp2o exp3l exp3l; w/sorbitol exp23 exp30 VI exp42 IIA VIN exp177 240' 120' ci lilA 30' VA 30' IVB G. metallreducens IX exp443 30' X exp1272 XI exp448 60' 120' 240' Jw 240' Vi 240' 480' 30' 60' Vil 120' 240 -0.9 -0.7 -0.5 -0.3 IX 0.1 0.3 0.5 0.7 Pearson correlation XI X -0.1 c. Nitrate shock b. Salt shock S C c'j 0 0:01 I 10 S 0- 0 ei 6 1 o9 0 CI- * * CmJ I* r 2= 0.030 -4 -2 0 2 -4 4 log2ratlo, D. vulgar!s 112 -2 0 2 4 0.9 Figure 4. Comparison of phenotypes in our stress compendium of D. vulgaris and G. metallireducens. a) Correlations in transcriptional responses between these two species in salt and nitrate stress. Each colored square in the matrix displays the Pearson correlation between the response of all orthologs in D. vulgaris and G.metallireducens at each sample point of the time courses; the two bold boxes highlight the same-condition comparisons (all-versus-all timepoints). The most correlated response within each condition is marked and expanded in Figure 4b-c. Roman numerals mark experimental conditions while letters delineate multiple protocols or stress levels within a condition. Experiment numbers reference MicrobesOnline expression database ids. The most correlated responses for each condition are marked and expanded in Figure 4b and c. b-c) The most similar responses within each condition. Within each treatment condition (indicated in a), the most similar responses are expanded in the scatter plots, illustrating the up- or down-regulation of all orthologous genes. Each point represents the log-fold-change measured in D. vulgaris along the x-axis and G. metallireducensalong the yaxis. 113 Table 1. Genes of our heat shock module Name Orthologs Description dnaK 1 1 1 1 chaperone Hsp70; DNA biosynthesis; autoregulated heat shock proteins yabH CIP11 Ion 1 1 0 0 1 1 1 1 DnaJ-domain-containing proteins 1 'f heat.sock: protein F215 eds dpeltcsbunkt ATP-depnden DNA-binding, ATP-dependent protease La; heat shock K-protein ycaL 1 0 0 0 putative heat shock protein; Zn-dependent protease with chaperone function htrB 1 1 0 0 heat shock protein; lipid A biosynthesis lauroyl acyltransferase hslJ 1 0 0 0 heat-inducible protein ydhM 1 1 0 0 pred-icted moecuar chaipleen distardi relaed tok HSP704fold htpX 1 1 1 1 heat shock protein, integral membrane protein yegk 1 1 0 0 ddg hscA yfhE 1 I 1 0 1 1 0 0 0 0 0 0 putative heat shock protein; Lauroyl/myristoyl acyltransferase illis7Ol iginsfeINl lilasmber etshockiroteki, chsapieoe DnaJ-domain-containing proteins 1 rp*E 1 1 0 1 Rtymell cipB 1 1 1 1 protein disaggregation chaperone mreB 1 1 1 1 ON yrfl I 1 1 1 0 0 1 regulator of ftsl, penicillin binding protein 3, Heat shock protein Hsp70 domain tein&M40A$ dsh hypothetical protein; Disulfide bond chaperones of the Hsp33 family 0 0 0 heat shock Patoln;idilKWif 0 heat shock protein; Molecular chaperone (small heat shock protein) 1 lb 1 1 0 ~1 h~~slU 1 1 1 1 1 lbpA hslV eIn e 1 -het t hock ilililllllf esigmaat eok pr-otein.hitimApil - heat sokandeoidativetrs Omihileign@.iln@# %m~shockprteliv) hnwlogwsrtodailents se sulbunltti 1 heat shock protein hslVU, proteasome-related peptidase subunit 1 14 mopB 1 1 1 1 co-chaperonin GroES lysi lmiislphl!#Idil ; ieshilldisatein yJIL 0 0 1 0 Activator of 2-hydroxyglutaryl-CoA dehydratase (HSP70-class ATPase domain) Heat shock module genes were identified by text mining of E. coli gene annotations. Orthologous genes defined by the BBH method were identified in four other organisms presence/absence of orthologous genes indicated by 1/0 in Se, Salmonella enterica;So, Shewanella oneidensis; Dv, Desulfovibrio vulgaris; and Bs, Bacillus subtilis.BBH orthologs were used in this case because limiting the module to tree-orthologs eliminated nearly all 114 the genes under consideration at long phylogenetic distances; we judge this less strict definition of orthology as more likely to yield a false negative than false positive result. * genes of heat shock module - all pairwise orthologs I I S. enterica CD- 0 B. subtills D. vulgaris S. oneldensis (D- WD CD Qoo0 0 (N o0 (N- 'Ii (N 0 0M (N *1* -6 -4 D. -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 log2 fold-change In expression Figure 5. Conserved response of E. col heat shock module across five species despite global divergence of response. Global response of all orthologs (gray points) is poorly conserved but the response of genes in the heat shock module (red points) remains consistent over billions of years of evolutionary time in the heat shock condition, despite different protocols and conditions for the perturbation. Genes of the heat shock module are listed in Table 1. 115 2 4 6 D. vulgarls Hea shock Acid shock Cold shock Salt shock Alkaline shock 15' j25' Hea shock ! Expomte fernc hnfonalio 2V LA 8V 16 6 Coldshock 1 2V . oneldeness I exp8;42C LA expl;15C ILBexpl;8C M exp7;pH4 LB 4V 82 . VDgwe VI exp24;50C 8C Vi exp25; pH5 VUiLAexp38; VilLB explll;pH6.2 IV exp7;pH10 V.A exp394; 02+ V.B exp1715;02- IX exp37; pH10 XA exp20;250mM X. exp14; [250MnV X.C exps:[500mM 6 XD expO; [250mM Acidshock 3V Alkaline shock I 12 -0.9 -0.7 4.5 -0.3 4.1 IwV.B Softshock VI Vi VHLA VIItB IX XA X.B XC 0.1 0.3 0.5 0.7 0.9 V.A Pearson correlation X. Supplemental Figure 1. Comparison of phenotypes in our stress compendium of S. oneidensisand D. vulgarisusing all putative orthologs. We compare transcriptional responses of these two species to five environmental perturbations, using all 832 putative orthologs identified by the bi-directional best hit method. Each colored square in the matrix displays the Pearson correlation between the response of all orthologs in S. oneidensis and D.vulgaris at each sample point of the time courses; the five bold boxes highlight the same-condition comparisons (all-versus-all timepoints) for heat, cold, acid, alkaline, and salt. Roman numerals mark experimental conditions while letters delineate multiple protocols or stress levels within a condition. Experiment numbers (expXX) reference MicrobesOnline expression database ids. 116 3.8.1 Supplemental Note: transcriptomic experiments obtained from other laboratories and their comparison We compared the E. coli transcriptional phenotypes in response to pH stress, as assayed by a number of laboratories. Published microarray data were imported into MicrobesOnline to facilitate comparison (MicrobesOnline experiment ids: 40, 1005, 1064, 1142, 1231-3). Where necessary, M3D metadata was used to guide data normalization (treatment versus control). For time series, the zero time point was used as a control where possible, otherwise, we normalized to the control growth condition in the same data set, having identical metadata (except for pH=-7), including the same experimenter name. For experiment ids 1005, 1064, 1142, 1231-3, the control experiments used for normalization were 995, 1065, 1143 and 1230, respectively. To look for the similarity of response in E. coli and S. enterica in heat shock, previously published microarray data from a number of labs (Allen et al. 2003; Gutierrez-Rios et al. 2003) were imported into MicrobesOnline (experiment ids: 12, 1004, 82, 83). This included GEO accession GSE807 and GSE808, which are not associated with any known publication. Additional heat shock data was obtained from a supplementary data associated with a recent publication (Jozefczuk et al. 2010), and normalized as described in their supplementary notes. For the heat shock comparison among five species (Figure 5), published data (Chhabra et al. 2006; Frye et al.; Gao et al. 2004; Gutierrez-Rios et al. 2003; Helmann et al. 2001) were imported into MicrobesOnline experiment ids 2,8,12,24,82, and 83. Supplemental Table 1. Significantly-changing genes common to multiple species' condition-specific responses. A list of possibly conserved elements of condition-specific transcriptional responses: orthologous genes common to multiple species' responses to the same condition in our compendium of S. oneidensis,D. vulgaris, and G. metallireducens(So, Dv, Gm). For each condition, the best-correlated comparison was considered (these were highlighted in Figure 3b-e and Figure 4b-c). Alkaline (pH+) had no positively correlated responses across species; 117 instead the comparison with the highest number of shared significantly-changing genes was used (60 minutes in So versus 30 minutes in Dv). Significantly changing genes were defined or downregulated (+) more than 2-fold; no genes met these criteria in the as those up- (t) acid condition (pH-), so genes with 1.5-fold change (4o) in both species are reported. Gene IDs are those of D. vulgaris'genesin the MicrobesOnline database and are listed multiple times where they were part of conserved responses to multiple conditions. Dv.vs.So Dv.vs.Gm gene ID (Dv) *C+ *C- pH- pH+ salt salt N03 209537 + 206238 4 + + gene name description LysE Lysine efflux permease dnaK Molecular chaperone Molecular chaperone GrpE (heat shock 206239 $ grpE protein) Preprotein translocase subunit SecA 206252 $ secA (ATPase, RNA helicase) 206718 + ftsH ATP-dependent Zn proteases ATP-dependent Lon protease, bacterial 206779 + Ion type ATP-dependent protease HsIVU (CIpYQ), 206912 + hslU ATPase subunit ATP-dependent protease HsIVU (CpYQ), 207028 $ hslV peptidase subunit DNA-directed RNA polymerase, sigma 207253 + rpoD subunit (sigma70/sigma32) 207254 $ dnaG DNA primase (bacterial type) DnaJ-class molecular chaperone with C- 208766 + dnaJ terminal Zn finger domain Phosphoribosylcarboxyaminoimidazole 209423 + purE (NCAIR) mutase ABC-type tungstate transport system, 206171 + permease component 118 206203 4 atpG FOF1-type ATP synthase, gamma subunit 206204 + atpA FOF1-type ATP synthase, alpha subunit FOF-type ATP synthase, delta subunit (mitochondrial oligomycin sensitivity + atpH protein) 206212 4 rodA Bacterial cell division membrane protein 206205 Phosphoribosylaminoimidazolesuccinocar 206222 4 purC boxamide (SAICAR) synthase 206262 4 rplS Ribosomal protein L19 206266 rpsP Ribosomal protein S16 206298 + + frr Ribosome recycling factor 206358 4 rplU Ribosomal protein L21 206643 + fabF 3-oxoacyl-(acyl-carrier-protein) synthase 206644 4 acpP Acyl carrier protein Dehydrogenases with different specificities (related to short-chain 206645 4 fabG alcohol dehydrogenases) 3-oxoacyl-[acyl-carrier-protein] synthase 206646 4 206650 + 4 fabH III rpmB Ribosomal protein L28 (acyl-carrier-protein) S- 206688 4 fabD malonyltransferase 206748 4 rplB Ribosomal protein L2 206764 4 rplO Ribosomal protein L15 206772 4 rplQ Ribosomal protein L17 Predicted membrane GTPase involved in 207715 4 4 typA stress response Holliday junction resolvasome, 207743 4 ruvC endonuclease subunit 207942 4 metK S-adenosylmethionine synthetase rplT Ribosomal protein L20 bioB Biotin synthase and related enzymes 208032 4 208055 4 119 208436 + rplJ Ribosomal protein L10 208437 4 rpIL Ribosomal protein L7/L12 208452 + purB Adenylosuccinate lyase 207021 + rho Transcription termination factor 209448 + nusA Transcription elongation factor Uncharacterized protein conserved in bacteria; Ribosome maturation factor 209449 + RimP (IP) 206238 $4+ dnaK Molecular chaperone 207836 $ dut dUTPase 209441 + rpsO Ribosomal protein S15P/S13E 206739 4 rpsL Ribosomal protein S12 rpIR Ribosomal protein L18 +4 206761 $ 206769 + rpsK Ribosomal protein S11 208768 + greA Transcription elongation factor mutY A/G-specific DNA glycosylase hisF Imidazoleglycerol-phosphate synthase 209216 + 209220 209256 + eno Enolase 209332 + hup-1 Bacterial nucleoid DNA-binding protein 209403 + trpD Anthranilate phosphoribosyltransferase 206202 $ atpD FOF1-type ATP synthase, beta subunit $ Preprotein translocase subunit SecA 206252 $ $ 206301 + 206635 secA (ATPase, RNA helicase) tsf Translation elongation factor Ts + leuS Leucyl-tRNA synthetase 206637 4 ribH Riboflavin synthase beta-chain 206752 4 rplP Ribosomal protein L16/L1OE 206753 4 rpmC Ribosomal protein L29 + rplR Ribosomal protein L18 206764 + 4 rplO Ribosomal protein L15 208034 + infC Translation initiation factor 3 (IF-3) 206761 4 + + 120 hypothetical protein; Cell division protein 208169 ZapA-like (TIGR) DNA-directed RNA polymerase, beta' 208439 rpoC $ subunit/160 kD subunit DNA-directed RNA polymerase, alpha 206771 $ 206800 + 207060 $ rpoA subunit/40 kD subunit Predicted phosphatases dapB Dihydrodipicolinate reductase Uncharacterized conserved protein; 207526 membrane protein, putative (TIGR) + 3-hydroxymyristoyl/3-hydroxydecanoyl207856 fabZ + (acyl carrier protein) dehydratases Dehydrogenases with different specificities (related to short-chain 206645 + + fabG alcohol dehydrogenases) 206650 + + rpmB Ribosomal protein L28 + rpsT Ribosomal protein S20 207364 conserved hypothetical protein; 206683 + 206980 4 coaD Phosphopantetheine adenylyltransferase 207285 4 yajC Preprotein translocase subunit YajC Mammalian cell entry-related (TIGR) Protein-L-isoaspartate 207315 4 pcm carboxylmethyltransferase Peptidyl-prolyl cis-trans isomerase 207339 4 208903 + ppiB-2 (rotamase) - cyclophilin family + ftsE Predicted ATPase involved in cell division 209074 + trpS Tryptophanyl-tRNA synthetase 206202 + atpD FOF1-type ATP synthase, beta subunit 206301 + tsf Translation elongation factor Ts 206530 + argG Argininosuccinate synthase 206531 + argF Ornithine carbamoyltransferase 206748 4 + rplB Ribosomal protein L2 206749 + rpsS Ribosomal protein S19 121 206750 + rplV Ribosomal protein L22 206751 + rpsC Ribosomal protein S3 + rpmC Ribosomal protein L29 + rpsQ Ribosomal protein S17 + rplR Ribosomal protein L18 206762 $ rpsE Ribosomal protein 55 206763 $ rpmD Ribosomal protein L30/L7E 206753 + 206754 206761 4+ Inorganic 207088 ppaC + pyrophosphatase/exopolyphosphatase Uncharacterized protein conserved in bacteria; DNA-binding protein, YbaB/EbfC 208720 family (TIGR) + tRNA delta(2)-isopentenylpyrophosphate 4 206981 + 208903 + + miaA transferase ftsE Predicted ATPase involved in cell division Flagellar biosynthesis pathway, + 208972 fliP component FliP Flagellar biosynthesis/type Ill secretory 209245 206238 pathway protein + dnaK + Molecular chaperone 208317 + 208611 + rbr Rubrerythrin 207805 + rbr2 Rubrerythrin cytochrome c3 Chemotaxis protein; stimulates 208485 + methylation of MCP proteins 207043 + Methyl-accepting chemotaxis protein 209249 4 flgC Flagellar basal body rod protein Nitrate/TMAO reductases, membrane- 209570 4 206917 4 nrfH bound tetraheme cytochrome c subunit Predicted ATP-dependent protease 6Fe-6S prismane cluster-containing 208040 4 b0873 122 protein 208775 membrane protein, HPP family + FOF1-type ATP synthase, epsilon subunit + 206201 + atpC (mitochondrial delta subunit) tRNA delta(2)-isopentenylpyrophosphate 206981 4 miaA transferase 207355 4 gmhA Phosphoheptose isomerase 207398 4 IspA Lipoprotein signal peptidase Predicted membrane GTPase involved in 207715 4 4 typA stress response CMP-2-keto-3-deoxyoctulosonic acid 208632 4 kdsB synthetase Recombinational DNA repair protein 208721 4 4 recR 123 (RecF pathway) 124 Chapter 4. Conclusions and future directions 4.1 Short reads for metagenomics In Chapter 1 I introduced SHE-RA, an algorithm that significantly extends the range of applications for ultra high-throughput sequencing, a crucial technological advance for the interrogation of complex environmental samples. SHE-RA aligns overlapping paired-end reads, correcting errors, and offering -48-160x deeper sampling (per dollar) compared to the technology commonly used for such studies, 454. I demonstrated that SHE-RA made Illumina a viable alternative to 454 for community analysis (Rodrigue et al. 2010), the Chisholm group used it to aid in the assembly of single-cell bacterial genome fragments that are the result of multiple displacement amplification (Malmstrom et al. 2012), a protocol that is growing in popularity to interrogate the large fraction of the microbial world that remains unculturable (De Jager and Siezen 2011). The Moran group used it to analyze the day-night and seasonal dynamics of a metatranscriptome of a complex marsh community (Gifford et al in prep). The code was made open source on a website (Timberlake 2012), and has at least 60 users. SHE-RA also includes automatic identification and trimming of veryshort-insert overlapping pairs, where adapter and barcode sequence can drive improper de novo read clustering or create false-positive hits in protein space. While this work was motivated by the desire to get longer Illumina reads' error rates from 10e-1 to 10e-2 or below, the general principle of error rate reduction by multiple read passes on the same DNA molecule could be used elsewhere for to push error below 10e-3 or 10e-4 for new applications. Because some high fidelity PCR protocols boast error rates around 10e-7 (Vallania et al. 2012), the sequencing platform could once again represent the dominant source of error. Further decreases in error rates would improve assessments of microbial diversity (Beerenwinkel and Zagordi 2011; Zagordi et al. 2011), 125 and make for more sensitive detection of rare variants such are seen in viral populations or heterogeneous samples, such as tumors (Macalalad et al. 2012; Choi et al. 2009; Aparicio and Huntsman 2010). While the SHE-RA algorithm was tested on Illumina reads, it was designed to be largely platform-independent, and will, in principle, work well with any sequencing technology where individual reads' measurement error is largely independent. More generally, the primary challenges in applying UHTS to community datasets remain computational: both de novo clustering and homology identification, initial steps in taxonomic or functional profiling, are currently computationally prohibitive in terms of working memory (RAM) and execution time, respectively (Preheim and Timberlake, unpublished observations). In particular, there is a gap between the fast but insensitive read mapping software written for UHTS and the more traditional and sensitive methods of homology detection, such as blastx (Altschul et al. 1997). While fast read mappers can align millions of sequences to reference genomes (Li and Durbin 2009; Langmead and Salzberg 2012, 2), their limited sensitivity means that only reads very similar to a known reference sequence can be identified, a tiny fraction in most communities. This gap has been partly bridged by speeding up sequence comparisons using software (Edgar 2010; Koskinen and Holm 2012) or hardware (Suzuki et al. 2012), but sequencing production outpaces these improvements, such that the bioinformatic processing remains the limiting step (Ghodsi 2012; Nekrutenko and Taylor 2012). Beyond speeding up basic pairwise homology searches, the most useful addition to the community genomics toolbox at this point would be scalable versions of the search algorithms for matching probabilistic representations of protein families and domains (position specific scoring matrices and hidden Markov models) (Altschul et al. 1997; Eddy 2001). These represent the greatest area for progress in computational analysis of natural microbial communities. 126 4.2 Assembly and analysis of 82 environmental bacteria In Chapter 2, I described my work to assemble the genomic reference for an important model system for marine bacterial and evolution and ecology. In the first part of the chapter, I described an efficient assembly strategy to assemble 82 Vibrionaceae genomes from short reads. In the second part, I use these genomes to survey the relationship between ecology, phenotype and genotype in this family. The hybrid assembly approach I developed minimizes the need for labor-intensive and expensive long-insert libraries, but avoids the pitfalls of reference-only approaches, which require a very closely related finish genomes, and miss all of the diversity in gene content that exists between even the closest conspecifics. My analysis shows that 1 closed or near-closed genome per species is sufficient to significantly improve related strains' assemblies via this hybrid reference and de novo assembly approach, and even genomes assembled with short-insert paired-end reads were useful references. Even very fragmented reference assemblies (N50-2kb) to more distant genomes or fragmented references could be effectively utilized by this strategy for improving assembly contiguity. The upper limit for this approach, in terms of distance to reference, is still limited by length and quality of short read sequence and will likely benefit from the next generation of more permissive read-mappers (e.g. FR-HIT(Niu et al. 2011)). My hybrid reference and de novo assembly approach was able to effect a -3fold improvement in N50 even for the most recent generation of Illumina reads. It was shown effective with short single-end reads, which may be relevant again as a new generation of sequencing platforms comes online (Glenn 2011). A similar 'assisted assembly' approach was reported (Gnerre et al. 2009), but it was designed for mammalian genome assemblies and required paired-end reads. The ABBA algorithm had a similar goal, but could only be used to close gaps spanned by known protein sequences (Salzberg et al. 2008) , and requires users' a priori knowledge of all connections between contigs in the assembly, closer reference strains, extensive expert knowledge, and manual curation. My hybrid approach requires no expert genomic knowledge or manual curation beyond the selection of a suitable reference genome with sufficient average nucleotide identity, which can be easily approximated from blast output (Altschul et al. 1997). It was designed as a modular pipeline using two of the most commonly used algorithms for short-read analysis, MAQ and Velvet (Li, Ruan, and Durbin 127 2008; Zerbino et al. 2009), and therefore is accessible to a wide audience as well as easily generalizable to other tools. These 82 assembled genomes have served as the genomic reference for an important model system for marine bacteria and natural populations, and have already informed to our understanding of bacterial evolution and ecology through a number of collaborations or independent work, mainly in the Alm and Polz labs. These genomes were used to search for the genetic bases of phosphothioration, a newly identified type of restriction-modification system that is ecologically-associated (Wang et al. 2011). They were recently used in a study detailing the initial events in sympatrically-differentiating populations (Shapiro et al. 2012), which demonstrated how homologous recombination can be force promoting divergence of populations. Because of the naturally defined populations in this model system, they have been invaluable for illuminating the social relationships within and between bacterial populations. For example, antibiotics production and iron acquisition were each shown to be contexts where these populations behaved as social entities, exhibiting cooperative and cheating behaviors (Cordero, Wildschutte, et al. 2012; Cordero, Ventouras, et al. 2012). In the rest of Chapter 2, I presented a survey of flexible gene turnover, gene sharing, metabolism, and ecology in these 82 Vibrionaceae. In addition to their congruence demonstrated by a number of other means, the ecological populations delineated in this model system were also predominantly congruent in terms of metabolism, core genome phylogenies, and in particular, gene content, which was able to delineate populations along ecological lines very early in the differentiation process. While the ecologically defined populations behaved as separate entities in a number of ways, the inferred habitats (based on particle size and season) may not represent consistent ecology across independent lineages adapted to that habitat. This measure of ecology did not have predictive power for sharing of gene content over the long term, or of metabolic similarity, perhaps because both genotype and phenotype are driven largely by phylogenetic history. In fact, metabolic phenotype could only statistically distinguish ecological populations after they were sufficiently separated by phylogenetic distance. Despite a diversity of ecologies and microhabitat preferences in this group, these genomes shared genes with a frequency much higher than a recent survey of HGT reported for same-ecology bacteria or even for bacteria isolated from various sites on the human 128 body, a known nexus of transfer. To what extent this is due to shared geography and isolation method, or perhaps to the vibrio's natural tendency for DNA uptake, is an open question, but the relative breadth of the labels used to describe "ecology" in these studies almost certainly has an effect. These vibrios not only have a high rate of recent transfer with each other, but an astonishingly high rate of introduction of new DNA overall. Our superfine phylogenetic sampling included many pairs of genomes that are distinguished by only 100-300 SNPs genome-wide, but more than 100-300kb of new DNA. I showed that new, potentially beneficial, genes are rapidly introduced, providing abundant raw material for selection within the very fine-scale phylogenetic distances where we have previously reported ecological differentiation (Hunt et al. 2008). While the majority of this very fast flexible DNA never reaches fixation, a large amount does fix in a population. Flexible genes fixation is apparently non-neutral, occurs on timescales compatible with ecological differentiation, and delineates populations in a way congruent with their ecology before core genes or metabolic phenotype can do so. My analysis found that 10-30% of new flexible genes are fixing in a habitat-specific and thus non-neutral manner; this should be confirmed by further work. Additional characteristics of recently fixed species-specific flexible genes, such dN/dS or nonrandom functional distribution might bolster our ecological evidence of non-neutral fixation. By OTU-like divergence, fixation of flexible DNA has resulted in large scale (-25%) turnover in the core of any one population. This highlights the large numerical discrepancy with core and flexible genomes defined by common arbitrary phylogenetic cutoffs, and mislabeling of the types. In particular, a large fraction of 'flexible' genes apparently at intermediate penetrance in a V. cyclitrophicus were actually fixed in a population, and arguably now core. These observations further motivate the attention to core and flexible definition in future study. The lack of proper species definitions is only part of the trouble with gene frequency distribution curves. Gene frequency distribution curves are a common focus of flexible gene study and served as the test bed for a recently developed neutral model of flexible genes fixation (Luo et al. 2011; Haegeman and Weitz 2012). However, if not all flexible genes are selectively neutral, a more fundamental issue is that these curves combine new genes being fixed, fixed genes being lost, and genes whose equilibrium penetrance is intermediate. These three processes almost certainly separate, both mechanistically and in terms of the selective forces driving them, and should really be treated separately instead of combined 129 into a single penetrance curve. Future study in this model system may be able to identify the distinct selective pressures driving these three processes. Of particular interest are the genes with equilibrium intermediate penetrance in populations, as may be driven by frequency-dependent selection. Genes of low or intermediate penetrance provide diversity in the population pan genome, but little is understood about this process. Some have hypothesized it is useful for long term buffering against environmental variation, others studies point to phage evasion. A survey of the functional content and selective forces on these gene sequences might demonstrate their utility: a preliminary analysis of vibrios' very rare proteins (an approximation of the intermediate penetrance gene set) revealed that most evolve under weak purifying selection (median dN/dS<0.25 at most distance bins). One established force selecting for intermediate penetrance is cheating, where certain individuals in a population bear the cost to make a resource publicly available (for example, siderophore production results in extracellular bioavailable iron), but all members are able to benefit (by taking up bound siderophores), and some do so without bearing the cost (Cordero, Ventouras, et al. 2012), Substrate utilization profiles have been measured for a large number of strains in this model system, providing an excellent dataset to investigate similar patterns in substrate utilization. In particular, some complex carbohydrates and other long polymers must be degraded extracellularly before import, and during that time they can diffuse away from the original enzyme producer. In fact, the organic particles in the marine environment that are a major source of organic matter for copiotrophs are known to be surrounded by large diffuse plumes of partially degraded dissolved organic matter, possibly escaped the degrader and intended consumer. Biolog profiles might reveal that substrate utilization abilities for less labile organic matter (which would be prone to cheating behavior) share a tendency of high within-population variation. While the nature of this cheating can be regulatory (Wilder, Diggle, and Schuster 2011) and situationdependent, it could also be forced by genotype, and perhaps associated with certain ecologies. One intriguing possible role for intermediate penetrance genes is the creation of allelic diversity through periods of hypermutation in a few individuals, perhaps in times of stress. We have repeatably observed stress-induced hypermutation in vibrios in the laboratory mediated by prophage exicision (S. Clarke and S. Timberlake, unpublished data) 130 but whether the allelic diversity so generated is passed on laterally in the wild is a fascinating open question. Genes fixed at intermediate penetrance are one of many things motivating a reexamination of the operational definitions of core and flexible genes. While these unubiquitous genes would be labeled flexible by any common definition, some genes' loss from a population could trigger collapse (e.g., loss of siderophore production capabilities). More generally, the study of flexible genomes has likely been confounded by the wide variety of arbitrary phylogenetic cutoffs used for species group delineation (Grote et al. 2012). The vibrio model system's natural definitions of populations and hence core and flexible genomes are ideal to address key predictions of the Core Genome Hypothesis and its proposed species definition, which has been largely ignored in the prolific core/flexible genome literature. The core and flexible genome dichotomy was originally proposed to reconcile the seeming contradiction of bacterial gene content variability with species-like clusters: despite rampant HGT in the flexible genome driving intraspecific diversity, intraspecific homologous recombination in the core genome could still provide a homogenizing influence on a stable, species-defining core genome (Ruiting Lan and Reeves 2000; R. Lan and Reeves 2001). Homogenizing intra-population versus diversifying interpopulation recombination in core genes (including population-specific core and family core) versus flexible genes could be quantified to test this prediction. Enriched intra-cluster recombination frequency has been found previously (Lef6bure et al. 2010; Bennett et al. 2010; Muzzi and Donati 2011), and within-population exchange dominated recent recombination in the core genome of V. cyclitrophicus (Shapiro et al. 2012), though the generality of this finding needs to be demonstrated. In fact, one study found recombination frequency was not enriched among strains sharing some environmental labels (Muzzi and Donati 2011). The time scale by which a new member of the species-specific core appears core-like and diversifying inter-cluster recombination is suppressed seems surprisingly fast but the mechanisms are unknown. Suppression of inter-population exchange at these distances may not be easily to explain by the falloff in homologous recombination frequency with distance; could selection act quickly enough, or perhaps other mechanisms: clusterspecific restriction modification systems were identified as mechanism to actively exclude inter-cluster DNA (Budroni et al. 2011). However, this mechanism may not be wholly responsible or generally applicable (restriction modification systems appear largely variable within vibrio populations ((Wang et al. 2011) and S. Timberlake, unpublished 131 results). The processes of fixation, recombination and integration of flexible genes into core genomes and populations have been largely ignored in core and flexible genome analyses but are likely fundamental to understanding adaptation and speciation in bacteria. Lastly, this model system is unique in its combination of hierarchically sampled whole genomes and quantitative ecological data. Here I draw attention to a few unexplored strengths of this dataset and the questions they could address in future study. I include some preliminary results. 1) Genetic drift and bacterial gene deletion The association between genome size and species phylogeny branch length in this family is obvious to the naked eye. However, the possible relationship between genome size and drift and its mechanisms are still a matter of active debate (Kuo, Moran, and Ochman 2009; Whitney and Garland 2010; Whitney et al. 2011; Michael Lynch 2011). This dataset is excellent for estimating natural populations' Nepl, Ka/Ks, and inferring gene gain and loss with better temporal precision than most clades to understand the timescales for these processes and the strength of possible selection on gene deletions. Additionally, these selective forces might be largely distinct from mechanisms driving the vast fast loss of flexible genes. 2) Quantifying habitat specialization and its evolutionary dynamics It is commonly held that specialists undergo population size and genome reduction, however, the label 'specialist' is poorly defined and therefore hard to investigate. On what time scales does habitat invariability qualify for specialization, and on what time scales might it impact core and flexible genome size? I propose that one quantitative measure of population specialization is the diversity represented in a population's ecological metadata. For example, a preliminary investigation revealed a modest association between population's entropy in size fraction and core genome size (r=0.6), but no association with genomic fluidity (a recently introduced measure of population gene content diversity (Haegeman and Weitz 2012)). Population pan-genomes have been proposed as a mechanism for buffering environmental variability, but perhaps the environmental variation measured by habitat entropy occurs with such frequency that sensing should be hardcoded in an individual (Kussell and Leibler 2005), rather than just buffered by population diversity. Indeed, studies of the vibrios have previously found evidence of high migration rates, for example, between the water column and 132 animals (Preheim et al. 2011). Quantitation of the level of specialization was not previously possible, and revealing the relevant timescales for environmental and genotypic variability would inform our understanding of genome and pan genome dynamics. 3) How do census population sizes relate to evolutionary dynamics? Estimates of census population sizes and of effective populations sizes typically differ by many many orders of magnitude. Preliminary analysis revealed that relative census population sizes (i.e., populations' relative abundances recovered by the isolation process) are in general well-correlated with mean evolutionary rate (S. Timberlake, unpublished results). Where census population sizes are predictive of Ne and other measures of drift, and where they differ, could improve our understanding of this discrepancy and the evolutionary dynamics association with such processes. 133 4.3 Conservation of coexpression without conservation of response in bacterial gene regulatory networks In Chapter 3, I quantified coexpressed gene pairs conserved across bacteria, and contrasted that with a striking lack of conservation of transcriptomic response in our model species and more closely related species drawn from the literature. Why are orthologous genes not expressed in a similar way under similar conditions? Two elements can contribute to this dissimilarity: evolutionary divergence between species, and protocol-specific side-effects. The extent to which each of these elements contribution is due to natural selection versus neutral drift is an important unanswered question. The phenomena of similar microarray experiments producing inconsistent results has been documented in the literature, but those effects are unlikely to be the driving force behind these observations. Different platforms and analysis methods may inconsistently quantify absolute expression level or low abundance transcripts, leading to different results (Draghici et al., 2006; Shippy et al., 2006). However, a large consortium's systematic evaluation found excellent quantitative agreement in global profiles between a variety of platforms, sites, and normalization techniques, particularly for the fold-change values used throughout this comparative analysis (Canales et al., 2006; Shi et al., 2006; Shippy et al., 2006), consistent with the idea that most of the variation occurs in the biomass. Indeed, abundant microarray measurements in well-studied yeast systems provide the perfect comparison to the variability of response we have highlighted in bacteria. In a number of yeast species, a diverse array of stressors all elicit a stereotypical global transcriptional program, termed the environmental stress response (ESR(Gasch et al., 2000)). The ESR is a global and highly stereotypical pattern, and is consistent across experiments done in different labs, with different methodology, platforms, and strains. The global correlation in response is also largely shared between S. cerevisiae and S. pombe (Chen et al., 2003; Gasch, 2007) despite 500 million years of evolutionary divergence (Berbee and Taylor, 1993), and significant rewiring of even basic cellular transcription programs (Rustici et al. 2004). More recent analyses have concluded that the yeast ESR is largely explained by cell-cycle genes (Nikolai 2011), but it remains in striking contrast to our observations of response variability in the same species or closely related bacteria. The 134 lack of such a global response in bacteria is intriguing, as it could indicate fundamental differences in the strategies used to tolerate stress across the domains of life. While some cross-condition response similarities were noted in this survey, also absent was evidence of the "general stress response" (GSR), mediated by alternative sigma factors in bacteria. This may be because regulatory control of the GSR turns over rapidly (Caimano et al. 2004; Gunesekere et al. 2006; Mittenhuber 2002) or because the ancient features of the GSR network are largely post-translational (Mittenhuber 2002; Staron and Mascher 2010). Also, while descriptive studies of the GSR abound in the literature, quantitative assessments of its transcriptional conservation were absent and may be a useful area of further study, perhaps to contrast with the well-studied yeast ESR. Is it plausible that bacteria and yeast are using fundamentally different strategies to adapt phenotype transcriptionally to changing conditions, and if so, what drives this distinction? Perhaps another striking difference in their evolutionary modes: bacterial gene content changes principally by HGT (David and Alm 2010; Pal, Papp, and Lercher 2005; Treangen and Rocha 2011) while eukaryotic genomes typically grow by duplication followed by sub- or neo-functionalization, and formation of regulatory connections following these two processes is likely distinct (Morgan N. Price, Arkin, and Alm 2006; Pal, Papp, and Lercher 2005). Moreover, the timescale for bacterial gene content turnover is comparable to the timescale for expression divergence: nearly half of the bacterial proteome turns over within many bacterial OTUs (clusters of <3% 16S divergence) (reviewed in(Grote et al. 2012)) the time by which E coli and S enterica transcriptional response to heat is nearly uncorrelated (Paramvir S. Dehal et al.). And while transcripts from this flexible genome are not often not observed to be expressed in laboratory conditions (Taoka et al. 2004), in environmental samples, many flexible genes are more highly expressed than core genes (Stewart et al. 2011). Thus the fast acquisition of new genes could substantially mold the expression profile and its evolution in bacteria in a way distinct from eukaryotes'. Microarrays and rna-seq provide the most comprehensive and unbiased examination of transcriptional responses, and standard approach to their analysis is to list the genes whose differential expression has the highest degree of statistical confidence (Pan 2002; Kerr and Churchill 2001), but this comparative survey showed that variation in transcriptional responses can lead to inconsistent lists, if not misleading conclusions. The 135 extreme sensitivity in the bacterial transcriptional response to stress not only calls into question the functional conclusions drawn from a single experiment's list of significantly changing genes, but also the fraction of the global transcriptional response that actually contributes to fitness. Previous reports highlighting a lack of correspondence between gene upregulation and condition-specific fitness defects have also cast doubt on the common assumption that differential expression is indicative of a positive fitness effect (S. L. Tai et al. 2007; Birrell et al. 2002; Giaever et al. 2002; Klosinska et al. 2011). However, the cross-protective nature observed for many stress responses (Leyer 1993) could account for some of this discrepancy: differential expression may not be beneficial for the conditions at hand but anticipatory for the likely future perturbations (Mitchell et al. 2009). Still, the overall contribution of such effects remains unclear. Despite compelling evidence that certain features of bacterial transcriptional responses have been highly optimized by selection, among them, response timing (Zaslaver et al. 2004), regulon spatial organization (Janga 2012), and even anticipatory responses (Mitchell et al. 2009), that doesn't preclude widespread neutrality in a single, dynamic, condition-specific response (Lynch, 2007). Has selection tuned every part of a dynamic global transcriptional states, or is it more likely that of the hundreds of genes that respond (significantly) to an environmental perturbation, only a fraction of these contribute to fitness, and the remaining changes are side-effects of the global network that are mostly neutral for fitness? For example, the E coli CRP regulon controls hundreds of downstream genes (Zheng et al. 2004; Salgado et al. 2006), but it seems implausible that each of them is responding optimally and contributing to fitness in every condition where CRP activity is modulated. It may be that for any one condition, the expression of a handful are important while the expression of the rest are 'hitchhiking' via their shared regulator. Questioning global optimality seems even more reasonable when one considers that a monoculture laboratory dish is a poor representation of the natural ecology experienced by a bacterium, and whatever the experimental protocol used, those conditions have likely never been optimized by selection. Culture-free measurement techniques may thus play an important role in addressing these questions. Despite a lack of global conservation, I showed a module known to be important for fitness had far greater conservation; conversely, conserved condition-specific modules 136 could be learned (Lazzeroni and Owen 2002) where they are not known a priori.In this way, comparative transcriptomics should help to identify functionally important parts of the response: by honing in on those conserved by selection, just as the conserved residues in a multiple sequence alignment can highlight the residues crucial for protein function which purifying selection has maintained as invariant. This will not only provide an important new dimension to inform physiological interpretation of a response, but will also help start to constrain the extent to which gene expression evolves by selection versus drift. Even a basic evolutionary context for bacterial gene expression and the transcriptional response is sorely lacking. A recent review on the promise of next generation sequencing for the study of bacterial transcription (Guell et al. 2011) did not even mention conservation, selection, or drift, ideas that we know to be central to the understanding of all biological processes (Dobzhansky 1973). A quantitative phylogenetic model of expression evolution would allow us to test whether divergently expressed genes differ from expectation of the null neutral model, as well as estimate strength of selection on transcriptional profiles and thus their fitness effects (Bedford and Hartl 2009). At present, the extent to which transcriptional regulation evolves largely by selection versus drift is unclear. While recent studies have weighed in on appropriate models for evolution of steady state transcript abundance: for largely neutral evolution in primates livers but not brains (Khaitovich et al. 2004; Khaitovich et al. 2006), or for stabilizing selection in flies (Rifkin, Kim, and White 2003; Bedtheford and Hartl 2009), neither for bacteria nor for transcriptional responses has a model been proposed or tested. Phylogenetic modeling of the evolution of quantitative traits has long been a pillar of ecological study (Garland, Harvey, and Ives 1992; Ricklefs and Starck 1996; Felsenstein and others 2004; Whitehead and Crawford 2006) and its application to transcriptional data is not without precedent (Fay and Wittkopp 2007). The justification for modeling the expression level (or change in level) of a gene as an 0-U process is that many small effects can influence a gene's expression. An unbiased walk is representative of neutral drift and it has been argued this fits gene expression divergence well for some datasets (Khaitovich et al. 2004). In addition to modeling neutral drift, the mathematical formulation of this process can model a number of selective forces: the walk can be biased (directional selection), mean-reverting (stabilizing selection), and include branch-specific rates or even punctuated equilibria (Gu 2004). These models, however, have only very rarely been applied to data (Oakley et al. 2005; Bedford and Hartl 2009). 137 A major hurdle to the application of these phylogenetic random-walk models is the taxon sampling required for parameter estimation in even the simplest models. Simulations of gene fold-change values evolving under a random walk model and sampled by realistically noisy measurements, I estimated that at least 20 transcriptomes would be required to reject the null hypothesis for even strong directional selection (2 times the standard deviation by drift) on a gene-by-gene basis (test AUC=0.8; S Timberlake, unpublished data). Still, genome-wide trends are possible to establish with less extensive datasets: a recent phylogenetic analysis of Drosophila tissue gene expression found that stabilizing selection was a better model than neutral divergence without bound to explain the average of genome-wide expression divergence (Bedford and Hartl 2009). (The same study also noted that with 7 taxa, the test to reject neutrality in favor of stabilizing selection on a gene-by-gene basis was still vastly underpowered.) That technological advances will make a sufficient level of transcriptome sampling available in the very near future leads me to outline these key analyses: 1) Does a model of neutral or stabilizing selection better explain bacterial expression evolution within populations, and between populations? 2) Can estimates of gene expression drift help reconcile the discrepancy between gene upregulation and deletion fitness effect? 3) Are most genes, and in particular new genes, under stabilizing or directional selection in most relevant conditions? 4) Can the varying strengths of selection in different conditions inform our understanding of the recent and ongoing evolutionary pressures of the population's niche? The application of quantitative phylogenetic methods to the study of transcriptional regulation will be critical for answering fundamental questions about the evolution of transcriptional regulation, and thus for informed functional interpretations of transcriptomic data. 138 139 Chapter 5. Glossary NOTE: some of these were taken or adaptedfrom previous publicationsincluding (Shapiro et al. 2009). de novo assembly: the process of finding overlaps between reads and building them into contigs de novo, i.e., without any prior knowledge of the source sequence reference assembly: the process of aligning ("mapping") reads from one library onto a reference sequence of a closely related strain that was obtained through other means, following by inference of the reads source sequence by consensus building in the aligned regions. The term reference assembly is a bit misleading. N50: an assembly contiguity metric: the length (in base pairs) of the shortest contig of the assembly that would have to be included in the set of contigs containing 50% of the total source sequence. N80 (with an analogous meaning) is also sometimes reported. paired-end reads: many sequencing technologies can read sequence starting from each end of a DNA molecule. The resulting pair of reads is known to occur in a specific orientation in the source sequence, separated by some distance (usually approximated by measurement of the distribution of DNA molecule lengths sequenced; this distance is called the insert length). ecology: the niche, hyperspace whose dimensions are defined as environmental variables (including biotic and abiotic conditions) in which a species is able to persist and maintain stable population sizes (Hutchinson 1957; Hardesty 1975). Measurements of these environmental axes may be used to approximate an organism's ecology. habitat: In this thesis, I typically refer to the inferred habitats output by L. David's algorithm, AdaptML (Hunt et al. 2008). Habitats should be an approximation, though perhaps a very limited one, of a genome's ecology. 140 species: Definitions of species, or even their existence, in Bacteria are highly contentious. I use species in the following ways: 1) the term "named species" refers to taxonomical nomenclature for which there is a historical precedent in microbiology, and wide use in the literature, but which might not correspond to any modern post-NGS or theoretical concept of bacterial species, and 2) a purely theoretical notion of species, which will share a niche and sexual gene exchange, harbor largely neutral diversity, and in every other way conform to all microbiologists criteria for species. phylogenetic cluster: a set of genomes related much more closely to each other than to nearest relatives, (as measured by housekeeping gene or other genetic distance); these are sometimes treated as bacterial species. population: Populations are species-like entities, a phylogenetic cluster of genomes that is ecologically cohesive. The strictest definition requires their discovery by one run of an experimental sampling protocol, but populations found congruent by previous study (Preheim, Timberlake, and Polz 2011) were sometimes combined in this work when there was good phylogenetic evidence to do so. Populations are used as the primary taxonomic unit in Chapter 3. sympatric speciation: The process through which new species arise in the absence of physical barriers to gene flow between them. niche partitioning: The process whereby different organisms co-exist in a community rather than competing for resources. Niches can be partitioned when lineages avoid competition by using different resources, or restricting their activity to different physical spaces, seasons, times of day, etc. convergence: Similar evolutionary adaptations occurring independently in two lineages. Used here to refer to gene acquisition and elements of phenotype including gene expression and substrate metabolism. Convergence is often used as evidence of (positive) selection. penetrance: the proportion of a population with a particular genotype; in this work, the genotype is more specifically the possession of a particular flexible gene. (neutral) drift: The process by which genetic changes with negligible effects on fitness become stochastically fixed in a population of finite size. negative/purifying/stabilizing selection: The evolutionary force selecting against deleterious genomic changes and promoting conservation of the ancestral state. 141 positive/diversifying selection: The evolutionary force causing novel alleles, genes, or phenotype conferring a fitness advantage to rise in frequency in a population. homologous recombination: The exchange of identical or similar (homologous) stretches of DNA. illegitimate recombination (horizontal transfer): The lateral acquisition of DNA that does not replace orthologous DNA. 142 Chapter 6. References Achaz, G., E. Coissac, P. Netter, and E. P. C. Rocha. 2003. "Associations Between Inverted Repeats and the Structural Evolution of Bacterial Genomes." Genetics 164 (4): 1279- 1289. Aird, D., M. G. Ross, W. S. Chen, M. Danielsson, T. Fennell, C. Russ, D. B. Jaffe, C. Nusbaum, and A. Gnirke. 2011. "Analyzing and Minimizing PCR Amplification Bias in Illumina Sequencing Libraries." Genome Biol 12 (2): R18. Akashi, H., and T. Gojobori. 2002. "Metabolic Efficiency and Amino Acid Composition in the Proteomes of Escherichia Coli and Bacillus Subtilis." Proceedingsof the National Academy of Sciences of the United States ofAmerica 99 (6): 3695. Allen, T. E., M. J. Herrgard, M. Liu, Y. Qiu, J. D. Glasner, F. R. Blattner, and B. 0. Palsson. 2003. "Genome-scale Analysis of the Uses of the Escherichia Coli Genome: Model-driven Analysis of Heterogeneous Data Sets."Journalof Bacteriology 185 (21): 6392. Alm, E. J., K. H. Huang, M. N. Price, R. P. Koche, K. Keller, 1. L. Dubchak, and A. P. Arkin. 2005. "The MicrobesOnline Web Site for Comparative Genomics." Genome Research 15 (7): 1015-1022. Altschul, S. F., T. L. Madden, A. A. SchAffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. "Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs." Nucleic Acids Research 25 (17): 3389-3402. Angiuoli, Samuel, Julie Dunning Hotopp, Steven Salzberg, and Herv6 Tettelin. 2011. "Improving Pan-genome Annotation Using Whole Genome Multiple Alignment." BMC Bioinformatics 12 (1) (June 30): 272. doi:10.1186/1471-2105-12-272. Aparicio, S. A. J. R., and D. G. Huntsman. 2010. "Does Massively Parallel DNA Resequencing Signify the End of Histopathology as We Know It?" The Journalof Pathology 220 (2): 307-315. Aziz, Ramy K, Daniela Bartels, Aaron A Best, Matthew DeJongh, Terrence Disz, Robert A Edwards, Kevin Formsma, et al. 2008. "The RAST Server: Rapid Annotations Using Subsystems Technology." BMC Genomics 9: 75. doi:10.1186/1471-2164-9-75. Babu, M. M., N. M. Luscombe, L. Aravind, M. Gerstein, and S. A. Teichmann. 2004. "Structure and Evolution of Transcriptional Regulatory Networks." Current Opinion in StructuralBiology 14 (3): 283-291. Bankevich, Anton, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, et al. 2012. "SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing." Journalof Computational Biology 19 (5) (May): 455-477. doi:10.1089/cmb.2012.0021. Barrett, T., D. B. Troup, S. E. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I. F. Kim, A. Soboleva, M. Tomashevsky, and R. Edgar. 2007. "NCBI GEO: mining tens of millions of expression profiles-database and tools update." Nucleic acids research 35 (Database issue) (January): D760-5. 143 Bedford, T., and D. L. Hart. 2009. "Optimization of Gene Expression by Natural Selection." Proceedingsof the NationalAcademy of Sciences 106 (4): 113 3. Beerenwinkel, N., and 0. Zagordi. 2011. "Ultra-deep Sequencing for the Analysis of Viral Populations." Curr.Opin. Virol 1 (5): 413-418. Bennett, Julia, Stephen Bentley, Georgios Vernikos, Michael Quail, Inna Cherevach, Brian White, Julian Parkhill, and Martin Maiden. 2010. "Independent evolution of the core and accessory gene sets in the genus Neisseria: insights gained from the genome of Neisseria lactamica isolate 020-06." BMC Genomics 11: 652. Berbee, M. L., and J. W. Taylor. 1993. "Dating the Evolutionary Radiations of the True Fungi." CanadianJournalof Botany 71 (8): 1114-1127. Berg, 0. G., and C. G. Kurland. 2002. "Evolution of Microbial Genomes: Sequence Acquisition and Loss." MolecularBiology and Evolution 19 (12): 2265-2276. Bergmann, S., J. Ihmels, and N. Barkai. 2004. "Similarities and Differences in Genome-wide Expression Data of Six Organisms." PLoS Biology 2 (1): e9. Birrell, G. W., J. A. Brown, H. I. Wu, G. Giaever, A. M. Chu, R. W. Davis, and J. M. Brown. 2002. "Transcriptional Response of Saccharomyces Cerevisiae to DNA-damaging Agents Does Not Identify the Genes That Protect Against These Agents." Proceedingsof the NationalAcademy of Sciences 99 (13): 8778. Boucher, Y., 0. X. Cordero, A. Takemura, D. E. Hunt, K. Schliep, E. Bapteste, P. Lopez, C. L. Tarr, and M. F. Polz. 2011. "Local Mobile Gene Pools Rapidly Cross Species Boundaries to Create Endemicity Within Global Vibrio Cholerae Populations." MBio 2 (2). http://mbio.asm.org/content/2/2/eoo335-10.short. Brady, A., and S. L. Salzberg. 2009. "Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models." Nature Methods 6 (9): 673-676. Brauer, M. J., C. Huttenhower, E. M. Airoldi, R. Rosenstein, J. C. Matese, D. Gresham, V. M. Boer, 0. G. Troyanskaya, and D. Botstein. 2008. "Coordination of growth rate, cell cycle, stress response, and metabolic activity in yeast." Molecularbiology of the cell 19 (1) (January): 352-367. Britten, R. J., and E. H. Davidson. 1969. "Gene regulation for higher cells: a theory." Science (New York, N. Y.) 165 (891) (July 25): 349-357. Brown, P. 0., and D. Botstein. 1999. "Exploring the New World of the Genome with DNA Microarrays." Nature Genetics 21 (1 Suppl): 33-37. Bruijn, Frans J. de. 2011. Handbook of MolecularMicrobial Ecology I: Metagenomics and ComplementaryApproaches.John Wiley & Sons. Budroni, S., E. Siena, J. C. D. Hotopp, K. L. Seib, D. Serruto, C. Nofroni, M. Comanducci, D. R. Riley, S. C. Daugherty, and S. V. Angiuoli. 2011. "Neisseria Meningitidis Is Structured in Clades Associated with Restriction Modification Systems That Modulate Homologous Recombination." Proceedingsof the NationalAcademy of Sciences 108 (11): 4494-4499. Butland, G., J. M. Peregrin-Alvarez, J. Li, W. Yang, X. Yang, V. Canadien, A. Starostine, et al. 2005. "Interaction network containing conserved and essential protein complexes in Escherichia coli." Nature 433 (7025) (February 3): 531-537. Butler, J., M. Kleber, P. Alvarez, W. Brockman, C.W. Chin, S. Gnerre, M. Grabherr, E. Mauceli, B. Birren, and E. S. Lander. "ALLPATHS: A New Genome Assembly Algorithm That Preserves Intrinsic Ambiguity." Caimano, M. J., C. H. Eggers, K. R. 0. Hazlett, and J. D. Radolf. 2004. "RpoS Is Not Central to the General Stress Response in Borrelia Burgdorferi but Does Control Expression of One or More Essential Virulence Determinants." Infection and Immunity 72 (11): 64336445. 144 Canales, R. D., Y. Luo, J.C.Willey, B. Austermiller, C. C. Barbacioru, C. Boysen, K. Hunkapiller, R. V. Jensen, C. R. Knight, and K. Y. Lee. 2006. "Evaluation of DNA Microarray Results with Quantitative Gene Expression Platforms." Nature Biotechnology 24 (9): 11151122. Cases, Ildefonso, Victor de Lorenzo, and Christos A. Ouzounis. 2003. "Transcription Regulation and Environmental Adaptation in Bacteria." Trends in Microbiology 11 (6) (June): 248-253. doi:10.1016/SO966-842X(03)00103-3. Chaisson, M. J., D. Brinza, and P. A. Pevzner. 2009. "De Novo Fragment Assembly with Short Mate-paired Reads: Does the Read Length Matter?" Genome Research 19 (2): 336- 346. Chen, D., W. M. Toone, J. Mata, R. Lyne, G. Burns, K. Kivinen, A. Brazma, N. Jones, and J. Bahler. 2003. "Global Transcriptional Responses of Fission Yeast to Environmental Stress." MolecularBiology of the Cell 14 (1): 214. Chhabra, S. R., Q. He, K. H. Huang, S. P. Gaucher, E. J.Alm, Z. He, M. Z. Hadi, T. C. Hazen, J. D. Wall, and J. Zhou. 2006. "Global Analysis of Heat Shock Response in Desulfovibrio Vulgaris Hildenborough." Journalof Bacteriology 188 (5): 1817-1828. Chipman, Hugh, and Robert Tibshirani. 2006. "Hybrid Hierarchical Clustering with Applications to Microarray Data." Biostatistics 7 (2) (April 1): 286-301. doi:10.1093/biostatistics/kxjOO7. Choi, M., U. 1. Scholl, W. Ji, T. Liu, I. R. Tikhonova, P. Zumbo, A. Nayir, et al. 2009. "Genetic Diagnosis by Whole Exome Capture and Massively Parallel DNA Sequencing." Proceedingsof the NationalAcademy of Sciences 106 (45): 19096-19101. Ciccarelli, F. D., T. Doerks, C.von Mering, C. J. Creevey, B. Snel, and P. Bork. 2006. "Toward Automatic Reconstruction of a Highly Resolved Tree of Life." Science 311 (5765): 1283-1287. Ciulla, D., G. Giannoukos, A. Earl, M. Feldgarden, D. Gevers, J. Levin, J. Livny, et al. 2010. "Evaluation of Bacterial Ribosomal RNA (rRNA) Depletion Methods for Sequencing Microbial Community Transcriptomes." Genome Biology 11 (Suppl 1): P9. Claesson, Marcus J., Qiong Wang, Orla O'Sullivan, Rachel Greene-Diniz, James R. Cole, R. Paul Ross, and Paul W. O'Toole. 2010. "Comparison of Two Next-generation Sequencing Technologies for Resolving Highly Complex Microbiota Composition Using Tandem Variable 16S rRNA Gene Regions." Nucleic Acids Research 38 (22) (December): e200. doi:10.1093/nar/gkq873. Cohan, F. M. 2011. "Are Species cohesive?-A View from Bacteriology." http://works.bepress.com/cgi/viewcontent.cgi?article=1005&context=frederickco han. Coleman, M. L., M. B. Sullivan, A. C. Martiny, C. Steglich, K. Barry, E. F. DeLong, and S. W. Chisholm. 2006. "Genomic Islands and the Ecology and Evolution of Prochlorococcus." Science 311 (5768): 1768-1770. Collins, F. S., M. L. Drumm, J. L. Cole, W. K. Lockwood, G. F. Vande Woude, and M. C. lannuzzi. 1987. "Construction of a General Human Chromosome Jumping Library, with Application to Cystic Fibrosis." Science 235 (4792): 1046. Cordero, 0, LA Ventouras, Edward F. DeLong, and Martin F. Polz. 2012. "Public Good Dynamics Drive Evolution of Iron Acquisition Strategies in Natural Bacterioplankton Populations." Proceedingsof the NationalAcademy of Sciences (November 19). doi:10.1073/pnas.1213344109. 2 http://www.pnas.org/content/early/ 012/11/14/1213344109. Cordero, 0., H. Wildschutte, B. Kirkup, S. Proehl, L. Ngo, F. Hussain, F. Le Roux, T. Mincer, and M. F. Polz. 2012. "Ecological Populations of Bacteria Act as Socially Cohesive Units of Antibiotic Production and Resistance." Science 337 (6099): 1228-1231. 145 David, L. A., and E. J. Alm. 2010. "Rapid Evolutionary Innovation During an Archaean Genetic Expansion." Nature. Dehal, P. S., M. P. Joachimiak, M. N. Price, J. T. Bates, J. K. Baumohl, D. Chivian, G. D. Friedland, et al. 2010. "MicrobesOnline: An Integrated Portal for Comparative and Functional Genomics." Nucleic Acids Research 38 (suppl 1): D396-D400. Dehal, Paramvir S., Marcin P. Joachimiak, Morgan N. Price, John T. Bates, Jason K. Baumohl, Dylan Chivian, Greg D. Friedland, et al. "MicrobesOnline: An Integrated Portal for Comparative and Functional Genomics." Nucleic Acids Research. Delcher, A. L., K. A. Bratke, E. C. Powers, and S. L. Salzberg. 2007. "Identifying Bacterial Genes and Endosymbiont DNA with Glimmer." Bioinformatics 23 (6): 673-679. DeRisi, J. L., V. R. Iyer, and P. 0. Brown. 1997. "Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale." Science 278 (5338): 680-686. Dhillon, I. S., E. M. Marcotte, and U. Roshan. 2003. "Diametrical Clustering for Identifying Anti-correlated Gene Clusters." Bioinformatics 19 (13): 1612-1619. Dixon, Philip. 2003. "VEGAN, a Package of R Functions for Community Ecology." Journalof Vegetation Science 14 (6): 927-930. doi:10.1111/j.1654-1103.2003.tb02228.x. Dobrindt, U., B. Hochhut, U. Hentschel, and J. Hacker. 2004. "Genomic Islands in Pathogenic and Environmental Microorganisms." Nature Reviews Microbiology 2 (5): 414-424. Dondoshansky, I., and Y. Wolf. 2000. BLASTCLUST-BLAST Score-basedSingle-linkage Clustering. Doolittle, W. F. 2012. "Population Genomics: How Bacterial Species Form and Why They Don't Exist." CurrentBiology 22 (11): R451-R453. Draghici, S., P. Khatri, A. C. Eklund, and Z. Szallasi. 2006. "Reliability and Reproducibility Issues in DNA Microarray Measurements." TRENDS in Genetics 22 (2): 101-109. Dudoit, S., and J. Fridlyand. 2002. "A Prediction-based Resampling Method for Estimating the Number of Clusters in a Dataset." Genome Biology 3 (7): research0036. Eddy, S. R. 2001. "${$HMMER: Profile Hidden Markov Models for Biological Sequence Analysis$}$." http://www.citeulike.org/group/2050/article/1079185. Edgar, R. C. 2010. "Search and Clustering Orders of Magnitude Faster Than BLAST." Bioinformatics 26 (19): 2460-2461. Eisen, J. A. 1998. "Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis." Genome Research 8 (3): 163. English, Adam C., Stephen Richards, Yi Han, Min Wang, Vanesa Vee, Jiaxin Qu, Xiang Qin, et al. 2012. "Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology." Ed. Zhanjiang Liu. PLoS ONE 7 (11) (November 21): e47768. doi:10.1371/journal.pone.0047768. J. J., M. E. Driscoll, V. A. Fusaro, E. J. Cosgrove, B. Hayete, F. S. Juhn, S. J. Schneider, and T. S. Gardner. 2008. "Many Microbe Microarrays Database: Uniformly Normalized Affymetrix Compendia with Structured Experimental Metadata." Nucleic Acids Research 36 (Database issue): D866. Fay, J. C., and P. J. Wittkopp. 2007. "Evaluating the role of natural selection in the evolution of gene regulation." Heredity (May 23). Felsenstein, J., and others. 2004. Inferring Phylogenies.Vol. 2. Sinauer Associates Sunderland. http://orton.catie.ac.cr/cgibin/wxis.exe/?IsisScript=SUV.xis&method=post&formato=2&cantidad=1&expresio Faith, n=mfn=007729. Fierer, Noah, Mya Breitbart, James Nulton, Peter Salamon, Catherine Lozupone, Ryan Jones, Michael Robeson, et al. 2007. "Metagenomic and Small-Subunit rRNA Analyses Reveal the Genetic Diversity of Bacteria, Archaea, Fungi, and Viruses in Soil." Applied 146 and EnvironmentalMicrobiology 73 (21) (November 1): 7059-7066. doi: 10.1 128/AEM.00358-07. Fink, G. R., P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. 0. Brown, D. Botstein, and B. Futcher. 1998. "Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization." Molecular Biology of the Cell 9 (12): 3273-3297. Finneran, K. T., M. E. Housewright, and D. R. Lovley. 2002. "Multiple Influences of Nitrate on Uranium Solubility During Bioremediation of Uranium-contaminated Subsurface Sediments." EnvironmentalMicrobiology 4 (9): 510-516. Fischer, Steve, Brian P. Brunk, Feng Chen, Xin Gao, Omar S. Harb, John B. lodice, Dhanasekaran Shanmugam, David S. Roos, and Christian J. Stoeckert. 2002. "Using OrthoMCL to Assign Proteins to OrthoMCL-DB Groups or to Cluster Proteomes Into New Ortholog Groups." In CurrentProtocolsin Bioinformatics.John Wiley & Sons, 4 7 1 250953.biO612s35/abstract. Inc. http://onlinelibrary.wiley.com/doi/10.1002/0 of Functional Modules in Evolution Fokkens, L., and B. Snel. 2009. "Cohesive Versus Flexible Eukaryotes." PLoS Comput Biol 5 (1): e1000276. Francisco, J. C., F. M. Cohan, and D. Krizanc. 2012. "Demarcation of Bacterial Ecotypes from DNA Sequence Data: A Comparative Analysis of Four Algorithms." In 2nd IEEE InternationalConference on ComputationalAdvances in Bio and Medical Sciences (ICCABS). http://works.bepress.com/cgi/viewcontent.cgi?article=1095&context=frederickco han. Fraser, C., E. J. Alm, M. F. Polz, B. G. Spratt, and W. P. Hanage. 2009. "The Bacterial Species Challenge: Making Sense of Genetic and Ecological Diversity." Science 323 (5915): 741-746. Frye, JG, Blackmer F, Cheng P, Proctor E, Porwollik S, and McClelland M. Series GSE808. Fuhrman, Jed A., Joshua A. Steele, Ian Hewson, Michael S. Schwalbach, Mark V. Brown, Jessica L. Green, and James H. Brown. 2008. "A Latitudinal Diversity Gradient in Planktonic Marine Bacteria." Proceedingsof the NationalAcademy of Sciences 105 (22) (June 3): 7774-7778. doi:10.1073/pnas.0803070105. Gao, H., Y. Wang, X. Liu, T. Yan, L. Wu, E. Alm, A. Arkin, D. K. Thompson, and J. Zhou. 2004. "Global Transcriptome Analysis of the Heat Shock Response of Shewanella Oneidensis." Journalof Bacteriology 186 (22): 7796. Gao, H., Z. K. Yang, L. Wu, D. K. Thompson, and J.Zhou. 2006. "Global Transcriptome Analysis of the Cold Shock Response of Shewanella Oneidensis MR-1 and Mutational Analysis of Its Classical Cold Shock Proteins."Journal of Bacteriology 188 (12): 4560. Garland, T., P. H. Harvey, and A. R. Ives. 1992. "Procedures for the Analysis of Comparative Data Using Phylogenetically Independent Contrasts." Systematic Biology 41 (1): 18- 32. Gasch, A. P. 2007. "Comparative Genomics of the Environmental Stress Response in Ascomycete Fungi." Yeast 24 (11): 961-976. Gasch, A. P., P. T. Spellman, C. M. Kao, 0. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, and P. 0. Brown. 2000. "Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes." Science's STKE 11 (12): 4241-4257. Gelfand, M. S. 2006. "Evolution of transcriptional regulatory networks in microbial genomes." Currentopinion in structuralbiology 16 (3) (June): 420-429. Ghodsi, M. 2012. "Searching, Clustering and Evaluating Biological Sequences." http://drum.lib.umd.edu/handle/1903/13186. 147 Giaever, G., A. M. Chu, L. Ni, C. Connelly, L. Riles, S. V6ronneau, S. Dow, A. Lucau-Danila, K. Anderson, and B. Andre. 2002. "Functional Profiling of the Saccharomyces Cerevisiae Genome." Nature 418: 387-391. Giannoukos, G., D. M. Ciulla, K. Huang, B. J. Haas, J. Izard, J. Z. Levin, J. Livny, et al. 2012. "Efficient and Robust RNA-seq Process for Cultured Bacteria and Complex Community Transcriptomes." Genome Biology 13 (3): r23. Gifford, Scott M, Shalabh Sharma, Johanna M Rinta-Kanto, and Mary Ann Moran. 2010. "Quantitative Analysis of a Deeply Sequenced Marine Microbial Metatranscriptome." The ISME Journal 5 (3) (September 16): 461-472. doi:10.1038/ismej.2010.141. Gilbert, J.A., D. Field, P. Swift, L. Newbold, A. Oliver, T. Smyth, P. J.Somerfield, S. Huse, and I. Joint. 2009. "The Seasonal Structure of Microbial Communities in the Western English Channel." EnvironmentalMicrobiology 11 (12): 3132-3139. Gilbreath, J. J., W. L. Cody, D. S. Merrell, and D. R. Hendrixson. 2011. "Change Is Good: Variations in Common Biological Mechanisms in the Epsilonproteobacterial Genera Campylobacter and Helicobacter." Microbiology and MolecularBiology Reviews 75 (1): 84. Glasner, J. D., P. Liss, G. Plunkett III, A. Darling, T. Prasad, M. Rusch, A. Byrnes, M. Gilson, B. Biehl, and F. R. Blattner. 2003. "ASAP, a Systematic Annotation Package for Community Analysis of Genomes." Nucleic Acids Research 31 (1): 147. Glenn, Travis C. 2011. "Field Guide to Next-generation DNA Sequencers." MolecularEcology Resources 11 (5): 759-769. doi:10.1111/j.1755-0998.2011.03024.x. Gnerre, Sante, Eric Lander, Kerstin Lindblad-Toh, and David Jaffe. 2009. "Assisted Assembly: How to Improve a De Novo Genome Assembly by Using Related Species." Genome Biology 10 (8) (August 27): R88. doi:10.1186/gb-2009-10-8-r88. Grote, J., J. C. Thrash, M. J. Huggett, Z. C. Landry, P. Carini, S. J. Giovannoni, and M. S. Rappe. 2012. "Streamlining and Core Genome Conservation Among Highly Divergent Members of the SAR11 Clade." mBio 3 (5) (September 18): e00252-12-e00252-12. doi:10.1128/mBio.00252-12. Gu, X. 2004. "Statistical Framework for Phylogenomic Analysis of Gene Family Expression Profiles." Genetics 167 (1): 531-542. Guell, Marc, Eva Yus, Maria Lluch-Senar, and Luis Serrano. 2011. "Bacterial Transcriptomics: What Is Beyond the RNA Horiz-ome?" Nature Reviews Microbiology 9 (9) (August 12): 658-669. doi:10.1038/nrmicro2620. Gunesekere, Ishara C., Charlene M. Kahler, David R. Powell, Lori A. S. Snyder, Nigel J. Saunders, Julian I. Rood, and John K. Davies. 2006. "Comparison of the RpoHDependent Regulon and General Stress Response in Neisseria Gonorrhoeae." Journal of Bacteriology 188 (13) (July 1): 4769-4776. doi:10.1128/JB.01807-05. Gutierrez-Rios, R. M., D. A. Rosenblueth, J. A. Loza, A. M. Huerta, J. D. Glasner, F. R. Blattner, and J. Collado-Vides. 2003. "Regulatory Network of Escherichia Coli: Consistency Between Literature Knowledge and Microarray Profiles." Genome Research 13 (11): 2435-2443. Hacker, J., and E. Carniel. 2001. "Ecological Fitness, Genomic Islands and Bacterial Pathogenicity." EMBO Reports 2 (5): 376-381. Haegeman, Bart, and Joshua S Weitz. 2012. "A Neutral Theory of Genome Evolution and the Frequency Distribution of Genes." BMC Genomics 13: 196. doi: 10.1186/1471-2164- 13-196. Hardesty, Donald L. 1975. "The Niche Concept: Suggestions for Its Use in Human Ecology." Human Ecology 3 (2) (April): 71-85. doi:10.1007/BF01552263. Hayes, E. T., J. C. Wilks, P. Sanfilippo, E. Yohannes, D. P. Tate, B. D. Jones, M. D. Radmacher, S. S. BonDurant, and J. L. Slonczewski. 2006. "Oxygen limitation modulates pH 148 regulation of catabolism and hydrogenases, multidrug transporters, and envelope composition in Escherichia coli K-12." BMC microbiology 6 (October 6): 89. Hayes, E. T., J. C.Wilks, E. Yohannes, D. P. Tate, M. Radmacher, S. S. BonDurant, and J. L. Slonczewski. 2006. "pH and Anaerobiosis Coregulate Catabolism, Hydrogenases, Ion and Multidrug Transporters, and Envelope Composition in Escherichia Coli K-12." BMC Microbiol 6: 89. He, Q., Z. He, D. C. Joyner, M. Joachimiak, M. N. Price, Z. K. Yang, H. C. B. Yen, C. L. Hemme, W. Chen, and M. M. Fields. 2010. "Impact of Elevated Nitrate on Sulfate-reducing Bacteria: a Comparative Study of Desulfovibrio Vulgaris." The ISMEJournal4 (11): 1386-1397. Heidelberg, J. F., R. Seshadri, S. A. Haveman, C. L. Hemme, I. T. Paulsen, J. F. Kolonay, J.A. Eisen, N. Ward, B. Methe, L. M. Brinkac, et al. 2004. "The genome sequence of the anaerobic, sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough." Nature biotechnology 22 (5) (May): 554-559. Heidelberg, J. F., R. Seshadri, S. A. Haveman, C. L. Hemme, I. T. Paulsen, J. F. Kolonay, J. A. Eisen, N. Ward, B. Methe, and L. M. Brinkac. 2004. "Genome Sequence of the Dissimilatory Metal Ion-reducing Bacterium Shewanella Oneidensis." Nature Biotechnology 22: 554-559. Helmann, J. D., M. F. W. Wu, P. A. Kobel, F. J. Gamo, M. Wilson, M. M. Morshedi, M. Navre, and C. Paddon. 2001. "Global Transcriptional Response of Bacillus Subtilis to Heat Shock."Journal of Bacteriology 183 (24): 7318-7328. Hiatt, J. B., R. P. Patwardhan, E. H. Turner, C. Lee, and J. Shendure. 2010. "Parallel, Tagdirected Assembly of Locally Derived Short Sequence Reads." Nature Methods 7 (2): 119-122. Hoff, K. 2009. "The Effect of Sequencing Errors on Metagenomic Gene Prediction." Bmc Genomics 10 (1): 520. Hooper, Sean D, Konstantinos Mavromatis, and Nikos C Kyrpides. 2009. "Microbial Cohabitation and Lateral Gene Transfer: What Transposases Can Tell Us." Genome Biology 10 (4): R45. doi:10.1186/gb-2009-10-4-r45. Hu, P., S. C. Janga, M. Babu, J. J. Diaz-Mejia, G. Butland, W. Yang, 0. Pogoutse, X. Guo, S. Phanse, and P. Wong. 2009. "Global Functional Atlas of Escherichia Coli Encompassing Previously Uncharacterized Proteins." PLoS Biol 7 (4): e96. Hunt, D. E., L. A. David, D. Gevers, S. P. Preheim, E. J. Alm, and M. F. Polz. 2008. "Resource Partitioning and Sympatric Differentiation Among Closely Related Bacterioplankton." Science 320 (5879): 1081-1085. Hutchinson, G. E. 1957. "Concluding Remarks." Cold Spring HarborSymposia on Quantitative Biology 22 (0) (January 1): 415-427. doi:10.1101/SQB.1957.022.01.039. De Jager, V., and R. J. Siezen. 2011. "Single-cell Genomics: Unravelling the Genomes of Unculturable Microorganisms." MicrobialBiotechnology 4 (4): 43 1-437. Jozefczuk, S., S. Klie, G. Catchpole, J. Szymanski, A. Cuadros-Inostroza, D. Steinhauser, J. Selbig, and L. Willmitzer. 2010. "Metabolomic and Transcriptomic Stress Response of Escherichia Coli." Molecular Systems Biology 6 (1). Kaczanowska, M., and M. Ryden-Aulin. 2007. "Ribosome biogenesis and the translation process in Escherichia coli." Microbiology and molecular biology reviews: MMBR 71 (3) (September): 477-494. Kaneko, T., S. Sato, H. Kotani, A. Tanaka, E. Asamizu, Y. Nakamura, N. Miyajima, M. Hirosawa, M. Sugiura, and S. Sasamoto. 1996. "Sequence Analysis of the Genome of the Unicellular Cyanobacterium Synechocystis Sp. Strain PCC6803. II. Sequence Determination of the Entire Genome and Assignment of Potential Protein-coding Regions." DNA Research 3 (3): 109-136. 149 Kannan, G., J. C.Wilks, D. M. Fitzgerald, B. D. Jones, S. S. Bondurant, and J. L. Slonczewski. 2008. "Rapid acid treatment of Escherichia coli: transcriptomic response and recovery." BMC microbiology 8 (February 26): 37. Kerr, M. Kathleen, and Gary A. Churchill. 2001. "Bootstrapping Cluster Analysis: Assessing the Reliability of Conclusions from Microarray Experiments." Proceedingsof the NationalAcademy of Sciences 98 (16) (July 31): 8961-8965. doi:10.1073/pnas.161273698. Kettler, G. C., A. C.Martiny, K. Huang, J.Zucker, M. L. Coleman, S. Rodrigue, F. Chen, A. Lapidus, S. Ferriera, and J. Johnson. 2007. "Patterns and Implications of Gene Gain and Loss in the Evolution of Prochlorococcus." PLoS Genetics 3 (12): e231. Keymer, D. P., M. C.Miller, G. K.Schoolnik, and A. B. Boehm. 2007. "Genomic and Phenotypic Diversity of Coastal Vibrio Cholerae Strains Is Linked to Environmental Factors." Applied and EnvironmentalMicrobiology 73 (11): 3705-3714. Khaitovich, P., W. Enard, M. Lachmann, and S. Paabo. 2006. "Evolution of Primate Gene Expression." Nature Reviews Genetics 7 (9): 693-702. Khaitovich, P., G. Weiss, M. Lachmann, I. Hellmann, W. Enard, B. Muetzel, U. Wirkner, W. Ansorge, and S. Paabo. 2004. "A neutral model of transcriptome evolution." PLoS biology 2 (5) (May): E132. Kirkup, Benjamin, LeeAnn Chang, Sarah Chang, Dirk Gevers, and Martin Polz. 2010. "Vibrio Chromosomes Share Common History." BMC Microbiology 10 (1) (May 10): 137. doi:10.1186/1471-2180-10-137. Klosinska, M. M., C. A. Crutchfield, P. H. Bradley, J. D. Rabinowitz, and J. R. Broach. 2011. "Yeast cells can access distinct quiescent states." Genes & development 25 (4) (February 15): 336-349. Konstantinidis, K. T., M. H. Serres, M. F. Romine, J. L. M. Rodrigues, J. Auchtung, L. A. McCue, M. S. Lipton, A. Obraztsova, C. S. Giometti, and K. H. Nealson. 2009. "Comparative Systems Biology Across an Evolutionary Gradient Within the Shewanella Genus." Proceedingsof the NationalAcademy of Sciences 106 (37): 15909. Koskinen, J. Patrik, and Liisa Holm. 2012. "SANS: High-throughput Retrieval of Protein Sequences Allowing 50% Mismatches." Bioinformatics 28 (18) (September 15): i438-i443. doi:10.1093/bioinformatics/bts4l7. Kuo, Chih-Horng, Nancy A Moran, and Howard Ochman. 2009. "The Consequences of Genetic Drift for Bacterial Genome Complexity." Genome Research 19 (8) (August): 1450-1454. doi:10.1101/gr.091785.109. Kuo, Chih-Horng, and Howard Ochman. 2009. "Deletional Bias Across the Three Domains of Life." Genome Biology and Evolution 1: 145-152. doi:10.1093/gbe/evpO16. Kussell, Edo, and Stanislas Leibler. 2005. "Phenotypic Diversity, Population Growth, and Information in Fluctuating Environments." Science 309 (5743) (September 23): 2075-2078. doi:10.1126/science.1114383. Lan, R., and P. R. Reeves. 2001. "When Does a Clone Deserve a Name? A Perspective on Bacterial Species Based on Population Genetics." Trends in Microbiology9 (9): 419- 424. Lan, Ruiting, and Peter R. Reeves. 2000. "Intraspecies Variation in Bacterial Genomes: The Need for a Species Genome Concept." Trends in Microbiology 8 (9) (September 1): 396-401. doi:10.1016/SO966-842X(00)01791-1. Langmead, B., and S. L. Salzberg. 2012. "Fast Gapped-read Alignment with Bowtie 2." Nature Methods 9 (4): 357-359. Lapierre, P., and J. P. Gogarten. 2009. "Estimating the Size of the Bacterial Pan-genome." Trends in Genetics 25 (3): 107-110. 150 Lauro, Federico M., Diane McDougald, Torsten Thomas, Timothy J.Williams, Suhelen Egan, Scott Rice, Matthew Z. DeMaere, et al. 2009. "The Genomic Basis of Trophic Strategy in Marine Bacteria." Proceedingsof the NationalAcademy of Sciences 106 (37) (September 15): 15527-15533. doi:10.1073/pnas.0903507106. Lazzeroni, L., and A. Owen. 2002. Statistica Sinica 12 (1): 61-86. Leaphart, A. B., D. K. Thompson, K. Huang, E. Alm, X. F. Wan, A. Arkin, S. D. Brown, L. Wu, T. Yan, and X. Liu. 2006. "Transcriptome Profiling of Shewanella Oneidensis Gene Expression Following Exposure to Acidic and Alkaline pH." Journalof Bacteriology 188 (4): 1633. Leary, Daniel J., and Sui Huang. 2001. "Regulation of Ribosome Biogenesis Within the Nucleolus." FEBS Letters 509 (2): 145 <last-page> 150. Lee, M. C., and C. J. Marx. 2012. "Repeated, Selection-driven Genome Reduction of Accessory Genes in Experimental Populations." PLoS Genetics 8 (5): e1002651. Lee, Z. M. P., C. Bussema Ill, and T. M. Schmidt. 2009. "rrnDB: Documenting the Number of rRNA and tRNA Genes in Bacteria and Archaea." Nucleic Acids Research 37 (suppl 1): D489-D493. Lef~bure, T., P. D. P. Bitar, H. Suzuki, and M. J. Stanhope. 2010. "Evolutionary Dynamics of Complete Campylobacter Pan-genomes and the Bacterial Species Concept." Genome Biology and Evolution 2: 646. Lennon, N. J., R. E. Lintner, S. Anderson, P. Alvarez, A. Barry, W. Brockman, R. Daza, R. L. Erlich, G. Giannoukos, and L. Green. 2010. "Method A Scalable, Fully Automated Process for Construction of Sequence-ready Barcoded Libraries for 454." http://www.biomedcentral.com/content/pdf/gb-2010-11-2-r15.pdf. Letunic, I., and P. Bork. 2011. "Interactive Tree Of Life V2: Online Annotation and Display of Phylogenetic Trees Made Easy." Nucleic Acids Research 39 (Web Server) (April 5): W475-W478. doi:10.1093/nar/gkr201. Levine, Stuart. 2012. "Biomicrocenter Website." http://openwetware.org/wiki/BioMicroCenter:Pricing. Li, H., and R. Durbin. 2009. "Fast and Accurate Short Read Alignment with BurrowsWheeler Transform." Bioinformatics 25 (14): 1754-1760. Li, H., J. Ruan, and R. Durbin. 2008. "Mapping Short DNA Sequencing Reads and Calling Variants Using Mapping Quality Scores." Genome Research 18 (11): 1851-1858. Lilburn, Timothy G., Hong Cai, and Jianying Gu. 2009. "The Core and Pan-Genome of the Vibrionaceae." In, 84-90. IEEE. doi:10.1109/IJCBS.2009.60. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5260735. Logares, Ramiro, Thomas H.A. Haverkamp, Surendra Kumar, Anders Lanz6n, Alexander J. Nederbragt, Christopher Quince, and H avard Kauserud. 2012. "Environmental Microbiology Through the Lens of High-throughput DNA Sequencing: Synopsis of Current Platforms and Bioinformatics Approaches."Journalof Microbiological Methods 91 (1) (October): 106-113. doi:10.1016/j.mimet.2012.07.017. Lopez, L. C. S., M. P. de Aguiar Fracasso, D. 0. Mesquita, A. R. T. Palma, and P. Riul. 2012. "The Relationship Between Percentage of Singletons and Sampling Effort: A New Approach to Reduce the Bias of Richness Estimates." EcologicalIndicators 14 (1): 164-169. Lovley, D. R., S. J. Giovannoni, D. C. White, J. E. Champine, E. J. P. Phillips, Y. A. Gorby, and S. Goodwin. 1993. "Geobacter Metallireducens Gen. Nov. Sp. Nov., a Microorganism Capable of Coupling the Complete Oxidation of Organic Compounds to the Reduction of Iron and Other Metals." Archives of Microbiology 159 (4): 336-344. Lozada-Chavez, I., S. C. janga, and J. Collado-Vides. 2006. "Bacterial Regulatory Networks Are Extremely Flexible in Evolution." Nucleic Acids Research 34 (12): 3434-3445. 151 Luo, Chengwei, Seth T. Walk, David M. Gordon, Michael Feldgarden, James M. Tiedje, and Konstantinos T. Konstantinidis. 2011. "Genome Sequencing of Environmental Escherichia Coli Expands Understanding of the Ecology and Speciation of the Model Bacterial Species." Proceedingsof the NationalAcademy of Sciences 108 (17) (April 26): 7200-7205. doi:10.1073/pnas.1015622108. Lynch, M. 2007. "The evolution of genetic networks by non-adaptive processes." Nature reviews.Genetics 8 (10) (October): 803-813. Lynch, Michael. 2011. "Statistical Inference on the Mechanisms of Genome Evolution." Ed. Nancy A. Moran. PLoS Genetics 7 (6) (June 9): e1001389. doi:10.1371/journal.pgen.1001389. Macalalad, A. R., M. C.Zody, P. Charlebois, N. J.Lennon, R. M. Newman, C.M. Malboeuf, E. M. Ryan, et al. 2012. "Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data." PLoS Computational Biology 8 (3): e1002417. Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. 2012. Cluster: ClusterAnalysis Basics and Extensions. Makarova, K, A Slesarev, Y Wolf, A Sorokin, B Mirkin, E Koonin, A Pavlov, et al. 2006. "Comparative Genomics of the Lactic Acid Bacteria." Proceedingsof the National Academy of Sciences of the United States ofAmerica 103 (42) (October 17): 15611- 15616. doi:10.1073/pnas.0607117103. Malmstrom, R. R., S. Rodrigue, K. H. Huang, L. Kelly, S. E. Kern, A. Thompson, S. Roggensack, P. M. Berube, M. R. Henn, and S. W. Chisholm. 2012. "Ecology of Uncultured Prochlorococcus Clades Revealed Through Single-cell Genomics and Biogeographic Analysis." The ISME Journal. http://www.nature.com/ismej/journal/vaop/ncurrent/full/ismej201289a.html. Markowitz, V. M., K. Mavromatis, N. N. Ivanova, I. Chen, A. Min, K. Chu, and N. C. Kyrpides. 2009. "IMG ER: a System for Microbial Genome Annotation Expert Review and Curation."Bioinformatics 25 (17): 2271-2278. Maurer, L. M., E. Yohannes, S. S. Bondurant, M. Radmacher, and J. L. Slonczewski. 2005. "pH Regulates Genes for Flagellar Motility, Catabolism, and Oxidative Stress in Escherichia Coli K-12." Journalof Bacteriology 187 (1): 304. McAdams, H. H., B. Srinivasan, and A. P. Arkin. 2004. "The Evolution of Genetic Regulatory Systems in Bacteria." Nature Reviews Genetics 5 (3): 169-178. Miller, Christopher, Brett Baker, Brian Thomas, Steven Singer, and Jillian Banfield. 2011. "EMIRGE: Reconstruction of Full-length Ribosomal Genes from Microbial Community Short Read Sequencing Data." Genome Biology 12 (5) (May 19): R44. doi:10.1186/gb-2011-12-5-r44. Mitchell, Amir, Gal H Romano, Bella Groisman, Avihu Yona, Erez Dekel, Martin Kupiec, Orna Dahan, and Yitzhak Pilpel. 2009. "Adaptive Prediction of Environmental Changes by Microorganisms." Nature 460 (7252) (July 9): 220-224. doi:10.1038/nature08112. Mittenhuber, G. 2002. "A Phylogenomic Study of the General Stress Response Sigma Factor sigmaB of Bacillus Subtilis and Its Regulatory Proteins."Journalof Molecular Microbiology and Biotechnology 4 (4): 427. Moreno, C., J. Romero, and R. T. Espejo. 2002. "Polymorphism in Repeated 16S rRNA Genes Is a Common Property of Type Strains and Environmental Isolates of the Genus Vibrio." Microbiology 148 (4): 1233-1239. Mukhopadhyay, A., Z. He, E. J.Alm, A. P. Arkin, E. E. Baidoo, S. C.Borglin, W. Chen, T. C. Hazen, Q. He, and H. Y. Holman. 2006. "Salt Stress in Desulfovibrio Vulgaris Hildenborough: An Integrated Genomics Approach." Journalof Bacteriology 188 (11): 4068-4078. 152 Mulder, NicolaJ, Rolf Apweiler, TeresaK Attwood, Amos Bairoch, Alex Bateman, David Binns, Peer Bork, et al. 2007. "New Developments in the InterPro Database." Nucleic Acids Research 35 (suppl_1) (January 12): D224-228. Muzzi, A., and C. Donati. 2011. "Population Genetics and Evolution of the Pan-genome of< I> Streptococcus Pneumoniae</i>." InternationalJournalof Medical Microbiology. 422 111000944. http://www.sciencedirect.com/science/article/pii/S1438 Nekrutenko, A., and J. Taylor. 2012. "Next-generation Sequencing Data Interpretation: Enhancing Reproducibility and Accessibility." Nature Reviews Genetics 13 (9): 667- 672. Nishiguchi, Michele K., and Vinod S. Nair. 2003. "Evolution of Symbiosis in the Vibrionaceae: a Combined Approach Using Molecules and Physiology." Intern ationalJournalof Systematic and Evolutionary Microbiology 53 (6) (November 1): 2019-2026. doi:10.1099/ijs.0.02792-0. Niu, Beifang, Zhengwei Zhu, Limin Fu, Sitao Wu, and Weizhong Li. 2011. "FR-HIT, a Very Fast Program to Recruit Metagenomic Reads to Homologous Reference Genomes." Bioinformatics27 (12) (June 15): 1704-1705. doi:10.1093/bioinformatics/btr252. Oakley, T. H., Z. Gu, E. Abouheif, N. H. Patel, and W. H. Li. 2005. "Comparative Methods for the Analysis of Gene-Expression Evolution: An Example Using Yeast Functional Genomic Data." Molecular Biology and Evolution 22 (1): 40-50. Pal, C., B. Papp, and M. J. Lercher. 2005. "Adaptive evolution of bacterial metabolic networks by horizontal gene transfer." Nature genetics 37 (12) (December): 1372-1375. Pan, Wei. 2002. "A Comparative Review of Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments." Bioinformatics 18 (4) (April 1): 546-554. doi:10.1093/bioinformatics/18.4.546. Papa, Eliseo, Michael Docktor, Christopher Smillie, Sarah Weber, Sarah P. Preheim, Dirk Gevers, Georgia Giannoukos, et al. 2012. "Non-Invasive Mapping of the Gastrointestinal Microbiota Identifies Children with Inflammatory Bowel Disease." PLoS ONE 7 (6) (June 29): e39242. doi:10.1371/journal.pone.0039242. Polz, M. F., D. E. Hunt, S. P. Preheim, and D. M. Weinreich. 2006. "Patterns and Mechanisms of Genetic and Phenotypic Differentiation in Marine Microbes." Philosophical Transactionsof the Royal Society B: BiologicalSciences 361 (1475) (November 29): 2009-2021. doi:10.1098/rstb.2006.1928. Preheim, S. P. 2010. "Ecology and Population Structure of Vibrionaceae in the Coastal Ocean". Massachusetts Institute of Technology. http://dspace.mit.edu/handle/1721.1/58184. Preheim, S. P., Y. Boucher, H. Wildschutte, L. A. David, D. Veneziano, E. J. Alm, and M. F. Polz. 2011. "Metapopulation Structure of Vibrionaceae Among Coastal Marine Invertebrates." EnvironmentalMicrobiology 13 (1): 265-275. Preheim, S. P., S. Timberlake, and M. F. Polz. 2011. "Merging Taxonomy with Ecological Population Prediction in a Case Study of Vibrionaceae." Applied and Environmental Microbiology 77 (20): 7195-7206. Price, M. N., K. H. Huang, A. P. Arkin, and E. J. Alm. 2005. "Operon Formation Is Driven by Coregulation and Not by Horizontal Gene Transfer." Genome Research 15 (6): 809-819. Price, Morgan N., Adam P. Arkin, and Eric J.Alm. 2006. "The Life-Cycle of Operons." PLoS Genetics 2 (6) (June): e96 OP. Price, MorganN, ParamvirS Dehal, and AdamP Arkin. 2007. "Orthologous Transcription Factors in Bacteria Have Different Functions and Regulate Different Genes." PLoS ComputationalBiology 3 (9) (September): e175 OP. 153 Quince, Christopher, Thomas P. Curtis, and William T. Sloan. 2008. "The Rational Exploration of Microbial Diversity." The ISMEJournal2 (10): 997-1006. doi:10.1038/ismej.2008.69. Ricklefs, R. E., and J. M. Starck. 1996. "Applications of Phylogenetically Independent Contrasts: a Mixed Progress Report." Oikos 77 (1): 167-172. Rifkin, S. A., J. Kim, and K. P. White. 2003. "Evolution of Gene Expression in the Drosophila Melanogaster Subgroup." Nature Genetics 33 (2): 138-144. Riley, M. 1993. "Functions of the Gene Products of Escherichia Coli." MicrobiologicalReviews 57 (4): 862. Rocap, G., F. W. Larimer, J. Lamerdin, S. Malfatti, P. Chain, N. A. Ahlgren, A. Arellano, M. Coleman, L. Hauser, and W. R. Hess. 2003. "Genome Divergence in Two Prochlorococcus Ecotypes Reflects Oceanic Niche Differentiation." Nature 424 (6952): 1042-1047. Rocha, E. P. C. 2003. "DNA Repeats Lead to the Accelerated Loss of Gene Order in Bacteria." TRENDS in Genetics 19 (11): 600-603. Rodrigue, Sebastien, Arne C. Materna, Sonia C. Timberlake, Matthew C. Blackburn, Rex R. Malmstrom, Eric J. Alm, and Sallie W. Chisholm. 2010. "Unlocking Short Read Sequencing for Metagenomics." Ed. Jack Anthony Gilbert. PLoS ONE 5 (7) (July 28): e11840. doi:10.1371/journal.pone.0011840. Roguev, A., S. Bandyopadhyay, M. Zofall, K. Zhang, T. Fischer, S. R. Collins, H. Qu, et al. 2008. "Conservation and rewiring of functional modules revealed by an epistasis map in fission yeast." Science (New York, N.Y.) 322 (5900) (October 17): 405-410. Rustici, G., J. Mata, K. Kivinen, P. Lio, C.J. Penkett, G. Burns, J. Hayles, A. Brazma, P. Nurse, and J. Bahler. 2004. "Periodic gene expression program of the fission yeast cell cycle." Nature genetics 36 (8) (August): 809-817. Rzhetsky, Andrey, Edoardo M. Airoldi, Curtis Huttenhower, David Gresham, Charles Lu, Amy A. Caudy, Maitreya J. Dunham, James R. Broach, David Botstein, and Olga G. Troyanskaya. 2009. "Predicting Cellular Growth from Gene Expression Signatures." PLoS ComputationalBiology 5 (1): e1000257. Salgado, H., S. Gama-Castro, M. Peralta-Gil, E. Diaz-Peredo, F. Sanchez-Solano, A. SantosZavaleta, I. Martinez-Flores, et al. 2006. "RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions." Nucleic acids research 34 (Database issue) (January 1): D394-7. Salzberg, S. L., D. D. Sommer, D. Puiu, and V. T. Lee. 2008. "Gene-boosted Assembly of a Novel Bacterial Genome from Very Short Reads." PLoS ComputationalBiology 4 (9): e1000186. Schloss, P. D., S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A. Lesniewski, B. B. Oakley, D. H. Parks, and C. J. Robinson. 2009. "Introducing Mothur: Open-source, Platform-independent, Community-supported Software for Describing and Comparing Microbial Communities." Applied and EnvironmentalMicrobiology 75 (23): 7537-7541. Shapiro, B. J. 2010. "Genomic Signatures of Sex, Selection and Speciation in the Microbial World". Massachusetts Institute of Technology. http://dspace.mit.edu/handle/1721.1/61788. Shapiro, B. J., L. A. David, J. Friedman, and E. J. Alm. 2009. "Looking for Darwin's Footprints in the Microbial World." Trends in Microbiology 17 (5): 196-204. Shapiro, B. J., J. Friedman, 0. X. Cordero, S. P. Preheim, S. C.Timberlake, G. Szab6, M. F. Polz, and E. J. Alm. 2012. "Population Genomics of Early Events in the Ecological Differentiation of Bacteria." Science 336 (6077): 48-51. 154 Shatkay, H., S. Edwards, W. J. Wilbur, and M. Boguski. 2000. "Genes, Themes and Microarrays." In Int. Conf on Intelligent Systems in Molecular Biology. http://www.aaai.org/Papers/ISMB/2000/ISMB0O-033.pdf. Shi, L., L. H. Reid, W. D. Jones, R. Shippy, J. A. Warrington, S. C. Baker, P. J. Collins, F. De Longueville, E. S. Kawasaki, and K. Y. Lee. 2006. "The MicroArray Quality Control (MAQC) Project Shows Inter-and Intraplatform Reproducibility of Gene Expression Measurements." Nature Biotechnology 24 (9): 1151-1161. Shippy, R., S. Fulmer-Smentek, R. V. Jensen, W. D. Jones, P. K. Wolber, C. D. Johnson, P. S. Pine, C. Boysen, X. Guo, and E. Chudin. 2006. "Using RNA Sample Titrations to Assess Microarray Platform Performance and Normalization Techniques." Nature Biotechnology 24 (9): 1123-1131. Shokralla, Shadi, Jennifer L Spall, Joel F Gibson, and Mehrdad Hajibabaei. 2012. "Nextgeneration Sequencing Technologies for Environmental DNA Research." Molecular Ecology 21 (8) (April): 1794-1805. doi:10.1111/j.1365-294X.2012.05538.x. Sim, Siew Hoon, Yiting Yu, Chi Ho Lin, R. Krishna M. Karuturi, Vanaporn Wuthiekanun, Apichai Tuanyok, Hui Hoon Chua, et al. 2008. "The Core and Accessory Genomes of Burkholderia Pseudomallei: Implications for Human Melioidosis." PLoS Pathog 4 (10) (October 17): e1000178. doi:10.1371/journal.ppat.1000178. Simon, M., H. P. Grossart, B. Schweitzer, and H. Plough. 2002. "Microbial Ecology of Organic Aggregates in Aquatic Ecosystems." Aquatic MicrobialEcology 28 (2): 175-2 11. Singh, A. H., D. M. Wolf, P. Wang, and A. P. Arkin. 2008. "Modularity of Stress Response Evolution." Science's STKE 105 (21): 7500. Slavov, N., E. M. Airoldi, A. van Oudenaarden, and D. Botstein. 2012. "A Conserved Cell Growth Cycle Can Account for the Environmental Stress Responses of Divergent Eukaryotes." MolecularBiology of the Cell 23 (10) (March 28): 1986-1997. doi: 10.1091/mbc.E1 1-1 1-0961. Smillie, Chris S., Mark B. Smith, Jonathan Friedman, Otto X. Cordero, Lawrence A. David, and Eric J. Alm. 2011. "Ecology Drives a Global Network of Gene Exchange Connecting the Human Microbiome." Nature. Smoot, Michael E, Keiichiro Ono, Johannes Ruscheinski, Peng-Liang Wang, and Trey Ideker. 2011. "Cytoscape 2.8: New Features for Data Integration and Network Visualization." Bioinformatics (Oxford, England) 27 (3) (February 1): 431-432. doi:10.1093/bioinformatics/btq675. Soergel, David A W, Neelendu Dey, Rob Knight, and Steven E Brenner. 2012. "Selection of Primers for Optimal Taxonomic Classification of Environmental 16S rRNA Gene Sequences." The ISME Journal6 (7) (July): 1440-1444. doi:10.1038/ismej.2011.208. Sommer, D. D., A. L. Delcher, S. L. Salzberg, and M. Pop. 2007. "Minimus: a Fast, Lightweight Genome Assembler." BMC Bioinformatics 8 (1): 64. Sorber, K., C. Chiu, D. Webster, M. Dimon, J. G. Ruby, A. Hekele, and J. L. DeRisi. 2008. "The Long March: a Sample Preparation Technique That Enhances Contig Length and Coverage by High-throughput Short-read Sequencing." PloS One 3 (10): e3495. Staron, Anna, and Thorsten Mascher. 2010. "General Stress Response in A-proteobacteria: PhyR and Beyond." Molecular Microbiology 78 (2): 271 <last-page> 277. Stewart, F. J., A. K. Sharma, J. A. Bryant, J. M. Eppley, E. F. DeLong, and others. 2011. "Community Transcriptomics Reveals Universal Patterns of Protein Sequence Conservation in Natural Microbial Communities." Genome Biology 12 (3): R26. Stolyar, S., Q. He, M. P. Joachimiak, Z. He, Z. K. Yang, S. E. Borglin, D. C. Joyner, et al. 2007. "Response of Desulfovibrio vulgaris to Alkaline Stress."Journalof Bacteriology (October 5). 155 Stuart, J. M., E. Segal, D. Koller, and S. K. Kim. 2003. "A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules." Science 302 (5643): 249-255. Su, C., J. M. Peregrin-Alvarez, G. Butland, S. Phanse, V. Fong, A. Emili, and J. Parkinson. 2008. "Bacteriome. Org-an Integrated Protein Interaction Database for E. Coli." Nucleic Acids Research 36 (suppl 1): D632. Suyama, M., D. Torrents, and P. Bork. 2006. "PAL2NAL: Robust Conversion of Protein Sequence Alignments into the Corresponding Codon Alignments." NucleicAcids Research 34 (suppl 2): W609-W612. Suzuki, S., T. Ishida, K. Kurokawa, and Y.Akiyama. 2012. "GHOSTM: A GPU-Accelerated Homology Search Tool for Metagenomics." PLoS One 7 (5): e36060. Szabo, Gitta, Sarah P. Preheim, Kathryn M. Kauffman, Lawrence A. David, Jesse Shapiro, Eric J. Alm, and Martin F. Polz. 2012. "Reproducibility of Vibrionaceae Population Structure in Coastal Bacterioplankton." The ISME Journal(November 22). doi:10.1038/ismej.2012.134. http://www.nature.com/ismej/journal/vaop/ncurrent/abs/ismej2012134a.html. Szklarczyk, D., A. Franceschini, M. Kuhn, M. Simonovic, A. Roth, P. Minguez, T. Doerks, et al. 2010. "The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored." Nucleic acids research (November 2). www.refworks.com. Tai, S. L., V. M. Boer, P. Daran-Lapujade, M. C.Walsh, J.H. de Winde, J.M. Daran, and J.T. Pronk. 2005. "Two-dimensional transcriptome analysis in chemostat cultures. Combinatorial effects of oxygen availability and macronutrient limitation in Saccharomyces cerevisiae." The Journalof biologicalchemistry 280 (1) (January 7): 437-447. Tai, S. L., I. Snoek, M. A. H. Luttik, M. J.H. Almering, M. C.Walsh, J.T. Pronk, and J.M. Daran. 2007. "Correlation Between Transcript Profiles and Fitness of Deletion Mutants in Anaerobic Chemostat Cultures of Saccharomyces Cerevisiae."Microbiology 153 (3): 877. Tai, V., A. F. Y. Poon, I. T. Paulsen, and B. Palenik. 2011. "Selection in Coastal Synechococcus (Cyanobacteria) Populations Evaluated from Environmental Metagenomes." PloS One 6 (9): e24249. Taoka, M., Y. Yamauchi, T. Shinkawa, H. Kaji, W. Motohashi, H. Nakayama, N. Takahashi, and T. Isobe. 2004. "Only a Small Subset of the Horizontally Transferred Chromosomal Genes in Escherichia Coli Are Translated into Proteins." Molecular& Cellular Proteomics 3 (8): 780-787. Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, et al. 2003. "The COG database: an updated version includes eukaryotes." BMC bioinformatics4 (September 11): 41. Tautz, Diethard, Hans Ellegren, and Detlef Weigel. 2010. "Next Generation Molecular Ecology." MolecularEcology 19: 1-3. doi:10.111 1/j.1365-294X.2009.04489.x. Team, R. Core. 2012. R: A Language and Environmentfor StatisticalComputing. Vienna, Austria. http://www.R-project.org. Tettelin, H., V. Masignani, M. J. Cieslewicz, C. Donati, D. Medini, N. L. Ward, S. V. Angiuoli, et al. 2005. "Genome Analysis of Multiple Pathogenic Isolates of Streptococcus Agalactiae: Implications for the Microbial 'pan-genome'." Proceedingsof the National Academy of Sciences of the United States ofAmerica 102 (39): 13950-13955. Thompson, J. R. 2005. "Genotypic Diversity Within a Natural Coastal Bacterioplankton Population." Science 307 (5713) (February 25): 1311-1313. doi:10.1126/science.1106028. 156 Thompson, J. R., M. A. Randa, L. A. Marcelino, A. Tomita-Mitchell, E. Lim, and M. F. Polz. 2004. "Diversity and Dynamics of a North Atlantic Coastal Vibrio Community." Applied and Environmental Microbiology 70 (7): 4103-4110. Timberlake, Sonia. 2012. "SHERA Website." Almlab Software. Accessed October 27. http://almlab.mit.edu/ALM/Software/Software.html. Tirosh, I., A. Weinberger, M. Carmi, and N. Barkai. 2006. "A Genetic Signature of Interspecies Variations in Gene Expression." Nature Genetics 38: 830-834. Treangen, Todd J., and Eduardo P. C. Rocha. 2011. "Horizontal Transfer, Not Duplication, Drives the Expansion of Protein Families in Prokaryotes." PLoS Genet 7 (1) (January 27): e1001284. doi:10.1371/journal.pgen.1001284. Tuller, T., Y. Girshovich, Y. Sella, A. Kreimer, S. Freilich, M. Kupiec, U. Gophna, and E. Ruppin. 2011. "Association between translation efficiency and horizontal gene transfer within microbial communities." Nucleic acids research 39 (11): 4743-4755. Uniprot Consortium. 2011. "Ongoing and Future Developments at the Universal Protein Resource." Nucleic Acids Res. 39: 214-219. Vallania, F., E. Ramos, S. Cresci, R. D. Mitra, and T. E. Druley. 2012. "Detection of Rare Genomic Variants from Pooled Sequencing Using SPLINTER." Journal of Visualized Experiments: JoVE (64). http://www.ncbi.nlm.nih.gov/pubmed/22760212. Venkateswaran, K., D. P. Moser, M. E. Dollhopf, D. P. Lies, D. A. Saffarini, B. J. MacGregor, D. B. Ringelberg, et al. 1999. "Polyphasic Taxonomy of the Genus Shewanella and Description of Shewanella Oneidensis Sp. Nov." InternationalJournalof Systematic Bacteriology 49 (2): 705-724. Waltman, P., T. Kacmarczyk, A. R. Bate, D. B. Kearns, D. J. Reiss, P. Eichenberger, and R. Bonneau. 2010. "Multi-species Integrative Biclustering." Genome Biology 11 (9): R96. Wang, L., S. Chen, K. L. Vergin, S. J. Giovannoni, S. W. Chan, M. S. DeMott, K. Taghizadeh, et al. 2011. "DNA Phosphorothioation Is Widespread and Quantized in Bacterial Genomes." Proceedingsof the NationalAcademy of Sciences 108 (7): 2963-2968. Warner, J. R. 1999. "The economics of ribosome biosynthesis in yeast." Trends in biochemical sciences 24 (11) (November): 437-440. Wetzel, Joshua, Carl Kingsford, and Mihai Pop. 2011. "Assessing the Benefits of Using Matepairs to Resolve Repeats in De Novo Short-read Prokaryotic Assemblies." BMC Bioinformatics 12 (1) (April 13): 95. doi:10.1186/1471-2105-12-95. Whitehead, A., and D. L. Crawford. 2006. "Neutral and adaptive variation in gene expression." Proceedingsof the NationalAcademy of Sciences of the United States of America 103 (14) (April 4): 5425-5430. Whitney, Kenneth D, Bastien Boussau, Eric J Baack, and Theodore Garland Jr. 2011. "Drift and Genome Complexity Revisited." PLoS Genetics 7 (6) (June): e1002092. doi:10.1371/journal.pgen.1002092. Whitney, Kenneth D, and Theodore Garland Jr. 2010. "Did Genetic Drift Drive Increases in Genome Complexity?" PLoS Genetics 6 (8) (August). doi:10.1371/journal.pgen.1001080. Wilder, Cara N., Stephen P. Diggle, and Martin Schuster. 2011. "Cooperation and Cheating in Pseudomonas Aeruginosa: The Roles of the Las, Rhl and Pqs Quorum-sensing Systems." The ISME Journal 5 (8): 1332-1343. doi:10.1038/ismej.2011.13. Yoder-Himes, D. R., P. S. G. Chain, Y. Zhu, 0. Wurtzel, E. M. Rubin, J. M. Tiedje, and R. Sorek. 2009. "Mapping the Burkholderia Cenocepacia Niche Response via High-throughput Sequencing." Proceedingsof the NationalAcademy of Sciences 106 (10): 3976. 157 Young, A. L., H. 0. Abaan, D. Zerbino, J.C.Mullikin, E. Birney, and E. H. Margulies. 2010. "A New Strategy for Genome Assembly Using Short Sequence Reads and Reduced Representation Libraries." Genome Research 20 (2): 249-256. Zagordi, 0., A. Bhattacharya, N. Eriksson, and N. Beerenwinkel. 2011. "ShoRAH: Estimating the Genetic Diversity of a Mixed Sample from Next-generation Sequencing Data." BMC Bioinformatics 12 (1): 119. Zarrineh, P., A. C. Fierro, A. Sanchez-Rodriguez, B. De Moor, K. Engelen, and K. Marchal. 2010. "COMODO: an adaptive coclustering strategy to identify conserved coexpression modules between organisms." Nucleic acids research. http://nar.oxfordjournals.org. Zaslaver, A., A. E. Mayo, R. Rosenberg, P. Bashkin, H. Sberro, M. Tsalyuk, M. G. Surette, and U. Alon. 2004. "Just-in-time Transcription Program in Metabolic Pathways." Nature Genetics 36: 486-49 1. Zerbino, D. R., and E. Birney. 2008. "Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs." Genome Research 18 (5): 821-829. Zerbino, D. R., G. K. McEwen, E. H. Margulies, and E. Birney. 2009. "Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-read De Novo Assembler." PloS One 4 (12): e8407. Zhang, Wei, Edward G. Dudley, and Joseph T. Wade. 2011. "Genomic and Transcriptomic Analyses of Foodborne Bacterial Pathogens." In Genomics of FoodborneBacterial Pathogens,ed. Martin Wiedmann and Wei Zhang, 311-341. New York, NY: Springer New York. http://www.springerlink.com/index/10.1007/978-1-4419-7686-4_10. Zheng, Dongling, Chrystala Constantinidou, Jon L Hobman, and Stephen D Minchin. 2004. "Identification of the CRP Regulon Using in Vitro and in Vivo Transcriptional Profiling."Nucleic Acids Research 32 (19): 5874-5893. doi:10.1093/nar/gkh9O8. Zhou, A., Z. He, A. M. Redding-Johanson, A. Mukhopadhyay, C. L. Hemme, M. P. Joachimiak, F. Luo, Y. Deng, K. S. Bender, and Q. He. 2010. "Hydrogen Peroxide-induced Oxidative Stress Responses in Desulfovibrio Vulgaris Hildenborough." Environmental Microbiology. Zinman, G. E., S. Zhong, and Z. Bar-Joseph. 2011. "Biological interaction networks are conserved at the module level." BMC systems biology 5 (August 23): 134-0509-5- 134. Zwick, M. E., S. J.Joseph, X. Didelot, P. E. Chen, K.A. Bishop-Lilly, A. C.Stewart, K. Willner, et al. 2012. "Genomic Characterization of the Bacillus Cereus Sensu Lato Species: Backdrop to the Evolution of Bacillus Anthracis." Genome Research 22 (8): 1512- 1524. 158