Bacterial genes and genome dynamics in the... Sonia C. Timberlake J MAY

Bacterial genes and genome dynamics in the environment
by
MASSACHUSTS INST
OF TECHNOLOGY
Sonia C.Timberlake
tE
MAY 2 9 2013
Bachelor of Science, Biology
California Institute of Technology, 2003
SUBMITTED TO THE DEPARTMENT OF BIOLOGICAL ENGINEERING~
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY IN BIOLOGICAL ENGINEERING
AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
FEBRUARY 2013
© 2013 Massachusetts Institute of Technology. All Rights Reserved.
Signature of Author:
Sonia C. Timberake
January 11, 2013
Certified by:
Eric J Aim
Associate Professor of Civil and Environmental Engineering and Biological Engineering
Thesis Supervisor
14i
Accepted by:
Forest A. White
Associate Professor of Biological Engineering
Course 20 Graduate Committee Chairperson
--.- i
This doctoral thesis has been examined by a committee of the Biological Engineering
Department as follows:
Ernest Fraenkel
Chairperson, Graduate Thesis Committee:
Associate Professor of Biological Engineering and Biology
Eric
J. Alm
Thesis Advisor, Committee Member
Associate Professor of Civil and Environmental Engineering and Biological Engineering
Martin P. Polz
Committee Member
Professor of Civil and Environmental Engineering
Janelle R. Thompson
Committee Member
Assistant Professor of Civil and Environmental Engineering
3
4
Genes and genome dynamics in the environment
by
Sonia C Timberlake
Submitted to the Department of Biological Engineering on January 11, 2013, in partial
fulfillment of the requirements for the degree of Doctor of Philosophy in Biological
Engineering
Abstract
One of the most marvelous features of microbial life is its ability to thrive in such diverse
and dynamic environments. My scientific interest lies in the variety of modes by which
microbial life accomplishes this feat. In the first half of this thesis I present tools to leverage
high throughput sequencing for the study of environmental genomes. In the second half of
this thesis, I describe modes of environmental adaptation by bacteria via gene content or
gene expression evolution.
Associating genes' usage and evolution to adaptation in various environments is a
cornerstone of microbiology. New technologies and approaches have revolutionized this
pursuit, and I begin by describing the computational challenges I resolved in order to bring
these technologies to bear on microbial genomics. In Chapter 1, I describe SHE-RA, an
algorithm that increases the useable read length of ultra-high throughput sequencing
technologies, thus extending their range of applications to include environmental
sequencing. In Chapter 2, I design a new hybrid assembly approach for short reads and
assemble 82 Vibrio genomes. Using the ecologically defined groups of this bacterial family, I
investigate the genomic and metabolic correlates of habitat and differentiation, and
evaluate a neutral model of gene content. In Chapter 3, I report the extent to which
orthologous genes in bacteria exhibit the same transcriptional response to the same change
in environment, and describe the features and functions of bacterial transcriptional
networks that are conserved. I conclude this thesis with a summary of my tools and results,
their use in other studies, and their relevance to future work. In particular, I discuss the
future experiments and analytical strategies that I am eager to see applied to compelling
open questions in microbial ecology and evolution.
Thesis Supervisor: Eric JAlm
Associate Professor of Civil and Environmental Engineering and Biological Engineering
5
6
Acknowledgements
I want to begin by thanking the Biological Engineering department, for welcoming me to
MIT to let me pursue my love of science. Doug Lauffenburger, Al Grodzinsky, Forest White
and other members of the BE community have worked hard to build scientific family that is
full of exciting new ideas and friendly, persistent encouragement to pursue them! In
particular I want to thank Dalia Fares for always signing and turning in who knows what all
paperwork I forgot every single term to keep me enrolled at MIT.
I want thank Sheila Frankel for all of her excellent advice and encouragement and
indefatigable efforts, aside Jim Long, to keep Parsons from falling down around us.
I want to thank my committee members, Ernest Fraenkel, Martin Polz, and Janelle
Thompson, for all of their feedback and advice. In particular I want to thank my advisor, Eric
Alm, for his absolutely unstinting support and enthusiasm. Thank you, Eric, for being even
more excited and optimistic about my projects than I was. To Lawrence David, Sean Clarke,
Jesse Shapiro and the rest of my lab mates in the Alm Lab, it was so lucky for me that you
are both exceptional scientists and human beings. In general, the people I've met in BE,
Parsons, and the MIT community at large, are not only ferociously intelligent and
interesting, but also so generous, accepting, and brave. I feel so lucky have been a part of
these communities at MIT and at Caltech.
I want to specially thank all my family for always making a welcoming home for me
wherever you are. I want to thank my parents for raising me with the knowledge that
nothing I could ever ask of them would be too much. Megan Bartley, you were a daily source
of support and encouragement. Nina, thank you for being so sane.
Finally, a big thank you to my wonderful husband, Kenzie MacIsaac, who makes my
world a better, funnier, happier place.
7
8
Table of Contents
A BST RACT ...................................................................................................................................................................
5
ACKNOWLEDGEMENTS .............................................................................................................................................
7
INTRODUCTION ...................................................................................................
14
0 .1
O VE RV IEW .....................................................................................................................................................
14
0.2
THE PIVOTAL IMPORTANCE OF DEEP COVERAGE IN ENVIRONMENTAL GENOMICS.....................15
0.3
A MODEL SYSTEM FOR BACTERIAL ECOLOGY AND EVOLUTION.......................................................
0.4
CONSERVATION OF COEXPRESSION WITHOUT CONSERVATION OF RESPONSE IN BACTERIAL GENE
CHAPTER 0.
17
REGULATORY NETWORKS.......................................................................................................................................23
CHAPTER 1.
UNLOCKING SHORT READS FOR METAGENOMICS..................................25
1.1
INTRODUCTION.............................................................................................................................................26
1 .2
R ESU LTS ........................................................................................................................................................
1 .3
D ISCU SSIO N............................................................................................................................................---.-.2
9
1.4
MATERIALS AND METHODS .......................................................................................................................
29
1.5
REFERENCES.................................................................................................................................................34
1.6
CONTRIBUTIONS TO THIS MANUSCRIPT ..............................................................................................
1 .7
A PPEND IX................................................................................................................................
1.7.1
26
35
...................
35
Supplemental Note: alignment confidence metric........................................................
CHAPTER 2.
35
ASSEMBLY AND ANALYSIS OF 82 VIBRIO GENOMES ...............................
39
2 .1
A BSTRA CT ......................................................................................................................................
...... --.3 9
2.2
INTRODUCTION...............................................................................................................................
.........
2.3
TECHNICAL CHALLENGES AND NEW ASSEMBLY APPROACHES .......................................................
40
42
2.3.1
Hybridgenome assembly approachfor short reads .....................................................
42
2.3.2
Pipelinefor 16S rRNA gene assembly from short reads ...............................................
45
2.3.3
Using whole genome phylogeny to support ecologicalpopulationpredictions......46
2.4
RESULTS AND DISCUSSION .................................................................................................................
46.....46
2.4.1
M etabolic profiles' relationshipwith phylogeny and ecology .....................................
46
2.4.2
Gene content relationshipswith phylogeny and ecology .............................................
49
2.4.3
High rates of HGT among strainsco-existing in the water column..........................
51
2.4.4
Core and flexible gene turnover.............................................................................................
52
9
2.4.5
2
.5
Genes fixing within a population..................5
..................................................................................................
53
2 .5
C ONCLU SIO NS .......................................................................
55
2 .6
M ETHODS ......................................................................................................................................................
58
2.6.1
Whole genome shotgun librariesand sequencing...........................................................
58
2.6.2
Hybrid assembly pipeline.................................................................................................................
58
2.6.3
16S "contig-mapping" assembly pipeline ............................................................................
60
2.6.4
165 phy log eny.......................................................................................................................................
60
2.6.5
16S distance matrix,clustering, and pairwisedistances...............................................
60
2.6.6
Identifying recentHGT between populations....................................................................
61
2.6.7
Counting recentHGT within 80 vibrio strains....................................................................
61
2.6.8
Callinggenes and annotatinggenomes ...............................................................................
61
2.6.9
Building orthologousgroups ...................................................................................................
61
2.6.10
Building concatenatedgene species phylogenies........................................................
62
2.6.11
Genome similarityby protein content...............................................................................
62
2.6.12
Populationsimilarity by protein content..........................................................................
62
2.6.13
Substrateutilization assays...................................................................................................
63
2.6.14
Substrate utilization profiles and similarity ...................................................................
63
2.6.15
Populationsimilarityby substrate utilization profile................................................
63
2.7
AUTHOR CONTRIBUTIONS..........................................................................................................................63
2 .8
R EFEREN CES .................................................................................................................................................
64
See main thesis document reference section, Error!Reference source not found............. 64
2.9
2.10
FIGU RES AN D L EGEN DS ..............................................................................................................................
SUPPLEMENTAL NOTES AND DISCUSSION
.....................................................................................
65
81
2.10.1
Hybrid assembly pipeline: discussion of accuracy........................................................
81
2.10.2
16S rRNA phylogeny.......................................................................................................................
81
2.10.3
Defining gene families in 82 genomes: sensitivity to methodology, sampling....... 81
2.10.4
Using gene content divergence to confirm population predictions......................
82
2.10.5
Habitat-specificgene gain and loss in V. cyclitrophicus............................................
82
2.10.6
Gene deletions in general:bias, neutrality,orselection?..........................................
83
CHAPTER 3.
CONSERVATION OF COEXPRESSION WITHOUT CONSERVATION OF
RESPONSE IN BACTERIAL GENE REGULATORY NETW ORKS...........................................
85
3.1
AUTHORSHIP & PUBLICATION NOTE..................................................................................................
85
3 .2
A BST RA CT .....................................................................................................................................................
86
10
3.3
BACKGROUND ...............................................................................................................................................
87
3.4
RESULTS AND DISCUSSION .........................................................................................................................
89
3.4.1
Coexpression is significantlyconserved acrossbacteria................................................
89
3.4.2
Expression phenotypes are not conserved .........................................................................
92
3.5
CONCLUSIONS AND FUTURE DIRECTIONS ...........................................................................................
95
3.6
M ATERIALS AND M ETHODS. ......................................................................................................................
97
3.6.1
Biom ass production and stress application......................................................................
97
3.6.2
M icroarrayextraction and hybridization..............................................................................100
3.6.3
Com putationalm ethods.................................................................................................................
3.7
AUTHOR CONTRIBUTIONS.......................................................................................................................106
3.8
FIGURES AND LEGENDS ...........................................................................................................................
3.8.1
102
106
Supplemental Note: transcriptomicexperiments obtainedfrom other laboratories
and their com parison .....................................................................................................................................
CHAPTER 4.
CONCLUSIONS AND FUTURE DIRECTIONS...................................................
117
125
4.1
SHORT READS FOR METAGENOMICS......................................................................................................125
4.2
ASSEM BLY AND ANALYSIS OF 82 ENVIRONM ENTAL BACTERIA ........................................................
4.3
CONSERVATION OF COEXPRESSION WITHOUT CONSERVATION OF RESPONSE IN BACTERIAL GENE
127
REGULATORY NETW ORKS............................................................................................................................----....134
CH A PTER 5.
GLO SSA RY ...............................................................................................................
140
CH A PTER 6.
REFERENCES ..........................................................................................................
143
11
Index of Figures and Tables
Chapter 1
Figure 1. Size-dependent isolation of DNA fragments from sheared genomic DNA via dSPRI
...............................................................................................................................................................................
Figure 2. Reproducibility of double-SPRI................................................................................................
27
28
Table 1. Oligonucleotides sequences..............................................................................................................28
Figure 3. Quality of Illumina reads out to 143 bp...............................................................................
29
Figure 4. High-confidence alignment yield by insert lengths.........................................................
30
Figure 5. Quality of composite fragments after overlapping.........................................................
30
Figure 6. Composite Illumina reads constitute a legitimate alternative to pyrosequencing for
m etagen om ics stu d ies ................................................................................................................................
Table 2. Cost estimates of sequencing strategies
..............................................................................
31
31
Figure 7. Short insertions and deletions in low-confidence composite Illumina reads ........... 33
Supplemental Figure 1. SHE-RA alignment confidence metric separates true alignments
from m is-align m ents...................................................................................................................................3
6
Supplemental Figure 2. Taxa abundances in metagenomics sample are similar when using
overlapped Illumina or 454 sequencing ........................................................................................
37
Chapter 2
Figure 1. Assembly NSO improvement using hybrid assembly approach
................................
Figure 2. Species phylogeny of Vibrio genomes..................................................................................
65
66
Figure 3. H abitat and gene content..................................................................................................................67
Figure 4. Habitat and substrate utilization profiles............................................................................
68
Figure 5. Counting recent HGT within 80 vibrio isolates .................................................................
69
Figure 6. Network of recent HGT between vibrio populations .....................................................
70
Figure 7. Core and pan genomes for 4 species definitions ..............................................................
71
Figure 8. V. cyclitrophicusstrains clustered by flexible gene content and neutral markers..72
Supplemental Figure 1. Distribution of polymorphisms in 16S genes assembled with short
rea d p ip elin e ...................................................................................................................................................
73
Supplemental Figure 2. 16S phylogeny of our populations and other vibrio type strains......74
12
75
Supplemental Figure 3. Gene content divergence with phylogenetic distance.......................
Supplemental Figure 4. Overlap in gene content between large-particle-associated
76
p o p u latio n s ......................................................................................................................................................
Supplemental Figure 5. Shared rare genes by same-filter strains ................................................
77
Supplemental Figure 6. Substrate utilization in the Vibrio family..............................................
78
Supplemental Figure 7. Substrate utilization profile divergence with phylogenetic distance
79
...............................................................................................................................................................................
Supplemental Figure 8. Distribution of gene family penetrance in V. cyclitrophicus............80
Chapter 3
Figure 1. Conservation of coexpression across the bacterial domain ..........................................
108
Figure 2. Transcriptomic response to pH is highly variable, even within the same species.
............................................................................................................................................................................
1 09
Figure 3. Comparison of phenotype in our stress compendium of S. oneidensis and D.
v u lg aris ...........................................................................................................................................................
11 1
Figure 4. Comparison of phenotypes in our stress compendium of D. vulgaris and G.
m etalliredu cens. .........................................................................................................................................
11 3
Table 1. Genes of our heat shock module...................................................................................................114
Figure 5. Conserved response of E. coli heat shock module across five species despite global
d iv ergen ce of resp on se. .........................................................................................................................
1 15
Supplemental Figure 1. Comparison of phenotypes in our stress compendium of S.
oneidensis and D. vulgaris using all putative orthologs.............................................................116
Supplemental Table 1. Significantly-changing genes common to multiple species' conditionsp ecific resp o n ses......................................................................................................................................1
13
17
Chapter 0.
Introduction
0.1 Overview
This is a very exciting time to be investigating the wonderfully diverse and complex
microbial world. One of the most marvelous features of microbial life is its ability to thrive
in such diverse and dynamic environments. My scientific interest lies in the variety of
modes by which microbial life accomplishes this feat. In the first half of this thesis I present
tools to leverage high throughput sequencing for the study of environmental genomes. In
the second half of this thesis, I describe modes of environmental adaptation by bacteria via
gene content or gene expression evolution.
Associating genes' usage and evolution to adaptation in various environments is a
cornerstone of microbiology. New technologies and approaches have revolutionized this
pursuit, and I begin by describing the computational challenges I resolved in order to bring
these technologies to bear on microbial genomics. In Chapter 1, I describe SHE-RA, an
algorithm that increases the useable read length of ultra-high throughput sequencing
technologies, thus extending their range of applications to include environmental
sequencing. In Chapter 2, I design a new hybrid assembly approach for short reads and
assemble 82 Vibrio genomes. Using the ecologically defined groups of this bacterial family, I
investigate the genomic and metabolic correlates of habitat and differentiation, and
evaluate a neutral model of gene content. In Chapter 3, I report the extent to which
orthologous genes of co-existing bacteria exhibit the same transcriptional response to the
same change in environment, and describe the features and functions of bacterial
transcriptional networks that are conserved. I conclude this thesis with a summary of my
tools and results, their use in other studies, and their relevance to future work. In particular,
I discuss the future experiments and analytical strategies that I am eager to see applied to
compelling open questions in microbial ecology and evolution.
14
0.2 The pivotal importance of deep coverage in environmental
genomics
While the deep sequencing enabled by cheap next generation technologies has
revolutionized our understanding of the diversity and dynamics of the microbial world
(Tautz, Ellegren, and Weigel 2010), short read length has prevented environmental surveys
from making use of the most prolific platforms. Of these ultra high-throughput sequencing
(UHTS) platforms, Illumina offered slightly longer reads than Solid (Shokralla et al. 2012),
but at -100bp, still fell below the length threshold where most proteins can be reliably
identified (Hoff 2009; Brady and Salzberg 2009) and where most bacteria can be identified
by standard community profiling methods. Indeed, the majority of Illumina 16S ribosomal
reads from human gut microbiota could not be classified down to the genus level, even if the
optimal variable region was targeted for sequencing (Soergel et al. 2012; Claesson et al.
2010). And despite the vast coverage improvement compared to 454, diversity estimates
from Illumina were unreliable (Claesson et al. 2010). Efforts to obtain longer sequences
from short-read instruments have been made (Hiatt et al. 2010; Sorber et al. 2008), but
these multistep library preparations impose a complexity bottleneck, limiting their utility
for applications like environmental genomics (Hiatt et al. 2010). Thus, work in these areas
chiefly relied on longer read technologies (Logares et al. 2012; Bruijn 2011), and suffered
the -48-160x drop in the number of reads per dollar (Rodrigue et al. 2010; Bruijn 2011).
The next generation sequencing revolution has helped us to appreciate the
extremely diverse and dynamic nature of microbial communities, and that we have sampled
only a fraction of their diversity (Quince, Curtis, and Sloan 2008). Study after study reports
that their community profiles suffer from tremendous undersampling in many and varied
environments (Fierer et al. 2007; Fuhrman et al. 2008;
J. A. Gilbert et al. 2009; Lopez et al.
2012), such that many estimates of diversity involve substantial extrapolations of species
abundance curves (Lopez et al. 2012). In some cases, increasing sequencing depth does not
just improve our comprehension of a system, but make it possible. To learn to triage
potential pediatric IBD patients non-invasively, stool community profiles were fed to a
machine learning tool, and the taxa it relied on as discriminating features to build the
classifier were mainly low-abundance. The vast majority of these taxa would have escaped
observation without the coverage afforded by UHTS technologies, making this non-invasive
15
diagnosis method not just less effective but most likely impossible (Papa et al. 2012).
Community function is orders of magnitude more diverse, and so the undersampling is even
more severe. In the metatranscriptomic sample of a coral holobiont community, sequencing
to a depth of 0.5-1e6 reads resulted in the majority of identified microbial transcripts
sampled by just a single read (Timberlake, unpublished data), partially because a large
fraction of the sample was dominated by ribosomal reads, which is typical even after
substantial efforts to chemically deplete them (Giannoukos et al. 2012; Ciulla et al. 2010).
For a 1 million read marine microbial community transcriptome, a tracer RNA was used to
estimate sample coverage at less than 10e-7 (Gifford et al. 2010). The microbial consortia
that naturally live on and interact with humans and all living things are increasingly
recognized as contributing to health and disease in complex ways; it is thus important not
only for our understanding of the microbial world but for a variety of fields to extend the
deep coverage afforded by UHTS platforms to environmental microbiology studies.
In Chapter 1, I describe SHE-RA, an algorithm that helps remove both the read
length and error rate barriers to the use of short reads. Short fragments of DNA are be
sequenced from both ends, such that the low-quality terminii of each read pair overlap. By
aligning these two fragments and combining the information of two largely independent
measurements of the base identity at these overlapping positions, SHE-RA corrects errors
and extends the useable read length. If shotgun DNA is of interest, generating a narrow
library size distribution appropriate for overlapping read pairs requires some precision in
the DNA fragment size-selection step, and so SHE-RA is presented in a conjunction with an
optimized chemical protocol for dSPRI magnetic beads use in precise DNA fragment size
selection (Lennon et al. 2010) as part of a scalable high-throughput protocol (Rodrigue et al.
2010). I compare SHE-RA output (overlapped Illumina read pairs) to the standard
operating procedure for environmental sequencing at the time, 454-FLX, and show that it is
a legitimate alternative to 454 for identifying protein functions and taxa in a shotgun
marine metagenome. While the SHE-RA algorithm was tested on Illumina reads, it is
designed to be largely platform-independent, and will, in principle, work well with any
sequencing technology that produces paired ends reads, thus making metagenomics,
transcriptomics, and community profiling tractable for the shorter reads produced by UHTS
technologies.
16
0.3 A model system for bacterial ecology and evolution
Comparative genomics is central to our understanding of function and evolution, but in
bacteria its scope has been limited in several important ways. A large fraction of sequenced
and studied genomes are either very distant relatives or closely related pathogens
(Markowitz et al. 2009). Extremely long divergence times concentrate our understanding of
evolutionary dynamics and adaptation on the very slowest processes, while the proclivity
for pathogen sequencing skews our understanding of faster evolutionary dynamics towards
the evolutionary modes of that particular lifestyle. Finally, the lack of a commonly accepted
bacterial species definition makes the adaptive value of variation difficult to interpret
(Fraser et al. 2009; Shapiro et al. 2009).
In Chapter 2 1 describe the assembly and analysis of 82 genomes of the
Vibrionaceae, an important model system for microbial evolution and ecology. This family
of heterotrophic bacteria is abundant in aquatic systems worldwide and includes many
human and animal pathogens, and is thus ecologically, medically and economically
important. Unlike the vast majority of environmental bacteria, the 'vibrios' are easily
cultivatable in the lab, and are thus an ideal system for comparative genomics study,
particularly because of the breadth of ecological information collected for this group and
the natural phylogenetic units that this ecology defined. In the first half of this chapter, I
describe the methodological advances required to assemble 82 genomes from short reads.
In the second half of this chapter, I survey the gene content of these bacteria, including its
lateral acquisition, fast turnover, fixation, and relation to ecology. I do so in the context of
one of the premier features of this model system: taxonomic groups distinguished by
ecology.
Using the environmental data collected with more than 1900 marine isolates,
ecologically cohesive populations were delineated in the Vibrionaeae family. Bacterial
taxonomic units are typically demarcated by drawing arbitrary phylogenetic cutoffs, but
this clustering often corresponds poorly to organisms' physiology or ecology (Francisco,
Cohan, and Krizanc 2012). Instead, phylogenetically as well as ecologically cohesive vibrio
populations were identified with the machine learning algorithm AdaptML (Hunt et al.
2008). Using season (Fall/Spring) and particle size, ecological differentiation was found to
occur at genetic distances that were surprisingly fine. However, subsequent studies
17
consistently discovered these same ecological populations: they were largely cohesive in a
variety of gene phylogenies (Preheim, Timberlake, and Polz 2011), reproducibly recovered
across time (Szabo et al. 2012), and robust to the particular ecological trait used to define
them (Preheim, Timberlake, and Polz 2011; Preheim et al. 2011). The ecological
distinctness between populations demonstrates that despite co-existing in the water
column and sometimes very close genetic relations, these groups do not have same niche,
instead partitioning resources spatially and temporally. Within populations, individuals
interact socially (Cordero, Wildschutte, et al. 2012) and sexually (Shapiro et al. 2012),
experiencing rampant homologous recombination along population lines that could result
in genetic differentiation in a manner akin to eukaryotic speciation. All in all, these
ecological populations have so many of the hallmarks of a fundamental taxonomic unit, and
it is likely that within each, individual genotypes share a niche and face similar selective
pressures, making this the ideal model system to study variation, adaptation to
environment, and differentiation.
In addition to the breadth of ecological information associated with this group, part
of the strength of the model system stems from our genome sequences' sampling the
hierarchy of diversity and divergence in the Vibrionaeae family: from finely sampled 'sister
strains' separated by just 60 core genome SNPs, fewer than can be seen to accumulate from
a few weeks' selection in the lab (Timberlake and Clarke, unpublished results) to the most
divergent genomes, which include much of the family's -600 million years radiation
(Thompson 2007). Unlike previous comparative genomics studies that focus either on
several strains or divergent clades, here we can quantify genomic and phenotypic variance
within a population, the unit on which selection acts, as well as the divergence between
populations, illuminating the evolutionary processes involved in adaptation to new habitats.
Importantly, while previous studies described evolutionary processes in the context of the
operational taxonomic definitions ("OTU", typically 3% 16S divergence) or named species,
such studies are likely combining multiple differentiated natural populations with divergent
ecologies (Hunt et al. 2008; Francisco, Cohan, and Krizanc 2012), thus convolving intrapopulation and inter-population variation, and convolving the effects of the likely distinct
forces of selection experienced in the multiple niches in that "species". Thus this genomic
reference for the vibrio model system affords us a unique opportunity to investigate the
evolutionary dynamics governing variation in a single niche, adaptation to new ecologies, as
well as the interactions between co-existing populations.
18
In the first part of this chapter, I describe the methodological advances required to
assemble 82 genomes for this model system. Ultra high-throughput sequencing platforms
(UHTS) has made such genome assembly projects tractable, but there are theoretical limits
on de novo assembly of short reads that the deep coverage cannot overcome. This is
particularly true of the early platforms' short single-end reads that constituted most of our
dataset (Chaisson, Brinza, and Pevzner 2009). I report a computational approach designed
to assemble full-length16S genes, so that important phylogenetic marker could be utilized,
and a more general approach for more contiguous genome assembly from short reads.
Expensive complementary approaches such as long-insert mate-pair libraries are typically
required for genome assembly contiguity (Chaisson, Brinza, and Pevzner 2009; Butler et al.;
Wetzel, Kingsford, and Pop 2011). By designing a hybrid reference and de novo assembly
approach, I was able to leverage just 1 near-finished genome per population to achieve
markedly better assembly contiguity (2-3x N50's) compared with de novo assembly alone.
In the second part of this chapter, I use these 82 new genomes to survey the
relationships of phenotype and gene content with ecology in this model system. Vibrios are
known for their great genetic and metabolic diversity, but the driving forces and dynamics
of metabolic evolution remain largely unknown (Keymer et al. 2007). For each of these
natural populations, an inferred habitat, based on the isolation data of their phylogenetic
group, serves as a representation of their ecology. In this study, we use season (Fall/Spring)
and particle size, two environmental measures available for the vast majority of the strains
full genome sequenced. These ecological traits have been shown to be important axes for
ecological differentiation in this group (Thompson et al. 2004; Hunt et al. 2008) as well as
stable descriptions of populations' ecology over time (Szabo et al. 2012) but it is not yet
known to what extent they engender genotypic or phenotype similarity. To explore the
effect of similar microscale ecology on these features, I determine the extent to which cohabitants share substrate utilization capabilities and overall gene content, as well as the
frequency and structure of recent horizontal gene transfer and its structure in this clade.
What is the series of events that coincides with the differentiation of a population?
Recent studies report that within traditional species boundaries exist multiple subpopulations that can be robustly separated along various non-nucleotide-divergence axes
such as ecology (Rocap et al. 2003), geography (Keymer et al. 2007), host organism
19
(Preheim et al. 2011; Sim et al. 2008). However the genomic and phenotypic correlates for
divergence along those axes, and their temporal sequence, are largely unknown. Proteomic
expression has been suggested to be diverging faster than gene content (Konstantinidis et
al. 2009), but the relative timescales for the divergence of other attributes is unclear. Our
superfine phylogenetic sampling included many genomes that are distinguished by only
100-300 SNPs genome-wide, a number comparable to what accumulates in just a few weeks
of selection and bottlenecks of these same strains in the laboratory (Clarke and Timberlake,
unpublished results). Vibrios are known to partition resources at surprisingly fine levels of
phylogenetic divergence, and though a recent study illuminated some of the crucial events
that occur early in this differentiation process (Shapiro et al. 2012), but the role of gene
content and the extent to which this partitioning on fine timescales involves new DNA, new
alleles, or new regulation is unknown. In Chapter 2, I ask to what degree flexible DNA, gene
content, core genome, or metabolic profiles are distinct between closely related individuals
and populations and start to compare the relative timescales for the divergence of these
different measures as populations differentiate.
Closely related bacteria are increasingly appreciated to vary greatly in their gene
repertoire (Tettelin et al. 2005), and the vibrios in particular have remarkably fast gene
turnover of many genes (Thompson 2005), but the selective implication of this large and
labile flexible genome is poorly understood (Polz et al. 2006). Indeed, same-cluster isolates
(1% 16S divergence) can exhibit 20% differences in genome size (Thompson 2005). It is
astounding but not implausible to suggest that this large fraction of the genome's content
may not be adaptive (Berg and Kurland 2002; Polz et al. 2006), indeed, in the absence of
other evidence, neutral variation seems a more appropriate null hypothesis. However, the
current literature usually makes one of two opposing assumptions about the adaptive value
of these genes with intermediate penetrance in a bacterial group: either strain-specific gene
content is held responsible for pathogenic or other adaptive phenotype, and lowpenetrance or strain-specific genes are dismissed as being inessential or not adaptive. A
neutral model of flexible gene fixation was recently introduced (Haegeman and Weitz
2012), however, the model has been tested only in its ability to fit observed distributions of
flexible genes penetrance in species groups. The ecological data available in this Vibrio
model system will enable us to directly test the selective value of flexible DNA and its
neutral or selective fixation. In this study, I quantify the extent to which useful flexible DNA,
20
core genes and metabolism diverge at the time scales where resource partitioning is known
to occur, and the extent to which flexible gene fixation is concordant with a neutral model.
While the neutrality of gene content variation remains obscure, the model that each
genome partitions into two kinds of genes, "core" and "flexible", has become a very popular.
The model was originally proposed as a way to reconcile the huge intraspecific variation in
gene content with any species concept (Ruiting Lan and Reeves 2000; R. Lan and Reeves
2001). Despite the rampant illegitimate recombination introducing new DNA and
diversifying the flexible genome, a species' core genome could be resistant to diversifying
forces. This core genome, stable and common to all, would be homogenized by intraspecies
homologous recombination, thus generating the phylogenetic clusters we empirically
observe and the cohesion we associate with the idea of species. Indeed, a few studies have
shown the enrichment of recombination in core genes within phylogenetic clusters
(Lef6bure et al. 2010; Bennett et al. 2010; Shapiro et al. 2012), and even highlighted a
possible mechanism (Budroni et al. 2011). While this rationale (and indeed the bacterial
species concept itself (Doolittle 2012; Cohan 2011)) remains the subject of active debate,
the core/flexible dichotomy appears to be biologically and ecologically meaningful, as the
two gene types have been show have differ in genomic location (Hacker and Carniel 2001;
Dobrindt et al. 2004; Coleman et al. 2006) functions (Makarova et al. 2006; Kettler et al.
2007), evolutionary modes (Bennett et al. 2010), expression (Stewart et al. 2011), purifying
selection strength (V. Tai et al. 2011), and geographical distribution (Boucher et al. 2011).
The success of this core/flexible model is remarkable given how widely their
definitions vary (due, in part, to the lack of a consistent definition for an evolutionary unit in
bacteria). Various studies defining "core" genes as those shared at clonal complex, named
species, genus or even family level, a span of more than 2-12% 16S divergence (Grote et al.
2012). This begs the question whether the core versus flexible findings are the result of
analyzing two extremes from a continuum of gene types. Indeed, as comparative studies
have included increasing numbers of genomes, these two gene types are being further
subdivided in the literature into "core", "extended core", "unique core", "character",
"accessory" and "flexible" (Lilburn, Cai, and Gu 2009; Zwick et al. 2012; Lapierre and
Gogarten 2009), based on the fraction of genomes in which they are observed. This further
partitioning leads one to question if the core/flexible dichotomy concept is valid, and
whether a better and consistent definition of core and flexible, one that is consistent with
bacterial ecology and differentiation, could make this model of gene-type partitions more
21
meaningful, and could even reinforce the delineation of species in bacteria (Ruiting Lan and
Reeves 2000).
The 82 Vibrio genomes I have assembled, partitioned into natural evolutionary
units, present a unique opportunity to properly define core and flexible gene content in a
biologically meaningful way in accordance with the original Core Genome Hypothesis. In
this preliminary analysis of those core and flexible genomes, I evaluate a neutral model of
flexible gene content penetrance, and the extent to which flexible genes fix and can replace
core genes on the timescales of ecological differentiation and with ecological correlates.
22
0.4 Conservation of coexpression without conservation of
response in bacterial gene regulatory networks.
Microbes sense their changing environment and alter their phenotype via gene regulation in
response. These gene regulatory responses are thought to be critical for fitness and
adaptation (Britten and Davidson 1969; Cases, de Lorenzo, and Ouzounis 2003), but despite
extensive documentation of differentially expressed genes (DeRisi, Iyer, and Brown 1997;
Faith et al. 2008), even a basic understanding of their fitness contributions, conservation, or
adaptive change is lacking (Birrell et al. 2002; Giaever et al. 2002; S. L. Tai et al. 2005;
Klosinska et al. 2011).
A fundamental unanswered question is whether orthologous bacterial genes
respond the same way to the same environmental cues. Comparative transcriptomic studies
in bacteria typically focus on species-specific differences, which can then be implicated in
pathogenic or other adaptive phenotypes (Yoder-Himes et al. 2009; Guell et al. 2011; Zhang,
Dudley, and Wade 2011). Impressive studies have characterized the divergence of
transcriptomic responses in 5 closely related yeast species (Tirosh et al. 2006), and of
proteomic abundance profiles in 10 co-generic proteobacteria (Konstantinidis et al. 2009),
but to our knowledge, conservation of bacterial transcriptomic responses had been never
been quantified.
While the evolution of response has not been explored in prokaryotes, evolution of
their transcriptional network has been explored by looking at features of network
architecture, including modules (McAdams, Srinivasan, and Arkin 2004; Babu et al. 2004;
Gelfand 2006). Studies have convincingly demonstrated that genomes are partitioned into
functional modules that are expressed together (Stuart et al. 2003; Bergmann, Ihmels, and
Barkai 2004), clustered together (Price et al. 2005), physically interacting (Zinman, Zhong,
and Bar-Joseph 2011) and laterally transferred together (Morgan N. Price, Arkin, and Alm
2006). As the transcriptional network in evolves, these units can remain cohesive, with
some conserved transcriptional modules documented in species as divergent as E coli and
metazoans (Stuart et al. 2003; Bergmann, Ihmels, and Barkai 2004), or even out to B. subtilis
(Waltman et al. 2010; Zarrineh et al. 2010). Overall, network evolution is complex, including
flexibility and extensive rewiring of functional modules (Lozada-Chavez, Janga, and Collado-
23
Vides 2006; Roguev et al. 2008; Singh et al. 2008; Fokkens and Snel 2009), however, the
parts of the transcriptional network that are conserved over 4 billion years of bacterial
evolution may be fundamental to microbial life and its most common
environmental/adaptive challenges.
In Chapter 3 of this thesis, I assess conservation in the bacterial transcriptional
network in two ways. First, I quantify the conservation and function of the simplest possible
units of the transcriptional networks, co-expressed gene pairs, in four very divergent
bacteria, starting with an extensive compendium of transcriptomic data we generated in
two environmental isolates (Venkateswaran et al. 1999; Heidelberg, Seshadri, Haveman,
Hemme, Paulsen, Kolonay, Eisen, Ward, Methe, Brinkac, et al. 2004; Heidelberg, Seshadri,
Haveman, Hemme, Paulsen, Kolonay, Eisen, Ward, Methe, Brinkac, et al. 2004)
supplemented by mining public databases for expression profiles in other species (Barrett
et al. 2007; Faith et al. 2008). Second, I conduct a comparative survey of transcriptional
phenotype to determine the features of transcriptional response that are conserved,
focusing on three model metal-reducing bacteria and the environmental perturbations
which are thought to be relevant for their ecology (Lovley et al. 1993; Finneran,
Housewright, and Lovley 2002). This comparative survey of transcriptional phenotype
revealed the surprisingly fast-diverging and mercurial nature of bacterial gene regulatory
responses, motivating an re-examination of how functional conclusions are drawn from
transcriptomic experiments and highlighting the clear need for a phylogenetic model of
gene expression evolution.
24
Chapter 1.
Unlocking short reads for
metagenomics
Note: this chapter is presented as it originally appeared in:
PLoS ONE 5 (7) (July 28): e11840. doi:10.1371/journal.pone.0011840,
plus the following additions:
1.6 Contributions to this manuscript
1.7 Appendix
25
0'PLoS
OPEN a ACCESS Freely available online
one
Unlocking Short Read Sequencing for Metagenomics
S6bastien Rodrigue*, Arne C. Maternal, Sonia C. Timberlake, Matthew C. Blackburn, Rex R. Malmstrom,
Eric J. Alm*, Sallie W. Chisholm*
Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, Massachussetts, United States of America
Abstract
Background: Different high-throughput nucleic acid sequencing platforms are currently available but a trade-off currently
exists between the cost and number of reads that can be generated versus the read length that can be achieved.
Methodology/Principa/ Findings:We describe an experimental and computational pipeline yielding millions of reads that
can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an
automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately
sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form
a longer composite read.
Conclus/ons/Igniflcance:This strategy is broadly applicable to sequencing applications that benefit from low-cost highthroughput sequencing, but require longer read lengths. We demonstrate that our approach enables metagenomic
analyses using the lilumina Genome Analyzer, with low error rates, and at a fraction of the cost of pyrosequencing.
Citation: Rodrigue 5,Materna AC, Timberlake SC,Blackburn MC, Malmstrom RR,et al. (2010) Unlocking Short Read Sequencing for Metagenomics. PLoS ONE 5(7):
e11840. doi:10.1371/journal.pone.0011840
Editor. Jack Anthony Gilbert, Plymouth Marine Laboratory, United Kingdom
Received May 26, 2010; Accepted July 6, 2010; Published July 28, 2010
Copyright: © 2010 Rodrigue et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by funds from the Gordon and Betty Moore Foundation Marine Microbiology Initiative (www.moore.org), and the Center for
Microbial Oceanography: Research and Education (C-MORE) (http://cmore.soest.hawaii.edu) awarded to S.W.C., and by funds from the Department of Energy
Genome-to-Life (DOE-GTL) (http://genomicscience.energy.gov) program awarded to S.W.C. and EJA. S.R.is the recipient of postdoctoral fellowships from the
(www.nserc-crsng.gc.ca) and from the Fonds Quebecois de Recherche sur Ia Nature et
Natural Science and Engineering Research Council of Canada (NSERC)
Technologies (FQRNT) (www.fqmt.gouv.qc.ca). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the
manuscript.
Competing Interests: The authors have declared that no competing interests exist.
E-mail: ejalm@mit.edu (EJA); chisholm@mit.edu (SWC)
@These authors contributed equally to this work.
although the procedures we describe could be applied to other
sequencing platforms that can generate paired end data.
Introduction
New DNA sequencing technologies are dramatically changing
the research landscape in biology by enabling experiments that
were previously too expensive or time-consuming. Three platforms, the Roche454 Genome Sequencer, the Illumina Genome
Analyzer II and the Applied Bio-Systems SOLiD system, are
widely used [1]. The latter two instruments produce a very large
number of short reads, typically 25-75 nucleotides long, at a
significantly lower cost per basepair (bp) compared to the Roche454 pyrosequencing system. However, short reads create difficulties for some applications such as metagenomics and metatranscriptomics, and therefore pyrosequencing remains the
platform of choice. Efforts have been made to obtain longer
sequences from short-read instruments, and further expand their
range of applications [2,3]. A clever strategy was recently
developed by Hiatt and colleagues, but their approach requires
many steps and imposes a library complexity bottleneck that can
limit its utility for applications like metagenomics [2]. We sought
to establish a more straightforward and broadly applicable
method. Our approach builds on preparing libraries with tunable
size distributions such that sequencing from both ends generates
overlapping reads. The lower-quality ends of mated reads can then
be overlapped to produce high-quality consensus sequence. We
demonstrate our strategy using the Illumina GAII technology
.PLoS ONE | www.plosone-org
Results
We developed a simple and automatable protocol to rapidly,
efficiently and reproducibly isolate DNA fragments of a specific
size distribution without the need for gel electrophoresisis. The
method is similar to a recently published protocol [4] but allows
adjusting the size distribution of the isolated DNA fragments as
appropriate for the sequencing platform used. After ultrasonic
DNA shearing, size-selection is carried out via double solid phase
reversible immobilization (dSPRI) using carboxyl coated magnetic
beads (Figure 1). DNA associates with these particles in a strictly
size-dependent manner according to the concentration of
polyethylene glycol and salts [4-7], which can be adjusted by
changing the volume ratio of SPRI bead suspension to DNA
solution (Figure 1). The size distribution of selectively bound DNA
is concentration independent (Figure 2), which supports the
preparation of sequencing libraries from dilute samples. In the first
step of this procedure, DNA fragments larger than the desired
library insert size are bound to magnetic beads, which are then
discarded. The supernatant is saved, and mixed with an
appropriate volume of fresh SPRI beads. This second incubation
preferentially immobilizes DNA fragments of the targeted size
26
July 2010 1 Volume 5 | Issue 7 | e11840
DNA Sequencing Pipeline
A
RaO hesdADNA
B
S,,,.n12
S.,,.,1
Msme
8ncubndonUMe
RW*o
beadaftp1
Inuba
fne
A
ag
A
1u11.3
65e50
20 Mn
15min
"46p
25.5
5
6550l:
20 min
60115[pI
:0.52
15min
labp
23.3
C
46ASOpI09
20miW
5SSMLprI0.s5
7min
larbp
2M.e
455 MO
; 0.9
20 mi
40M5[pA;0.42
7 min
"
248
E
4&50;
20min
350mnq;0.32
7min
assp
F
4
20min
20/5 p;0.21
7mm
Sbp
22,1
G 45050LA;]005
20min
Iwssjp;0.11
7min
flhp
Is
40501s1Q;0s8
5mn0
100/85[ius
sI.18
5mkn
srp
H
.3
;0.S
9
0 Kpg:0
1550 100150 200
300
0115Ipit 0.87
400 500
700 1500
C
CV[%
bp
232
fragment length [bp]
32.4
15 50 100150 200
300
4
15 50 100 150200
300
400 500
700 1500
15 50 100 150200
300
400 500
700 1500
100
40
200I
100
I.
100
H
40
20
F
100
so
so
0
40
40
30
20
15 50 100 150200
300
400 500
700 1500
150501010200
300
400 500
700
1500
fragment Wngth 1bp]
Figure 1. Size-dependent isolation of DNA fragments from sheared genomic DNA via dSPRI. AMPure XP SPRI beads bind DNA fragments
in a size dependent manner according to the concentration of salts and polyethylene glycol (PEG) in the reaction [4-7], which can easily be changed
by using different volume ratios of DNA to SPRI bead solutions. Atwo-step procedure isemployed to isolate targeted DNA size fractions. Panels Ato
H present Bioanalyzer DNA-1000 assays showing the sheared genomic DNA used as starting material (black), the larger size DNA fragments discarded
in separation 1 (red), and the size fraction purified and recovered after separation 2 (blue). Panel I is a table summarizing the conditions and results
displayed in panels Ato H.All Bioanalyzer DNA-1000 traces after separation 1 (panel J), and after separation 2 (panel K), are respectively displayed on
a graph for the conditions presented in panels A to H. The conditions displayed in panel H were used to obtaine the Illumina composite reads
discussed in the text. The wider DNA fragment size distribution from panel H allowed to better analyze the effects of shorter versus longer
overlapping regions on consensus reads.
doi:10.1 371/journal.pone.001 1840.gOO1
range. Shorter DNA molecules, salts and other reagents are
removed by wash steps, and the DNA fraction that remains bound
to the magnetic SPRI beads can subsequently be recovered with
virtually no loss [4]. The resulting DNA population reproducibly
exhibits a relatively narrow size distribution (Figure 2), and can be
used to prepare high-throughput sequencing libraries by ligating
appropriate oligonucleotide adapters (Table 1).
DNA size selection allowed us to choose an appropriate library
insert length with precision, such that mate-pair sequences of
each insert overlap in the middle, and can be aligned to produce
a longer, more accurate composite read. We used our streamlined
dSPRI protocol to produce a paired-end library with an insert
size range of 100-300 bp (Figure 1H) for the Illumina GAIIx
platform, and generated millions of paired-end reads 143
nucleotides (ntds) in length. We observed that the sequencing
.Q.PLoS ONE
I www.plosone.org
error rate remained <1% for 110 ntds in the forward read, and
87 ntds in the reverse read (Figure 3A). To reconstruct each
insert's sequence, we developed the SHERA (SHortread ErrorReducing Aligner) software tool, which aligns each mate-pair to
produce a composite read with Phred-like quality scores [8]. With
each alignment, we report a confidence metric, allowing
misaligned reads to be reliably filtered out. We chose a cut-off
such that 87% of reads were retained and less than 1% of paired
reads were incorrectly aligned. The average composite read
length reached 180±40 bp (Figure 4), as expected from the
library insert size distribution. Importantly, the composite reads'
length can easily be adjusted by modifying dSPRI selection
conditions (Figure 1), and reach >200 bp. For nucleotides in the
aligned region, SHERA recalculates the error probability of each
bp, given the base call data from both reads (Figure 3B).
27
July 2010 1 Volume 5 1 Issue 7 1 e11840
DNA Sequencing Pipeline
A
1550100
200
300 400 500
B
C
5.
5.
I
I
700 1500
1550100
fragment length (bp]
200
300 400 500
700 1500
15 50 100
fragment length (bp]
200
300 400 500
700 1500
fragment length [bp]
Figure 2. Reproducibility of double-SPRI. The panel shows DNA fragment size distributions as obtained by Bioanalyzer DNA-1000 assays. The
curves represent the size fractions removed during the first separation step A) or recovered after the second separation B). The two size fractions
were independently reproduced in 4 or 8 separation experiments. The curves in B) represent the libraries sequenced after dSPRI based size selection,
adapter ligation and PCR enrichment. While concentrations (arbitrary fluorescence units) vary between reproduced libraries the range of removed or
enriched DNA fragment sizes was highly reproducible. Panel c) shows the DNA fragment size distribution recovered after the second separation
when using decreasing amounts sheared genomic DNA. dSPRI allows reliable size selection in a DNA concentration independent manner.
doi:10.1 371/joumal.pone.001 1840.g002
Composite read quality exceeds that of individual short-reads
(Figure 3). The shortest composite reads have the best quality
scores, still, the average error rate remained <1% across 95% of
250 bp reads (Figure 5). New technologies and sequencing
chemistry improvements are likely to further increase ultra
high-throughput read lengths, thus extending the achievable
length of high-quality composite sequences.
Short-read technologies have not been widely used for
metagenomic analyses because of the difficulty in confidently
assigning phylogeny or putative gene function to short sequences,
although a strategy using Illumina mate paired libraries to assign
taxonomy has recently been evaluated with simulated data [9].
We sought to test if our composite reads could potentially
transcend this limitation. To address this question, we sequenced
a metagenomic DNA sample from the Hawaii Ocean Time
Series [10] using our overlapping paired-end approach and
compared the results with those obtained with the Roche 454FLX technology, the most widely-used platform for these studies.
Over 4.8 million high-confidence composite sequences of
180±40 bp were generated from a single sequencing lane of
the Illumina GAI~x while 673,673 454-FLX reads of 207±71 bp
were already available [10]. Despite the slightly shorter average
read length of the composite Illumina reads relative to 454-FLX,
the fraction of reads that could be assigned to a taxon was similar
in both cases, even when considering longer 454-FLX reads
(Figure 6A). The relative abundance of the sequences that could
be assigned to each taxon was also consistent between the 454FLX and composite read datasets (Figure 6B). We have focused
our comparison on Illumina and 454-FLX, because they generate
similar read lengths, and read length can make a difference in
performance for some applications [11]. As the technology
landscape evolves our general approach for experimental and
computational overlapping of short paired reads will be readily
adapted to new platforms and upgrades, and therefore is likely to
continue to play a role in providing additional cost-effective
sequencing options. Taken together, these results demonstrate
that Illumina composite reads produced by SHERA constitute a
practical and cost-effective alternative to 454-FLX for metagenome sequencing and analysis while providing significantly deeper
sampling at a lower cost.
Table 1. Oligonucleotides sequences.
Name
IGA-A-down
Purification
PLC
Sequence (5' to 3')
B'B'B
AGA TCG GAA GAG CGT CGT GTA GGG AAA GAG TGT AC/3AmM/
IGA-A-up
HPLC
/5AmMC6/ACA CTC TTT CCCTAC ACG ACG CTCTTC CGA TCT BBBB
IGA-PE-80-down
HPLC
/5AmMC6/CTC GGC ATT CCT GCT GAA CCG CTC TTC CGA TCT
IGA-PE-80-up
HPLC
AGA TCGGAA GAG CGG TTC AGC AGG AAT GCC GAG /3AmM/
IGA-PCR-PE-F
PAGE
AAT GAT ACG GCG ACC ACC GAG ATC TAC ACT CTT TCC CTA CAC GAC GCT CTT CCG ATC T
IGA-PCR-PE-R
PAGE
CAA GCAGAA GAC GGC ATA CGA GAT CGG TCT CGG CAT TCC TGC TGA ACC GCT CTT CCG ATC T
QPCA.Iibrary quantification-F
Standard desakilng
MAT GAT ACG GCG ACC ACC GA
QPCR-library quantiflcation-R
Standard desalting
CAA GCA GAA GAC GGC ATA CGA
All oligonucleotides were purchased from IDT (www.idtdna.com).
5AmMC6: 5' Amino Modifier C6.
3AmM: 5' Amino Modifier.
B: Barcode base. The sequencing reads start with these bases. The barcodes used in this study were AAAC, ACCC, AGGC and ATTC.
B': Complementary barcode base when the two oligonucleotide from the same adapter are annealed.
doi:10.1371/journal.pone.001 1840.t001
.PLoS ONE | www.plosone.org
28
July 2010 1 Volume 5 1 Issue 7 | e11840
DNA Sequencing Pipeline
A
CM.
Q75
0.
CM
0504
a.
mean
Q25
6
6o
position on forward read
2o
8b
4o
10
1k0
140
140
120
80
100
60
40
20
0
position on reverse read
-
B
075
(0
C>-
050
mean
0
CD
025
a
0 -
CD,
' I
0
20
I
I
I
40
60
80
I
I
I
I
I
100
120
140
160
180
position on composite fragment [bp]
Figure 3. Quality of llumina reads out to 143 bp. A) The mean Phred quality of single reads is shown in the solid red lines, with error bars
displaying quartiles. Read quality is highly variable toward the reads' ends. Read quality as a function of base pair isworse on the second mate of the
pair. B) The mean and quartiles of Phred quality by base for the average-length composite read.
doi:10.1 371/journal.pone.001 1840.g003
applications including amplicon sequencing, transcriptomics,
and de novo assembly, will take advantage of the millions of
longer high quality reads produced by this simple and scalable
method.
Discussion
We have demonstrated that our overlapping mate-pair
strategy can substantially increase read length while maintaining
high sequence quality, in turn making economical short-read
platforms suitable for a range of applications. For example, the
performance of phylogenetic classification and annotation of
DNA sequences improves with read length, the most significant
increase being observed by extending read length from 100 to
200 bp [12-13]. By allowing to reach and exceed this latter
threshold, our method enables the use of the ubiquitous Illumina
platform for metagenomics. Importantly, the dSPRI DNA
fragment size selection and the SHERA package for computational read joining can be used with any existing or emerging
short-read sequencing platforms capable of producing matepairs. Compared to pyrosequencing, the dominant sequencing
technology in metagenomics, our composite read strategy
already provides a >30-fold increase in sequencing capacity
per dollar at similar or higher accuracy (Table 2). We expect that
the advances described here will not only unlock short read
sequencing for metagenomics, but that a number of other
..
PLoS ONE I www.plosone.org
Materials and Methods
Double-SPRI (dSPRI) DNA fragment size-selection
High-molecular weight genomic DNA was obtained from a
marine environmental sample (Hawaii Ocean Time Series, cruise
186, 75 m depth, [10]) and from laboratory cultures using
standard procedures [14], or by amplifying genomes from
individually flow-sorted cells as described [15]. DNA samples
(50 pL, 2 pag) were sheared using 18-24 cycles of alternating 30
seconds ultasonic bursts and 30 seconds pauses in a 4*C water
bath (Bioruptor UCD-200, Diagenode). The resulting DNA
molecules were end-repaired and phosphorylated according to
manufacturer's recommendations (End-repair kit, Enzymatics or
New England Biolabs). The size distribution of the resulting
fragments ranged on average from -100 to -700 bp as judged by
Agilent Bioanalyzer DNA-1000 assays.
29
July 2010 1 Volume 5 | Issue 7 | e11840
DNA Sequencing Pipeline
Illumina paired end reads
40000 -
30000
-
20000
-
Overlapped composite fragment
A
R
sequence length 143bp
E
0)
0
Co,
0)
E
Q -
CL
cc
10000-
l-
M~
0-
100
150
200
250
300
350
0-
fragment length [bp]
0
Figure 4. High-confidence alignment yield by Insert lengths.
Distribution of insert lengths by aligning the original reads to the
reference sequence (gray), and lengths of composite reads retained
(red).
doi:10.1371/journal.pone.001 1840.g004
100
150
200
250
.0
a.
sequence length 180bp
8
a
We optimized a two-step dSPRI protocol [4] for DNA fragment
size selection that quantitatively removes fragments above or
below the desired size range. In a first separation step, DNA
fragments above an upper size threshold are bound by Agencourt
AMPure XP SPRI magnetic beads and discarded. In the following
second separation step fragments in the desired size range are
selectively captured on the beads while all shorter fragments are
discarded with the supernatant. The conditions during first and
second separation are tunable to enable selective enrichment for
other fragment size distributions (Figure 1). In addition, we
showed that the size specificity of the dSPRI procedure is stable
over a wide range of DNA concentrations (Figure 2). The
sequencing libraries prepared for this work were obtained by
combining 50 pL of sheared DNA with 40 piL of Agencourt
AMPure XP SPRI magnetic beads and incubated at room
temperature for 5-7 minutes. The tubes were next placed on a
magnet holder (Invitrogen-Dynal) for 2 minutes before transferring the supernatant to a new tube and discarding AMPure XP
particles from this first separation step. For the second separation
step, 100 pL of fresh AMPure XP beads were added to 85 pL of
the supematant from the previous separation, mixed well, and
incubated for 5 minutes before transferring the tube to the magnet
holder rack. The supernatant from this second step was discarded
and the SPRI particles were washed twice with 0.5 mL of 70%
ethanol. The wash steps ensure complete removal of polyethylene
glycol (PEG) and salts, but also of any remaining unbound
fragments smaller than the desired fragment size distribution. After
removing the 70% ethanol from the wash steps, the beads were air
dried at room temperature (or at 37'C) for 10-15 minutes. In
order to recover the bound DNA, 30 pL of water was added to the
beads, and incubated at room temperature for at least I minute.
The DNA solution was separated from the Agencourt AMPure
XP SPRI magnetic beads for 2 min on the magnet holder, and
transferred to a new tube. Applying aforementioned conditions for
dSPRI led to the specific enrichment of a DNA size-fraction with
an average size of 207 bp. Enrichment for other size distribution
can easily be achieved by modifying the dSPRI conditions as
described in Figure 1.
.PLoS ONE | www.plosone.org
50
B
00
0
(N
0
Ca
0
50
100
150
200
250
a-.
C
I..
| sequence length 250bp
-
0
I
A
CD
0
.
0
Cu
-0
0
0.
~0
o0
0
50
100
150
200
250
position on composite sequence
Figure S. Quality of composite fragments after overlapping. The
mean Phred quality (and corresponding error rate) at each position of the
read. We show the original Illumina reads and the composite fragments
generated by overlapping. A) 143 bp, the length of the original Illumina
read; B) 180 bp, the mean length of overlapped fragment in this library and
C) 250 bp, roughly the mean length generated by 454-FLX technology.
doi:10.1 371/journal.pone.001 1840.g005
30
July 2010 | Volume 5 | Issue 7 1 el 1840
DNA Sequencing Pipeline
A
B
assignable fraction of reads
0-
0.0
0.1
0.2
0.3
0.4
0.5
0.6
a
0 top 25 taxa represented in sample
0
0
.*
LL
0
*t3
.b
0-
101
102
103
104
o10
composite Illumina reads
Figure 6. Composite Illumina reads constitute a legitimate alternative to pyrosequencing for metagenomics studies. DNA from a marine
metagenomics sample was sequenced with our overlapping mate-pair approach. The composite reads were directly compared to 454-FLX sequences
from the exact same sample. A) Fraction of reads that could be assigned to a taxon using the composite reads (mean read length 180 bp), the entire 454FLX dataset (mean read length 207 bp), or longer 454-FLX reads (mean read length 254 bp). B)Comparison of taxon assignments using composite reads
and 454-FLX pyrosequencing reads. The top 25 represented taxa, with colored symbols next to the bracket, are listed in Table S1.
doi:1 0.1371/journal.pone.001 1840.g006
Illumina paired-end library construction
reaction consisting of 20 mM Tris-HCI pH 8.8, 10 mM
(NH 4) 2SO 4 , 10 mM KCI, 2 mM MgSO 4 , and 0.1% Triton X100. The resulting DNA fragments were diluted 2 fold with
ddH 2O, and PCR amplified using Phusion Hot Start HighFidelity DNA polymerase (New England Biolabs), and simultaneously monitored on a Bio-Rad Opticon real-time PCR
instrument. Each reaction consisted of I sL of diluted nicktranslated DNA, 0.5 units of Phusion Hot Start High-Fidelity
DNA polymerase, IX Phusion HF buffer, 0.4 pM of each IGAPCR-PE-F and IGA-PCR-PE-R primers [described in Table 1],
200 pM dNTPs as well as 0.25X SYBR green I. The initial
denaturation step was at 98'C for 1 minute, followed by successive
cycles of 98'C for 15 seconds, 65'C for 10 seconds and 72'C for
15 seconds. The reactions were stopped in the late logarithmic
amplification phase, and the DNA from the corresponding
samples were pooled. The resulting libraries were composed of
fragments with distinct adapters at each extremities due to PCR
suppression effect [17-19]. The amplified libraries were subjected
to an additional round of AMPure XP SPRI beads purification,
with a DNA/beads ratio of 1, to remove residual primers and
adapter dimers.
In order to achieve optimal cluster densities for sequencing on the
Illumina GAII platform, samples were quantified using two methods.
The samples were first analyzed using the High-Sensitivity DNA Kit
for the Bioanalyzer (Agilent Technologies) to detect any primer-dimers
and to determine the average molecular weight of each library. The
samples were next quantified by real-time PCR on a Light~ycler II
480 (Roche), using the LightCycler 480 SYBR Green I Master mix
(Roche), appropriate primers (Table 1), and serially diluted PhiX
335 bp control library (Illumina) as a standard (quantified with the
Agilent Bioanalyzer; we observed that the Illumina PhiX 335 bp
Our strategy to create Illumina compatible sequencing libraries
uses a blunt-end ligation step of two distinct adapters rather than
the ligation of a single Y-shaped adapter [16]. dSPRI-selected
DNA fragments were combined with a 10-fold molecular excess
for each oligonucleotide adapters (see Table I for oligonucleotide
sequences). Blunt-end DNA fragments were ligated using a rapidligation kit (Enzymatics or New England Biolabs) for 5-10 minutes
at room temperature. Adapter excess was subsequently removed
using Agencourt AMPure XP SPRI beads according to the
manufacturer's recommendations (DNA/bead ratio of 1). A nicktranslation step was performed with 2 units of BstI exo~ DNA
polymerase (New England Biolabs) at 65'C for 25 minutes in a
Table 2. Cost estimates of sequencing strategies.
Sequencing technology
Composite illumina
Cost of a run ()4,200.00
Number of reads obtained
Average
Read length
(bp)
454-FLX
8,000.00
10,000,000
400,000
180
25
Reads per $
2381
50
bp per $
428,571
12,500
Cost ratio per read
48
1
Cost ratio per bp
34
1
'The lilumina sequencing pricing is from the MIT BioMicro center (http://
openwetware.org/wiki/BioMicroCenter:Pricing) and the 454-FLX is from http://
www.biosystems.usu.edu/core_labs/genomics/454/.
doi:10.1371/journal.pone.001 1840.t002
.PLoS ONE | www.plosone.org
31
July 2010 | Volume 5 | Issue 7 | e11840
DNA Sequencing Pipeline
probability that the consensus base is wrong, given the base calls at
that position in the alignment and their respective quality scores.
Calculating this probability required making a few assumptions
and approximations:
concentration vary from lot to lot, and typical concentrations were
found to range from 11-15 nM). The phiX standard curve was
required to have an r 2 value of >0.98 or the run was repeated. This
step both confirmed the concentration of the libraries as well as
confirmed the presence of proper primers for Illumina sequencing.
1. Given a basecall was in error, there is a uniform probability of
reporting one of the other three bases.
Illumina paired-end sequencing
2. The probability of correct alignment was approximated to be
1. (A decent approximation given true-positive rates of >99%
in our quality filtered alignments, see Counting False
Alignments below).
Illumina libraries were clustered for sequencing immediately
following sample quantification according to manufacturers recomendations (Illumina). Following clustering, the samples were loaded
on to an Illumina GAJIx sequencer. Sequencing reagents for each
read were pooled prior to loading. Data from the Illumina GAIIx was
analyzed using the Illumina pipeline 1.4.0 to generate fastq files.
3. The basecalls of the forward and reverse read are independent
observations, given the true base in the insert.
4. The prior probability of a base being present in an insert in this
flowcell can be estimated from the low-error portion of the
reads or approximated as uniform at the user's discretion (In
our case, we used empirical observations to estimate %GC in
single-genome samples, and a uniform distribution for the
metagenomic sample.)
Mate-Read Overlapper Algorithm
In order to align overlapping mate reads and produce accurate
composite fragments, we developed the SHERA software pacakge,
a collection of computer code written in the PERL programming
language. The algorithm for aligning paired reads and computing
quality scores was designed to be broadly applicable, independent
of the sequencing platform. The source code is freely available at
http://alnlab.mit.edu/ALM/Software/Software.html. In brief,
SHERA takes overlapping short reads as input, finds the best
alignment, produces a composite read, calculates new quality
scores for the overlapping bases, and scores the alignment
confidence. A detailed description of each of these steps is
provided below. We also describe our procedure for evaluating
false alignment rates on both real and simulated reads.
We first confirmed that the posterior error probability for the
overlapped region was correct by counting the true error rate in
composite reads constructed from simulated reads. Next, using
real mate-reads as input, we confirmed that our Phred quality
scores up to 32 correspond to the correct error rate as calculated
by comparing to the reference. Above 32, the error rate does not
improve (for neither the overlapped bases nor the original Illumina
bases). This is due to errors in the reference and/or a very small
percentage of Illumina reads with insertions or deletions.
Finding the best alignment
Counting False Alignments
For each mate-pair, the algorithm scores all possible ungapped
alignments (this includes alignments shorter than the read length,
which are produced when a fragment is fully sequenced resulting
in and additional synthesis reactions proceed past its 5-prime end,
base-pairing with the Illumina adapters.) Each alignment is scored
by summing over all matches and subtracting a penalty for each
mismatch. For the alignment with the best score, the algorithm
constructs the consensus sequence in this overlapping region by
reporting the nucleotide with the higher quality value, or the
informative base call if one nucleotide is 'N'. A Phred quality score
is calculated for each consensus base (see below). The algorithm
output includes sequence and quality of the composite read
generated from each mate pair, as well as an alignment confidence
score that enables filtering out probable mis-alignments.
Minimizing the number of false alignments is an important step
for some applications such as de novo assembly. Our goal was to
find a threshold for alignment confidence that would allow no
more than 1% misalignments to remain in our subset of composite
reads (based on previous experience with de novo assembly of
libraries containing hybrid reads). We used several approaches to
assess our alignment accuracy and found similar results.
First, we used the read-simulator from the MAQ package [20]
to produce simulated reads that mimicked the error profiles
observed in the quality values of our Illumina data as well as the
distribution of overlap lengths. In the case of simulated reads, the
true distance between mate pairs is known exactly and the
reference is perfect. After filtering (confidence metric 0.5) we
retain 89.7% of the composite reads, 0.5% of which were
constructed from mis-alignments. This strategy is useful when no
reference is available, or as a test case on new platforms.
Second, we used mate-reads from the PhiX control lane
sequenced during our Illumina sequencing run. This DNA library
is prepared from the PhiX174 bacteriophage genome for purposes
of quality control, and we found that it had and insert length
200±21 bp (estimated by mapping to the reference with MAQ.
These mapped mate-pairs defined the "true" insert length of each
sequenced fragment. After constructing the composite sequences
and filtering, we plotted the difference in length between a
composite fragment and its insert length as predicted by MAQ
(Figure 7). This histogram is a sum of two distributions, the
overlapper software's misalignments (a broad gaussian) and a
sharp peak of small (1-2 bp) indels. We examined gapped
alignments to confirm this was the case and also found that many
of the insertion or deletions occur in homopolymer runs. We used
a simple linear model to infer the number of misalignments; this
yielded a false positive rate of 1.0% for PhiX174.
Alignment confidence metric
Alignment confidence was quantified in order to identify
composite reads constructed from mis-alignments. Due to the
relatively short alignment lengths with non-uniform base composition and quality, we did not model the statistical significance of
an alignment analytically, but instead chose an empirical metric.
Because mis-alignments of similar length have similar alignment
scores (within noise), we use seven alignments of similar length of
the same mate-pair to calculate a baseline, and divide by the range
of the scores to estimate the alignment's significance. Alignments
of all confidence levels are reported; we leave selection of this
threshold to the user to achieve the sensitivity/specificity required
of their downstream application.
Calculation of a Phred Quality Score for the overlapped
region
SHERA reports a Phred-scaled error probability for each base
in the composite read. For overlapped bases, this is the posterior
. 0
.1ldn. -
PLoS ONE I www.plosone.org
32
July 2010 1 Volume 5 | Issue 7 | el 1840
DNA Sequencing Pipeline
0
0.0
q
8
o
0
0
CD
0
q,
0)
0
CDO
0
I1T
-200
0
000111100f-am00
-150
-100
I
-50
i
0
1
50
0
100
difference between pairs mapped length and composite fragment length
Figure 7. Short insertions and deletions in low-confidence composite Illumina reads. Mate-reads from the control lane (PhiX174
bacteriophage genome) were used to assess false alignments introduced by the SHERA pipeline. After constructing the composite sequences and
filtering, we plotted the difference in length (if any) between a composite fragment and its insert length as predicted by MAQ by mapping the
original mate-pairs to the reference.This histogram isa sum of two distributions, the overlapper software's misalignments (a broad gaussian) and a
sharp peak of small (1-2 bp) indels. We used a simple and conservative linear model to remove the indels and infer the number of misalignments.
doi:10.1 371/journal.pone.001 1840.g007
used the popular metagenomics software MEGAN [22] for
downstream analysis. We compared the fraction of unassignable
reads (due to lack of significant blast hit in the database) across four
subsets of sequence data from the same sample, which were chosen
to compare sequencing platforms and the effect of read length: highconfidence composite Illumina reads, low-confidence composite
Illunina reads, a representative sample of our 454 reads (mean
length 207 bp), a longer subset of our 454 reads (mean length
254 bp). For comparison purposes we used random samples of
50,000 reads each. We used the MEGAN parameters recommended by the authors [22] as suitable for comparing reads of different
topperminscorebylength = 0.35,
lengths
(minscore = 35.0,
cent= 10, minsupport = 1; the low minsupport value was chosen
because some of the data subsets tested were necessarily small. We
also tested the alternative parameter set with similar results:
minscore = 40.0, toppercent = 10, minsupport = 5). In Figure 6A
we show the fraction of DNA fragments for which MEGAN was
able to assign to a known genome or taxon via significant protein
homology. We show that composite Illumina reads from highconfidence alignments perform at least as well as reads obtained
from 454-FLX. We also find that composite reads produced by
alignments we designated as "low-confidence" have a very high
fraction of unassignable reads, further proof of the ability of our
algorithm to identify misalignments. Finally, we show results for
454-FLX reads of mean length 254 bp (range: 230-270 bp), more
typical of datasets produced by the 454-FLX platform. The
assignable read fraction is comparable.
Next we asked if the taxonomic classification was consistent
across platforms. We used MEGAN [22] to classify all of the 454FLX reads and an equal number of composite Illumina reads.
Figure 6B shows the taxonomic classification is relatively invariant
with respect to the sequencing technology chosen. We performed
this analysis with two reasonable parameters sets (minscore = 35.0,
minscorebylength=0.35, toppercent= 10, mninsupport=5 OR
minscore =40.0, toppercent =10,minsupport = 5) and observed
equally good agreement in taxonomic classification across
platforms.
Finally, for Figure 3C, we assessed alignment yield and
accuracy on mate-pair reads sequenced from a dSPRI library.
This library was generated from a single Prochlorococcus cell (as
described [15]), for which a draft copy of the genome
(N50 = ~7 Kb) was constructed independently by de novo
assembly of 454-Titanium reads from the same DNA sample.
We mapped the Illumina mate-paired reads to these 454Titanium contigs to define the "true" insert length of each
sequenced insert. A small fraction of mate-pairs (<5%) were
unmapped or mapped to different contigs; these were excluded
from further analysis. As in the case of PhiX 174, there were a
number of small indels evident in the comparison between reads
and reference which were excluded from our statistic. We
estimate that 0.5% of our high confidence set of composite reads
were mis-alignments introduced by SHERA.
Comparison of 454-FLX and overlapped Illumina reads
for metagenomic applications
To evaluate the suitability of overlapped Illumina for metagenomic analysis, we made a direct comparison to the Roche 454FLX platform, currently the most commonly used sequencing
platform for this application. We used sequencing data from the
exact same marine metagenomic sample [10]. We asked whether
reads from the two platforms were equally able to identify significant
protein homology with known proteins for the purposes of
taxonomic classification.
We obtained 6,339,825 mate-paired Illumina reads of length
143 ntds each with insert lengths shown in Figure 3. We joined the
Illumina mate-pair reads with our algorithm and filtered the
composite reads by alignment confidence, resulting in 4,830,552
high-confidence composite reads of length 180±40 bp. We
analyzed both high-confidence and low-confidence sets of composite reads as described below. We also had access to 673,673 454FLX reads of length 207 - 71 ntds available for comparison [10].
We blasted all 454-FLX reads and composite Illumina reads against
the NCBI protein database nr [21] using translated nucleotide
queries (blastx parameters -b 100 -v 100 -e 30 -Q 11 -F "in S"), and
.PLoS ONE I www.plosone.org
33
July 2010 1 Volume 5 | Issue 7 | e11840
DNA Sequencing Pipeline
sample with 454-FLX. We would also like to thank the MIT BioMicro
Center for performing the Illumina sequencing reported in this study.
Supporting Information
Table S1
Top 25 taxa identified in the HOT186 75 meters
depth metagenomics sample.
Found
at:
doi:10.1371/journal.pone.0011840.s001
Author Contributions
(0.80 MB
Conceived and designed the experiments: SR ACM EJA SWC. Performed
the experiments: SR ACM MCB RM. Analyzed the data: SR ACM SCT.
Contributed reagents/materials/analysis tools: SR ACM SCT. Wrote the
paper: SR ACM SCT EJA SWC.
PDF)
Acknowledgments
We are grateful to the members of the DeLong and Chisholm laboratories
(MIT) that collected, processed, and sequenced the metagenomic DNA
References
I. Schuster SC (2008) Next-generation sequencing transforms today's biology. Nat
12. Hofl KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P (2008)
Methods 5: 16- 18.
2. HiattJB, Patwardhan RP, Turner EH, Lee C. ShendureJ (2010) Parallel, tagdirected assembly of locally derived short sequence reads. Nat Methods 7:
Gene prediction in metagenomic fragments: A large scale machine learning
approach. BMC Bioinformatics 9: 217.
13. Brady A, Salzberg SL (2009) Phymm and PhynmBL: metagenomic phyloge-
119-122.
3. Sorber K, Chiu C, Webster D, Dimon M, Ruby JG, et al. (2008) The long
14.
march: a sample preparation technique that enhances contig length and
coverage by high-throughput short-read sequencing. PLoS One 3: e3495.
4. Lennon NJ. Lintner RE, Anderson S, Alvarez P, Barry A, et al. (2010) A
scalable, fully automated process for construction of sequence-ready barcoded
libraries for 454. Genome Biol 11: R15.
15.
5. DeAngelis MM, Wang DG, Hawkins TL (1995) Solid-phase reversible
6.
7.
8.
9.
10.
11.
16.
immobilization for the isolation of PCR products. Nucleic Acids Res 23:
4742-4743.
Hawkins TL, O'Connor-Morin T, Roy A, Santillan C (1994) DNA purification
and isolation using a solid-phase. Nucleic Acids Res 22: 4543-4544.
Krizova J, Spanova A, Rittich B, Horak D (2005) Magnetic hydrophilic
methacrylate-based polymer microspheres for genomic DNA isolation.
J Chromatogr A 1064: 247-253.
Ewing B, Green P (1998) Base-calling of automated sequencer traces using
phred. 11. Error probabilities. Genome Res 8: 186- 194.
Mitra S, Schubach M, Huson DH (2010) Short clones or long clones? A
simulation study on the use of paired reads in metagenomics. BMC
Bioinformatics I I Suppl 1: S12.
Martinez A, Tyson GW, Delong EF (2010) Widespread known and novel
phosphonate utilization pathways in marine bacteria revealed by functional
screening and metagenomic analyses. Environ Microbiol 12: 222-238.
Wommack KE, Bhavsar J, Ravel J (2008) Metagenomics: read length matters.
Appi Environ Microbiol 74(5): 1453 63.
.1PLoS ONE | www.plosone.org
17.
netic classification with interpolated Markov models. Nature Methods 6:
673-676.
Kettler GC. Martiny AC, Huang K, Zucker J, Coleman ML, et al. (2007)
Patterns and implications of gene gain and loss in the evolution of
Prochlorococcus. PLoS Genet 3: e231.
Rodrigue S, Malmstrom RR, Berlin AM, Birren BW, Henn MR, et al. (2009)
Whole genome amplification and de novo assembly of single bacterial cells.
PLoS One 4: e6864.
Bentley DR. Balasubramanian S, Swerdlow HP, Smith GP, Milton J, et al.
(2008) Accurate whole human genome sequencing using reversible terminator
chemistry. Nature 456: 53-59.
Matz M, Shagin D, Bogdanova E, Britanova 0, Lukyanov S, et al. (1999)
Amplification of cDNA ends based on template-switching effect and step-out
PCR. Nucleic Acids Res 27: 1558-1560.
18. Shagin DA, Lukyanov KA, Vagner LL, Matz MV (1999) Regulation of average
length of complex PCR product. Nucleic Acids Res 27: e23.
19. Siebert PD, Chenchik A, Kellogg DE, Lukyanov KA, Lukyanov SA (1995) An
improved PCR method for walking in uncloned genomic DNA. Nucleic Acids
Res 23: 1087-1088.
20. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and
calling variants using mapping quality scores. Genome Res 18: 1851-1858.
21. Altschul SF, Madden TL, Schaffer AA ZhangJ, Zhang Z, et al. (1997) Gapped
BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res 25: 3389-3402.
22. Huson DH, Auch AF, Qi J, Schuster SC (2007) MEGAN analysis of
metagenomic data. Genome Res 17: 377 -386.
34
July 2010 | Volume 5 1 Issue 7 | el 1840
1.6 Contributions to this manuscript
As the sole author with bioinformatic/computational expertise, I was 100% responsible for
designing, writing, and testing the SHE-RA algorithm, analyzing all of the sequence data, and
making all of the figures with the exception of the Bioanalyzer machine plots. I was
responsible for writing all aspects of the paper that deal with the algorithm and sequence
analysis.
1.7 Appendix
1.7.1
Supplemental Note: alignment confidence metric
To assess alignment confidence, one could calculate the probability of an alignment being
due to chance analytically. The multivariate hypergeometric distribution would be
appropriate for this calculation if certain simplifying assumptions are true. Namely, the
probability of picking a particular base must be fixed. The short alignment lengths and high
local variance of base and GC content make this assumption untenable. Instead, an empirical
metric was developed. The chosen alignment's score is compared to the empirical
distribution of scores of (mis-)alignments of similar length from the same read and region.
Alignment confidence metric is calculated as maximal score's excess over this distribution,
normalized by distribution width.
35
S0
-o-
correct alignments
-o-
mis-alignments
4)
Co
C
0
4-
00
a
0
000
0L
0
000
0
o-o==0-o-o-8-o-o-o-o
P
I
0.0
I
0.2
I
0.4
I
0.6
I
0.8
1.0
SHE-RA alignment confidence metric
Supplemental Figure 1. SHE-RA alignment confidence metric separates true
alignments from mis-alignments.
SHE-RA reports the best alignment for all reads, along with a confidence score that allows
the user to tune the stringency of overlaps according to the fidelity required for their
downstream application. Simulated reads of the same length and quality distribution as the
Illumina IIGA reads were used to definitively assess true and mis-alignments. The figure
shows that the alignment confidence score produces a good separation between true
alignments and mis-alignments. The resulting separation was confirmed consistent in real
data, where overlap accuracy was estimated by mapping to a draft reference genome
assembled with separate data.
36
IO
0-
0
top 25 taxa represented In sample
a
Prochlorococcus marinus
0
Bacteria
x
cellular organisms
root
Candidatus Pelagibacter
0
+
3
Proteobacteria
0
0-
Cyanobacteria
0
Tl)
0
-o
M.
U
C-
U-
4
U.O
Gammaproteobacteria
Bacteroidetes
U SAR11 cluster
M
A Psychroflexus torquis ATCC 700755
Prochiorococcus phage P-SSM2
* 1P
6
Candidatus Pelagibacter sp. HTCC7211
Alphaproteobacteria
*
N
0
Alteromonas macleodii
Candidatus Pelagibacter ubique
Not assigned
5e
Rhodobacterales
Viruses
Flavobacteria
.,
I
I
10
I
102
10
3
4
10
I
overlapped Illumina reads
0
Prochlorococcus phage P-SSP7
0
Candidatus Pelagibacter ubique HTCC1 062
0 Alteromonas macleodii ATCC 27126
O
Prochlorococcus marinus str. AS9601
Prochlorococcus marinus str. MIT 9312
Supplemental Figure 2. Taxa abundances in metagenomics sample are similar when
using overlapped Illumina or 454 sequencinj
Reads from 454 and SHE-RA-processed Illumina paired end reads were blastx-ed against nr
and the software MEGAN was used to assign each read to a known genome or taxon via
significant protein homology. (A subset of 500,000 unfiltered random reads from each
dataset was used for comparison.) The close correspondence in taxa abundance
demonstrates that the overlapped Illumina reads are a legitimate alternative for surveying
community composition and that the library construction protocol does not suffer any gross
biases or bottlenecks.
37
Prochlorococcus gene content
454-FLX
E
overlapped Illumina
Supplemental Figure 3. Gene functions identified in metagenomics sample HOT186
75m show similar fractional abundances using overlapped Illumina or 454
sequencing.
Community shotgun genomic data from the HOT186 75m sample were analyzed for gene
content from Prochlorococcus,the most abundant taxon in this sample. Reads from 454 and
SHE-RA-processed Illumina paired end reads were blastx-ed against nr and those reads
assigned to the Prochlorococcusgenus were binned by their GOslim functional category. (A
subset of 500,000 unfiltered random reads from each dataset was used for comparison.)
The correspondence in gene function abundance demonstrates that the overlapped Illumina
reads are effective for studying community gene function.
38
Chapter 2.
Assembly and analysis of 82
Vibrio genomes
2.1 Abstract
The Vibrionaceae family of marine bacteria is an important model system for bacterial
evolution and ecology. Drawing from our strain collection of thousands of environmental
isolates grouped into natural, ecologically distinct populations, we sequenced 82 whole
genomes from 22 populations to better understand the genomic underpinnings of
ecological differentiation.
In the first half of this chapter, I describe methodological advances required for
short read assembly, in order to build and analyze the genomic reference for this model
system. A hybrid assembly approach was designed, enabling efficient construction of
related genome sequences. This approach produces significantly more contiguous genomes
than de novo assembly alone can, and is generally applicable to short read assembly
problems.
In the second half of this chapter, I survey the relationships of phenotype and gene
content with ecology in this model system. Core and flexible genes were defined using these
ecological populations, and genome comparisons revealed fast turnover and adaptive
fixation of flexible genes on the extremely fast phylogenetic timescales where we previously
observed ecological partitioning at the microscale. A high rate of recent HGT exceeded that
previously found between ecologically similar organisms. However, over the long term,
neither variance in gene content nor metabolic profiles were correlated with our measures
of ecology and instead was primarily explained by phylogenetic divergence: cohabitant
populations did not have more similar gene content, flexible pools or substrate utilization
profiles once the effect of distance was removed.
39
2.2 Introduction
The Vibrionaceae are heterotrophic bacteria abundant in aquatic environments
worldwide and include many human and animal pathogens, and are thus ecologically,
medically and economically important. Unlike the vast majority of environmental bacteria,
vibrios are easily cultivatable in the lab, making them a good model system for comparative
genomics study, particularly because of the breadth of ecological data collected for this
group and the ecologically-differentiated populations we were thus able to define.
Using the environmental data collected with more than 1900 marine isolates,
ecologically cohesive populations have been delineated among the vibrios. Bacterial
taxonomic units are typically demarcated by drawing arbitrary phylogenetic cutoffs, but
such groupings often correspond poorly to organisms' physiology or ecology (Francisco,
Cohan, and Krizanc 2012). Instead, the vibrios were grouped into both phylogenetically and
ecologically cohesive populations (Hunt et al. 2008; Preheim et al. 2011; Szabo et al. 2012;
Preheim 2010), using a variety of environmental datasets analyzed by the machine learning
algorithm AdaptML (Hunt et al. 2008). Ecological differentiation of related populations was
observed at levels of phylogenetic divergence that were surprisingly fine, but subsequent
studies consistently rediscovered these same ecological populations: they were largely
cohesive in a variety of gene phylogenies (Preheim, Timberlake, and Polz 2011),
reproducibly recovered across time (Szabo et al. 2012), and robust to the particular
ecological trait used to define them (Preheim, Timberlake, and Polz 2011; Preheim et al.
2011). The ecological distinctness between populations demonstrates that despite coexisting in the water column and sometimes very close genetic relations, these groups do
not have same niche, instead partitioning resources spatially and temporally. Within
populations, individuals interact socially (Cordero, Wildschutte, et al. 2012) and sexually
(Shapiro et al. 2012), experiencing rampant homologous recombination along population
lines that could result in genetic differentiation in a manner akin to eukaryotic speciation.
All in all, these ecological populations have so many of the hallmarks of a fundamental
taxonomic unit, and it is likely that within each, individual genotypes share a niche and face
similar selective pressures, making this an excellent model system to study diversity,
divergence, and adaptation to environment.
While ecologically differentiated populations have been defined in this family, it is not
yet known to what extent this differentiation engenders genotypic or phenotypic
40
divergence. In general, the extent to which ecology predicts similarity of gene content or
other phenotype is not well understood. While bacterial lineages with similar 'lifestyles'
have been reported to possess more similar gene functional content (Lauro et al. 2009),
both gene type and ecology are defined broadly in such studies. Genomes of similar ecology
are also reported to exchange more genes, either from co-habitants or other, geographically
distinct, genomes of similar ecology (Hooper, Mavromatis, and Kyrpides 2009; Smillie et al.
2011). However, the overall long-term effect of this gene pool sharing is not known: is it so
rampant and persistent that multiple lineages adapting to the same habitat do so through
the use of convergent strategies?
To explore the effect of similar microscale ecology on gene sharing, we determine
the extent to which populations with similar environmental distributions share overall gene
content, as well as the frequency and structure of recent horizontal gene transfer and its
structure in this clade. For each of our ecological populations defined by AdaptML a
projected habitat was inferred based on the isolation data of their phylogenetic group, and
serves as one representation of the group's ecology. In this study, we consider habitats
learned from season (Fall/Spring) and particle size, two environmental measures available
for the vast majority of the strains whose full genome was sequenced. These environmental
traits have been shown to be important axes for ecological differentiation in this group
(Thompson et al. 2004; Hunt et al. 2008) as well as stable descriptions of population's
ecology over time (Szabo et al. 2012) but their utility as a descriptor of true ecology, and
genomic and metabolic correlates were unknown. We ask the extent to which multiple
lineages with the same projected habitat share genes or common carbon substrates.
Vibrios are known for their great metabolic potential and diversity, but the driving
forces and dynamics of metabolic evolution are still largely unknown (Keymer et al. 2007).
Substrate metabolism could be largely stable on the timescales of bacterial species
divergence, driven primarily by lineage history (phylogenetic inertia), or it could be
substantially divergent, driven by the current ecological preferences of the lineage. It is also
possible that it could be diverse, varying substantially with in a population on the
timescales where we observe rapid flexible gene turnover. Using 46 metabolically diverse
strains of vibrio, we investigate the timescales for diversity and divergence of metabolic
profiles, and the extent to which multiple adaptations to the same inferred habitat involve
convergence of phenotypes.
41
Closely related bacteria are increasingly appreciated to vary greatly in their gene
repertoire (Tettelin et al. 2005), and vibrios in particular have remarkably fast turnover of
much of their genome (Thompson 2005), but the selective implication of this large and
labile flexible genome is poorly understood (Polz et al. 2006). Conspecific vibrios can differ
by 1 megabase in genome content in their 5 megabase genomes (Polz et al. 2006), and even
strains related as closely as -1% 16S divergence can exhibit 20% differences in genome
size (Thompson 2005). It is not implausible to suggest that this surprisingly large fraction
of the genome may not be adaptive (Berg and Kurland 2002; Polz et al. 2006), indeed, in the
absence of other evidence, neutral variation seems the most appropriate null hypothesis.
However, the current literature typically makes one of two opposing assumptions about the
adaptive value of flexible genes: studies that anecdotally attribute any strain-specific gene
content as responsible for pathogenic or other adaptive phenotype, and those that dismiss
low-penetrance or strain-specific genes as likely being inessential or not adaptive
(Haegeman and Weitz 2012). Thus, opposing interpretations on the adaptive value of this
rather extensive 'flexible' gene content are common, and recently introduced neutral model
for flexible genes distribution has only been tested in a limited way. Because our dataset
includes ecological differentiation on a similarly fine scale as this rapid gene turnover, we
can use ecology to investigate the extent to which flexible genes fix neutrally in a
population.
To construct the genomic reference for this important model system and perform
this first survey of the interplay between ecology and genotype, we relied on next
generation sequencing (NGS). NGS platforms have made genome sequencing tractable and
affordable, however, short reads introduce a number of limitations and technical challenges.
In the first part of this study, we discuss two methodological advances developed to
overcome the limitations of current short read assembly methods. The first is a generally
applicable approach developed for efficient and contiguous assembly of bacterial genomes.
The second is a pipeline to construct from short reads the important phylogenetic marker,
the16S gene.
2.3 Technical challenges and new assembly approaches
2.3.1
Hybrid genome assembly approach for short reads.
A hybrid approach to assembly was designed which significantly improved assembly
contiguity for many types of short reads. Assembly of short reads presents a number of
42
theoretical and technical challenges. There are practical limits on the efficiency with which
de novo assembler heuristics infer linkage information (Bankevich et al. 2012), especially in
light of abundant errors and regions of aberrant coverage due to amplification and
sequencing biases (Aird et al. 2011) . There are also theoretical limits on de novo assembly
of certain features by short reads (Wetzel, Kingsford, and Pop 2011; Chaisson, Brinza, and
Pevzner 2009), particularly the early Illumina platforms' short single-end reads that
constituted most of our dataset. On the other hand, reference assembly approaches require
an extremely closely related (near-)finished genome and ignore the substantial amount of
strain-specific DNA. None of these issues can be overcome simply by the abundant deep
coverage afforded by short reads. State-of-the art de novo assembly of single end short read
datasets produced genome sequences fragmented into many thousands of contigs (Zerbino
and Birney 2008; Salzberg et al. 2008).
Because of systematic biases and theoretical constraints, the gaps between contigs
are not random failures but largely fall in the same regions, particularly in related genomes
(Wetzel, Kingsford, and Pop 2011; Chaisson, Brinza, and Pevzner 2009). Our initial
approach to assembly of many related genomes, gap-filling between contigs via alignment
to a related strain's contigs, proved ineffective, as the vast majority of gaps occurred at the
same trouble spots in both genomes (S. Timberlake, unpublished data). Long-insert
libraries (-4kb), required to overcome the theoretical limits on de novo assembly of short
reads (Chaisson, Brinza, and Pevzner 2009; Butler et al.; Wetzel, Kingsford, and Pop 2011)
provide long range linkage information between regions bordering trouble spots so
assemblies no longer break in those regions. However, long DNA fragments are physically
incompatible with the Illumina platform (Levine 2012), and the library construction
protocols designed to sidestep this limitation are very costly in terms of reagents and time.
A hybrid assembly approach was designed to make efficient use of labor-intensive
long-insert linkage information, propagating it to related strains in a conservative manner.
Each hybrid assembly requires a reference genome, but the required relatedness of this
reference is more relaxed than is typical for reference assembly. A minimal set of costly
references is chosen by estimating the maximum nucleotide divergence at which reference
assembly was useful, given the constraints of the mapping software at the time (Li, Ruan,
and Durbin 2008). We used long-insert libraries, prepared in addition to standard libraries,
to close (or nearly close) one genome per population, using a standard de novo assembly
algorithm (Zerbino et al. 2009). References so generated have complete sequences in these
43
otherwise unassemblable regions, making synteny information available for the rest of the
closely related strains in the population. Where long inserts were unavailable, even synteny
information from reference strains assembled from short-insert paired reads improved
single-end read genome assembly with this hybrid approach.
Hybrid assembly is a modular process proceeding in two main steps: reference
assembly to propagate conserved synteny information and de novo assembly to incorporate
strain-specific DNA and resolve strain-specific linkage. Briefly, for each strain "S"
undergoing hybrid assembly, whole-genome shotgun short reads from S were mapped to
the most closely related reference strain "R", generating reference contigs that were
supplied to the de novo assembly algorithm with all reads from strain S. Any single base of
R not covered by mapped reads from S was considered a possible strain-specific synteny
breakpoint. This conservative approach was motivated by the observation that closely
related vibrios experience frequent synteny-disrupting gene insertion and deletion, and
that the genomic repeats which are hotspots for such events (Achaz et al. 2003; Rocha
2003) include the exact same features where synteny cannot be resolved by short reads
(Wetzel, Kingsford, and Pop 2011; Chaisson, Brinza, and Pevzner 2009).
This hybrid assembly approach substantially increased N50, the standard metric for
assembly contiguity, for all common lengths and types of Illumina reads. (The N50 is the
length of the smallest contig that needs to be included in the set of contigs comprising half
the total bases in the genome.) Hybrid assembly improved N50's by 2-3 fold (Figure 1) over
de novo assembly alone, and was effective for both single-end reads (51-76bp) and the
paired-end reads (63-1,02bp) produced by more mature NGS, the Illumina IIGA. (Extensive
parameter space exploration was conducted for both assembly methods; see Methods).
Moreover, this modular approach is generally applicable both in terms of software
algorithms, short read platform, and long-range linkage information source. With the
modules used here, reference genomes that successfully increased contiguity were as
distant as -90% average nucleotide identity in the core genome, substantially more
distance than is typically suggested for reference assembly of any kind, including some
other hybrid approaches (Salzberg et al. 2008)). However, long-range linkage information
need not come from a near-complete reference genome; Pacific Biosciences now offers long
reads which will likely be a suitable a suitable reference for this approach (English et al.
2012).
44
2.3.2
Pipeline for 16S rRNA gene assembly from short reads.
The small ribosomal RNA subunit (16S rRNA) cannot be assembled from short read data by
traditional methods, but it is integral to microbial taxonomy and phylogeny as the canonical
unit of evolutionary distance. As part of the operational bacterial taxon definition, it is used
in the standard procedure for identifying type strains or named species for an isolate, as
well as for some definitions of HGT (Smillie et al. 2011). Given these important roles for the
16S ribosomal sequence and its unavailability from our short read sequences, a custom
pipeline was built to assemble it.
16S sequences are particularly recalcitrant to assembly from short reads due to
aberrant coverage and repetition in the genome. Vibrio genomes in particular possess an
extremely high number of 16S gene copies among known bacteria, with each genome
encoding multiple (8-14) polymorphic copies (Z. M. P. Lee, Bussema III, and Schmidt 2009),
whose diversity within a single genome (Moreno, Romero, and Espejo 2002) approaches
that between OTUs. Like all divergent repeats, it is theoretically impossible to assemble de
novo from short reads. However, reference assembly requires a full-length 16S sequence
from a closely related conspecific, which is unavailable for many environmental bacteria. A
recently published algorithm for 16S gene assembly iterates read mapping to references
with reference update steps, but requires paired-end reads (Miller et al. 2011). Our hybrid
genome assembly approach described above could assemble full-length ribosomal
sequences, but only in populations after long-mate pair reads were utilized.
A computational pipeline was designed for assembly of full-length Vibrio 16S
sequences from short reads with the respective limitations of de novo and reference
assembly in mind. Briefly, centigs from de novo assembly of whole genome shotgun short
reads were screened for the presence of 16S fragments. Subsequences matching 16S
significantly were identified by sensitive alignment to 16S reference sequences of
Vibrionaceae complete genomes, filtered for sufficient coverage, and used to construct fulllength sequences while retaining all observed bases at polymorphic sites. Unlike
polymorphisms in reads, which may be due to sequencing error, contigs over a minimum
coverage depth are predominantly error-free and polymorphic sites thus represent real
intragenome diversity (Supplemental Figure 1). The resulting 16S sequence from each
genome is an amalgam of all 16S alleles in the genome, and the number and location of the
variable positions was consistent with Sanger-sequenced finished genomes. There are as
many as 24 polymorphic sites within the 16S genes of a single isolate, so intragenomic
45
alleles could differ by as much as -1.5%. This is not without precedent: for example, family
member Photobacterium profundum SS9 has intragenomic 16S divergence of >1.5% (P. S.
Dehal et al. 2010). Evolutionary distances based on these 16S sequences were used to build
phylogenies and confirm taxonomy by clustering our strains with named type strains
(Supplemental Figure 2, Supplemental Notes), and as a measure of phylogenetic distance.
2.3.3
Using whole genome phylogeny to support ecological population predictions
Whole genome sequences revealed that most ecological populations previously predicted
using hsp60 were robustly supported by the majority of core genes, while within
populations, phylogenetic relationships were inconsistent and/or poorly resolved.
Ecological populations were originally delineated using the housekeeping gene hsp60 due
to its ubiquity and better phylogenetic resolution than 16S. However, subsequent analysis
found that though these groups were largely cohesive in a number of housekeeping genebased phylogenies, some populations were incongruent or could not be confirmed (Preheim
et al. 2011). To determine population cohesion and phylogeny more precisely, species
phylogenies were constructed using the majority of ubiquitous genes or, alternatively, a
subset of 30 ribosomal proteins (Methods). Core gene phylogenies robustly supported most
previously predicted ecological populations but undermined or rejected a few (see
Supplemental notes). In particular, strains of V. splendidus with no ecological population
prediction or ones with poor or conflicting support were combined for some of this analysis,
as has been suggested previously (Preheim 2010), or omitted. Intrapopulation nodes
typically had low bootstrap values and inconsistencies between phylogenies. This likely
reflects rampant intrapopulation recombination in the core genome, as found previously in
this group (Shapiro et al. 2012), and descent may be impossible to resolve.
2.4 Results and Discussion
2.4.1
Metabolic profiles' relationship with phylogeny and ecology
The patterns and drivers of metabolic diversity and divergence are largely unknown, so we
sought to identify them in this metabolically diverse family. Substrate metabolism could be
driven by the current ecological preferences of the lineage, or by lineage history. Likewise,
substrate metabolism could vary widely in a population (due to flexible DNA), or these
capabilities could largely be encoded in a slowly changing core genome. The role of ecology
in shaping metabolism, particularly microscale ecology, is also not well understood.
Multiple lineages living in the same habitat may have convergently evolved the potential to
46
exploit the same types of resources found in a habitat, or they may partition those resources
so as not to be in direct competition. The projected habitat inferred for each population
provides a representation of that population's microscale ecology, and we asked whether
particle size preferences (and season) entailed similar carbon substrate utilization profiles.
To get a metabolic profile of each strain, we quantified its ability to respire 95 different
carbon substrates on a Biolog GN2 plate as the sole carbon source. The metabolic profiles of
46 fully sequenced strains from 16 populations spanning the diversity of our family
exhibited substantial within-population similarity but within-species variance, no ecological
convergence, and evolution dominated by phylogenetic divergence.
Metabolic profiles in these 46 vibrio strains were remarkably diverse and malleable
(Supplemental Figure 6). There was only a single substrate that could be ubiquitously
respired (alpha-D-glucose, though several others were common) and many distant relatives
utilized zero common substrates. Moreover, the ability to utilize large numbers of
substrates has been gained and lost multiple times. There was a great degree of diversity in
metabolic profiles within named species. Strains within an OTU (<3%16S distance)
exhibited obvious variation, more so than biological replicates (Welch's t-test on strainpairs Pearson correlations; p<0.012). However, ecologically defined populations had largely
cohesive metabolic profiles (indistinguishable from the similarity of biological replicates:
p>0.25).
If metabolic profiles are more similar between co-habiting lineages, this could
indicate convergent strategies for resource exploitation in that habitat. Same-habitat strains
(either by particle size preference or seasonal co-occurrence), however, do not have more
similar metabolic profiles than other strains with similar evolutionary divergence (Figure
4). In contrast, a recent study within environmental V. cholerae strains found a number of
metabolic capabilities correlated with environmental parameters (Keymer et al. 2007).
There are a number of possible explanations for these results. One possibility is that within
any one inferred microhabitat in our system, competition drives resource partitioning and
thus divergent metabolic potential, whereas the V. cholerae clades examined in that
previous study were primarily not co-existing geographically even at the macroscale. A
second and likely explanation is that our inferred habitats, based on limited environmental
measurements, may correspond poorly with true ecology. For example, it is known that
particle size bins group together many disparate substrate types, including particles
derived from plants versus animal detritus (Preheim 2010), and thus particle size might be
47
an extremely limited measure of ecological similarity, even though it (together with season)
successfully delineated very closely related populations (Hunt et al. 2008). To drill deeper
for ecological drivers of metabolism, more numerous or more precise environmental
preferences (for example, particle type, host or body site preference (Preheim 2010)) might
be tested for a signal of metabolic convergence. A third explanation is that phylogeny, not
ecology, largely drives metabolic profiles. Consistent with this idea, phylogenetic distance
explained a large part of the variation between populations (Supplemental Figure 7).
Metabolic divergence between populations (mean Pearson correlation in metabolic
profiles) increased roughly linearly with phylogenetic distance, a pattern that is consistent
with overall neutral evolution of a character along a phylogeny, and the idea that metabolic
profiles are largely a function of evolutionary history. Phylogenetic distance was not
controlled for in the previous V. cholera study.
It is generally accepted that a bacterial species unit should be defined by a common
phenotype in addition to other features like core genome (Keymer et al. 2007), but whether
substrate utilization assays like the Biolog plates are appropriate for delineating bacterial
groups is not certain. Previous study reported that for three vibrio populations, substrate
utilization profiles were more similar (average Pearson correlation) within populations
than between populations (Preheim 2010). However, this observation can easily be
explained by the large phylogenetic distance between the populations, rather than the
population delineation per se. Instead, when we exclude the effect of phylogenetic distance,
metabolic differentiation between populations is less clear. We focus on V. cyclitrophicus,
which consists of two diverging populations that are generally not phylogenetically
separated (for example, in core gene phylogenies) but are ecologically differentiated into
large and small particle specialists (SP-specialists). This ecological split is robustly
supported by SNPs a handful of genome regions (Shapiro et al. 2012), but within- and
between-population genetic divergences are similar in these early stages of differentiation
(Shapiro et al. 2012). Here, ecological populations' metabolic profiles were not cohesive:
while the LP strains' metabolic profiles were more similar within than between populations
(p<0.01), SP populations were not (p>0.75), and hierarchical clustering of V. cyclitrophicus
metabolic profiles were not congruent with the ecological split. This may be because the
early stages of ecological differentiation do not engender clear metabolic differentiation, or
because such metabolic differentiation was not measured by this particular assay. While the
robustness or generality of this result remains to be seen, it suggests that even high-
48
throughput measures of metabolic phenotype, such as the commonly employed Biolog
plates, may not be sensitive for delineating bacterial species units.
2.4.2
Gene content relationships with phylogeny and ecology
Vibrios of this model system are known to partition resources at the microscale (Hunt et al.
2008), however, the extent to which adaptation to these microscale habitats involves the
use of the same genes from the flexible gene pool was unknown. As with metabolism, gene
content may be driven by the current ecological preferences of the lineage, or lineage
history; it is even possible that neutral variance might dominate gene content. Enrichment
of shared flexible genes and shared total genes in same habitat populations revealed that
this measure of ecological similarity was not predictive of preferential acquisition or
retention of genes. Instead, the majority of total gene content divergence could be explained
by phylogenetic distance alone.
Gene content is largely driven by lineage history, as the overall fraction of shared
gene families (mean Jaccard similarity) between populations decreases roughly linearly
with phylogenetic distance (Supplemental Figure 3). Same-habitat populations were not
more similar than other populations once the effect of distance was accounted (Figure 3). In
fact, different-habitat populations apparently shared more gene content for any given 16S
divergence, but this difference was not significant at any one distance (Welch's t-test on
population-pairs' averages, p>0.07 for every bin). We also considered gene content
similarity between just the large particle specialist populations (LP-specialists), as this
projected habitat had a particularly distinct signal (Hunt et al. 2008), and it includes larger,
denser, less ephemeral bacterial communities (Simon et al. 2002; Hunt et al. 2008).
However, gene content actually appeared less similar between LP-specialist populations
than other populations at each phylogenetic distance (Supplemental Figure 4), though this
was only statistically significant at one distance (3%< 16S distance <4%, p=0.025).
Measuring similarity between total genome content could obscure a signal of
convergence among more recently acquired flexible genes relevant for the lineage's current
habitat, particularly because gene content is largely attributable to phylogenetic history,
and might therefore be shaped more by the long ecological history of the lineage than by the
current ecology. However, rare genes were no more likely to be shared by separate lineages
with the same habitat, indicating that even recent overlap of flexible gene pools is not
driven by these habitats. While the age of gene acquisition for rare genes was hard to assess,
rare genes are more likely to be novel than genes fixed in many lineages, so shared rare
49
genes between populations may indicate that they have recently been accessing the same
flexible gene pool. Same-habitat genome pairs' tendency to share flexible genes was
calculated by considering rare genes (present in 2-5 genomes), less rare genes (2-8
genomes) and relatively common but non-ubiquitous genes (2-20 or 2-40 genomes).
However, regardless of gene rarity, distance bin, or habitat metric, same-habitat genomes
did not consistently share more flexible genes (p>0.25), with one particular and consistent
exception (Supplemental Figure 5). Within-cluster (first distance bin) comparisons always
share more genes between same-habitat strains (p<0.01). This dichotomy of within
population versus between population gene sharing, though not enumerated as part of the
original Core Genome Hypothesis (Ruiting Lan and Reeves 2000), is certainly consistent
with it's attempts to incorporate HGT in to a cohesive species model for bacteria. While
laterally acquired DNA is typically considered a diversifying force, this within-population
gene sharing constitutes an additional homogenizing force within populations and a
diversifying force between them.
Our measures of ecological similarity within this family, particle size preference and
season, were not predictive of preferential acquisition or retention of genes by a variety of
methods. This may be because resource partitioning and/or phylogenetic history could
drive very different adaptive strategies in the same habitat, however, the literature includes
quite a bit of lore suggesting that similar ecology predicts similar gene content. It is often
suggested that certain lifestyles entail enrichment of certain genes or gene functions (Lauro
et al. 2009), though these lifestyle categories may represent very differencet trophic
strategies and very divergent bacteria. Studies within closely related bacteria have also
suggested that ecology drives gene content similarity (Luo et al. 2011; Sim et al. 2008),
however it was not clear that genetic divergence was not driving both. Indeed, we found
that phylogenetic distance predicts a large fraction of gene content similarity (Supplemental
Figure 3). Lateral gene sharing is found to be largely driven by ecology after accounting for
distance, even when using broad labels for ecology (Tuller et al. 2011; Smillie et al. 2011).
However, it is possible that similarity of ecology amongst all our vibrios (all heterotrophic
marine bacteria from the same coastal waters) dwarfed any additional similarity
engendered by our projected habitats. This may be a likely explanation as our projected
habitats were based on just two axes of the niche space, and even similar particle size
encompasses a variety of substrate types, as discussed above. We were not able to detect
50
convergent adaptations to similar habitats, but whether this is due to competitive exclusion
or another ecological explanation remains open.
2.4.3
High rates of HGT among strains co-existing in the water column.
Because our metrics of ecology did not drive similar metabolism or enriched gene sharing
between our inferred habitats, we therefore took an unbiased approach to discover the
structure of DNA exchange within our family. We found that these vibrio genomes have a
very high rate of horizontal transfer, and describe a structured network of exchange
concentrated in a few partners that share a particle-size based ecology.
To analyze gene exchange amongst 80 vibrio strains, we focused on recent HGT, as
older transfers may have been driven by historical ecology distinct from the current niche.
The rate of recent HGT between two bacterial lineages was quantified as the probability
that a pair of genomes drawn from those two lineages (>=3% 16S distance) recently shared
at least one laterally acquired piece of DNA (a stretch of DNA >500bp long and >99%
identical, and therefore extremely unlikely to be vertically inherited). This empirical
method for identifying recent HGT was duplicated exactly from a recently publication
(Smillie et al. 2011).for purposes of direct comparison with their estimates of transfer
frequency in 2235 complete genomes. In addition to quantifying recent gene exchange
between 2% 16S clusters for comparison, we also calculated HGT rates between our
ecologically-defined groups.
Recent HGT frequency was remarkably high (Figure 5), roughly 4x higher between
our 2% 16S clusters of vibrios than between (non human-microbiome) genomes from the
same environment, and rivaling that of genomes isolated from humans but not observed at
the same body site. This is despite the fact that these vibrios are known to have different
ecologies (Hunt et al. 2008; Preheim et al. 2011). However, previous studies quantifying
high rates of horizontal gene exchange between microbes with the "same ecology" rely on
mostly broad labels like "marine" or "hot spring" to describe ecology (Tuller et al. 2011;
Smillie et al. 2011). Even our different-habitat strains meet this permissive definition of
same ecology and are further known to co-exist in the water column geographically and, to
some degree, spatially (e.g., co-occurrence on specific plants or animals (Preheim 2010)),
features that are known to have a positive effect on gene exchange frequency (Smillie et al.
2011; Boucher et al. 2011). Breadth or specificity of ecological label might similarly explain
the exceptionally high HGT frequency observed between microbiota of the same human
51
body site: this definition of ecology is likely the most specific, and broader definitions
encompass dissimilar ecologies that dilute the signial. Thus it is likely that recent studies
highlighting ecology as a predominant force in gene transfer still underestimate its role.
While our counts are too low to rigorously assess the effect of microscale habitat on
horizontal transfer, the network of populations organized by HGT frequency has some
structure, perhaps due to ecology (Figure 6). Some populations are never observed to
engage in transfer within our populations, while others partner quite frequently. A nexus of
transfer between phylogenetically distant relatives connects the three known LP-specialist
populations: V. breoganii,V. crassostreae,V. sp. F10, and V. splendidus population 19 (though
some of these are now known to preferentially attach to very different kinds of particles or
metazoans (Preheim 2010)). Another subnetwork of HGT partners connects the only
human pathogens in our dataset with some of the fish pathogens: V. cholerae (node "VC"), V.
alginolyticus (5) (Nishiguchi and Nair 2003), and V. ordalli (3) and V. splendidus (18). This is
despite being assigned to different microscale habitats, and in the case of V. cholerae,
geography (V. cholerae strains included in this analysis were isolated from ponds and from
clinical patients). Though the effect of shared microscale ecology on HGT is statistically
uncertain here, the high degree to which shared ecology drives transfer (Smillie et al. 2011)
suggests that many newly acquired genes are indeed adaptive in the environment.
2.4.4
Core and flexible gene turnover.
These vibrios not only have a high rate of recent transfer with each other, but an
astonishingly high rate of introduction of new DNA overall. Our superfine phylogenetic
sampling included many pairs of genomes distinguished by only 100-300 SNPs genomewide, a number comparable to what accumulates in a few weeks of bottlenecks in the
laboratory (Clarke and Timberlake, unpublished results). Surprisingly, during those short
divergence times in the wild, 100-300kb of new DNA can accumulate in each strain.
This rapid and persistent turnover of flexible DNA introduces abundant raw
material for selection. Rather than a false signal of diversity composed simply of assembly
artifacts or other junk DNA, >30% of the DNA gained and lost between these sister strains
was organized into open reading frames with identifiable homology. While genes of phage
origin and other possibly selfish elements typically accounted for 10% of these genes, the
majority of annotated ORFs could have adaptive value: these were primarily transporters
and other extracellular genes, as well as carbohydrate-active or DNA-active enzymes. Thus a
52
very rapid influx of functional genes provides abundant raw material for selection even
faster than the short timescales where stable resource partitioning has been observed to
occur.
While the majority of this newly introduced DNA never reaches fixation, what fixes
is still sufficient to replace a large fraction of core genes within traditional OTU (3% 16S)
divergence. Of 4711 flexible gene families observed in V. cyclitrophicus(likely a huge
underestimate of the genes that have been gained and lost already in this clade), 2100 were
observed in only a single strain, despite our very fine phylogenetic sampling. To get an
overview of the rate at which this functional, protein-coding flexible DNA might accumulate
in the genome and fix in populations, we constructed core and pan genome curves. Focusing
on our most finely sampled clades, V. cyclitrophicusand V. splendidus (Figure 7), we defined
core genomes for each ecological group and the phylogenetic groups that contain them.
Despite rapid loss of most new flexible DNA, the rate of replacement of genes in a
population's core was still remarkably high. On the timescale of within-OTU divergence,
turnover of more than 25% of the genome core to an ecological population has occurred. Of
3839 genes core to this large-particle LP-specialist population, only 2806 remain core in all
of the 3% 16S OTU cluster (subsampled to equalize genome count). The practical
implications are clear: the choice of phylogenetic grouping has a large qualitative impact on
genes' assignment to core or flexible categories. The biological implications of this very fast
turnover are still unclear (Thompson 2005; Polz et al. 2006): is this evolution be neutral or
adaptive?
2.4.5
Genes fixing within a population
Independent association of a genotypic trait with ecology can be good evidence of positive
selection, provided shared ancestry cannot explain both. While we found flexible gene
content is not more similar between independent lineages with the same inferred habitat, it
is significantly more similar between same-habitat strains within a diverging clade, possibly
indicating selective fixation. V. cyclitrophicus consists of two diverging populations, which
are generally not phylogenetically separated (for example, in core gene phylogenies) but are
ecologically differentiated into large and small particle specialists (SP-specialists), a split
robustly supported by SNPs a handful of genome regions (Shapiro et al. 2012). Within the
SP and LP-specialist habitats, genotypes shared more rare genes (t-test on all species pairs,
p=O; Supplemental Figure 5), and had more similar gene content overall: hierarchical
clustering of V. cyclitrophicusstrains based on presence/absence of all protein families
53
delineates two groups falling along ecological lines (Figure 8). This abundance of habitatspecific flexible genes evident so early in the differentiation process suggests a preeminent
role of new, laterally acquired flexible DNA in the adaptation and ecological differentiation
of populations (Supplemental Note).
Flexible genes, i.e., genes of intermediate penetrance in a bacterial group, are not
well established to be neutral or adaptive (Haegeman and Weitz 2012). Penetrance of
flexible genes into V. cyclitrophicusexhibit a frequency distribution curve characteristic of
many bacterial populations but poorly understood: the vast majority of genes are very rare
or completely fixed, and few have intermediate penetrance (Supplemental Figure 8). The
adaptive value of these genes and their reason for intermediate fixation was not clear.
Interpretations for low penetrance genes in the literature directly conflict: strain-specific
DNA is often held responsible for pathogenic or other adaptive phenotype, while other
studies assume that low penetrance implies low selective benefit. Systematic evidence
confirming either of these opposing interpretations is lacking, but a recent paper on flexible
gene distribution suggested a neutral model that might be appropriate for flexible gene
fixation (Haegeman and Weitz 2012).
To test the extent to which flexible genes are undergoing neutral gene fixation, we
looked at the habitat specificity of flexible gene families within V. cyclitrophicus.We focused
on 850 flexible gene families present in more than 3 V. cyclitrophicusgenomes. We focused
on this subset of families for two reasons: 1) they have sufficient counts for a meaningful
gene-by-gene habitat-enrichment test and 2) gene families present in fewer strains could be
argued to have been gained ancestrally and present in same-habitat strains due to clonal
expansion. Of these 850 gene families, 175 were distributed significantly unevenly between
the two ecologies (2-tailed hypergeometric test with 5% FDR). Thus, more than 30% of the
flexible genes apparently rising to fixation (or being deleted) are doing so in a clearly
habitat-specific and likely non-neutral manner. The selective forces governing gain versus
loss of these genes likely differ (Supplemental Notes).
A possible parsimonious explanation for this pattern is gene presence/absence in
one ancestral SP- or LP-specialist, followed by clonal expansions and vertical inheritance of
flexible genes by their same-ecology descendants. While any signal for clonal ancestry
within V. cyclitrophicushas been obscured by rampant recombination (no single phylogeny
accounts for more than 0.7% of core genome length), the vast majority of the core genome's
54
length does not support monophyly for SP and LP strains, with trees of 69% of
recombination-free blocks rejecting the ecological split (Shapiro et al. 2012 SOM; Shapiro
2010). Moreover, the tree of 30 ribosomal proteins (presumably neutral markers for this
ecological adaptation) has good support for phylogenetically well-mixed SP and LP
specialists (Figure 8; most nodes' bootstraps>80%). Therefore parallel fixation of the same
flexible genes in genomes of the same ecology is a likely explanation and is good evidence
that their fixation occurs by selection, not drift.
In total, V. cyclitrophicusgenomes harbor 4710 flexible gene families, so the majority
of flexible genes that enter a population are deleted. The timescale on which this occurs
seems distinct from those gene losses driven by the deletion bias and selective deletion
commonly discussed in the literature (Kuo and Ochman 2009) (Supplemental Notes), and
thus neither the mechanism nor adaptive value of these rapid deletions is known.
2.5 Conclusions
This study presented the assembly of 82 genomes and a first survey of flexible gene
turnover, gene sharing, metabolism, and ecology among populations of this Vibrionaceae
model system.
There exists a persistent gap between abundant short read sequences and tool
development for their use. I developed two assembly strategies to fill this gap, including a
hybrid assembly approach that significantly increases genome assembly contiguity by
efficiently leveraging costly long-insert library information. If long-insert library
construction continues to be an expensive and limiting step, extending this approach for use
with more distant reference genomes would be extremely useful, and may be possible with
the next generation of fast but more permissive read-mappers (Niu et al. 2011). Hybrid
assembly has a modular implementation and is broadly applicable: it was effective both for
the latest generation of Illumina paired-end reads, and for the older single-end short reads,
which may once again be relevant with the newest generation of sequencing platforms
(Glenn 2011). It was effective using a fairly wide range of reference genomes, in terms of
both relatedness and contiguity.
A survey of genotype, phenotype and ecology in 82 environmental bacteria revealed
several surprising findings. The adaptive fixation and preeminent role of flexible gene
55
content in population differentiation is clear, occurring with surprising rapidity, with the
very fine phylogenetic distances where ecological differentiation occurs. While the inferred
habitats in our study were not predictive of genotypic or metabolic convergence, they
delineated ecological populations largely congruent by a number of measures, including
gene content, core genome phylogeny, and metabolism.
I showed that new and potentially beneficial genes are rapidly introduced to each
genome, providing abundant raw material for selection with the very rapid timescales
where we have previously reported ecological differentiation. While the majority of this
very fast flexible DNA never reaches fixation, a large amount does fix, apparently nonneutrally, on timescales compatible with ecological differentiation, and delineates
populations in a way congruent with their ecology, before core genome divergence or
metabolic phenotype. By OTU-like divergence, it has resulted in large-scale (-25%)
turnover in the core of any one population. This highlights the large qualitative discrepancy
between core and flexible genomes defined by arbitrary phylogenetic cutoffs and other
species-group definitions. In particular, a large fraction of genes apparently at intermediate
penetrance in a species were actually fixed in a population, very likely by selection.
These new vibrio genomes, despite association with different microscale habitats,
shared genes with a frequency much higher than a recent survey of HGT reported for sameecology bacteria or even for bacteria isolated from various sites on the human body, a
known nexus of transfer. To what extent this is due to shared geography and isolation
method, or perhaps to the vibrios' natural tendency for DNA uptake, is an open question,
but the relative breadth of the labels used to describe "ecology" in these studies certainly
has an effect. HGT is enriched among strains with similar ecology, pointing to a likely role
for these genes in environmental adaptation.
However, our measures of ecology did not have predictive power for sharing of gene
content over the long term, or of metabolic similarity. Both genotype and phenotype were
found to be largely driven by phylogenetic history. In fact, metabolic phenotype could only
statistically distinguish ecological populations when they were sufficiently separated by
phylogenetic distance. However, we cannot dismiss the possibility that ecology could drive
convergent adaptations, as our projected habitats (inferred ecological preferences based on
particle size and season) may not sufficiently represent the populations' true ecology.
56
Finally, there is an apparent dichotomy in the role of flexible genes in adapting to a
habitat. Within a differentiating clade (V cyclitrophicus), strains with the same habitat
acquire large numbers of the same flexible genes, apparently in parallel. However, between
populations, flexible genes are no more likely to be shared by lineages adapting in parallel
to the same projected habitat. This is true even amongst the closely related populations,
within phylogenetic groups that are typically treated as conspecifics (e.g., V. splendidus
clade). If ecological populations rather than arbitrarily defined OTUs are considered the
fundamental evolutionary unit for selection, then this dichotomy becomes consistent with
our concepts of species and evolution in macro-organisms: individuals from the same
species can share a niche, resources, genetic information, but competitive exclusion
precludes different species found in the same habitat from doing so.
The quantitative ecological data collected with this model system, and the fine
phylogenetic resolution of genomic sequences and ecological partitioning, make it an ideal
system to study evolutionary dynamics and selection on a hierarchy of timescales. Here we
have demonstrated that even the shortest timescales are relevant for gene content to be
selected and play a role in ecological differentiation, but that phenotype and genotype
dynamics on longer timescale are not so easily explained by our measures of ecology.
Future work on this system can likely illuminate the longer term ecological benefits and
core genome integration, as well and the variety of selective forces governing the fixation
and loss, of these dynamic and important flexible genes.
57
2.6 Methods
2.6.1
Whole genome shotgun libraries and sequencing
For these 82 genomes, a wide variety of library types, read lengths, insert sizes and
sequencing machines was used. Each genome was sequenced using a standard whole
genome shotgun short-read library preparation, with some genomes additionally
sequenced with mate-pair libraries (detailed below) or Sanger sequencing that had
previously been assembled (12B01, 12G01). Genomic DNA libraries' were prepared by
random shearing per Illumina's protocol (and ligation of custom paired-end barcodes for
multiplexing, where applicable). For most genomes, short (50 or 76bp) single-end reads
were sequenced at the Broad Institute Illumina GI. For V. cyclitrophicus (population 15)
genomes, paired-end (63 or 102bp) reads were sequenced on the Illumina GAII at MIT's
BioMicroCenter.
Additionally, to aid in genome closure, long-insert mate-pair libraries were
constructed for 11 genomes (1F-53, ZF-14, ZF-29, 9ZC13, 12B09, 5S-149, 5S-101, FF-33, 1F211, ZS-17, FF-500) by two methodologies. For strains 1F-53 and ZF-270, long-insert matepair libraries were constructed with an ECOP15 "jumping construct" protocol (Collins et al.
1987; Young et al. 2010). Briefly, 4-6kb sheared gDNA fragments were ligated to oligomers
containing a restriction enzyme recognition site that directs cleavage -2 Sbp downstream,
and sequenced in 35bp paired reads on an Illumina GAII at MIT's BioMicroCenter. ECOP15
restriction sites were identified by pattern matching and reads trimmed of this nongenomic DNA. The remaining 10 long-insert mate-pair libraries were constructed using the
Illumina mate-pair kit. Briefly, 4-6kb sheared gDNA fragments were blunt-end biotinylated,
circularized and (randomly) re-sheared, and 80bp paired reads sequenced on an Illumina
GAII at MIT's BioMicroCenter. All reads were quality trimmed with the trimBWAstyle.pl
script to eliminate regions of low-quality bases (average less than phred score 20) (Li and
Durbin 2009). Strains 12B01 and 12G01 were previously sequenced with Sanger reads
(GenBank WGS scaffolds NZCH724170-84.1, NZCH902589-98.1, respectively); short read
mapping was used to amend -20 errors or mutations in each of those Sanger-based drafts.
2.6.2
Hybrid assembly pipeline
The hybrid assembly approach developed here proceeds in two main steps: short reads
from each strain are scaffolded on each reference strain genome to generate "reference
58
assembly contigs", which are then used in combination with short reads in a final de novo
assembly. For each strain "S"undergoing hybrid assembly, we chose one or more of the
most closely related reference strains "R" to scaffold reads. For 69 strains undergoing
hybrid assembly, 13 strains with near-closed genomes (from Sanger or long-insert-mate
pair sequencing efforts) were available for use as references.
Reference assembly contigs were generated as follows. Untrimmed short reads from
strain S were mapped onto each reference R separately using maq-0.7.1 (parameters: maq
map -N -n 3 -e 800) allowing a maximum of 3 mismatches in the seed (mapper sensitivity
maximum) but up to 20-25 total high quality mismatches, allowing codon 3rd positions to be
approximately saturated for a trimmed -75bp read. Where paired reads were available,
only proper pairs were kept. (Proper pairs are those with both reads mapping within 2
standard deviations of the insert and in the correct FR orientation). The maq consensus
sequences generated from this reference assembly were split on any single base that was
uncovered or had ambiguous consensus to generate "reference assembly contigs." These
rather stringent criteria for use of synteny information from reference strains were chosen
to avoid assuming that synteny is highly conserved in the absence of corroborating
evidence from strain S short reads.
Reference assembly contigs were used in combination with de novo assembly as
follows. To assemble strain 12F01, reads were assembled de novo using the program eulersr (Chaisson, Brinza, and Pevzner 2009). De novo contigs and reference assembly contigs
were merged with the AMOS suite program minimus2 (Sommer et al. 2007), using a
minimum overlap of 25 and minimum percent id of 98. For the other strains, reads were
trimmed of low quality bases as above. All short trimmed short reads as well as reference
assembly contigs were supplied to Velvet for de novo assembly (the latter were labeled as
belonging to the "Long" read category for velvet (-long-multcutoff=0). Parameter space
was searched to optimize N50 with the following constraints: covcutoff > 3,
predicted exp-cov * 0.5< exp-cov < predicted.exp-cov * 2.5, where predicted-exp-cov was
calculated directly from the reads or estimated automatically by Velvet from the data. 25 < k
< 49, except for the "ditag" type long-insert assembies, which require k<=21. Assemblies
built with longer kmers were preferred unless N50 difference was > 5%.
59
2.6.3
16S "contig-mapping"assembly pipeline
Custom assembly pipeline for 16S sequences consisted of three main steps: de novo
assembly of all short reads, scaffolding of contigs, and consensus building. All de novo
contigs for each genome with coverage higher than 10-fold were blasted against a database
of 16S sequences from finished genomes in the Vibrio family (blast parameters "-W=7 e=0.001"). Aligned contigs were trimmed down to 16S subsequences, and Mothur (Schloss
et al. 2009) was used to build a consensus sequence for each strain, retaining all observed
bases at intra-genomic polymorphic positions (encoded with IUPAC ambiguity codes).
Unlike polymorphisms in reads, which may be due to sequencing error, contigs over a
minimum threshold coverage depth should be relatively free of errors, as our library
construction protocol was performed in 4-8 parallel subsamples, to avoid propagation of
errors in early rounds of PCR.
For 16S genes assembled this way, intragenome polymorphism frequency was
concentrated in variable regions also seen in finished genomes from the family, consistent
with the idea that they are true polymorphisms, not short read sequencing errors or
assembly artifacts. (Variable regions for the group were calculated using 129 16S rRNA
sequences found in 15 finished genomes of 10 Vibrionales species downloaded from the
MicrobesOnline database (Alm et al. 2005)). Where this contig-mapping approach resulted
in too many Ns and ambiguities, maq was used to map reads to a within-population 16S
sequence (assembled via the contig-mapping) and to call consensus.
As an additional note, in organisms where 16S intragenome polymorphism is
minimal or unimportant, simpler approaches to the assembly of 16S are likely effective.
Here, short reads, single-ends, and high intragenomic polymorphism teamed up in a tragic
trifecta that required this approach.
2.6.4
16S phylogeny
Ribosomal rRNA sequences of Vibrio family strains were obtained from IMG genomes using
that database's ORF coordinates. All IMG vibrios and our 82 Vibrio strains were aligned to
the Silva reference alignment with Mothur, and Phyml was used with default parameters to
estimate the phylogeny.
2.6.5
16S distance matrix, clustering, and pairwise distances
A distance matrix between species was constructed using the minimum Hamming distance
between two sequences where each polymorphic site received a distance count of zero if
60
the sets of nucleotides intersected. This is an underestimate of the true distance. To
calculate between-cluster distances, one representative per population with the fewest
ambiguous (polymorphic) sequence positions was selected.
2.6.6
Identifying recent HGT between populations
Recent horizontally transferred sequences were discovered according to an empirical
method (Smillie et al. 2011). Briefly, segments of >300bp >99% identical found (by blastn)
in genomes diverged more than 3% at 16S rRNA locii were considered HGT (only segments
of >500bp were considered for the purposes of direct comparison of counts with a recent
study (Smillie et al. 2011), see "Counting recent HGT..." below). Multiple HGT sequences
between the same pair of genomes were counted as a single HGT event, and HGT event
frequencies were calculated between each population pair (rather than between each 2%
16S cluster as below) as the fraction of their genome pairs with at least 1 HGT sequence.
2.6.7
Counting recent HGT within 80 vibrio strains
Recent HGT frequencies were calculated according to an empirical method (Smillie et al.
2011) for the purposes of direct comparison with those counts. The following minimal
modifications to their pipeline were made to accommodate the assembled 16S of this
dataset. First, the assembled 16S with the fewest ambiguous positions was chosen to
represent each population. Representative sequences were then clustered with Mothur at
average neighbor <2% distance. Finally, distance between these clusters was calculated as
the minimum distance between any two representative sequences in the cluster. Each of
these modifications underestimates 16S diversity and divergence, producing a conservative
underestimate of the 16S distance on the order of 0.5%.
2.6.8
Calling genes and annotating genomes
ORFs were called by Glimmer3 (Delcher et al. 2007) with default parameters as
implemented by the RAST pipeline (Aziz et al. 2008). RAST was also used to assign
functional annotations.
2.6.9
Building orthologous groups
Orthologous groups were built by two methods. OrthoMCL was used with default
parameters, except for a 50% amino acid identity cutoff (Fischer et al. 2002) . Orthologous
groups are sensitive to parameter choices and to sampling (see Supplemental Note),
however OrthoMCL gene families resulted in Vibrio family core genome sizes comparable to
61
estimates for other groups of family members using the well validated approach of tree-
based orthology (P. S. Dehal et al. 2010; M. Price, Dehal, and Arkin 2007). OrthoMCL gene
families were used in all gene content analysis, except for the shared rare genes analysis
(Supplemental Figure 5) and the species phylogenies that include V. cholerae-like strains).
For those, gene families were defined by Blastclust (L=0.75,S=1.75) (Dondoshansky and
Wolf 2000).
2.6.10 Building concatenated gene species phylogenies
In additional to 16S phylogenies, two species phylogenies were constructed using
concatenated alignments of many protein-coding genes to evaluate support for, and
membership in, ecological populations. A core genome phylogeny was constructed from 66
near-ubiquitous gene families (defined by blastclust, above) present in single copy in at
least 75 genomes, more than half the total number of ubiquitous gene families. Nucleotide
sequences were aligned with MUSCLE (default settings) and concatenated alignments'
phylogeny was estimated with Phyml (default settings). Separately, a concatenated
ribosomal protein tree was constructed. The amino acid sequences of 30 ubiquitous
OrthoMCL families encoding ribosomal proteins were aligned with MUSCLE and guided
alignment of nucleotide sequences with Pal2nal (Suyama, Torrents, and Bork 2006) using
default parameters.
2.6.11 Genome similarity by protein content
Global similarity between two strains gene content was measured as the fractional overlap
in gene content. The Jaccard similarity between genomes A and B is
A-,
where A is the set
of protein families present in genome A and B the set of protein families present in genome
B. This jaccard metric was calculated using the Vegan package of the R computing
environment (Dixon 2003; Team 2012). Hierarchical clustering of proteomes was
performed by UPGMA, implemented in the 'cluster' package in R (Maechler et al. 2012).
2.6.12 Population similarity by protein content
Average genome-wide similarity of protein content between two populations was
calculated as the mean of the set of jaccard distances between pairs of genomes composed
of one genome from each population.
62
2.6.13 Substrate utilization assays
All substrate utilization data was previously generated and generously supplied by S.
Preheim (Preheim 2010 and unpublished data) and detailed methods are available
elsewhere. Briefly, cells grown in suspension to OD600=0.1 were inoculated into each well
of a 96-well Biolog GN2 plate (containing 95 different carbon substrates plus water) and
incubated at 22C for four days. Optical density readings were taken for each well at 580 nm
to quantify metabolism (i.e. respiration) in each well.
2.6.14 Substrate utilization profiles and similarity
Each strain's substrate utilization profile was defined as the vector of the ODs resulting
from the respiration of 95 different carbon substrates on the Biolog GN2 plate. Similarity of
profiles was defined as the Pearson correlation of these vectors.
2.6.15 Population similarity by substrate utilization profile
Overall similarity of substrate utilization between two populations was calculated as the
mean of the set of Pearson correlations between pairs of strains composed of one strain
from each population.
2.7 Author Contributions.
The analysis and writing for this chapter were solely the work of S. Timberlake, with thanks
to O.X. Cordero for running OrthoMCL, one method for building orthologous gene families.
S.P. Preheim conceived of and performed all of the Biolog substrate utilization experiments.
Thanks to S.P. Preheim, M.C. Butler, A. Perrotta and others for isolating, culturing, and
prepping sequencing libraries of the vibrios used in this work. M.F. Polz and E.J. Alm
provided materials and reagents and valuable feedback.
63
2.8 References
See main thesis document reference section, Error! Reference source not found..
64
2.9 Figures and Legends
350000
300000
250000
in
z
200000
150000
100000
50000
0
a,>
>.
600000
hybrid
assembly
500000
I
6
de a no
400000
0
U.,
z
300000
de novoa
alone
200000
I
100000
0
in
N
Ln
in -
N
r
N
N
N50
N
N
N
Figure 1. Assembly N50 improvement using hybrid assembly approach.
The improvement of hybrid assemblies over de novo assemblies is quantified by a
comparison of the assembly statistic N50. The N50 represents the length of the smallest
contig that needs to be included in a set of contigs containing half the total base pairs of the
genome assembly. Red bars, N50 of de novo assembly alone; blue bars, N50 of hybrid
assembly. Top panel, single-end reads (51-76bp); bottom panel, paired-end reads (63102bp). Genomes for which no hybrid assembly was performed (due to either assembly by
long-insert mate-pair reads or lack of a suitable reference) are omitted.
65
0.1 nudeotide substitutions per site
Size Fraction:
>63 pm
63-5 tm
5-1 pm
<1 Rm
Season:
U
*
spring
fall
Figure 2. Species phylogeny of Vibrio genomes
Alternating gray/blue shaded backgrounds delineate ecologically differentiated clades
defined in previous studies (Hunt et al. 2008; Preheim 2010). Strains without population
prediction or where population prediction is directly contradicted by this species tree
topology are left unshaded. Leaf labels contain strain names and, where applicable,
population numbers predicted by AdaptML on the Fractionation 2006 data (Hunt et al.
2008). Color strips indicate size fraction and season of collection (not inferred habitat). Core
genome phylogeny reveals robust support for F11 and F17, clades previously unsupported
due to conflicting signals in housekeeping gene trees (Preheim, Timberlake, and Polz 2011).
Groups correspond to the following named taxa: 1, Enterovibriocalviensis-like;2,
Enterovibrionorvegicus; 3, V.ordalli;4, V.rumoiensis-like; 5, V. alginolyticus;6, V.
66
aestuarianus;7, Aliivibrio logei; 8, V. Aliivibriofischeri;9, V. breoganii; G10, V. sp F10; 11 V.
splendidus cluster 1; 13, V. crassostreae;14, V. kanaloae; 15, V. cyclitrophicus; 16, V.
tasmaniensis;17 V. lentus; 18-25 V. splendidus.
This phylogeny was constructed from 66 ubiquitous gene families present in single copy in
at least 75 genomes. Gene families were defined by Blastclust (L=0.75,S=1.75)
(Dondoshansky and Wolf 2000). Nucleotide sequences were aligned with muscle (default
settings) and concatenated alignments' phylogeny was estimated with Phyml (default
settings, with 100 bootstrap replicates), and visualized with iTOL (Letunic and Bork 2011).
Nodes with bootstrap >80 are indicated with dots. Scale bar is in units of nucleotide
substitutions per site.
inferred habitat: particle size
same-habitatedifferent-habitat c -
C
8
CL
inferred habitat: season
.
0
6
C0
2
4
6
8
10
2
%16S distance between populations
4
6
8
10
%16S distance between populations
Figure 3. Habitat and gene content
The overlap in protein-coding gene families (Jaccard distance) is plotted as a function of 16S
distance between populations. Populations with the same inferred habitat (adapted from
(Hunt et al. 2008)), either in terms of particle size (left panel) or seasonal preference (right
panel), do not have more similarity in their overall gene content. Each point represents the
mean gene overlap (Jaccard similarity) of all population pairs in that distance bin
(calculated separately for same-season or same-particle).
Jaccard similarity between each
population pair was calculated as the mean of the set of Jaccard similarities between all
pairwise genome comparisons composed of one genome from each population.
67
inferred habitat: season
inferred habitat: particle size
same-habitat
e
-
-
different-habitat oC
o .2
.5
0
0-
i
III
10
0
IIII
0
2
4
6
8
2
4
I
6
II
8
10
phylogenetic distance (%16S)
phylogenetic distance (%16S)
Figure 4. Habitat and substrate utilization profiles
The similarity in substrate utilization profiles is plotted as a function of phylogenetic
distance between populations. Populations with the same inferred habitat (adapted from
(Hunt et al. 2008)), either in terms of particle size (left panel) or seasonal preference (right
panel), do not have more similar substrate utilization profiles. Each point represents the
mean of metabolic similarities between all population pairs in that distance bin (calculated
separately for same-season or same-particle). Overall metabolic similarity between each
population pair was calculated as the mean of the set of Pearson correlations of strain pairs
composed of one strain from each population. The dashed gray line represents the mean
Pearson correlation in metabolic profile between 3 sets of biological duplicates. Data were
generated and generously supplied by (S. Preheim, unpublished data).
68
HGT frequency in 2235 bacteria
50
Human
-
A
C-
40
0
-o
Same site
Different site
Non-Human
ea..
E 1s5L1.0
0.5
30
U00
3
o
20
I
s
6 1o1sa
HGT frequency in 80 vibrio genomes
Is
16SDistance()_____________________
-
10-
3
6
9
12
165 Distance (%)
15
18
3
6
9
16S Distance (%)
Figure 5. Counting recent HGT within 80 vibrio isolates.
Recent HGT was counted within 80 Vibrio isolates (right panel) using a recent method that
estimated frequency of recent HGT amongst all 2235 full genomes publicly available for the
purposes of direct comparison (left panel, reproduced from previously published data
(Smillie et al. 2011)). The y-axis is the probability that a pair of genomes drawn from two
different species at a particular phylogenetic divergence (>=3% 16S distance) recently
shared at least one piece of DNA (a stretch of DNA >500bp long and >99% identical).
Minimal modifications to %16S distance calculation methods were made to accommodate
the assembled 16S of this dataset (see methods), resulting in conservative underestimates
of 16S distance.
69
<1p
...
1-5p
*@0
>63p
Figure 6. Network of recent HGT between vibrio populations
Each node represents a population colored by particle size preference. Population numbers
and habitats from Fractionation 2006 study (Hunt et al. 2008) or merged from (Preheim,
Timberlake, and Polz 2011), except Vibrio cholerae-like strains, "VC". V. splendidus strains
unassigned to supported populations were grouped together, "SPL". HGT frequencies were
used to weight a spring-embedded force diagram made in Cytoscape (Smoot et al. 2011),
and unconnected nodes had no recent HGT observed with the other populations in this
dataset. Recent horizontally transferred sequences were discovered according to an
empirical method (Smillie et al. 2011) with the following modifications: nucleotide
stretches of 300-500bp (not just >500bp) were also considered and HGT event frequencies
were calculated for each population pair (not 2% average neighbor cluster) as the fraction
of their genome-pairs with at least one HGT sequence.
70
Pan genomes for 4 definitions of "species"
Vcyc:LP
Vcyc:LP+SP
I
-OTU
(<3% 16S)
Vibrio family
0
00
0
Core genomes for 4 definitions of "species"
0
0
0N
!
l-Il-iI-
0
0
LO/
/r
5
10
15
20
number of genomes
Figure 7. Core and pan genomes for 4 species definitions.
The core genome and pan genome curves are plotted for four different ways of delineating
species-like groups. For pan curves (upper panel) each point (x,y) counts the total number
of orthologous gene families y found in any of the set of x genomes genomes (the vertical
spread results from different combinations of x genomes from the total N in the group;
where CxN > 100, 100 sets of x genomes are chosen randomly); the line traces the average
over all combinations. For core curves, (lower panel), each point (x,y) counts the number of
71
orthologous gene families y present in all x genomes. Core and pan genomes were defined
using 4 different hierarchical groups: a single ecologically differentiated population of largeparticle specialists (Vcyc:SP), a concatenated gene tree phylogenetic cluster comprised of
two ecological populations (Vcyc:SP+LP), -3% 16S rRNA cluster (OTU <3%16S), or by
named taxonomic family (Vibrio family). When considering the larger phylogenetic groups,
population sample number was equalized by random downsampling.
0.1
Species phylogeny
Jaccard distance of gene content
ZF-270
ZF-14
ZF-28
1F-175
F-53
F-160
1F-97
F-111
1F-273
E5F-59
FF-160
ZF-99
1F-97
1F-111
1F-273
F-270
1F-289
ZF-14
FF-75
F-207
ZF-264
ZF-207
ZF-30
1F-289
ZF-65
ZF-30
F-75
ZF-205
inferred
habitat
F59
ZF-65
F-99
LP: >63p
ZF-264
SP: 1-5p
ZF-170
ZF-205
1F-53
1F-175
ZF-170
F-28
Figure 8. V. cyclitrophicus strains clustered by flexible gene content and neutral
markers
Strains of the V. cyclitrophicus population are hierarchically clustered according to their
fraction of shared protein families. Green and red squares label strains assigned to small
particle and large particle specialist populations of this species, respectively (Hunt et al.
2008; Shapiro et al. 2012). Phylogeny of 30 concatenated ribosomal proteins supports nonclonal origins for the SP and LP specialists. Species tree shown is a subtree of the
concatenated ribosomal protein phylogeny; dots denote branches with >80% bootstrap
support.
72
16S from Vibrio finished genomes
(R
0
075'
C~L
0
I
C)J
6N
0
1500
1000
500
C0~
CD
16S assembled from short reads
0O
0)
EE
04I
1111i11III
Ell
0
.
IIAHIMA IIE
111111
1 1 is
1000
500
111
IIII
il
I
I I
I
I'A
1500
Position in 16S rRNA alignment
Supplemental Figure 1. Distribution of polymorphisms in 16S genes assembled with
short read pipeline.
Polymorphism frequency by position across the 16S gene assembled from short reads was
concentrated in variable regions also seen in finished genomes from the family, consistent
with the idea that they are true polymorphisms, not short-read sequencing/assembly
artifacts. Top panel, the distribution of intra- and intergenome polymorphism observed for
the family using 129 16S rRNA genes found in 15 finished genomes of 10 Vibrionales
species. Lower panel, intragenome polymorphic sites of 16S genes assembled by short read
pipeline.
73
Fa_FF -2
~ imi
_4us
Vibio~
Vibrio'w
6421-8
1..31.
.mm^sVT5C66840_12
Vaarlo-pRC58
_04723020
153
Supplemental Figure 2. 16S phylogeny of our populations and other vibrio type
strains.
16S rRNA genes assembled from short reads are placed in phylogenetic context with other
cultured Vibrio isolates. One representative per population with the fewest ambiguous
(polymorphic) positions was selected, aligned to the Silva reference, and Phyml was used to
estimate the phylogeny.
74
C)
2 11,spl
0
(1)
E
%
6
-7,8
vb
p1,2
CL*
CU
*
-~
o
.5,9
a)
0
CU
(D
6
E
0
2
4
8
6
10
% 16S divergence between populations
Supplemental Figure 3. Gene content divergence with phylogenetic distance
The fractional overlap in protein-coding gene families (Jaccard distance) is plotted against
16S distance between populations. Each point is the mean Jaccard distance between two
populations (Jaccard distance between each population pair was calculated as the mean of
the set of all Jaccard distances between genome pairs composed of one genome from each
population.) Outliers are labeled with population numbers from Fractionation2006 study;
spl, V. splendidus putative populations 19,20,23. Some populations have %16S distance=-0
because intragenomic diversity exceeds between-population divergence. Gene content 's
linear divergence seems to saturate at high %16S distance, but it is possible this is due to
orthologous group definitions.
75
inferred habitat: large particle specialist
LP-specialist pairs 0other pairs 0-
0
o
C
0
0
C
't2
0
Q
00-
-
I
I
I
I
I
2
4
6
8
10
%16S distance between populations
Supplemental Figure 4. Overlap in gene content between large-particle-associated
populations
The fractional overlap in protein-coding gene families (Jaccard distance) is plotted as a
function of 16S distance between populations. Large-particle specialist populations (Hunt et
al. 2008), do not have more similarity in their overall gene content. Each point represents
the mean Jaccard distance between population pairs in that distance bin (calculated
separately for same-season or same-particle). Jaccard distance between each population
pair was calculated as the mean Jaccard distance of the set of all pairwise genome
comparisons composed of one genome from each population. Gene content overlap in sameand different-habitat comparisons differed only at one distance bin (3% < 16S distance
<4%, p=0.025)
76
CD
E
0
CL
-,
0
a)
a)
E
C,
CO
S
same-habitatFra
m
ndifferent-habitatg
a)
-0
E
0)
C%4
0.
05
.01.
C
00
notaveae o each poutosding
prevous
lotsthspis
somihrdltetheiga
of rare genes.) Rare genes were defined as genes found in between 3 and 20 genomes (other
definitions were used with similar results). There was no consistent pattern of increased
shared genes in genome pairs of same versus different habitats, with one consistent
exception for all definitions of rarity: Same-habitat strains within a population (first
distance bin) always share more genes between (p<0.01). The outlier at intermediate
distance consists of comparisons of Population 1 and 2 strains, which have surprisingly
similar gene content given their distance; the reason for this is unknown.
77
IF- 302
1
FF- 2
I
I I
FF-4514
1
ZF-Ii a
I
4
'CIO
Z 299
"-W9
ZF-"9
12GOI
107
3
F$-,44
3
F 233
3
Z112910
gZam
10
R01371
2311
6
win
A
6=7713
1313
9CS108
13
ZF-01
;3
IF-63so
ZF-Z7015A
ZF-204
13A
;F-28915A
F 11158
,F-273
ZF-14150k
Z 17015A
ZS-139
II
IS"
OF-79
16
ZS-1717
FF_= Is
ZS-03
15A
V-SO
19
Fr- IS
04AA0
MEN
arm
a
am
man
a
on
ME
miummanowne
an
a MEMMEN
a
a
a
0
MNMMMMM
0
0
a
0
MEN
0
on
WE
0
a
a
0
We
Sam
a
a
P
a
a
a
WMI
0
0
0
a
of
OWN
On
.
a
WOMMOMMIN
a
ON
so
a
a
0
0
a
0
0
MEN
a
a
0
!al
an
No
18
IF-157
'2NI
8-12123
a
Supplemental Figure 6. Substrate utilization in the Vibrio family
A heat map of 95 substrates respiration (OD on Biolog GN2) in each of 46 strains with fully
sequenced genomes. Substrates are hierarchically clustered according to the OD vector in
46 strains (eisenCluster implemented in the R software package (Chipman and Tibshirani
2006) ); strains by core genes phylogeny. Leaves are labeled with strain names and original
population number (from Fractionation2006 study or adapted where necessary (Preheim,
Timberlake, and Polz 2011)). Leaf color strips code for season or filter size of isolation. In
large circles, numbers indicate assigned populations for clade (for confirmed populations,
see Methods) and colors indicate inferred habitats (Hunt et al. 2008).
78
0
I)I
Q
0
0)
V
0 0
0
O
0 io
[Td,
o qo
0
0
0
2
6
4
8
10
%16S distance
Supplemental Figure 7. Substrate utilization profile divergence with phylogenetic
distance
Similarity of substrate utilization for each population pair is plotted against evolutionary
distance (%16 distance). The Pearson correlation of each strain pair was calculated using
the vector of OD's for growth on 96 Biolog substrates); populations' similarity calculated as
the mean correlation between all pairwise strain comparisons composed of one strain from
each population and bars represent standard deviation. Where only comparison was
available, the median standard deviation of all other comparisons is plotted. The dashed
gray line represents the mean Pearson correlation between 3 sets of biological duplicates.
Large variation at intermediate distance is due to comparisons of V. splendidus populations
with similarly metabolically versatile V. alginolyticusversus comparisons with metabolically
diminished V. breoganii.
79
E)
0
E
LO
z
LO
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19
penetrance in cyclitrophicus
Supplemental Figure 8. Distribution of gene family penetrance in V.cyclitrophicus
A gene frequency distribution curves counts the number of gene families ("clusters") found
in 1, 2,...,X genomes of the strains of V. cyclitrophicus.
80
2.10 Supplemental notes and discussion.
2.10.1 Hybrid assembly pipeline: discussion of accuracy.
Assembly of whole genome shotgun reads is a probabilistic process, involving heuristic
steps (Zerbino et al. 2009; Chaisson, Brinza, and Pevzner 2009). Assemblies can incorporate
several major classes of errors: base substitution, base insertion, contig mis-joining, and
region duplication or deletion. Because this short read sequencing dataset wasn't created
with methods development in mind, there is no gold standard to check for accuracy.
However, there are two modular steps in the hybrid assembly pipeline, so we can consider
the accuracy of each separately. The latter step consists of de-novo assembly (using
reference assembly contigs as fake long reads), and the accuracy the de novo assembler
Velvet in terms of these errors classes has been documented extensively in the literature.
The first step consists of the generation of reference assembly contigs. Future work should
future quantify how errors they contain can be propagated through the de novo assembly
process.
2.10.2 16S rRNA phylogeny
To explore the phylogeny and taxonomy of our sequenced Vibrionaceae strains with other
members of the genus, a phylogeny was built including 16S sequences from complete
genome sequences of the Vibrionales order (Supplemental Figure 2). A striking topological
difference is the placement of the V. cholera-like clade as an outgroup to the rest of the
Vibrio genus. It is certainly possible this placement spurious and the 16S-based distance
estimates to this group inflated compared to most measures of phylogenetic divergence.
However, V. cholera-like strains were not particle-size filtered and were not used
throughout the majority of this study. More thorough analysis of Vibrionaceae family's
phylogeny (Kirkup et al. 2010) and these populations' taxonomy (Preheim, Timberlake, and
Polz 2011) have been undertaken elsewhere.
2.10.3 Defining gene families in 82 genomes: sensitivity to methodology, sampling.
Assigning orthology should not be challenging in a group of this level of sequence
divergence, but the size of the dataset presented challenges, as sequencing has outpaced
analysis tools. Several standard methods of constructing gene families were assessed for
our set of genomes, with widely varying results. Grouping of genes varied by program
81
choice, parameter choice, but most strikingly, with genome sampling within clades,
resulting in family core genome predictions (provisionally defined as those ubiquitously
present in the family's assemblies) that ranged between 70-1000. Within-population
downsampling resulted in highly discrepant gene family predictions: Blastclust predicted
only 160 ubiquitous gene families in the vibrios when all strains were used for clustering,
but more than 1000 when only 1 genome from each population was used (a number of
Blastclust parameter sets were tried with similarly discrepant results). Newer methods
designed for finely sampled genomes (Angiuoli et al. 2011) start with whole genome
alignments, using synteny to improve scaffolds, ORF calls, and ortholog assignments,
however, these methods are only appropriate at close phylogenetic distances and could not
be applied to our family. OrthoMCL was ultimately chosen for this genome content analysis
as it is commonly used and the size of core genomes produced were similar to estimates for
other Vibrionaceae family members, using the well-validated approach of tree-based
orthology (P. S. Dehal et al. 2010; M. Price, Dehal, and Arkin 2007).
2.10.4 Using gene content divergence to confirm population predictions.
The Fractionation2006 study inferred many recently differentiated pops in the
adaptive radiation of the V. splendidus/tasmeniensisclade. Whether these are truly distinct
populations has been difficult to confirm. They were not well supported by MLSA analysis,
or sampling occasions after the 2006 study (Preheim, Timberlake, and Polz 2011). It is not
yet well established that proteome content divergence can be used generally to distinguish
differentiated clusters, however, both proteome and concatenated gene phylogenies
support Population18 but both reject Population19. V. tasmeniensispopulations 16 and 17
should definitely be placed apart even though they were combined into a single population
by some ecological characters (Preheim 2010 Particle2007 study); they likely represent
more than two populations, as prescribed by habitats based on host and body site (Preheim
et al. 2011 Invert2007 study).
2.10.5 Habitat-specific gene gain and loss in V. cyclitrophicus
We consider the results of using a broader group of fixing genes than that presented
in the text. We considered 1155 flexible gene families present in more than 2 V.
cyclitrophicus strains, of which 128 were significantly enriched in one habitat (2-tailed
hypergeometric test with 5% FDR). Thus, more than 10% of the flexible genes apparently
rising to fixation (or being deleted) are doing so in a clearly habitat-specific manner.
82
Narrowing our focus to more common genes in the main text resulted in a higher estimate
of habitat-specific fixation.
The selective processes governing flexible gene gain and loss are very likely
different. If separate these gene families into two groups, those likely to be present in the
last common ancestor and undergoing loss, and those that have been recently gained and
rising to fixation, we see separate trends. V. cyclitrophicus flexible gene families also
present in two or more closely related (<3% 16S distance) outgroups were defined as
ancestral and undergoing loss. These 565 flexible gene families are putatively being deleted
from a common ancestor, while the rest (590) are recent gains. For recent gains, nearly
20% of them (114/590) are being fixed in a habitat-specific manner (2-tailed
hypergeometric test with 5% FDR). Gene gains are much more likely to have a habitatspecific distribution than genes lost (ks.test on hypergeometric p-values: DA-= 0.2125,
p<1e-11). As above, the results are consistent when a narrower focus considers more
common genes.
There are several major caveats for this approach. The main issue is that the strains
are treated as if they are on a star phylogeny, with independent history of gain and loss.
While the well-mixed nature of the genome does not directly reject this idea, there is
certainly some evidence for same-habitat strains sharing more phylogenetic history.
Although the SP and LP specialists are not each clonal expansions, the same-habitat pairs
constitute some of the most closely related strains in the group, and gene gains in these
lineages likely do not meet the criteria of independence for the hypergeometric test.. To try
to minimize this effect we considered genes near fixation, such that their distribution in
multiple genomes was unlikely to be a result of clonal expansion. Secondly, this test is
treating each ORF's presence/absence independently, and therefore their fixation is
independent. This is almost certainly not the case, as genes arrive on stretches of DNA,
sometimes very large ones, that are mobilized together (Shapiro et al. 2012). However, all
models of gene content we are aware of treat each ORF independently, including the neutral
gene content models recently developed (Haegeman and Weitz 2012), and our intention
was to assess their applicability in this system.
2.10.6 Gene deletions in general: bias, neutrality, or selection?
The majority of flexible genes that enter a population are not fixing by selection, rather, they
are deleted rapidly. Is this deletion is neutral? During experimental evolution of M.
83
extorquens, accessory genome deletions were found to be due to selection rather than
deletion bias, as they were not observed in mutation accumulation assays (M. C. Lee and
Marx 2012). However, others argue for strong and pervasive deletion bias in bacterial
genomes (Kuo, Moran, and Ochman 2009; Kuo and Ochman 2009), though the time scales
are much broader. Our comparison of strains evolving in the wild versus an equivalent
amount of SNP-divergence in the laboratory revealed a stark contrast, with wild strains
deleting orders of magnitude more DNA. In 7 laboratory lines of V. splendidus 12B01
evolved under conditions of elevating salt concentration, lines accumulated 50-100 SNPs,
but deletions were rarely observed. In stark contrast, wild-type sister strains of similar
divergence (1F-111/1F-273, 1F-53/1F-175, ZF-207/ZF-30 from V. cyclitrophicus,
12F01/12B01 and 12B09/FS-144 from V. splendidus and V. ordalifi respectively), each with
<100 core genome SNPS (or <300 total genome SNPs, including SNP clusters likely due to
recombined regions), had >100-300kb of flexible DNA deleted in each lineage. This figure is
likely an underestimate of the total amount of DNA deletions that have occurred along each
branch since only the leaves can be sampled. This suggests that perhaps fast loss of flexible
DNA may be mediated by turnover (i.e., replacement by new incoming DNA). Whether this
fast turnover of flexible DNA is mediated by different mechanisms than the deletion bias
typically discussed in the literature has not been explored, as strains this closely related
have not, to our knowledge, been analyzed previously.
84
Chapter 3.
Conservation of coexpression
without conservation of
response in bacterial gene
regulatory networks
3.1 Authorship & Publication Note
This chapter has been submitted as a manuscript by the following authors:
Sonia C Timberlake, Romy Chakraborty, Aindrila Mukhopadhyay, Katherine Huang,
Dominique C Joyner, Marcin P Joachimiak, Jason K Baumohl, Paramvir S Dehal, Zhili He,
Aifen Zhou, Jizhong Zhou, Adam P Arkin , Terry C Hazen, Eric J Alm
Detailed author contributions are available in section 3.7.
85
3.2 Abstract
Transcriptional response to environmental perturbation is important for fitness, but how
these adaptive strategies evolve remains poorly characterized. In this study, we address
two fundamental questions about the evolution of bacterial gene regulation. First, we asked
to what extent the basic units of the transcriptional network, coexpressed gene pairs, are
conserved across the billions of years of evolution spanned by three bacterial clades.
Second, we asked whether the global transcriptional phenotype in an environment is
conserved.
We report an extensive compendium of transcriptomic data in three metal-reducing
bacteria we generated to conduct our comparative survey of transcriptional phenotype. We
also analyze hundreds of previously described microbial gene expression experiments, to
quantify coexpression conserved in our model organisms and across the bacterial domain.
Comparing orthologous genes' responses, however, revealed that global transcriptional
phenotypes are not conserved over relatively short time spans, although we identify specific
exceptions where orthologous genes respond similarly to an environmental cue. We also
show that bacterial transcriptional responses are extremely sensitive to variations in
experimental treatments.
We find a lack of consistency or conservation in bacterial transcriptional responses,
which calls into question the standard analysis of transcriptomic data listing significant
changers in a single experiment. Comparative transcriptomics may be critical for identifying
the elements of any transcriptional response that are functionally (and not just statistically)
significant, but consistently produced data from very closely related bacteria will be
required. Our comparative survey of gene expression underscores the need for a model of
gene expression evolution, one that can relate the robustness and fitness of transcriptional
responses to their subsequent selection.
86
3.3 Background
Microbes in changing environmental conditions alter their phenotype via gene regulation in
response, but the selection on this response is poorly understood. Although instances of
differentially regulated genes have been extensively documented (DeRisi, Iyer, and Brown
1997; Faith et al. 2008), even a basic understanding of their fitness contributions (Birrell et
al. 2002; Giaever et al. 2002; Klosinska et al. 2011; S. L. Tai et al. 2007), divergence, or drift
is lacking.
A fundamental unanswered question is whether orthologous bacterial genes
respond the same way to the same environmental cues. While recent studies have
characterized the divergence of transcriptomic responses in 5 closely related yeast species
(Tirosh et al. 2006), and of proteomic abundance profiles in 10 co-generic proteobacteria
(Konstantinidis et al. 2009), to our knowledge, conservation of bacterial transcriptomic
responses had been never been quantified. Comparative transcriptomic studies in bacteria
typically revolve around strain-specific differences, which can then be implicated in
pathogenic or other adaptive phenotypes (Yoder-Himes et al. 2009; Guell et al. 2011; Zhang,
Dudley, and Wade 2011), but the extent to which these differences are due to selection or
drift has not been shown (M. Lynch 2007). Indeed, even upregulated genes' presumed
selective advantage is questioned (Birrell et al. 2002; Giaever et al. 2002; Klosinska et al.
2011; S. L. Tai et al. 2007), but their conservation could support it. Whether bacterial genes
respond the same way to the same condition is thus fundamental to our understanding of
transcriptional response. To characterize the evolution of transcriptional phenotype in a
variety of species and environmental stressors, we generated an extensive compendium of
transcriptomic data in three proteobacteria, and analyzed data from public repositories.
We focused our study of transcriptional phenotype on three dissimilatory metalreducers that have gathered interest as agents of bioremediation. Geobacter metallireducens
GS-15 and Desulfovibrio vulgaris Hildenborough are delta-proteobacteria and likely
diverged from Shewanella oneidensis MR-1 and the rest of the gamma-proteobacteria three
billion years ago at the base of the Proteobacterial clade (Ciccarelli et al. 2006; David and
Alm 2010). D. vulgaris is an anaerobic sulfate-reducer and S. oneidensis is a facultative
anaerobe, but their common natural environments and dissimilatory metal reducing
lifestyles and may pose similar challenges. On a genomic level, they share 832 putatively
orthologous genes, roughly a quarter of D. vulgaris' gene content, including half of its
87
transcription factors (Heidelberg, Seshadri, Haveman, Hemme, Paulsen, Kolonay, Eisen,
Ward, Methe, Brinkac, et al. 2004; Heidelberg, Seshadri, Haveman, Hemme, Paulsen,
Kolonay, Eisen, Ward, Methe, and Brinkac 2004; Mulder et al. 2007), though recent work
suggests that transcription factors identified as bi-directional best hits (BBH) may be an
overestimate of the vertically-inherited regulators controlling orthologous regulons (M.
Price, Dehal, and Arkin 2007). We also assayed phenotype of G.metailireducens,a much
closer relative of D. vulgaris, sharing more gene content (Paramvir S. Dehal et al.), and the
obligate anaerobic lifestyle (Lovley et al. 1993).
To ask how orthologous genes respond to similar environmental stressors, we
generated an extensive compendium of transcriptional profiles by subjecting these bacteria
to a variety of ancient and ubiquitous environmental changes: heat, cold, acid, alkaline, salt,
and nitrate. We also analyzed public transcriptomic datasets of for evidence of
transcriptional response conservation in other species and conditions.
While the evolution of transcriptional response has not been described in
prokaryotes, conservation of their transcriptional network has been explored by looking at
conserved network architecture, including modules (McAdams, Srinivasan, and Arkin 2004;
Babu et al. 2004; Gelfand 2006). Studies have convincingly demonstrated that genomes are
partitioned into functional modules that are expressed together (Stuart et al. 2003;
Bergmann, Ihmels, and Barkai 2004), clustered together (Price et al. 2005), physically
interacting (Zinman, Zhong, and Bar-Joseph 2011) and laterally transferred together
(Morgan N. Price, Arkin, and Alm 2006). As the transcriptional network in evolves, these
units can remain cohesive, with some conserved transcriptional modules documented in
species as divergent as E. coli and metazoans (Stuart et al. 2003; Bergmann, Ihmels, and
Barkai 2004), or even out to B. subtilis (Waltman et al. 2010; Zarrineh et al. 2010), but are
part of a complex picture of network evolution, including flexibility and extensive rewiring
of functional modules (Fokkens and Snel 2009; Roguev et al. 2008; Singh et al. 2008). For a
look at overall conservation of the transcriptional networks in our model organisms and
across the bacterial domain, we start by quantifying the conservation and function of the
simplest possible units, coexpressed gene pairs, in four very divergent bacteria.
Despite abundant transcriptomic data in bacteria, no model exists for its evolution,
and the correspondence between orthologous genes responses is not even qualitatively
understood. This comparative survey of transcriptional phenotype was undertaken to
88
examine transcriptional network evolution in bacteria in two fundamental ways:
conservation of links on the network, and conservation of network output. To do so, we
generated microarray data in three bacterial species and drew on a large database of public
microarray data to evaluate conservation of coexpression and response to a variety of
environmental stressors.
3.4 Results and Discussion
3.4.1
Coexpression is significantly conserved across bacteria
Our broad survey of conservation of coexpression painted a complex picture: we identify a
very significant number of coexpression couplings conserved over vast evolutionary
distances spanned by bacterial phyla. The majority, however, we do not find conserved,
likely reflecting re-organization of species' transcriptional programs, in agreement with
previous reports (Zarrineh et al. 2010). We used hundreds of microarray experiments to
define coexpressed genes and assess their conservation across the bacterial domain,
focusing on the simplest unit of the transcriptional network, coexpressed gene pairs. Gene
pairs that are up-regulated or down-regulated together across many conditions tend to be
functionally associated (Brown and Botstein 1999; Eisen 1998; Fink et al. 1998), and may
participate in the same pathway or physically interact. Coexpressed genes have been
further clustered into larger modules that act cohesively across conditions and evolutionary
time (Cai, 2010; van Noort et al., 2003; Waltman et al., 2010; Zarrineh et al., 2010).
Conservation in an ancient gene regulatory network was assessed by comparing
coexpressed gene pairs in three very divergent bacterial clades: a cyanobacterium
(Synechocystis PCC)(Kaneko et al. 1996) and our two model organisms (deeply branching
proteobacteria), using data from the well-studied E. coil MG1655 to bolster inference of
coexpressed gene in the gamma-Proteobacterial branch. Between the extensive phenotypic
dataset we generated and public databases (Barrett et al. 2007; Paramvir S. Dehal et al.),
these four organisms had some of the most extensive and varied assays of their
transcriptomic responses to common ecological conditions. This wide variety of
experimental conditions sampled many distinct transcriptional states, and we built a
comprehensive picture of coexpression for millions of gene pairs, then asked which
coexpression links were conserved.
Despite a few billion years spent diverging from a common ancestor, these
transcriptional networks shared more than 14 times the number of coexpressed pairs
89
expected by chance: of 18196 coexpressed pairs in the gamma-proteobacteria with
orthologous genes and data in the other two clades, 2629 of these links were conserved
whereas only 182 were expected under a model of random assortment. The model of
random assortment was constructed without reference to operon structure, because while
conservation of operon structure could be a mechanism underlying many conserved
coexpression links, operon structure is driven by co-regulation (M. N. Price et al. 2005), so it
is therefore likely that selection on coexpression preserves co-operon structure rather than
the reverse. We identified putative coexpressed pairs in a single organism as those in the
top 10th percentile of correlated expression over all conditions. (See methods for details and
results using alternative cutoffs). To validate our cutoff for defining coexpressed pairs, we
verified that this level of coexpression conservation was typical of protein partners by
considering the subset of coexpressed pairs also known to be physically interacting in vivo
(Butland et al. 2005; Hu et al. 2009). For the subset of 198 ubiquitous proteins where
coexpression confidence was increased by physical interaction data, adding this criterion
resulted in pairs just over twice as likely to be conserved. Thus, our simple strategy
identified coexpressed gene pairs with regulatory conservation comparable to that of
coexpressed proteins also known to physically interact in E. coli.
What are the cellular functions of these pieces of the transcriptional network
conserved across vast distances? Figure la depicts the conserved gene pairs' functions (as
defined by the Riley functional categories used in COG (Tatusov et al. 2003), and which are
unexpectedly likely to be linked given the distribution of conserved genes. Our conserved
set of links was enriched five-fold for pairs participating in protein translation, a process
known to be tightly co-regulated (Leary and Huang 2001; Kaczanowska and Ryden-Aulin
2007), though the fact that stressful conditions comprise a large part of microarray
databases and our data compendium may bias us towards discovery of this functional
categories. Stress conditions tend to slow growth rates and ribosomal biogenesis in general
(Brauer et al. 2008; Rzhetsky et al. 2009), and this is particularly true in our experiments, as
experimental perturbations were tuned such that each resulted in a 50% reduction in
growth rate. While a large fraction of conserved genes are in the replication, recombination,
and repair functional category (L), conserved links to this category are few, indicating that
these conserved genes have divergent regulation. While previous studies have described
predominantly within-functional-category coexpression links (Bergmann, Ihmels, and
Barkai 2004; Stuart et al. 2003), Figure la illustrates many conserved links between genes
90
from different COG functional categories. For instance, energy production and conversion
genes are coexpressed with translational machinery genes (C-J), at a rate 10-fold higher
than expected, with the genes of the ATP synthase complex among the most highly
connected, a reflection of the high cost of ribosome biosynthesis (Warner 1999).
Interestingly, the steep energy cost of amino acid usage in translation (Akashi and Gojobori
2002) is not reflected by conserved regulatory links with the energetic functional category
(E-C), though nucleotide metabolism is unexpectedly coupled with energy.
The choice of correlation threshold for defining coexpressed pairs influences
quantification of conservation, so we took a broader view by visualizing the divergence in
the whole coexpression distribution instead of just the top 10%. The observed joint
distribution of any two species counts where in each species' coexpression distribution a
gene pairs' correlation is conserved in the orthologous pair. Under a model of random
assortment, the correlation of a gene-pair in one organism is independent of its orthologous
gene-pairs in others, so the expected joint distribution is the product of the two
independent species' distributions. Figure lb quantifies the joint distribution's departure
from the random assortment model: at the extremes of coexpression, and the waning
similarity between successively more distant pairs of species is suggestive that the
divergence in coexpression has not yet plateaued.
We were surprised to see that a number of gene pairs remain anticorrelated over
billions of years (Figure 1c). While approximately half of these conserved pairs were
expected by chance and may be spurious, the links could generate hypotheses about
function. For example, ycal is currently uncharacterized (Uniprot Consortium 2011) and has
no known physical interactions (Su et al. 2008) nor convincing functional data compiled in
online databases (Szklarczyk et al. 2010), apart from homology to DNA internalization
proteins. The expression of ycal in opposition to a suite of growth and translation proteins
(cogJ, asnS, engA, ileS) (Uniprot Consortium 2011) and to a few proteins of the de novo
nucleotide synthesis pathway (guaA, purB), is consistent across billions of years, raising the
possibility that despite the diversity of natural competence programs observed in bacteria
(Gilbreath et al. 2011), competence induced out of phase with the ribosomal/growth
transcriptional program is ancient and conserved. Anticorrelated expression relationships,
and conserved ones in particular, can complement the usual expression profile clustering as
a means for generating critically needed hypotheses about physiology or function (Dhillon,
Marcotte, and Roshan 2003; Shatkay et al. 2000) but are not frequently used.
91
3.4.2
Expression phenotypes are not conserved
A fundamental question in the evolution of transcriptional phenotypes is whether
orthologous genes respond in a similar way to similar conditions. Because we could not find
this directly addressed in the scientific literature, we searched public databases to gather
comparable stress experiments performed in two or more species. For a number of
common environmental perturbations, including heat stress, oxidative stress, acid, alcohol,
and UV, we compared transcriptional responses to each stress in the closest species
available (typically same-genus or family), and never observed significant conservation (for
all orthologous genes, we compared log fold-change between treatment and control
expression, transcriptional response hereafter). The single exception was in the close
relativesEscherichiacoli and Salmonella enterica,in which we were able to find one time
point in heat shock with a modest signal for conservation (r 2=0.22) in our comparison of six
pairs of experimental time courses (Allen et al. 2003; Gutierrez-Rios et al. 2003; Jozefczuk et
al. 2010) (see Supplemental Notes for details).
In fact, we noticed that when a species' response to a given condition was assayed
by more than one laboratory, the inter-lab comparisons overwhelmingly and surprisingly
showed poor or zero correlation. Figure 2 highlights an example: E. coli differential
expression in pH perturbations show little or sometimes negative correlation across
different laboratories (Faith et al. 2008; Glasner et al. 2003; Kannan et al. 2008; Maurer et
al. 2005; Hayes, Wilks, Yohannes, et al. 2006; Hayes, Wilks, Sanfilippo, et al. 2006). In the
case of acute acid stress, three datasets (Faith et al. 2008; Glasner et al. 2003; Kannan et al.
2008) report a total of 836 significantly up-regulated genes, only two of which were
common to all three studies (at any time point). Indeed, the responses to acid compared
across laboratories were less similar than acid and alkaline within a laboratory, indicating
that the transcriptional response is driven more by the details of experimental protocol and
analysis than by the ancient ecological stressor. This observation argues for caution when
drawing physiological conclusions from any single transcriptomic experiment. It also
suggests that the dissimilarity we observed between species in comparisons of public data
could equally be due to differences in experimental details as to divergence between
species.
To limit the non-evolutionary sources of dissimilarity, we designed a standardized
experimental protocol and data analysis pipeline and produced a compendium of stress
response data in three species (see Methods). In S. oneidensis and D. vulgaris,we produced
92
time series of transcriptional responses to five common environmental stressors: heat and
cold, salt, acid, and alkaline. We also grew S. oneidensisboth aerobically and anaerobically
for the salt stress condition to interrogate the possible effects of aerobic growth on stress
phenotype.
The transcriptional phenotype of these two species responding to environmental
perturbations showed little evidence of conservation: only a modest correlation in
temperature shock (r 2 <0.11, Figure 3a). We first considered similarity of response of all 830
shared genes (BBH, see Supplementary Figure 1) but found temperature shock responses
were more correlated when we restricted our comparison to 509 tree-based orthologs, the
subset of orthologs whose gene trees support strict vertical descent (M. Price, Dehal, and
Arkin 2007; Paramvir S. Dehal et al.), highlighting the utility of tree-based methods for
identifying functional orthologs. We compared same-condition transcriptomic responses
(Figure 3a, bold boxes on diagonal) to ask if responses were similar, and different-condition
responses to ask if the similarity was condition specific or more general. Within each
condition, the most similar responses (annotated b-e) show very weak correlation
(0<r 2 <0.11). Other than the limited conservation in temperature shock, several
considerations indicate that global similarity is absent: in many comparisons no
significantly changing genes are shared, and within acid and salt shock, there are as many
anticorrelated as correlated responses. We also do not observe a consistent conserved
general stress response (GSR) in this transcriptomic data. This may be due to the fact that
GSR regulation diverges rapidly (Caimano et al. 2004; Gunesekere et al. 2006; Mittenhuber
2002), or that the ancient part of the response is largely post-translational (Mittenhuber
2002; Staron and Mascher 2010),but transcriptional conservation of the GSR was not found
to be addressed in the literature.
This lack of conservation in response is somewhat surprising given that
coexpression showed a significant level of conservation: 4083 gene pairs were still
coexpressed in both genomes, whereas only 1210 are expected by chance. This highlights
an important distinction between the conservation of coexpression and the conservation of
phenotype: coexpressed genes may act as a unit in both species, but that unit may not
respond in a similar manner to the same condition.
To look for conservation of phenotype over shorter evolutionary intervals, we
compared transcriptional responses of D. vulgaris to those of Geobactermetallireducens,but
93
still found no evidence for conservation (Figure 4). We grew G. metallireducensunder salt
and nitrate stress using protocols minimally modified from D. vulgaris,and compared samecondition transcriptomic responses as well as different-condition responses (Figure 4 and
additional data in Supplement). We observe some weak correlation in these species'
responses to the same condition (Figure 4b-c; r 2=0.09 for nitrate and 0.03 for salt), but as
these were less similar than responses to different conditions (for example, salt versus
nitrate or nitrate versus pH) they do not support condition-specific conservation. While a
conserved general stress response might exhibit such a signature, upon further
investigation, the most similar nitrate and salt responses across species (Figure 4b-c) share
only a single significantly changing gene, the DNA repair protein recR. Thus, over these
evolutionary intervals, neither general stress responses nor condition-specific ones are
similar at the transcriptional level.
Despite divergent global responses, there may be functional modules with
conserved expression. Under any particular condition, some critical modules may respond
directly to environmental cues, while others respond to indirect signals propagated through
the cell's global regulatory network. We reasoned that the expression phenotypes of
modules responding directly to any particular condition might be more conserved
evolutionarily. To test this hypothesis, we focused on the heat shock response, as it showed
the best signal of conservation, and the genes important for fitness have been well
characterized by independent experiments. We examined heat shock response data culled
from the databases for a broader set of species (Chhabra et al. 2006; Frye et al.; Gao et al.
2004; Gutierrez-Rios et al. 2003; Helmann et al. 2001), and apart from the close sister
species E. coli and S. enterica,little or no correlation of global transcriptional response was
found in comparisons of more evolutionarily distant pairs of species (Figure 5). However,
heat shock module genes, known a priorito be important for fitness in this condition, have a
highly conserved response between E. coli and S. enterica,and the general response is
consistent across all of the species tested, despite the large evolutionary separations
between them. In fact, the magnitudes of the responses for each gene in the heat shock
module are similar, which is remarkable given that the experiments were done in different
culture media, aerobicity, temperatures, platforms, and using overall different lab protocols.
The differences in the global transcriptional profiles could be driven by the
sensitivity to experimental protocols, or by drift of relatively unimportant parts of the
response. However, in the case of the heat shock module's response, selection not only
94
conserved it over vast evolutionary distance, but also made it robust to the many other
environmental cues acting in addition to heat shock. This observed conservation and
consistency places bounds on the effects of drift and sensitivity on genes expression with
known fitness benefit.
3.5 Conclusions and future directions
We showed that an environmental perturbation elicits vastly different
transcriptional responses in orthologous bacterial genes. Why are orthologous genes not
expressed in a similar way in a similar condition? Two elements contribute to this
dissimilarity: evolutionary divergence between species, and protocol-specific side effects.
The extent to which each of these is due to natural selection versus neutral drift is an
important unanswered question.
The phenomenon of similar microarray-based experiments producing inconsistent
results has been noted previously. For example, different platforms and analysis methods
may inconsistently quantify absolute expression level or low abundance transcripts
(Draghici et al. 2006; Shippy et al. 2006). However, a large consortium's systematic
evaluation found excellent quantitative agreement in global transcriptional profiles
between a variety of platforms, sites, and normalization techniques, particularly for foldchange values (Canales et al. 2006; Shippy et al. 2006; Shi et al. 2006), leading us to suppose
that most of the protocol-specific variation is present in the biomass, rather than introduced
by downstream measurement and processing. Indeed, more than a decade's worth of
microarray measurements in well-studied yeast systems provide the perfect comparison to
the variability of response we have highlighted in bacteria. In a number of yeast species, a
diverse set of stressors all elicit a stereotypical global transcriptional program, termed the
environmental stress response (ESR) (Gasch et al. 2000). In contrast to the "general stress
response" (GSR) mediated by alternative sigma factors in prokaryotes, this is a global and
highly stereotypical pattern, and is consistent across experiments done in different labs,
with different methodology, platforms, and strains (Gasch et al. 2000). The global
correlation of response is also largely shared between S. cerevisiae and S. pombe (Chen et al.
2003; Gasch 2007) despite 500 million years of evolutionary divergence (Berbee and Taylor
1993) and significant rewiring of even basic cellular transcription programs (Rustici et al.
2004). More recent analysis of the yeast ESR has found the effect largely due to cell-cycle
genes (Slavov et al. 2012), but the robust and conserved global response remains in striking
95
contrast to our observation of response variability in the same species or closely related
bacteria, and could indicate fundamental differences in the strategies used to tolerate stress
across the domains of life.
The extreme sensitivity in the bacterial transcriptional response to stress not only
calls into question the functional conclusions drawn from a single experiment's list of
significantly changing genes, but also the fraction of the global transcriptional response that
actually contributes to fitness, in line with previous studies that have found differential
expression poorly correlated with positive fitness effects measured by other means (Birrell
et al. 2002; Giaever et al. 2002; Klosinska et al. 2011; S. L. Tai et al. 2007).
Has selection tuned every part of the global transcriptional state, or is it more likely
that of the hundreds of genes that respond (significantly) to an environmental perturbation,
only a fraction of these contribute to fitness, and the remaining changes are side-effects of
the global network that are mostly neutral for fitness? For example, the E. coli CRP regulon
affects hundreds of downstream genes (Salgado et al. 2006; Zheng et al. 2004), but it seems
implausible that each of them is responding optimally and contributing to fitness in every
condition where CRP activity is modulated. It may be that for any one condition, the
expression of a handful are important while the expression of the rest are 'hitchhiking'
via their shared regulator. The extent to which bacterial transcriptional responses are
evolving under selection or neutral drift remains an open question, but even where neutral
evolution may not be the best model for expression (Bedford and Hartl 2009) that doesn't
preclude widespread neutrality in a dynamic, condition-specific response (M. Lynch 2007).
Questioning global optimality seems even more probable when one considers that a
monoculture laboratory dish is a poor duplicate of the natural ecology experienced by a
bacterium, and whatever the experimental protocol used, those conditions have likely never
been optimized by selection.
The most common approach to transcriptome analysis is to list the genes predicted
to change with the highest degree of statistical confidence, but here we showed that
variation in transcriptional responses with a species can lead to inconsistent lists, if not
misleading conclusions. Whether the mercurial nature of bacterial transcriptional
responses is due to protocol-specific responses or widespread neutrality is an important
fundamental question, but regardless, comparative transcriptomics provides a practical
strategy for identifying the functionally important parts of the response: by honing in on
96
those conserved by purifying or stabilizing selection, just as the conserved residues in a
multiple sequence alignment can highlight the residues critical for protein function. We
showed a module known to be important for fitness had far greater conservation;
conversely, conserved condition-specific modules could be learned (Lazzeroni and Owen
2002; Waltman et al. 2010) where they are not known a prioriand thus focus our
physiological interpretation of a response with information complementary to differential
expression.
In our data compendium of three metal-reducing proteobacteria, we found that
despite significant conservation of an ancient regulatory network, orthologous proteins
very rarely show a conserved response. Taking these findings into account, future
comparisons of transcriptional phenotypes from multiple closely related bacterial species,
employing rather stringent protocol standardization or in situ techniques, will be a valuable
approach for understanding the both the physiology of transcriptional responses in bacteria
and their evolution.
3.6 Materials and Methods.
3.6.1
Biomass production and stress application.
Our compendium of microarray data was conceived with the aim of comparing
transcriptional responses in three model metal-reducing species (S. oneidensis,D. vulgaris,
G. metallireducens).To that end, our experimental design included efforts to make biomass
production and quality control similar, and stress conditions comparable wherever
possible. In some cases experimental protocols could not be made identical due to biological
constraints (different media and/or gas composition of headspace were used to
accommodate the significantly different physiology of our organisms) or technological
constraints (e.g., microarrays were not available for all three species on the same platform).
Efforts to make the environmental stress comparable included determining the
stress level for each condition, in each organism, such that the cell growth was possible but
significantly inhibited. Various concentrations of stressors were tested to determine an
appropriate inhibitory concentration. In most cases we were able to find a stressor
concentration to reduce growth by 50% of the maximum compared to the control (noted as
the minimum inhibitory concentration or MIC). In some cases, transcriptomic responses of
97
a number of inhibitory concentrations were assayed. Biomass was collected at a number of
time points post treatment to capture response dynamics, these time points differed by
organism in accordance with their stress response and growth curves.
Some experiments we generated in our three model organisms have been published
previously. In S. oneidensis,heat shock (Gao et al. 2004), cold shock (Gao et al. 2006), and pH
(Leaphart et al. 2006) were described previously. In D. vulgaris,heat shock (Chhabra et al.
2006), alkaline (Stolyar et al. 2007), some salt data (Mukhopadhyay et al. 2006) and some
nitrate data (He et al. 2010) were described previously. The detailed protocols for
unpublished experimental data are described below.
Shewanella oneidensis. For salt stress studies on aerobically grown Shewanella oneidensis
strain MR1, starter cultures were inoculated 1:100 from 2ml vials of -80C freezer stocks
into LS4D medium and grown in a 30C water bath until log phase (OD-0.34). Then they
were inoculated into six 2000ml biomass production cultures. At an OD 5 9 o of approximately
0.4, 500mM NaCl stress was applied to three of these, and the other three were treated as
non-stressed control incubations.
To facilitate comparison of gene expression of Shewanella oneidensis to D. vulgaris,
we also assayed S. oneidensisstress response under strict anaerobic conditions. For salt
stress studies cultures were grown in LS4D medium (Mukhopadhyay et al. 2006) minimally
modified to allow growth of Shewanella oneidensis strain MR1 by substituting 50mM sulfate
with 50mM fumarate. Minimum inhibitory concentration under these conditions was
determined to be 300mM NaCl; during biomass production for transcriptomic studies
300mM NaCl was added as the stressor in triplicate incubations after cells reached mid-log
phase as described for D. vulgaris.Subsamples were harvested at 0, 60, 120 and 240 mins
after stressor addition.
Desulfovibriovulgaris.For Desulfovibrio vulgaris Hildenborough biomass production,
starter cultures were grown in LS4D medium from frozen glycerol stocks. Inoculations were
typically a 1:10 dilution. For example, 1ml thawed stock was added to 10ml LS4D for
overnight cultures. When these starter cultures reached mid-log phase, they were
inoculated into six biomass production cultures using a 1:100 dilution. At an ODso of
98
approximately 0.6 per ml, stressor was applied to three of these, and the other three were
mock-treated to serve as non-stressed control incubations.
In the case of salt stress, the MIC was determined to be 250mM NaCl (in addition to
what is present in LS4D). For the control incubations, a volume of LS4D or deionized
degassed water equal to the solution of NaCl was added. Subsamples were collected from
the stressed and non-stressed treatments generally at 0, 30, 60, 120, and 240 min following
stressor addition. To arrest further cell activity at this point, samples were processed as
described previously (Mukhopadhyay et al. 2006). Briefly, the required volume of culture
was harvested through a cooling tube, centrifuged at 11,000rcf for 10 minutes to pellet cells.
The supernatant was discarded and pelleted cells were resuspended in 30 ml chilled PBS,
spun down with the same parameters as before, and the pellet was stored in -80C until
further analysis.
In the case of nitrate stress, the MIC was determined to be 6500ppm. LS4D media
was inoculated with frozen stocks at 1:100 dilution and grown at 30C in an anaerobic
chamber (10% C02, 5%H2, remainder N2) overnight to an OD of 0.3/ml; then this culture
was used to inoculate samples (1:10) for biomass production. Six bottles were grown until
0.4/ml OD, at which point we took a TO sample from each bottle, and added 40 ml of 4.43 M
NaNO3 to three of the six bottles; an equal volume of water was added to the other three for
control. Additional samples for 30,60,120, and 240 minute time points were obtained. All
samples were obtained by aspiration from bottles in an airtight manner. The four nitrate
time courses analyzed in this study included slight experimental variations from this
workflow. Time zero "TO" OD varied between 0.28-0.4, and final pH of control between 7.07.3. Experiment 23 excluded the additional PBS-washing step after centrifugation. These
additional experiments with slight variations were all included in this meta-analysis to
allow for every possible chance of observing similarity in transcriptomic phenotype.
In the case of pH stress, 1.8M H2SO4 was added to reach a culture pH of 6.2 or 5.5
(Exp 111 and 38 respectively), and nothing was added to the control cultures.
In the case of cold stress, biomass production proceeded as described above for D.
vulgaris.After TO harvesting, the three treatment bottles were taken out of the anaerobic
chamber and placed in the water bath at 8C, pumped through a chilling device and quickly
chilled to 8C. Samples were taken at 60,120, and 240 minutes.
99
D. vulgaris heat shock experiments (Chhabra et al. 2006) were performed like the
rest of the D. vulgaris experiments, but with the following key differences: lactate sulfate
medium (LS) with yeast extract (not defined) was used as culture media. 37C rather than
30C was used for overnight cell growth as well as control conditions for biomass
production. Samples were transferred on the benchtop to a pre-chilled 50ml falcon tube and
spun down at 10,000rcf before storage in -80C.
Geobacter metallireducens. To facilitate comparison of gene expression of Geobacter
metallireducensto D. vulgaris,LS4D medium was minimally modified to accommodate the
growth of the iron-reducing Geobacterstrain GS 15. Notably, 10mM Iron-NTA was added as
the electron acceptor instead of 50mM sulfate, and 10mM acetate added instead of 60mM
lactate as the electron donor. Starter cultures were initiated in this medium from frozen
stocks, subcultured once before inoculation into biomass production cultures. Due to the
infeasibility of tracking cell growth in cultures containing iron by optical density, Iron(II)
concentration and cell counts were used as measures of growth. The MICs for G.
metallireducenswere determined as for D. vulgaris. Control and stressed cultures were
prepared in triplicate under anaerobic conditions as described above for D. vulgaris and
subsamples collected and processed at 0, 60, 120 and 240 mins after addition of stressor at
the MIC (100mM NaCl in the case of salt stress and 10mM in the case of nitrate).
3.6.2
Microarray extraction and hybridization.
For S. oneidensis salt experiments, microarray extraction and hybridization were performed
as described previously (Zhou et al. 2010) to a custom oligo array (Gao et al. 2004). For D.
vulgaris, microarray extraction and hybridization were performed as described previously
(Stolyar et al. 2007) with the exception of nitrate, which was previously published (He et al.
2010).
For Geobactermetallireducens,extraction and labeling of total RNA and genomic DNA were
carried out as described for D. vulgaris. Microarray hybridization was performed as
described previously (Zhou et al. 2010).
Data processing.
Log expression levels, including global normalization, were first
computed for each microarray. Log expression levels obtained from replicate arrays were
averaged. Each gene was represented by two spots on each microarray, and spots flagged
by the scanning software were excluded. The net signal of each spot was calculated by
100
subtracting the background signal and adding a pseudosignal of 100 to obtain a positive
value. For resulting net signals of 50, a value of 50 was used. For each spot, the level of
expression was the ratio of the two channels (ratio of mRNA to genomic DNA). For each
replicate, the levels were normalized so that the total expression levels for the spots that
were present on all replicates were identical. Finally, mean expression levels and standard
deviations of each spot were calculated, which required n > 1. To estimate differential gene
expression in control and treatment conditions, normalized log ratios were used. The log
ratio was log2 (treatment)/log2 (control). This log ratio was normalized using LOWESS on
the difference versus the sum of the log expression level (Dudoit and Fridlyand 2002).
Sector-based artifacts were observed; therefore, the log ratio was further normalized by
subtracting the median of all spots within each sector. Up to this point, data were processed
using spots instead of genes to allow sector-based normalization. Finally, the spots for each
gene were averaged to obtain a final normalized log ratio. To assess the significance of the
normalized log ratio, a Z score was calculated by using the following equation: Z
=
log2(treatment/control) / sqrt(0.25 + variance) , where 0.25 is a pseudovariance term. Log
ratios and Z values for all microarray data in this study are available at
http://www.microbesonline.org/cgi-bin/microarray/viewExp.cgi?expld.
The log ratios and
Z values were used to generate two types of plots, (i) volcano plots and (ii) operon-based
estimates of local accuracy. For volcano plots, we plotted the log2 ratio of all genes versus Z.
This provided an estimate of the total numbers of significant changers in a microarray
comparison and allowed determination of which time showed the most changers. For
operon-based estimates of local accuracy, each point represented a group of 100 predicted
significant changers with similar Z scores (the point showed the least significant Z value in
the set). The estimated accuracy of each group of changers was derived by inspecting other
genes in the same operons as these changers. For random changers, the transcripts for 50%
of these genes should have been regulated in the same direction, and for perfect changers
100% of the genes should have been regulated in the same direction. Members of the
operons without a consistent signal across replicates (Z < 0.5) were excluded. Even with
perfect microarray data, the estimated accuracy was somewhat less than 100% due to
errors in operon predictions. For operon plots see http://www.microbesonline.org/cgibin/microarray/viewExp.cgi?expId=XX and select Plots.
For public datasets, we averaged replicates and normalized data (median centered
and treatment versus control) where applicable, such that each gene was represented by a
101
log-fold-change in expression between treatment and control. The control was always taken
from the same data series as the treatment. For time-course data sets, we used the series'
zero time point as a control. For continuous culture experiments, the unstressed growth
condition with identical metadata was used as a control. Public datasets are cited in the
main text; exact experiment reference numbers, control conditions and processing are
enumerated in Supplemental Notes.
3.6.3
Computational methods
Significant changers.
As z-scores were not available for all of the public microarray data,
we used 2-fold up- or down-regulation as a threshold for significantly changing genes. If,
when investigating correlated responses in two or more species or from two or more
separate experiments, no conserved significant changing genes were discovered, we relaxed
this threshold to 1.5-fold (as it is less likely that the signal is spurious in both experiments).
Gene function and COG functional categories.
All functional annotations were
downloaded from MicrobesOnline (Paramvir S. Dehal et al.). This includes COG functional
categories, many of which were not assigned in the COG database (which contains a limited
set of species/strains) but were annotated by the MicrobesOnline pipeline. To assign a COG
functional category to each orthologous gene family (tree-orthologs from the
MicrobesOnline database), we used all COG functional categories assigned to family
members (typically 1 but sometimes more).
Coexpressed gene pairs in a single species.
We define putative "coexpressed pairs" as
the gene pairs with the most correlated response vectors across many conditions in each
organism. To supplement our own transcriptomic data (36 experiments for S oneidensis
and 79 for D. vulgaris), we gathered datasets from several public microarray databases for
two other organisms under diverse conditions: 40 for S. PCC, 40; 212 for E. coli. Some
conditions were represented by multiple time points. Prior to calculating coexpression, we
removed experiments that were too similar, i.e. the Pearson correlation over all genes,
p>0.90, to avoid their dominating the signal for coexpression.
For each pair of genes in an organism, we computed the Pearson correlation of
those two gene's response vectors over all conditions, i.e., a measure of tendency for the
two genes to be up- or down-regulated together in response to a variety of perturbations.
Each gene in a species was represented by a vector g where each entry gn is the log-foldchange in gene expression for that gene in experiment n (using all transcriptomic
102
experiments for that organism). The Pearson correlation of these vectors, p(gia, ggi) is the
coexpression of gene-pair (i, j) over all experiments n in species 1. (We only considered
gene pairs for which at least half the experiments had data.) For each organism, this yielded
a distribution of expression correlations p of 6 to 9 million gene pairs (G choose 2 where G
is the total number of genes in the genome). We took the top 10th percentile of gene pairs
from each organism's distribution as putatively coexpressed pairs. This low-specificity
cutoff for putative coexpressed pairs was a compromise to maintain reasonable sensitivity,
as it already excluded more than 25% of (predicted or experimentally validated) cooperonic genes in E. coli. Similarly, anticorrelated gene pairs were defined those in the
bottom 10th percentile (and Pearson's p<O) in each organism.
Conservation of coexpressed pairs.
To quantify conservation of coexpression, we took
the list of putative coexpressed gene pairs in each species, and asked which orthologous
gene-pairs were also coexpressed. To calculate both the observed and expected fractions
conserved, we considered only the gene pairs with conserved orthologous gene pairs
present in the other species, so that the signal of regulatory rewiring would not be
convoluted with the signal of gene presence/absence.
For the four-species comparative analyses, we used orthologous gene families built
from tree-based orthologs (M. Price, Dehal, and Arkin 2007; Paramvir S. Dehal et al.), 295
gene families present in S. PCC, D. vulgaris,and at least one of our two gammaproteobacteria. Each gene-family-pair (i,
and expression correlation
p(gi2, gj2)
j) has expression
correlation p(gui, gai) in species 1
in species 2, and conserved coexpression if both
correlations p rank in the top 10th percentile in their respective species. To calculate the
expected number of coexpression links conserved, we started with the set of gene pairs
coexpressed ( rank( p(gji, gi) ) >0.1) in either gamma-proteobacteria, considering only
those with orthologs and data in the other two branches (18196 gene pairs), and multiplied
by 0.10 twice (for each of the other two clades). Under a model of random assortment, 10%
of pairs would be expected to fall in the top 10th percentile of correlation for each phylum
added (D. vulgaris and S. PCC). A stricter definition of coexpression, the 1st percentile of
correlation, results in 407 conserved coexpressed pairs when only 0.5 are expected, a >800fold enrichment. Thus it is likely that our estimate of 14-fold enrichment of conservation
over expectation is quite conservative.
103
To describe which cellular functions are linked by these ultra conserved
coexpression links (Figure 1a), we assigned all 295 conserved gene families one (or more)
functional categories, or to "unknown" (there are 18 functional categories defined for COG,
following Riley (Riley 1993)). The expected 2-D distribution of links between categories
was calculated as the network of random linkages between conserved genes, i.e., by
multiplying this marginal distribution for all categories. The observed distribution of links
was calculated as following: for each coexpressed ortholog-pair we added one count linking
the pair of functional categories to which those ortholog groups belong (splitting counts for
membership in multiple categories). For the fold-enrichment for each pair of functional
categories, we divided observed counts in each category-pair by the number expected.
For the conservation of correlation matrices (Figure 1b) we considered two
genomes' conserved coexpressed gene-pairs. For the observed joint (2D) distribution, in
each uniform bin we counted how many orthologous gene-pairs share that level of
coexpression in the two organisms: each gene-family-pair (i, j) places a count into the 2D
bin defined by coordinates (p(gi, gji), p(gi2, g2)) for all Pearson's p. The expected joint
distribution for each pair of genomes was calculated empirically under a null model of
random assortment (i.e., orthologs are no more likely to conserve their coexpression links
than by random chance): we multiplied the marginal (discrete) distributions of each of the
two species. All distributions were normalized to sum to 1. Each bin of the observed joint
distribution was divided by the counts in the expected joint distribution to yield the factor
of excess orthologous coexpressed pairs present at each level of correlation.
Protein-protein interactions taken from two genome-wide data sets (Butland et al.
2005; Hu et al. 2009) were used to identify the subset of E. coli coexpressed pairs also
physically interacting. Conservation of coexpression in this subset was calculated as for the
full set.
Similarity of transcriptional response
The similarity in global transcriptional response
was computed as the Pearson correlation of the log-fold-change in expression of all
orthologous genes. Tree-based orthologs from MicrobesOnline were used, except where we
noted the bi-directional best hit (BBH) ortholog methodology (best hits were required over
75% of the length). Each transcriptional response (all genes in one time point of an
experimental treatment) was represented by a vector v, where each entry vg is the log-foldchange in gene expression for gene g (and the vector includes all genes in that genome for
104
which data was available for that experiment). Similarity of response is the Pearson
correlation p between two vectors vi and V2 (for all shared orthologous genes g in species 1
and 2), and is often discussed as p 2 or r 2 in the text, i.e., in the more intuitive terms of
fractional variance explained.
Conservation of transcriptional phenotype: definition.
To decide whether correlation
of expression profiles represented significant conservation of condition-specific response,
we considered two things. If there was no other data for the two species being compared (as
was often the case for public datasets) we drew a cutoff of r 2 =0.01 to consider a response
may be correlated. We considered the maximal correlation for all possible comparisons of
timepoints, so as not to assume that the dynamics of responses would be identical at the
same time points in different species. While r 2<0.01 would have a significant p-value in a
test where every orthologs' data point was assumed independent, that is certainly an
overestimate of the true number of independently expressed genes in a genome, so this is
likely a permissive threshold for significant correlation. Secondly, we asked if the response
was more correlated within the same treatment condition than between different treatment
conditions. Where responses to different-conditions are more similar than to sameconditions, the correlation is more likely due to chance or to a non-specific stress response
than to similar condition-specific adaptations acting in both the species. In all cases we
examined the list of conserved up- and down-regulated genes for corroboration and
physiological insights.
Estimation of divergence times.
Approximate divergence times were taken from a
recent paper (David and Alm 2010). Briefly, a chronogram of the tree of life was constructed
with the PhyloBayes software package; this chronogram was calibrated according to a set of
temporal constraints drawn from the paleontology and biogeochemistry literature.
105
3.7 Author Contributions
S. C. Timberlake performed all computational analysis (with the exception of processing of
raw microarray signal) and wrote the paper with E.J. Alm.
D. Joyner, R. Chakraborty, A. Mukhopadhyay and others performed growth assays,
phenotype microarrays and experiments and measurements.
M.P. joachimiak and others designed the microarray probe analysis pipeline.
M.P. Joachimiak, J.K. Baumohl, P.S. Dehal and K.H. Huang maintain(ed) the MicrobesOnline
database, which defines ortholgs and also imported expression data.
Special thanks to M. Price for E coli expression data compendium.
A.P. Arkin, T. Hazen, and E.J. Alm conceived of comparative stress experiments and provided
materials and reagents.
3.8 Figures and Legends
106
C-
a.
(5
(D
3
-1 -
-
.....
0o
0
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-- xa
0)
C~)
0
'A
ce
8
0
o
0)
o
eli
CD
f 6 b i0:j
.Obmmm
-0.6'
co
....
-0.8
-1
0.8'
0.6
-0.2
-0.4
00.2 0.4 -
-0.6
-0.8
-1
0.8
1-1
0.6
'a
Cf
Cbs
C,)
0.4 -
6
0 j :1o 0
o o
mA
.. .
oj
0.2-
0-
-0.2 -
-0.4 -
-0.6-
-0.8 -
b o 6 :C
O O 6 O
correlation of gene pair in S. oneidensis
I
U
I
to
'9
0
3
0
0,
3f
3t
-
107
(3
~
o
0-
0
e
..
o0
0
o.
e
0
o.0
j2
E!8
3
0
33
0
.
0
*0
R 5-
0D
..
.
0
-
0
0
0
0
0
0
6
e
3S
ia
00.
0@
S0
0
00
0
.
. 00
:
0
o0
.
0
0
0
*
e
.
-
..
0.
0.
0~
0
.@WinODD
-*e
33
0
0
0
.
0
-0
.
.
S0
-
Sr
a
2
3
-n
0.
*
e
~
0
I
Co
.
.
*
-
.
0
-
C.
OR
0
3
C
ci
3-
0
3.;.
~
3.-'
I
I
# conserved genes
I
0)
0I
eS
Figure 1. Conservation of coexpression across the bacterial domain
a) Cellular functions of ultra conserved genes and coexpressed gene pairs. The grey bars
profile the COG functional categories of 295 orthologous genes present (and having
expression data) in the three clades. The matrix of dots depicts the distribution of 2629
conserved coexpression partners amongst the functional categories: for each pair of
functional categories, dot size indicates the absolute number of coexpressed gene pairs,
while red-blue shading highlights function pairs with an unexpected share of coexpressed
members (compared to random links between the 295 genes present in all). b) Evolution of
the coexpression distribution between S. oneidensis and three increasingly distant species.
Each square (a bin of Pearson correlations) depicts the enrichment of orthologous genepairs whose level of coexpression has been conserved (compared to the number expected
under a model of random assortment (see methods)). The majority of the distribution
exhibits random assortment, but enrichment in the highly correlated gene pairs is a strong
signal of conservation between species. Some anticorrelated pairs' conservation is also
evident. c) Anticorrelated interactions conserved across divergent bacteria. Edges connect
genes with opposite gene expression profiles conserved in our three clades. Nodes are
labeled with a gene name where available; node colors encode the COG functional category
assigned to the gene. The main hub aat is a leucyltransferase that functions in the N-end
rule pathway for protein degradation. The other hub, ycal, is uncharacterized but is
putatively a membrane protein involved in DNA internalization and competence for
transformation.
108
o = acid
o alkaline
=
= acute
= anaerobic
10' @ pH12
0.9
10' @ pH12
0.7
2' @ pH15.5
0.5
0.3
10'@ pHS1.50.
0.1
-0.1
continuous @ pH5.0
-0.3
continuous @ pH8.7
continuous @ pH5.7
-0.5
-0.7
continuous @ pH8.5
-0.9
o
0
(
0..
Figure 2. Transcriptomic response to pH is highly variable, even within the same
species.
We compared the global responses from five data sets (separated by thin lines) assaying
acute or chronic pH-stress phenotype (timepoints and conditions are indicated) or pH
growth phenotype ("continuous" culture) in E. coli. Each square depicts the Pearson
correlation r of the response of all genes measured. Between labs, correlation is slight or
sometimes negative (maximum r 2=0.12). The responses to acid and alkaline from within a
dataset are far better correlated than the response to acid across datasets, despite being
normalized to their own controls. (All experiments were normalized to the pH7 condition
from the same data set; see methods for details).
109
D. vulgaris
Heat
shock
Cold
Acid
Alkaline
shock
shock
shock
Salt
shock
5'
Heat shock
I
2'
isy
10'
5'
r-
NA
5,
I
IIA
L.B
I
IV
VA
V.B
Cold shock
1'
.IS
II.B
20'
4a'
80'
160'
65'
Acid shock
ExperIments reference InformatIon
S. oneldensls
D. vulgarlS
exp8:42C
exp1; 15C
expl; 8C
exp7; pH4
exp7; pHl10
exp394; 02+
expl715;02-
VI
VI
VIlA
VII.B
IX
XA
X.B
LC
XD
III
d
Alkaline shock
exp24;50C
exp25; 8C
exp38; pH5
exp1l; pH6.2
exp37; pH10
exp2O; [250mM]
exp14; [250mM]
exp6; [500mM]
exp6; [250mM]
IV
3120'
VA
120'
V.B
I
Salt shock
-0.9 -0.7
-0.5 -0.3 -0.1
VII
VNIA
VIII.B
IX
XA
XB
XC
0.3
0.5
0.7
0.9
Pearson correlation
240'
VI
0.1
X.D
e. Salt shock
c. Cold shock
b. Heat shock
d. Acid shock
00
cm1
o
0
CS
N
0-
0000I
00o0
CII
.2
-4 -
CmI
0
2
0
r.
9
4
I 80
.
.0--- -- ---- ---------
-2
S
o0
Coo
0b2
CIJ
r00
ra. 0.073
-4
I -
0
2
log2rato, S onekdenass
4
I
I
i
j
I
I
I
I
1
-4
-2
0
2
4
-4
-2
0
2
r =.0.00
I
r=. .031
-
4
-4
-2
0
2
4
0
Figure 3. Comparison of phenotype in our stress compendium of S. oneidensis and D.
vulgaris
3a) Correlations in transcriptional responses between these two species in five
environmental perturbations. Each colored square in the matrix displays the Pearson
correlation between the response of all orthologs in S. oneidensis and D. vulgaris,comparing
each sample point in the time courses; the five bold boxes highlight the same-condition
comparisons (all-versus-all timepoints) for heat, cold, acid, alkaline, and salt. Roman
numerals mark experimental conditions while letters delineate multiple protocols or stress
levels within a condition. Experiment numbers (expXX) reference MicrobesOnline
expression database ids. The most correlated response within each condition is marked and
expanded in Figure 3b-e.
3b-e) The most similar transcriptional responses within each condition. Within each
treatment condition (indicated in a), the most similar responses are expanded in the scatter
plots, illustrating the up- or down-regulation of all orthologous genes. Each point represents
the log-fold-change measured in S. oneidensis along the x-axis and D. vulgaris along the yaxis. Alkaline responses were never correlated, and so omitted.
111
.............
......
......................................
.- ...-. .......
....
..............
G.metalllreducens
Salt shock
Salt__shock
Nitrate
shock
__
Experiments
reference Information
30'
60'
120'
IA
D. vulgarls
LA
exp6; [500mM]
.B exp6; [250mM]
240'
15'
60'
120'
Jw
b
11expl4
III
IVA
IV.B
V
VI
IB
240'
30'
60'
120'
exp2o
exp3l
exp3l; w/sorbitol
exp23
exp30
VI exp42
IIA
VIN exp177
240'
120'
ci
lilA
30'
VA
30'
IVB
G. metallreducens
IX exp443
30'
X exp1272
XI exp448
60'
120'
240'
Jw
240'
Vi
240'
480'
30'
60'
Vil
120'
240
-0.9 -0.7 -0.5 -0.3
IX
0.1
0.3
0.5
0.7
Pearson correlation
XI
X
-0.1
c. Nitrate shock
b. Salt shock
S
C
c'j
0 0:01
I
10
S
0-
0
ei
6
1
o9
0
CI-
*
*
CmJ
I*
r 2= 0.030
-4
-2
0
2
-4
4
log2ratlo, D. vulgar!s
112
-2
0
2
4
0.9
Figure 4. Comparison of phenotypes in our stress compendium of D. vulgaris and G.
metallireducens.
a) Correlations in transcriptional responses between these two species in salt and
nitrate stress. Each colored square in the matrix displays the Pearson correlation between
the response of all orthologs in D. vulgaris and G.metallireducens at each sample point of the
time courses; the two bold boxes highlight the same-condition comparisons (all-versus-all
timepoints). The most correlated response within each condition is marked and expanded
in Figure 4b-c. Roman numerals mark experimental conditions while letters delineate
multiple protocols or stress levels within a condition. Experiment numbers reference
MicrobesOnline expression database ids. The most correlated responses for each condition
are marked and expanded in Figure 4b and c.
b-c) The most similar responses within each condition. Within each treatment
condition (indicated in a), the most similar responses are expanded in the scatter plots,
illustrating the up- or down-regulation of all orthologous genes. Each point represents the
log-fold-change measured in D. vulgaris along the x-axis and G. metallireducensalong the yaxis.
113
Table 1. Genes of our heat shock module
Name
Orthologs
Description
dnaK
1
1
1
1
chaperone Hsp70; DNA biosynthesis; autoregulated heat shock proteins
yabH
CIP11
Ion
1
1
0
0
1
1
1
1
DnaJ-domain-containing proteins 1
'f
heat.sock: protein F215
eds dpeltcsbunkt
ATP-depnden
DNA-binding, ATP-dependent protease La; heat shock K-protein
ycaL
1
0
0
0
putative heat shock protein; Zn-dependent protease with chaperone
function
htrB
1
1
0
0
heat shock protein; lipid A biosynthesis lauroyl acyltransferase
hslJ
1
0
0
0
heat-inducible protein
ydhM
1 1
0
0
pred-icted moecuar chaipleen distardi relaed tok HSP704fold
htpX
1
1
1
1
heat shock protein, integral membrane protein
yegk
1
1
0
0
ddg
hscA
yfhE
1
I
1
0
1
1
0
0
0
0
0
0
putative heat shock protein; Lauroyl/myristoyl acyltransferase
illis7Ol
iginsfeINl
lilasmber
etshockiroteki, chsapieoe
DnaJ-domain-containing proteins 1
rp*E
1
1
0
1
Rtymell
cipB
1
1
1
1
protein disaggregation chaperone
mreB
1
1
1
1
ON
yrfl
I
1
1
1
0
0
1
regulator of ftsl, penicillin binding protein 3, Heat shock protein Hsp70
domain
tein&M40A$
dsh
hypothetical protein; Disulfide bond chaperones of the Hsp33 family
0
0
0
heat shock Patoln;idilKWif
0
heat shock protein; Molecular chaperone (small heat shock protein)
1
lb
1
1
0
~1
h~~slU
1
1
1
1
1
lbpA
hslV
eIn
e
1 -het
t hock
ilililllllf
esigmaat
eok pr-otein.hitimApil
-
heat sokandeoidativetrs
Omihileign@.iln@# %m~shockprteliv)
hnwlogwsrtodailents
se sulbunltti
1
heat shock protein hslVU, proteasome-related peptidase subunit
1
14
mopB
1
1
1
1
co-chaperonin GroES
lysi
lmiislphl!#Idil
;
ieshilldisatein
yJIL
0
0
1
0
Activator of 2-hydroxyglutaryl-CoA dehydratase (HSP70-class ATPase
domain)
Heat shock module genes were identified by text mining of E. coli gene annotations.
Orthologous genes defined by the BBH method were identified in four other organisms
presence/absence of orthologous genes indicated by 1/0 in Se, Salmonella enterica;So,
Shewanella oneidensis; Dv, Desulfovibrio vulgaris; and Bs, Bacillus subtilis.BBH orthologs
were used in this case because limiting the module to tree-orthologs eliminated nearly all
114
the genes under consideration at long phylogenetic distances; we judge this less strict
definition of orthology as more likely to yield a false negative than false positive result.
* genes of heat shock module
- all pairwise orthologs
I
I
S. enterica
CD-
0
B. subtills
D. vulgaris
S. oneldensis
(D-
WD
CD
Qoo0
0
(N
o0
(N-
'Ii
(N
0
0M
(N
*1*
-6
-4
D.
-2
0
2
4
6
-6
-4
-2
0
2
4
6
-6
-4
-2
0
2
4
6
-6
-4
-2
0
log2 fold-change In expression
Figure 5. Conserved response of E. col heat shock module across five species despite
global divergence of response.
Global response of all orthologs (gray points) is poorly conserved but the response of genes
in the heat shock module (red points) remains consistent over billions of years of
evolutionary time in the heat shock condition, despite different protocols and conditions for
the perturbation. Genes of the heat shock module are listed in Table 1.
115
2
4
6
D. vulgarls
Hea
shock
Acid
shock
Cold
shock
Salt
shock
Alkaline
shock
15'
j25'
Hea shock
!
Expomte fernc hnfonalio
2V
LA
8V
16
6
Coldshock
1
2V
. oneldeness
I exp8;42C
LA expl;15C
ILBexpl;8C
M exp7;pH4
LB
4V
82
. VDgwe
VI exp24;50C
8C
Vi exp25;
pH5
VUiLAexp38;
VilLB explll;pH6.2
IV exp7;pH10
V.A exp394;
02+
V.B exp1715;02-
IX exp37;
pH10
XA exp20;250mM
X. exp14;
[250MnV
X.C exps:[500mM
6
XD expO;
[250mM
Acidshock
3V
Alkaline
shock
I 12
-0.9 -0.7 4.5 -0.3 4.1
IwV.B
Softshock
VI
Vi
VHLA
VIItB
IX
XA
X.B
XC
0.1 0.3 0.5 0.7 0.9
V.A
Pearson correlation
X.
Supplemental Figure 1. Comparison of phenotypes in our stress compendium of S.
oneidensisand D. vulgarisusing all putative orthologs.
We compare transcriptional responses of these two species to five environmental
perturbations, using all 832 putative orthologs identified by the bi-directional best hit
method. Each colored square in the matrix displays the Pearson correlation between the
response of all orthologs in S. oneidensis and D.vulgaris at each sample point of the time
courses; the five bold boxes highlight the same-condition comparisons (all-versus-all
timepoints) for heat, cold, acid, alkaline, and salt. Roman numerals mark experimental
conditions while letters delineate multiple protocols or stress levels within a condition.
Experiment numbers (expXX) reference MicrobesOnline expression database ids.
116
3.8.1
Supplemental Note: transcriptomic experiments obtained from other
laboratories and their comparison
We compared the E. coli transcriptional phenotypes in response to pH stress, as assayed by
a number of laboratories. Published microarray data were imported into MicrobesOnline to
facilitate comparison (MicrobesOnline experiment ids: 40, 1005, 1064, 1142, 1231-3).
Where necessary, M3D metadata was used to guide data normalization (treatment versus
control). For time series, the zero time point was used as a control where possible,
otherwise, we normalized to the control growth condition in the same data set, having
identical metadata (except for pH=-7), including the same experimenter name. For
experiment ids 1005, 1064, 1142, 1231-3, the control experiments used for normalization
were 995, 1065, 1143 and 1230, respectively.
To look for the similarity of response in E. coli and S. enterica in heat shock, previously
published microarray data from a number of labs (Allen et al. 2003; Gutierrez-Rios et al.
2003) were imported into MicrobesOnline (experiment ids: 12, 1004, 82, 83). This included
GEO accession GSE807 and GSE808, which are not associated with any known publication.
Additional heat shock data was obtained from a supplementary data associated with a
recent publication (Jozefczuk et al. 2010), and normalized as described in their
supplementary notes.
For the heat shock comparison among five species (Figure 5), published data
(Chhabra et al. 2006; Frye et al.; Gao et al. 2004; Gutierrez-Rios et al. 2003; Helmann et al.
2001) were imported into MicrobesOnline experiment ids 2,8,12,24,82, and 83.
Supplemental Table 1. Significantly-changing genes common to multiple species'
condition-specific responses.
A list of possibly conserved elements of condition-specific transcriptional responses:
orthologous genes common to multiple species' responses to the same condition in our
compendium of S. oneidensis,D. vulgaris, and G. metallireducens(So, Dv, Gm). For each
condition, the best-correlated comparison was considered (these were highlighted in Figure
3b-e and Figure 4b-c). Alkaline (pH+) had no positively correlated responses across species;
117
instead the comparison with the highest number of shared significantly-changing genes was
used (60 minutes in So versus 30 minutes in Dv). Significantly changing genes were defined
or downregulated (+) more than 2-fold; no genes met these criteria in the
as those up- (t)
acid condition (pH-), so genes with 1.5-fold change (4o)
in both species are reported. Gene
IDs are those of D. vulgaris'genesin the MicrobesOnline database and are listed multiple
times where they were part of conserved responses to multiple conditions.
Dv.vs.So
Dv.vs.Gm
gene ID
(Dv)
*C+ *C- pH- pH+ salt salt N03
209537 +
206238
4
+
+
gene name
description
LysE
Lysine efflux permease
dnaK
Molecular chaperone
Molecular chaperone GrpE (heat shock
206239 $
grpE
protein)
Preprotein translocase subunit SecA
206252 $
secA
(ATPase, RNA helicase)
206718 +
ftsH
ATP-dependent Zn proteases
ATP-dependent Lon protease, bacterial
206779 +
Ion
type
ATP-dependent protease HsIVU (CIpYQ),
206912 +
hslU
ATPase subunit
ATP-dependent protease HsIVU (CpYQ),
207028 $
hslV
peptidase subunit
DNA-directed RNA polymerase, sigma
207253 +
rpoD
subunit (sigma70/sigma32)
207254 $
dnaG
DNA primase (bacterial type)
DnaJ-class molecular chaperone with C-
208766 +
dnaJ
terminal Zn finger domain
Phosphoribosylcarboxyaminoimidazole
209423 +
purE
(NCAIR) mutase
ABC-type tungstate transport system,
206171
+
permease component
118
206203
4
atpG
FOF1-type ATP synthase, gamma subunit
206204
+
atpA
FOF1-type ATP synthase, alpha subunit
FOF-type ATP synthase, delta subunit
(mitochondrial oligomycin sensitivity
+
atpH
protein)
206212 4
rodA
Bacterial cell division membrane protein
206205
Phosphoribosylaminoimidazolesuccinocar
206222
4
purC
boxamide (SAICAR) synthase
206262
4
rplS
Ribosomal protein L19
206266
rpsP
Ribosomal protein S16
206298
+
+
frr
Ribosome recycling factor
206358
4
rplU
Ribosomal protein L21
206643
+
fabF
3-oxoacyl-(acyl-carrier-protein) synthase
206644
4
acpP
Acyl carrier protein
Dehydrogenases with different
specificities (related to short-chain
206645 4
fabG
alcohol dehydrogenases)
3-oxoacyl-[acyl-carrier-protein] synthase
206646 4
206650
+
4
fabH
III
rpmB
Ribosomal protein L28
(acyl-carrier-protein) S-
206688
4
fabD
malonyltransferase
206748
4
rplB
Ribosomal protein L2
206764
4
rplO
Ribosomal protein L15
206772 4
rplQ
Ribosomal protein L17
Predicted membrane GTPase involved in
207715
4
4
typA
stress response
Holliday junction resolvasome,
207743
4
ruvC
endonuclease subunit
207942
4
metK
S-adenosylmethionine synthetase
rplT
Ribosomal protein L20
bioB
Biotin synthase and related enzymes
208032 4
208055
4
119
208436
+
rplJ
Ribosomal protein L10
208437
4
rpIL
Ribosomal protein L7/L12
208452
+
purB
Adenylosuccinate lyase
207021
+
rho
Transcription termination factor
209448
+
nusA
Transcription elongation factor
Uncharacterized protein conserved in
bacteria; Ribosome maturation factor
209449
+
RimP (IP)
206238 $4+
dnaK
Molecular chaperone
207836
$
dut
dUTPase
209441
+
rpsO
Ribosomal protein S15P/S13E
206739
4
rpsL
Ribosomal protein S12
rpIR
Ribosomal protein L18
+4
206761
$
206769
+
rpsK
Ribosomal protein S11
208768
+
greA
Transcription elongation factor
mutY
A/G-specific DNA glycosylase
hisF
Imidazoleglycerol-phosphate synthase
209216
+
209220
209256
+
eno
Enolase
209332
+
hup-1
Bacterial nucleoid DNA-binding protein
209403
+
trpD
Anthranilate phosphoribosyltransferase
206202
$
atpD
FOF1-type ATP synthase, beta subunit
$
Preprotein translocase subunit SecA
206252 $
$
206301
+
206635
secA
(ATPase, RNA helicase)
tsf
Translation elongation factor Ts
+
leuS
Leucyl-tRNA synthetase
206637
4
ribH
Riboflavin synthase beta-chain
206752
4
rplP
Ribosomal protein L16/L1OE
206753
4
rpmC
Ribosomal protein L29
+
rplR
Ribosomal protein L18
206764
+
4
rplO
Ribosomal protein L15
208034
+
infC
Translation initiation factor 3 (IF-3)
206761
4
+
+
120
hypothetical protein; Cell division protein
208169
ZapA-like (TIGR)
DNA-directed RNA polymerase, beta'
208439
rpoC
$
subunit/160 kD subunit
DNA-directed RNA polymerase, alpha
206771
$
206800
+
207060
$
rpoA
subunit/40 kD subunit
Predicted phosphatases
dapB
Dihydrodipicolinate reductase
Uncharacterized conserved protein;
207526
membrane protein, putative (TIGR)
+
3-hydroxymyristoyl/3-hydroxydecanoyl207856
fabZ
+
(acyl carrier protein) dehydratases
Dehydrogenases with different
specificities (related to short-chain
206645
+
+
fabG
alcohol dehydrogenases)
206650
+
+
rpmB
Ribosomal protein L28
+
rpsT
Ribosomal protein S20
207364
conserved hypothetical protein;
206683
+
206980
4
coaD
Phosphopantetheine adenylyltransferase
207285
4
yajC
Preprotein translocase subunit YajC
Mammalian cell entry-related (TIGR)
Protein-L-isoaspartate
207315
4
pcm
carboxylmethyltransferase
Peptidyl-prolyl cis-trans isomerase
207339
4
208903
+
ppiB-2
(rotamase) - cyclophilin family
+
ftsE
Predicted ATPase involved in cell division
209074
+
trpS
Tryptophanyl-tRNA synthetase
206202
+
atpD
FOF1-type ATP synthase, beta subunit
206301
+
tsf
Translation elongation factor Ts
206530
+
argG
Argininosuccinate synthase
206531
+
argF
Ornithine carbamoyltransferase
206748 4
+
rplB
Ribosomal protein L2
206749
+
rpsS
Ribosomal protein S19
121
206750
+
rplV
Ribosomal protein L22
206751
+
rpsC
Ribosomal protein S3
+
rpmC
Ribosomal protein L29
+
rpsQ
Ribosomal protein S17
+
rplR
Ribosomal protein L18
206762
$
rpsE
Ribosomal protein 55
206763
$
rpmD
Ribosomal protein L30/L7E
206753
+
206754
206761
4+
Inorganic
207088
ppaC
+
pyrophosphatase/exopolyphosphatase
Uncharacterized protein conserved in
bacteria; DNA-binding protein, YbaB/EbfC
208720
family (TIGR)
+
tRNA delta(2)-isopentenylpyrophosphate
4
206981
+
208903
+
+
miaA
transferase
ftsE
Predicted ATPase involved in cell division
Flagellar biosynthesis pathway,
+
208972
fliP
component FliP
Flagellar biosynthesis/type Ill secretory
209245
206238
pathway protein
+
dnaK
+
Molecular chaperone
208317
+
208611
+
rbr
Rubrerythrin
207805
+
rbr2
Rubrerythrin
cytochrome c3
Chemotaxis protein; stimulates
208485
+
methylation of MCP proteins
207043
+
Methyl-accepting chemotaxis protein
209249
4
flgC
Flagellar basal body rod protein
Nitrate/TMAO reductases, membrane-
209570
4
206917
4
nrfH
bound tetraheme cytochrome c subunit
Predicted ATP-dependent protease
6Fe-6S prismane cluster-containing
208040
4
b0873
122
protein
208775
membrane protein, HPP family
+
FOF1-type ATP synthase, epsilon subunit
+
206201
+
atpC
(mitochondrial delta subunit)
tRNA delta(2)-isopentenylpyrophosphate
206981
4
miaA
transferase
207355
4
gmhA
Phosphoheptose isomerase
207398
4
IspA
Lipoprotein signal peptidase
Predicted membrane GTPase involved in
207715
4
4
typA
stress response
CMP-2-keto-3-deoxyoctulosonic acid
208632
4
kdsB
synthetase
Recombinational DNA repair protein
208721
4
4
recR
123
(RecF pathway)
124
Chapter 4.
Conclusions and future
directions
4.1 Short reads for metagenomics
In Chapter 1 I introduced SHE-RA, an algorithm that significantly extends the range of
applications for ultra high-throughput sequencing, a crucial technological advance for the
interrogation of complex environmental samples. SHE-RA aligns overlapping paired-end
reads, correcting errors, and offering -48-160x deeper sampling (per dollar) compared to
the technology commonly used for such studies, 454. I demonstrated that SHE-RA made
Illumina a viable alternative to 454 for community analysis (Rodrigue et al. 2010), the
Chisholm group used it to aid in the assembly of single-cell bacterial genome fragments that
are the result of multiple displacement amplification (Malmstrom et al. 2012), a protocol
that is growing in popularity to interrogate the large fraction of the microbial world that
remains unculturable (De
Jager and Siezen 2011). The Moran group used it to analyze the
day-night and seasonal dynamics of a metatranscriptome of a complex marsh community
(Gifford et al in prep). The code was made open source on a website (Timberlake 2012), and
has at least 60 users. SHE-RA also includes automatic identification and trimming of veryshort-insert overlapping pairs, where adapter and barcode sequence can drive improper de
novo read clustering or create false-positive hits in protein space.
While this work was motivated by the desire to get longer Illumina reads' error
rates from 10e-1 to 10e-2 or below, the general principle of error rate reduction by multiple
read passes on the same DNA molecule could be used elsewhere for to push error below
10e-3 or 10e-4 for new applications. Because some high fidelity PCR protocols boast error
rates around 10e-7 (Vallania et al. 2012), the sequencing platform could once again
represent the dominant source of error. Further decreases in error rates would improve
assessments of microbial diversity (Beerenwinkel and Zagordi 2011; Zagordi et al. 2011),
125
and make for more sensitive detection of rare variants such are seen in viral populations or
heterogeneous samples, such as tumors (Macalalad et al. 2012; Choi et al. 2009; Aparicio
and Huntsman 2010). While the SHE-RA algorithm was tested on Illumina reads, it was
designed to be largely platform-independent, and will, in principle, work well with any
sequencing technology where individual reads' measurement error is largely independent.
More generally, the primary challenges in applying UHTS to community datasets
remain computational: both de novo clustering and homology identification, initial steps in
taxonomic or functional profiling, are currently computationally prohibitive in terms of
working memory (RAM) and execution time, respectively (Preheim and Timberlake,
unpublished observations). In particular, there is a gap between the fast but insensitive
read mapping software written for UHTS and the more traditional and sensitive methods of
homology detection, such as blastx (Altschul et al. 1997). While fast read mappers can align
millions of sequences to reference genomes (Li and Durbin 2009; Langmead and Salzberg
2012, 2), their limited sensitivity means that only reads very similar to a known reference
sequence can be identified, a tiny fraction in most communities. This gap has been partly
bridged by speeding up sequence comparisons using software (Edgar 2010; Koskinen and
Holm 2012) or hardware (Suzuki et al. 2012), but sequencing production outpaces these
improvements, such that the bioinformatic processing remains the limiting step (Ghodsi
2012; Nekrutenko and Taylor 2012). Beyond speeding up basic pairwise homology
searches, the most useful addition to the community genomics toolbox at this point would
be scalable versions of the search algorithms for matching probabilistic representations of
protein families and domains (position specific scoring matrices and hidden Markov
models) (Altschul et al. 1997; Eddy 2001). These represent the greatest area for progress in
computational analysis of natural microbial communities.
126
4.2 Assembly and analysis of 82 environmental bacteria
In Chapter 2, I described my work to assemble the genomic reference for an important
model system for marine bacterial and evolution and ecology. In the first part of the
chapter, I described an efficient assembly strategy to assemble 82 Vibrionaceae genomes
from short reads. In the second part, I use these genomes to survey the relationship
between ecology, phenotype and genotype in this family.
The hybrid assembly approach I developed minimizes the need for labor-intensive
and expensive long-insert libraries, but avoids the pitfalls of reference-only approaches,
which require a very closely related finish genomes, and miss all of the diversity in gene
content that exists between even the closest conspecifics. My analysis shows that 1 closed
or near-closed genome per species is sufficient to significantly improve related strains'
assemblies via this hybrid reference and de novo assembly approach, and even genomes
assembled with short-insert paired-end reads were useful references. Even very
fragmented reference assemblies (N50-2kb) to more distant genomes or fragmented
references could be effectively utilized by this strategy for improving assembly contiguity.
The upper limit for this approach, in terms of distance to reference, is still limited by length
and quality of short read sequence and will likely benefit from the next generation of more
permissive read-mappers (e.g. FR-HIT(Niu et al. 2011)). My hybrid reference and de novo
assembly approach was able to effect a -3fold improvement in N50 even for the most
recent generation of Illumina reads. It was shown effective with short single-end reads,
which may be relevant again as a new generation of sequencing platforms comes online
(Glenn 2011). A similar 'assisted assembly' approach was reported (Gnerre et al. 2009), but
it was designed for mammalian genome assemblies and required paired-end reads. The
ABBA algorithm had a similar goal, but could only be used to close gaps spanned by known
protein sequences (Salzberg et al. 2008) , and requires users' a priori knowledge of all
connections between contigs in the assembly, closer reference strains, extensive expert
knowledge, and manual curation. My hybrid approach requires no expert genomic
knowledge or manual curation beyond the selection of a suitable reference genome with
sufficient average nucleotide identity, which can be easily approximated from blast output
(Altschul et al. 1997). It was designed as a modular pipeline using two of the most
commonly used algorithms for short-read analysis, MAQ and Velvet (Li, Ruan, and Durbin
127
2008; Zerbino et al. 2009), and therefore is accessible to a wide audience as well as easily
generalizable to other tools.
These 82 assembled genomes have served as the genomic reference for an important
model system for marine bacteria and natural populations, and have already informed to
our understanding of bacterial evolution and ecology through a number of collaborations or
independent work, mainly in the Alm and Polz labs. These genomes were used to search for
the genetic bases of phosphothioration, a newly identified type of restriction-modification
system that is ecologically-associated (Wang et al. 2011). They were recently used in a
study detailing the initial events in sympatrically-differentiating populations (Shapiro et al.
2012), which demonstrated how homologous recombination can be force promoting
divergence of populations. Because of the naturally defined populations in this model
system, they have been invaluable for illuminating the social relationships within and
between bacterial populations. For example, antibiotics production and iron acquisition
were each shown to be contexts where these populations behaved as social entities,
exhibiting cooperative and cheating behaviors (Cordero, Wildschutte, et al. 2012; Cordero,
Ventouras, et al. 2012).
In the rest of Chapter 2, I presented a survey of flexible gene turnover, gene sharing,
metabolism, and ecology in these 82 Vibrionaceae. In addition to their congruence
demonstrated by a number of other means, the ecological populations delineated in this
model system were also predominantly congruent in terms of metabolism, core genome
phylogenies, and in particular, gene content, which was able to delineate populations along
ecological lines very early in the differentiation process.
While the ecologically defined populations behaved as separate entities in a number
of ways, the inferred habitats (based on particle size and season) may not represent
consistent ecology across independent lineages adapted to that habitat. This measure of
ecology did not have predictive power for sharing of gene content over the long term, or of
metabolic similarity, perhaps because both genotype and phenotype are driven largely by
phylogenetic history. In fact, metabolic phenotype could only statistically distinguish
ecological populations after they were sufficiently separated by phylogenetic distance.
Despite a diversity of ecologies and microhabitat preferences in this group, these
genomes shared genes with a frequency much higher than a recent survey of HGT reported
for same-ecology bacteria or even for bacteria isolated from various sites on the human
128
body, a known nexus of transfer. To what extent this is due to shared geography and
isolation method, or perhaps to the vibrio's natural tendency for DNA uptake, is an open
question, but the relative breadth of the labels used to describe "ecology" in these studies
almost certainly has an effect.
These vibrios not only have a high rate of recent transfer with each other, but an
astonishingly high rate of introduction of new DNA overall. Our superfine phylogenetic
sampling included many pairs of genomes that are distinguished by only 100-300 SNPs
genome-wide, but more than 100-300kb of new DNA. I showed that new, potentially
beneficial, genes are rapidly introduced, providing abundant raw material for selection
within the very fine-scale phylogenetic distances where we have previously reported
ecological differentiation (Hunt et al. 2008). While the majority of this very fast flexible DNA
never reaches fixation, a large amount does fix in a population. Flexible genes fixation is
apparently non-neutral, occurs on timescales compatible with ecological differentiation, and
delineates populations in a way congruent with their ecology before core genes or
metabolic phenotype can do so. My analysis found that 10-30% of new flexible genes are
fixing in a habitat-specific and thus non-neutral manner; this should be confirmed by
further work. Additional characteristics of recently fixed species-specific flexible genes,
such dN/dS or nonrandom functional distribution might bolster our ecological evidence of
non-neutral fixation. By OTU-like divergence, fixation of flexible DNA has resulted in large
scale (-25%) turnover in the core of any one population. This highlights the large
numerical discrepancy with core and flexible genomes defined by common arbitrary
phylogenetic cutoffs, and mislabeling of the types. In particular, a large fraction of 'flexible'
genes apparently at intermediate penetrance in a V. cyclitrophicus were actually fixed in a
population, and arguably now core. These observations further motivate the attention to
core and flexible definition in future study.
The lack of proper species definitions is only part of the trouble with gene frequency
distribution curves. Gene frequency distribution curves are a common focus of flexible gene
study and served as the test bed for a recently developed neutral model of flexible genes
fixation (Luo et al. 2011; Haegeman and Weitz 2012). However, if not all flexible genes are
selectively neutral, a more fundamental issue is that these curves combine new genes being
fixed, fixed genes being lost, and genes whose equilibrium penetrance is intermediate.
These three processes almost certainly separate, both mechanistically and in terms of the
selective forces driving them, and should really be treated separately instead of combined
129
into a single penetrance curve. Future study in this model system may be able to identify the
distinct selective pressures driving these three processes.
Of particular interest are the genes with equilibrium intermediate penetrance in
populations, as may be driven by frequency-dependent selection. Genes of low or
intermediate penetrance provide diversity in the population pan genome, but little is
understood about this process. Some have hypothesized it is useful for long term buffering
against environmental variation, others studies point to phage evasion. A survey of the
functional content and selective forces on these gene sequences might demonstrate their
utility: a preliminary analysis of vibrios' very rare proteins (an approximation of the
intermediate penetrance gene set) revealed that most evolve under weak purifying
selection (median dN/dS<0.25 at most distance bins).
One established force selecting for intermediate penetrance is cheating, where
certain individuals in a population bear the cost to make a resource publicly available (for
example, siderophore production results in extracellular bioavailable iron), but all members
are able to benefit (by taking up bound siderophores), and some do so without bearing the
cost (Cordero, Ventouras, et al. 2012), Substrate utilization profiles have been measured for
a large number of strains in this model system, providing an excellent dataset to investigate
similar patterns in substrate utilization. In particular, some complex carbohydrates and
other long polymers must be degraded extracellularly before import, and during that time
they can diffuse away from the original enzyme producer. In fact, the organic particles in the
marine environment that are a major source of organic matter for copiotrophs are known to
be surrounded by large diffuse plumes of partially degraded dissolved organic matter,
possibly escaped the degrader and intended consumer. Biolog profiles might reveal that
substrate utilization abilities for less labile organic matter (which would be prone to
cheating behavior) share a tendency of high within-population variation. While the nature
of this cheating can be regulatory (Wilder, Diggle, and Schuster 2011) and situationdependent, it could also be forced by genotype, and perhaps associated with certain
ecologies.
One intriguing possible role for intermediate penetrance genes is the creation of
allelic diversity through periods of hypermutation in a few individuals, perhaps in times of
stress. We have repeatably observed stress-induced hypermutation in vibrios in the
laboratory mediated by prophage exicision (S. Clarke and S. Timberlake, unpublished data)
130
but whether the allelic diversity so generated is passed on laterally in the wild is a
fascinating open question.
Genes fixed at intermediate penetrance are one of many things motivating a reexamination of the operational definitions of core and flexible genes. While these unubiquitous genes would be labeled flexible by any common definition, some genes' loss
from a population could trigger collapse (e.g., loss of siderophore production capabilities).
More generally, the study of flexible genomes has likely been confounded by the wide
variety of arbitrary phylogenetic cutoffs used for species group delineation (Grote et al.
2012). The vibrio model system's natural definitions of populations and hence core and
flexible genomes are ideal to address key predictions of the Core Genome Hypothesis and its
proposed species definition, which has been largely ignored in the prolific core/flexible
genome literature. The core and flexible genome dichotomy was originally proposed to
reconcile the seeming contradiction of bacterial gene content variability with species-like
clusters: despite rampant HGT in the flexible genome driving intraspecific diversity,
intraspecific homologous recombination in the core genome could still provide a
homogenizing influence on a stable, species-defining core genome (Ruiting Lan and Reeves
2000; R. Lan and Reeves 2001). Homogenizing intra-population versus diversifying
interpopulation recombination in core genes (including population-specific core and family
core) versus flexible genes could be quantified to test this prediction. Enriched intra-cluster
recombination frequency has been found previously (Lef6bure et al. 2010; Bennett et al.
2010; Muzzi and Donati 2011), and within-population exchange dominated recent
recombination in the core genome of V. cyclitrophicus (Shapiro et al. 2012), though the
generality of this finding needs to be demonstrated. In fact, one study found recombination
frequency was not enriched among strains sharing some environmental labels (Muzzi and
Donati 2011). The time scale by which a new member of the species-specific core appears
core-like and diversifying inter-cluster recombination is suppressed seems surprisingly fast
but the mechanisms are unknown. Suppression of inter-population exchange at these
distances may not be easily to explain by the falloff in homologous recombination frequency
with distance; could selection act quickly enough, or perhaps other mechanisms: clusterspecific restriction modification systems were identified as mechanism to actively exclude
inter-cluster DNA (Budroni et al. 2011). However, this mechanism may not be wholly
responsible or generally applicable (restriction modification systems appear largely
variable within vibrio populations ((Wang et al. 2011) and S. Timberlake, unpublished
131
results). The processes of fixation, recombination and integration of flexible genes into core
genomes and populations have been largely ignored in core and flexible genome analyses
but are likely fundamental to understanding adaptation and speciation in bacteria.
Lastly, this model system is unique in its combination of hierarchically sampled whole
genomes and quantitative ecological data. Here I draw attention to a few unexplored
strengths of this dataset and the questions they could address in future study. I include
some preliminary results.
1)
Genetic drift and bacterial gene deletion
The association between genome size and species phylogeny branch length in this
family is obvious to the naked eye. However, the possible relationship between genome
size and drift and its mechanisms are still a matter of active debate (Kuo, Moran, and
Ochman 2009; Whitney and Garland 2010; Whitney et al. 2011; Michael Lynch 2011).
This dataset is excellent for estimating natural populations' Nepl, Ka/Ks, and inferring
gene gain and loss with better temporal precision than most clades to understand the
timescales for these processes and the strength of possible selection on gene deletions.
Additionally, these selective forces might be largely distinct from mechanisms driving
the vast fast loss of flexible genes.
2)
Quantifying habitat specialization and its evolutionary dynamics
It is commonly held that specialists undergo population size and genome reduction,
however, the label 'specialist' is poorly defined and therefore hard to investigate. On
what time scales does habitat invariability qualify for specialization, and on what time
scales might it impact core and flexible genome size? I propose that one quantitative
measure of population specialization is the diversity represented in a population's
ecological metadata. For example, a preliminary investigation revealed a modest
association between population's entropy in size fraction and core genome size (r=0.6),
but no association with genomic fluidity (a recently introduced measure of population
gene content diversity (Haegeman and Weitz 2012)). Population pan-genomes have
been proposed as a mechanism for buffering environmental variability, but perhaps the
environmental variation measured by habitat entropy occurs with such frequency that
sensing should be hardcoded in an individual (Kussell and Leibler 2005), rather than
just buffered by population diversity. Indeed, studies of the vibrios have previously
found evidence of high migration rates, for example, between the water column and
132
animals (Preheim et al. 2011). Quantitation of the level of specialization was not
previously possible, and revealing the relevant timescales for environmental and
genotypic variability would inform our understanding of genome and pan genome
dynamics.
3)
How do census population sizes relate to evolutionary dynamics?
Estimates of census population sizes and of effective populations sizes typically differ by
many many orders of magnitude. Preliminary analysis revealed that relative census
population sizes (i.e., populations' relative abundances recovered by the isolation
process) are in general well-correlated with mean evolutionary rate (S. Timberlake,
unpublished results). Where census population sizes are predictive of Ne and other
measures of drift, and where they differ, could improve our understanding of this
discrepancy and the evolutionary dynamics association with such processes.
133
4.3 Conservation of coexpression without conservation of
response in bacterial gene regulatory networks
In Chapter 3, I quantified coexpressed gene pairs conserved across bacteria, and contrasted
that with a striking lack of conservation of transcriptomic response in our model species
and more closely related species drawn from the literature. Why are orthologous genes not
expressed in a similar way under similar conditions? Two elements can contribute to this
dissimilarity: evolutionary divergence between species, and protocol-specific side-effects.
The extent to which each of these elements contribution is due to natural selection versus
neutral drift is an important unanswered question.
The phenomena of similar microarray experiments producing inconsistent results
has been documented in the literature, but those effects are unlikely to be the driving force
behind these observations. Different platforms and analysis methods may inconsistently
quantify absolute expression level or low abundance transcripts, leading to different results
(Draghici et al., 2006; Shippy et al., 2006). However, a large consortium's systematic
evaluation found excellent quantitative agreement in global profiles between a variety of
platforms, sites, and normalization techniques, particularly for the fold-change values used
throughout this comparative analysis (Canales et al., 2006; Shi et al., 2006; Shippy et al.,
2006), consistent with the idea that most of the variation occurs in the biomass.
Indeed, abundant microarray measurements in well-studied yeast systems provide
the perfect comparison to the variability of response we have highlighted in bacteria. In a
number of yeast species, a diverse array of stressors all elicit a stereotypical global
transcriptional program, termed the environmental stress response (ESR(Gasch et al.,
2000)). The ESR is a global and highly stereotypical pattern, and is consistent across
experiments done in different labs, with different methodology, platforms, and strains. The
global correlation in response is also largely shared between S. cerevisiae and S. pombe
(Chen et al., 2003; Gasch, 2007) despite 500 million years of evolutionary divergence
(Berbee and Taylor, 1993), and significant rewiring of even basic cellular transcription
programs (Rustici et al. 2004). More recent analyses have concluded that the yeast ESR is
largely explained by cell-cycle genes (Nikolai 2011), but it remains in striking contrast to
our observations of response variability in the same species or closely related bacteria. The
134
lack of such a global response in bacteria is intriguing, as it could indicate fundamental
differences in the strategies used to tolerate stress across the domains of life.
While some cross-condition response similarities were noted in this survey, also
absent was evidence of the "general stress response" (GSR), mediated by alternative sigma
factors in bacteria. This may be because regulatory control of the GSR turns over rapidly
(Caimano et al. 2004; Gunesekere et al. 2006; Mittenhuber 2002) or because the ancient
features of the GSR network are largely post-translational (Mittenhuber 2002; Staron and
Mascher 2010). Also, while descriptive studies of the GSR abound in the literature,
quantitative assessments of its transcriptional conservation were absent and may be a
useful area of further study, perhaps to contrast with the well-studied yeast ESR.
Is it plausible that bacteria and yeast are using fundamentally different strategies to
adapt phenotype transcriptionally to changing conditions, and if so, what drives this
distinction? Perhaps another striking difference in their evolutionary modes: bacterial gene
content changes principally by HGT (David and Alm 2010; Pal, Papp, and Lercher 2005;
Treangen and Rocha 2011) while eukaryotic genomes typically grow by duplication
followed by sub- or neo-functionalization, and formation of regulatory connections
following these two processes is likely distinct (Morgan N. Price, Arkin, and Alm 2006; Pal,
Papp, and Lercher 2005). Moreover, the timescale for bacterial gene content turnover is
comparable to the timescale for expression divergence: nearly half of the bacterial
proteome turns over within many bacterial OTUs (clusters of <3% 16S divergence)
(reviewed in(Grote et al. 2012)) the time by which E coli and S enterica transcriptional
response to heat is nearly uncorrelated (Paramvir S. Dehal et al.). And while transcripts
from this flexible genome are not often not observed to be expressed in laboratory
conditions (Taoka et al. 2004), in environmental samples, many flexible genes are more
highly expressed than core genes (Stewart et al. 2011). Thus the fast acquisition of new
genes could substantially mold the expression profile and its evolution in bacteria in a way
distinct from eukaryotes'.
Microarrays and rna-seq provide the most comprehensive and unbiased
examination of transcriptional responses, and standard approach to their analysis is to list
the genes whose differential expression has the highest degree of statistical confidence (Pan
2002; Kerr and Churchill 2001), but this comparative survey showed that variation in
transcriptional responses can lead to inconsistent lists, if not misleading conclusions. The
135
extreme sensitivity in the bacterial transcriptional response to stress not only calls into
question the functional conclusions drawn from a single experiment's list of significantly
changing genes, but also the fraction of the global transcriptional response that actually
contributes to fitness.
Previous reports highlighting a lack of correspondence between gene upregulation
and condition-specific fitness defects have also cast doubt on the common assumption that
differential expression is indicative of a positive fitness effect (S. L. Tai et al. 2007; Birrell et
al. 2002; Giaever et al. 2002; Klosinska et al. 2011). However, the cross-protective nature
observed for many stress responses (Leyer 1993) could account for some of this
discrepancy: differential expression may not be beneficial for the conditions at hand but
anticipatory for the likely future perturbations (Mitchell et al. 2009). Still, the overall
contribution of such effects remains unclear.
Despite compelling evidence that certain features of bacterial transcriptional
responses have been highly optimized by selection, among them, response timing (Zaslaver
et al. 2004), regulon spatial organization (Janga 2012), and even anticipatory responses
(Mitchell et al. 2009), that doesn't preclude widespread neutrality in a single, dynamic,
condition-specific response (Lynch, 2007). Has selection tuned every part of a dynamic
global transcriptional states, or is it more likely that of the hundreds of genes that respond
(significantly) to an environmental perturbation, only a fraction of these contribute to
fitness, and the remaining changes are side-effects of the global network that are mostly
neutral for fitness? For example, the E coli CRP regulon controls hundreds of downstream
genes (Zheng et al. 2004; Salgado et al. 2006), but it seems implausible that each of them is
responding optimally and contributing to fitness in every condition where CRP activity is
modulated. It may be that for any one condition, the expression of a handful are important
while the expression of the rest are 'hitchhiking' via their shared regulator. Questioning
global optimality seems even more reasonable when one considers that a monoculture
laboratory dish is a poor representation of the natural ecology experienced by a bacterium,
and whatever the experimental protocol used, those conditions have likely never been
optimized by selection. Culture-free measurement techniques may thus play an important
role in addressing these questions.
Despite a lack of global conservation, I showed a module known to be important for
fitness had far greater conservation; conversely, conserved condition-specific modules
136
could be learned (Lazzeroni and Owen 2002) where they are not known a priori.In this
way, comparative transcriptomics should help to identify functionally important parts of
the response: by honing in on those conserved by selection, just as the conserved residues
in a multiple sequence alignment can highlight the residues crucial for protein function
which purifying selection has maintained as invariant. This will not only provide an
important new dimension to inform physiological interpretation of a response, but will also
help start to constrain the extent to which gene expression evolves by selection versus drift.
Even a basic evolutionary context for bacterial gene expression and the
transcriptional response is sorely lacking. A recent review on the promise of next
generation sequencing for the study of bacterial transcription (Guell et al. 2011) did not
even mention conservation, selection, or drift, ideas that we know to be central to the
understanding of all biological processes (Dobzhansky 1973). A quantitative phylogenetic
model of expression evolution would allow us to test whether divergently expressed genes
differ from expectation of the null neutral model, as well as estimate strength of selection on
transcriptional profiles and thus their fitness effects (Bedford and Hartl 2009). At present,
the extent to which transcriptional regulation evolves largely by selection versus drift is
unclear. While recent studies have weighed in on appropriate models for evolution of
steady state transcript abundance: for largely neutral evolution in primates livers but not
brains (Khaitovich et al. 2004; Khaitovich et al. 2006), or for stabilizing selection in flies
(Rifkin, Kim, and White 2003; Bedtheford and Hartl 2009), neither for bacteria nor for
transcriptional responses has a model been proposed or tested.
Phylogenetic modeling of the evolution of quantitative traits has long been a pillar of
ecological study (Garland, Harvey, and Ives 1992; Ricklefs and Starck 1996; Felsenstein and
others 2004; Whitehead and Crawford 2006) and its application to transcriptional data is
not without precedent (Fay and Wittkopp 2007). The justification for modeling the
expression level (or change in level) of a gene as an 0-U process is that many small effects
can influence a gene's expression. An unbiased walk is representative of neutral drift and it
has been argued this fits gene expression divergence well for some datasets (Khaitovich et
al. 2004). In addition to modeling neutral drift, the mathematical formulation of this process
can model a number of selective forces: the walk can be biased (directional selection),
mean-reverting (stabilizing selection), and include branch-specific rates or even punctuated
equilibria (Gu 2004). These models, however, have only very rarely been applied to data
(Oakley et al. 2005; Bedford and Hartl 2009).
137
A major hurdle to the application of these phylogenetic random-walk models is the
taxon sampling required for parameter estimation in even the simplest models. Simulations
of gene fold-change values evolving under a random walk model and sampled by
realistically noisy measurements, I estimated that at least 20 transcriptomes would be
required to reject the null hypothesis for even strong directional selection (2 times the
standard deviation by drift) on a gene-by-gene basis (test AUC=0.8; S Timberlake,
unpublished data). Still, genome-wide trends are possible to establish with less extensive
datasets: a recent phylogenetic analysis of Drosophila tissue gene expression found that
stabilizing selection was a better model than neutral divergence without bound to explain
the average of genome-wide expression divergence (Bedford and Hartl 2009). (The same
study also noted that with 7 taxa, the test to reject neutrality in favor of stabilizing selection
on a gene-by-gene basis was still vastly underpowered.) That technological advances will
make a sufficient level of transcriptome sampling available in the very near future leads me
to outline these key analyses:
1)
Does a model of neutral or stabilizing selection better explain bacterial expression
evolution within populations, and between populations?
2)
Can estimates of gene expression drift help reconcile the discrepancy between gene
upregulation and deletion fitness effect?
3)
Are most genes, and in particular new genes, under stabilizing or directional
selection in most relevant conditions?
4) Can the varying strengths of selection in different conditions inform our
understanding of the recent and ongoing evolutionary pressures of the population's
niche?
The application of quantitative phylogenetic methods to the study of transcriptional
regulation will be critical for answering fundamental questions about the evolution of
transcriptional regulation, and thus for informed functional interpretations of
transcriptomic data.
138
139
Chapter 5.
Glossary
NOTE: some of these were taken or adaptedfrom previous publicationsincluding (Shapiro et
al. 2009).
de novo assembly: the process of finding overlaps between reads and building them into
contigs de novo, i.e., without any prior knowledge of the source sequence
reference assembly: the process of aligning ("mapping") reads from one library onto a
reference sequence of a closely related strain that was obtained through other means,
following by inference of the reads source sequence by consensus building in the aligned
regions. The term reference assembly is a bit misleading.
N50: an assembly contiguity metric: the length (in base pairs) of the shortest contig of the
assembly that would have to be included in the set of contigs containing 50% of the total
source sequence. N80 (with an analogous meaning) is also sometimes reported.
paired-end reads: many sequencing technologies can read sequence starting from each
end of a DNA molecule. The resulting pair of reads is known to occur in a specific
orientation in the source sequence, separated by some distance (usually approximated by
measurement of the distribution of DNA molecule lengths sequenced; this distance is called
the insert length).
ecology: the niche, hyperspace whose dimensions are defined as environmental variables
(including biotic and abiotic conditions) in which a species is able to persist and maintain
stable population sizes (Hutchinson 1957; Hardesty 1975). Measurements of these
environmental axes may be used to approximate an organism's ecology.
habitat: In this thesis, I typically refer to the inferred habitats output by L. David's
algorithm, AdaptML (Hunt et al. 2008). Habitats should be an approximation, though
perhaps a very limited one, of a genome's ecology.
140
species: Definitions of species, or even their existence, in Bacteria are highly contentious. I
use species in the following ways: 1) the term "named species" refers to taxonomical
nomenclature for which there is a historical precedent in microbiology, and wide use in the
literature, but which might not correspond to any modern post-NGS or theoretical concept
of bacterial species, and 2) a purely theoretical notion of species, which will share a niche
and sexual gene exchange, harbor largely neutral diversity, and in every other way conform
to all microbiologists criteria for species.
phylogenetic cluster: a set of genomes related much more closely to each other than to
nearest relatives, (as measured by housekeeping gene or other genetic distance); these are
sometimes treated as bacterial species.
population: Populations are species-like entities, a phylogenetic cluster of genomes that is
ecologically cohesive. The strictest definition requires their discovery by one run of an
experimental sampling protocol, but populations found congruent by previous study
(Preheim, Timberlake, and Polz 2011) were sometimes combined in this work when there
was good phylogenetic evidence to do so. Populations are used as the primary taxonomic
unit in Chapter 3.
sympatric speciation: The process through which new species arise in the absence of
physical barriers to gene flow between them.
niche partitioning: The process whereby different organisms co-exist in a community
rather than competing for resources. Niches can be partitioned when lineages avoid
competition by using different resources, or restricting their activity to different physical
spaces, seasons, times of day, etc.
convergence: Similar evolutionary adaptations occurring independently in two lineages.
Used here to refer to gene acquisition and elements of phenotype including gene expression
and substrate metabolism. Convergence is often used as evidence of (positive) selection.
penetrance: the proportion of a population with a particular genotype; in this work, the
genotype is more specifically the possession of a particular flexible gene.
(neutral) drift: The process by which genetic changes with negligible effects on fitness
become stochastically fixed in a population of finite size.
negative/purifying/stabilizing selection: The evolutionary force selecting against
deleterious genomic changes and promoting conservation of the ancestral state.
141
positive/diversifying selection: The evolutionary force causing novel alleles, genes, or
phenotype conferring a fitness advantage to rise in frequency in a population.
homologous recombination: The exchange of identical or similar (homologous) stretches
of DNA.
illegitimate recombination (horizontal transfer): The lateral acquisition of DNA that
does not replace orthologous DNA.
142
Chapter 6.
References
Achaz, G., E. Coissac, P. Netter, and E. P. C. Rocha. 2003. "Associations Between Inverted
Repeats and the Structural Evolution of Bacterial Genomes." Genetics 164 (4): 1279-
1289.
Aird, D., M. G. Ross, W. S. Chen, M. Danielsson, T. Fennell, C. Russ, D. B. Jaffe, C. Nusbaum, and
A. Gnirke. 2011. "Analyzing and Minimizing PCR Amplification Bias in Illumina
Sequencing Libraries." Genome Biol 12 (2): R18.
Akashi, H., and T. Gojobori. 2002. "Metabolic Efficiency and Amino Acid Composition in the
Proteomes of Escherichia Coli and Bacillus Subtilis." Proceedingsof the National
Academy of Sciences of the United States ofAmerica 99 (6): 3695.
Allen, T. E., M. J. Herrgard, M. Liu, Y. Qiu, J. D. Glasner, F. R. Blattner, and B. 0. Palsson. 2003.
"Genome-scale Analysis of the Uses of the Escherichia Coli Genome: Model-driven
Analysis of Heterogeneous Data Sets."Journalof Bacteriology 185 (21): 6392.
Alm, E. J., K. H. Huang, M. N. Price, R. P. Koche, K. Keller, 1. L. Dubchak, and A. P. Arkin. 2005.
"The MicrobesOnline Web Site for Comparative Genomics." Genome Research 15 (7):
1015-1022.
Altschul, S. F., T. L. Madden, A. A. SchAffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman.
1997. "Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search
Programs." Nucleic Acids Research 25 (17): 3389-3402.
Angiuoli, Samuel, Julie Dunning Hotopp, Steven Salzberg, and Herv6 Tettelin. 2011.
"Improving Pan-genome Annotation Using Whole Genome Multiple Alignment." BMC
Bioinformatics 12 (1) (June 30): 272. doi:10.1186/1471-2105-12-272.
Aparicio, S. A. J. R., and D. G. Huntsman. 2010. "Does Massively Parallel DNA Resequencing
Signify the End of Histopathology as We Know It?" The Journalof Pathology 220 (2):
307-315.
Aziz, Ramy K, Daniela Bartels, Aaron A Best, Matthew DeJongh, Terrence Disz, Robert A
Edwards, Kevin Formsma, et al. 2008. "The RAST Server: Rapid Annotations Using
Subsystems Technology." BMC Genomics 9: 75. doi:10.1186/1471-2164-9-75.
Babu, M. M., N. M. Luscombe, L. Aravind, M. Gerstein, and S. A. Teichmann. 2004. "Structure
and Evolution of Transcriptional Regulatory Networks." Current Opinion in
StructuralBiology 14 (3): 283-291.
Bankevich, Anton, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin,
Alexander S. Kulikov, Valery M. Lesin, et al. 2012. "SPAdes: A New Genome Assembly
Algorithm and Its Applications to Single-Cell Sequencing." Journalof Computational
Biology 19 (5) (May): 455-477. doi:10.1089/cmb.2012.0021.
Barrett, T., D. B. Troup, S. E. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I. F. Kim, A.
Soboleva, M. Tomashevsky, and R. Edgar. 2007. "NCBI GEO: mining tens of millions
of expression profiles-database and tools update." Nucleic acids research 35
(Database issue) (January): D760-5.
143
Bedford, T., and D. L. Hart. 2009. "Optimization of Gene Expression by Natural Selection."
Proceedingsof the NationalAcademy of Sciences 106 (4): 113 3.
Beerenwinkel, N., and 0. Zagordi. 2011. "Ultra-deep Sequencing for the Analysis of Viral
Populations." Curr.Opin. Virol 1 (5): 413-418.
Bennett, Julia, Stephen Bentley, Georgios Vernikos, Michael Quail, Inna Cherevach, Brian
White, Julian Parkhill, and Martin Maiden. 2010. "Independent evolution of the core
and accessory gene sets in the genus Neisseria: insights gained from the genome of
Neisseria lactamica isolate 020-06." BMC Genomics 11: 652.
Berbee, M. L., and J. W. Taylor. 1993. "Dating the Evolutionary Radiations of the True Fungi."
CanadianJournalof Botany 71 (8): 1114-1127.
Berg, 0. G., and C. G. Kurland. 2002. "Evolution of Microbial Genomes: Sequence Acquisition
and Loss." MolecularBiology and Evolution 19 (12): 2265-2276.
Bergmann, S., J. Ihmels, and N. Barkai. 2004. "Similarities and Differences in Genome-wide
Expression Data of Six Organisms." PLoS Biology 2 (1): e9.
Birrell, G. W., J. A. Brown, H. I. Wu, G. Giaever, A. M. Chu, R. W. Davis, and J. M. Brown. 2002.
"Transcriptional Response of Saccharomyces Cerevisiae to DNA-damaging Agents
Does Not Identify the Genes That Protect Against These Agents." Proceedingsof the
NationalAcademy of Sciences 99 (13): 8778.
Boucher, Y., 0. X. Cordero, A. Takemura, D. E. Hunt, K. Schliep, E. Bapteste, P. Lopez, C. L.
Tarr, and M. F. Polz. 2011. "Local Mobile Gene Pools Rapidly Cross Species
Boundaries to Create Endemicity Within Global Vibrio Cholerae Populations." MBio
2 (2). http://mbio.asm.org/content/2/2/eoo335-10.short.
Brady, A., and S. L. Salzberg. 2009. "Phymm and PhymmBL: Metagenomic Phylogenetic
Classification with Interpolated Markov Models." Nature Methods 6 (9): 673-676.
Brauer, M. J., C. Huttenhower, E. M. Airoldi, R. Rosenstein, J. C. Matese, D. Gresham, V. M.
Boer, 0. G. Troyanskaya, and D. Botstein. 2008. "Coordination of growth rate, cell
cycle, stress response, and metabolic activity in yeast." Molecularbiology of the cell
19 (1) (January): 352-367.
Britten, R.
J., and
E. H. Davidson. 1969. "Gene regulation for higher cells: a theory." Science
(New York, N. Y.) 165 (891) (July 25): 349-357.
Brown, P. 0., and D. Botstein. 1999. "Exploring the New World of the Genome with DNA
Microarrays." Nature Genetics 21 (1 Suppl): 33-37.
Bruijn, Frans J. de. 2011. Handbook of MolecularMicrobial Ecology I: Metagenomics and
ComplementaryApproaches.John Wiley & Sons.
Budroni, S., E. Siena, J. C. D. Hotopp, K. L. Seib, D. Serruto, C. Nofroni, M. Comanducci, D. R.
Riley, S. C. Daugherty, and S. V. Angiuoli. 2011. "Neisseria Meningitidis Is Structured
in Clades Associated with Restriction Modification Systems That Modulate
Homologous Recombination." Proceedingsof the NationalAcademy of Sciences 108
(11): 4494-4499.
Butland, G., J. M. Peregrin-Alvarez, J. Li, W. Yang, X. Yang, V. Canadien, A. Starostine, et al.
2005. "Interaction network containing conserved and essential protein complexes in
Escherichia coli." Nature 433 (7025) (February 3): 531-537.
Butler, J., M. Kleber, P. Alvarez, W. Brockman, C.W. Chin, S. Gnerre, M. Grabherr, E. Mauceli,
B. Birren, and E. S. Lander. "ALLPATHS: A New Genome Assembly Algorithm That
Preserves Intrinsic Ambiguity."
Caimano, M. J., C. H. Eggers, K. R. 0. Hazlett, and J. D. Radolf. 2004. "RpoS Is Not Central to the
General Stress Response in Borrelia Burgdorferi but Does Control Expression of One
or More Essential Virulence Determinants." Infection and Immunity 72 (11): 64336445.
144
Canales, R. D., Y. Luo, J.C.Willey, B. Austermiller, C. C. Barbacioru, C. Boysen, K. Hunkapiller,
R. V. Jensen, C. R. Knight, and K. Y. Lee. 2006. "Evaluation of DNA Microarray Results
with Quantitative Gene Expression Platforms." Nature Biotechnology 24 (9): 11151122.
Cases, Ildefonso, Victor de Lorenzo, and Christos A. Ouzounis. 2003. "Transcription
Regulation and Environmental Adaptation in Bacteria." Trends in Microbiology 11
(6) (June): 248-253. doi:10.1016/SO966-842X(03)00103-3.
Chaisson, M. J., D. Brinza, and P. A. Pevzner. 2009. "De Novo Fragment Assembly with Short
Mate-paired Reads: Does the Read Length Matter?" Genome Research 19 (2): 336-
346.
Chen, D., W. M. Toone, J. Mata, R. Lyne, G. Burns, K. Kivinen, A. Brazma, N. Jones, and J.
Bahler. 2003. "Global Transcriptional Responses of Fission Yeast to Environmental
Stress." MolecularBiology of the Cell 14 (1): 214.
Chhabra, S. R., Q. He, K. H. Huang, S. P. Gaucher, E. J.Alm, Z. He, M. Z. Hadi, T. C. Hazen, J. D.
Wall, and J. Zhou. 2006. "Global Analysis of Heat Shock Response in Desulfovibrio
Vulgaris Hildenborough." Journalof Bacteriology 188 (5): 1817-1828.
Chipman, Hugh, and Robert Tibshirani. 2006. "Hybrid Hierarchical Clustering with
Applications to Microarray Data." Biostatistics 7 (2) (April 1): 286-301.
doi:10.1093/biostatistics/kxjOO7.
Choi, M., U. 1. Scholl, W. Ji, T. Liu, I. R. Tikhonova, P. Zumbo, A. Nayir, et al. 2009. "Genetic
Diagnosis by Whole Exome Capture and Massively Parallel DNA Sequencing."
Proceedingsof the NationalAcademy of Sciences 106 (45): 19096-19101.
Ciccarelli, F. D., T. Doerks, C.von Mering, C. J. Creevey, B. Snel, and P. Bork. 2006. "Toward
Automatic Reconstruction of a Highly Resolved Tree of Life." Science 311 (5765):
1283-1287.
Ciulla, D., G. Giannoukos, A. Earl, M. Feldgarden, D. Gevers, J. Levin, J. Livny, et al. 2010.
"Evaluation of Bacterial Ribosomal RNA (rRNA) Depletion Methods for Sequencing
Microbial Community Transcriptomes." Genome Biology 11 (Suppl 1): P9.
Claesson, Marcus J., Qiong Wang, Orla O'Sullivan, Rachel Greene-Diniz, James R. Cole, R. Paul
Ross, and Paul W. O'Toole. 2010. "Comparison of Two Next-generation Sequencing
Technologies for Resolving Highly Complex Microbiota Composition Using Tandem
Variable 16S rRNA Gene Regions." Nucleic Acids Research 38 (22) (December): e200.
doi:10.1093/nar/gkq873.
Cohan, F. M. 2011. "Are Species cohesive?-A View from Bacteriology."
http://works.bepress.com/cgi/viewcontent.cgi?article=1005&context=frederickco
han.
Coleman, M. L., M. B. Sullivan, A. C. Martiny, C. Steglich, K. Barry, E. F. DeLong, and S. W.
Chisholm. 2006. "Genomic Islands and the Ecology and Evolution of
Prochlorococcus." Science 311 (5768): 1768-1770.
Collins, F. S., M. L. Drumm, J. L. Cole, W. K. Lockwood, G. F. Vande Woude, and M. C. lannuzzi.
1987. "Construction of a General Human Chromosome Jumping Library, with
Application to Cystic Fibrosis." Science 235 (4792): 1046.
Cordero, 0, LA Ventouras, Edward F. DeLong, and Martin F. Polz. 2012. "Public Good
Dynamics Drive Evolution of Iron Acquisition Strategies in Natural Bacterioplankton
Populations." Proceedingsof the NationalAcademy of Sciences (November 19).
doi:10.1073/pnas.1213344109.
2
http://www.pnas.org/content/early/ 012/11/14/1213344109.
Cordero, 0., H. Wildschutte, B. Kirkup, S. Proehl, L. Ngo, F. Hussain, F. Le Roux, T. Mincer, and
M. F. Polz. 2012. "Ecological Populations of Bacteria Act as Socially Cohesive Units of
Antibiotic Production and Resistance." Science 337 (6099): 1228-1231.
145
David, L. A., and E. J. Alm. 2010. "Rapid Evolutionary Innovation During an Archaean Genetic
Expansion." Nature.
Dehal, P. S., M. P. Joachimiak, M. N. Price, J. T. Bates, J. K. Baumohl, D. Chivian, G. D. Friedland,
et al. 2010. "MicrobesOnline: An Integrated Portal for Comparative and Functional
Genomics." Nucleic Acids Research 38 (suppl 1): D396-D400.
Dehal, Paramvir S., Marcin P. Joachimiak, Morgan N. Price, John T. Bates, Jason K. Baumohl,
Dylan Chivian, Greg D. Friedland, et al. "MicrobesOnline: An Integrated Portal for
Comparative and Functional Genomics." Nucleic Acids Research.
Delcher, A. L., K. A. Bratke, E. C. Powers, and S. L. Salzberg. 2007. "Identifying Bacterial
Genes and Endosymbiont DNA with Glimmer." Bioinformatics 23 (6): 673-679.
DeRisi, J. L., V. R. Iyer, and P. 0. Brown. 1997. "Exploring the Metabolic and Genetic Control
of Gene Expression on a Genomic Scale." Science 278 (5338): 680-686.
Dhillon, I. S., E. M. Marcotte, and U. Roshan. 2003. "Diametrical Clustering for Identifying
Anti-correlated Gene Clusters." Bioinformatics 19 (13): 1612-1619.
Dixon, Philip. 2003. "VEGAN, a Package of R Functions for Community Ecology." Journalof
Vegetation Science 14 (6): 927-930. doi:10.1111/j.1654-1103.2003.tb02228.x.
Dobrindt, U., B. Hochhut, U. Hentschel, and J. Hacker. 2004. "Genomic Islands in Pathogenic
and Environmental Microorganisms." Nature Reviews Microbiology 2 (5): 414-424.
Dondoshansky, I., and Y. Wolf. 2000. BLASTCLUST-BLAST Score-basedSingle-linkage
Clustering.
Doolittle, W. F. 2012. "Population Genomics: How Bacterial Species Form and Why They
Don't Exist." CurrentBiology 22 (11): R451-R453.
Draghici, S., P. Khatri, A. C. Eklund, and Z. Szallasi. 2006. "Reliability and Reproducibility
Issues in DNA Microarray Measurements." TRENDS in Genetics 22 (2): 101-109.
Dudoit, S., and J. Fridlyand. 2002. "A Prediction-based Resampling Method for Estimating
the Number of Clusters in a Dataset." Genome Biology 3 (7): research0036.
Eddy, S. R. 2001. "${$HMMER: Profile Hidden Markov Models for Biological Sequence
Analysis$}$." http://www.citeulike.org/group/2050/article/1079185.
Edgar, R. C. 2010. "Search and Clustering Orders of Magnitude Faster Than BLAST."
Bioinformatics 26 (19): 2460-2461.
Eisen, J. A. 1998. "Phylogenomics: Improving Functional Predictions for Uncharacterized
Genes by Evolutionary Analysis." Genome Research 8 (3): 163.
English, Adam C., Stephen Richards, Yi Han, Min Wang, Vanesa Vee, Jiaxin Qu, Xiang Qin, et
al. 2012. "Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read
Sequencing Technology." Ed. Zhanjiang Liu. PLoS ONE 7 (11) (November 21):
e47768. doi:10.1371/journal.pone.0047768.
J. J., M.
E. Driscoll, V. A. Fusaro, E. J. Cosgrove, B. Hayete, F. S. Juhn, S. J. Schneider, and
T. S. Gardner. 2008. "Many Microbe Microarrays Database: Uniformly Normalized
Affymetrix Compendia with Structured Experimental Metadata." Nucleic Acids
Research 36 (Database issue): D866.
Fay, J. C., and P. J. Wittkopp. 2007. "Evaluating the role of natural selection in the evolution
of gene regulation." Heredity (May 23).
Felsenstein, J., and others. 2004. Inferring Phylogenies.Vol. 2. Sinauer Associates
Sunderland. http://orton.catie.ac.cr/cgibin/wxis.exe/?IsisScript=SUV.xis&method=post&formato=2&cantidad=1&expresio
Faith,
n=mfn=007729.
Fierer, Noah, Mya Breitbart, James Nulton, Peter Salamon, Catherine Lozupone, Ryan Jones,
Michael Robeson, et al. 2007. "Metagenomic and Small-Subunit rRNA Analyses
Reveal the Genetic Diversity of Bacteria, Archaea, Fungi, and Viruses in Soil." Applied
146
and EnvironmentalMicrobiology 73 (21) (November 1): 7059-7066.
doi: 10.1 128/AEM.00358-07.
Fink, G. R., P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. 0.
Brown, D. Botstein, and B. Futcher. 1998. "Comprehensive Identification of Cell
Cycle-regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray
Hybridization." Molecular Biology of the Cell 9 (12): 3273-3297.
Finneran, K. T., M. E. Housewright, and D. R. Lovley. 2002. "Multiple Influences of Nitrate on
Uranium Solubility During Bioremediation of Uranium-contaminated Subsurface
Sediments." EnvironmentalMicrobiology 4 (9): 510-516.
Fischer, Steve, Brian P. Brunk, Feng Chen, Xin Gao, Omar S. Harb, John B. lodice,
Dhanasekaran Shanmugam, David S. Roos, and Christian J. Stoeckert. 2002. "Using
OrthoMCL to Assign Proteins to OrthoMCL-DB Groups or to Cluster Proteomes Into
New Ortholog Groups." In CurrentProtocolsin Bioinformatics.John Wiley & Sons,
4 7 1
250953.biO612s35/abstract.
Inc. http://onlinelibrary.wiley.com/doi/10.1002/0
of Functional Modules in
Evolution
Fokkens, L., and B. Snel. 2009. "Cohesive Versus Flexible
Eukaryotes." PLoS Comput Biol 5 (1): e1000276.
Francisco, J. C., F. M. Cohan, and D. Krizanc. 2012. "Demarcation of Bacterial Ecotypes from
DNA Sequence Data: A Comparative Analysis of Four Algorithms." In 2nd IEEE
InternationalConference on ComputationalAdvances in Bio and Medical Sciences
(ICCABS).
http://works.bepress.com/cgi/viewcontent.cgi?article=1095&context=frederickco
han.
Fraser, C., E. J. Alm, M. F. Polz, B. G. Spratt, and W. P. Hanage. 2009. "The Bacterial Species
Challenge: Making Sense of Genetic and Ecological Diversity." Science 323 (5915):
741-746.
Frye, JG, Blackmer F, Cheng P, Proctor E, Porwollik S, and McClelland M. Series GSE808.
Fuhrman, Jed A., Joshua A. Steele, Ian Hewson, Michael S. Schwalbach, Mark V. Brown,
Jessica L. Green, and James H. Brown. 2008. "A Latitudinal Diversity Gradient in
Planktonic Marine Bacteria." Proceedingsof the NationalAcademy of Sciences 105
(22) (June 3): 7774-7778. doi:10.1073/pnas.0803070105.
Gao, H., Y. Wang, X. Liu, T. Yan, L. Wu, E. Alm, A. Arkin, D. K. Thompson, and J. Zhou. 2004.
"Global Transcriptome Analysis of the Heat Shock Response of Shewanella
Oneidensis." Journalof Bacteriology 186 (22): 7796.
Gao, H., Z. K. Yang, L. Wu, D. K. Thompson, and J.Zhou. 2006. "Global Transcriptome Analysis
of the Cold Shock Response of Shewanella Oneidensis MR-1 and Mutational Analysis
of Its Classical Cold Shock Proteins."Journal of Bacteriology 188 (12): 4560.
Garland, T., P. H. Harvey, and A. R. Ives. 1992. "Procedures for the Analysis of Comparative
Data Using Phylogenetically Independent Contrasts." Systematic Biology 41 (1): 18-
32.
Gasch, A. P. 2007. "Comparative Genomics of the Environmental Stress Response in
Ascomycete Fungi." Yeast 24 (11): 961-976.
Gasch, A. P., P. T. Spellman, C. M. Kao, 0. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein, and
P. 0. Brown. 2000. "Genomic Expression Programs in the Response of Yeast Cells to
Environmental Changes." Science's STKE 11 (12): 4241-4257.
Gelfand, M. S. 2006. "Evolution of transcriptional regulatory networks in microbial
genomes." Currentopinion in structuralbiology 16 (3) (June): 420-429.
Ghodsi, M. 2012. "Searching, Clustering and Evaluating Biological Sequences."
http://drum.lib.umd.edu/handle/1903/13186.
147
Giaever, G., A. M. Chu, L. Ni, C. Connelly, L. Riles, S. V6ronneau, S. Dow, A. Lucau-Danila, K.
Anderson, and B. Andre. 2002. "Functional Profiling of the Saccharomyces
Cerevisiae Genome." Nature 418: 387-391.
Giannoukos, G., D. M. Ciulla, K. Huang, B. J. Haas, J. Izard, J. Z. Levin, J. Livny, et al. 2012.
"Efficient and Robust RNA-seq Process for Cultured Bacteria and Complex
Community Transcriptomes." Genome Biology 13 (3): r23.
Gifford, Scott M, Shalabh Sharma, Johanna M Rinta-Kanto, and Mary Ann Moran. 2010.
"Quantitative Analysis of a Deeply Sequenced Marine Microbial Metatranscriptome."
The ISME Journal 5 (3) (September 16): 461-472. doi:10.1038/ismej.2010.141.
Gilbert, J.A., D. Field, P. Swift, L. Newbold, A. Oliver, T. Smyth, P. J.Somerfield, S. Huse, and I.
Joint. 2009. "The Seasonal Structure of Microbial Communities in the Western
English Channel." EnvironmentalMicrobiology 11 (12): 3132-3139.
Gilbreath, J. J., W. L. Cody, D. S. Merrell, and D. R. Hendrixson. 2011. "Change Is Good:
Variations in Common Biological Mechanisms in the Epsilonproteobacterial Genera
Campylobacter and Helicobacter." Microbiology and MolecularBiology Reviews 75
(1): 84.
Glasner, J. D., P. Liss, G. Plunkett III, A. Darling, T. Prasad, M. Rusch, A. Byrnes, M. Gilson, B.
Biehl, and F. R. Blattner. 2003. "ASAP, a Systematic Annotation Package for
Community Analysis of Genomes." Nucleic Acids Research 31 (1): 147.
Glenn, Travis C. 2011. "Field Guide to Next-generation DNA Sequencers." MolecularEcology
Resources 11 (5): 759-769. doi:10.1111/j.1755-0998.2011.03024.x.
Gnerre, Sante, Eric Lander, Kerstin Lindblad-Toh, and David Jaffe. 2009. "Assisted Assembly:
How to Improve a De Novo Genome Assembly by Using Related Species." Genome
Biology 10 (8) (August 27): R88. doi:10.1186/gb-2009-10-8-r88.
Grote,
J., J. C. Thrash, M. J. Huggett, Z.
C. Landry, P. Carini, S. J. Giovannoni, and M. S. Rappe.
2012. "Streamlining and Core Genome Conservation Among Highly Divergent
Members of the SAR11 Clade." mBio 3 (5) (September 18): e00252-12-e00252-12.
doi:10.1128/mBio.00252-12.
Gu, X. 2004. "Statistical Framework for Phylogenomic Analysis of Gene Family Expression
Profiles." Genetics 167 (1): 531-542.
Guell, Marc, Eva Yus, Maria Lluch-Senar, and Luis Serrano. 2011. "Bacterial Transcriptomics:
What Is Beyond the RNA Horiz-ome?" Nature Reviews Microbiology 9 (9) (August
12): 658-669. doi:10.1038/nrmicro2620.
Gunesekere, Ishara C., Charlene M. Kahler, David R. Powell, Lori A. S. Snyder, Nigel J.
Saunders, Julian I. Rood, and John K. Davies. 2006. "Comparison of the RpoHDependent Regulon and General Stress Response in Neisseria Gonorrhoeae." Journal
of Bacteriology 188 (13) (July 1): 4769-4776. doi:10.1128/JB.01807-05.
Gutierrez-Rios, R. M., D. A. Rosenblueth, J. A. Loza, A. M. Huerta, J. D. Glasner, F. R. Blattner,
and J. Collado-Vides. 2003. "Regulatory Network of Escherichia Coli: Consistency
Between Literature Knowledge and Microarray Profiles." Genome Research 13 (11):
2435-2443.
Hacker,
J., and E. Carniel. 2001.
"Ecological Fitness, Genomic Islands and Bacterial
Pathogenicity." EMBO Reports 2 (5): 376-381.
Haegeman, Bart, and Joshua S Weitz. 2012. "A Neutral Theory of Genome Evolution and the
Frequency Distribution of Genes." BMC Genomics 13: 196. doi: 10.1186/1471-2164-
13-196.
Hardesty, Donald L. 1975. "The Niche Concept: Suggestions for Its Use in Human Ecology."
Human Ecology 3 (2) (April): 71-85. doi:10.1007/BF01552263.
Hayes, E. T., J. C. Wilks, P. Sanfilippo, E. Yohannes, D. P. Tate, B. D. Jones, M. D. Radmacher, S.
S. BonDurant, and J. L. Slonczewski. 2006. "Oxygen limitation modulates pH
148
regulation of catabolism and hydrogenases, multidrug transporters, and envelope
composition in Escherichia coli K-12." BMC microbiology 6 (October 6): 89.
Hayes, E. T., J. C.Wilks, E. Yohannes, D. P. Tate, M. Radmacher, S. S. BonDurant, and J. L.
Slonczewski. 2006. "pH and Anaerobiosis Coregulate Catabolism, Hydrogenases, Ion
and Multidrug Transporters, and Envelope Composition in Escherichia Coli K-12."
BMC Microbiol 6: 89.
He,
Q., Z.
He, D. C. Joyner, M. Joachimiak, M. N. Price, Z. K. Yang, H. C. B. Yen, C. L. Hemme, W.
Chen, and M. M. Fields. 2010. "Impact of Elevated Nitrate on Sulfate-reducing
Bacteria: a Comparative Study of Desulfovibrio Vulgaris." The ISMEJournal4 (11):
1386-1397.
Heidelberg, J. F., R. Seshadri, S. A. Haveman, C. L. Hemme, I. T. Paulsen, J. F. Kolonay, J.A.
Eisen, N. Ward, B. Methe, L. M. Brinkac, et al. 2004. "The genome sequence of the
anaerobic, sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough."
Nature biotechnology 22 (5) (May): 554-559.
Heidelberg, J. F., R. Seshadri, S. A. Haveman, C. L. Hemme, I. T. Paulsen, J. F. Kolonay, J. A.
Eisen, N. Ward, B. Methe, and L. M. Brinkac. 2004. "Genome Sequence of the
Dissimilatory Metal Ion-reducing Bacterium Shewanella Oneidensis." Nature
Biotechnology 22: 554-559.
Helmann, J. D., M. F. W. Wu, P. A. Kobel, F. J. Gamo, M. Wilson, M. M. Morshedi, M. Navre, and
C. Paddon. 2001. "Global Transcriptional Response of Bacillus Subtilis to Heat
Shock."Journal of Bacteriology 183 (24): 7318-7328.
Hiatt, J. B., R. P. Patwardhan, E. H. Turner, C. Lee, and J. Shendure. 2010. "Parallel, Tagdirected Assembly of Locally Derived Short Sequence Reads." Nature Methods 7 (2):
119-122.
Hoff, K. 2009. "The Effect of Sequencing Errors on Metagenomic Gene Prediction." Bmc
Genomics 10 (1): 520.
Hooper, Sean D, Konstantinos Mavromatis, and Nikos C Kyrpides. 2009. "Microbial Cohabitation and Lateral Gene Transfer: What Transposases Can Tell Us." Genome
Biology 10 (4): R45. doi:10.1186/gb-2009-10-4-r45.
Hu, P., S. C. Janga, M. Babu, J. J. Diaz-Mejia, G. Butland, W. Yang, 0. Pogoutse, X. Guo, S.
Phanse, and P. Wong. 2009. "Global Functional Atlas of Escherichia Coli
Encompassing Previously Uncharacterized Proteins." PLoS Biol 7 (4): e96.
Hunt, D. E., L. A. David, D. Gevers, S. P. Preheim, E. J. Alm, and M. F. Polz. 2008. "Resource
Partitioning and Sympatric Differentiation Among Closely Related
Bacterioplankton." Science 320 (5879): 1081-1085.
Hutchinson, G. E. 1957. "Concluding Remarks." Cold Spring HarborSymposia on Quantitative
Biology 22 (0) (January 1): 415-427. doi:10.1101/SQB.1957.022.01.039.
De Jager, V., and R. J. Siezen. 2011. "Single-cell Genomics: Unravelling the Genomes of
Unculturable Microorganisms." MicrobialBiotechnology 4 (4): 43 1-437.
Jozefczuk, S., S. Klie, G. Catchpole, J. Szymanski, A. Cuadros-Inostroza, D. Steinhauser, J.
Selbig, and L. Willmitzer. 2010. "Metabolomic and Transcriptomic Stress Response
of Escherichia Coli." Molecular Systems Biology 6 (1).
Kaczanowska, M., and M. Ryden-Aulin. 2007. "Ribosome biogenesis and the translation
process in Escherichia coli." Microbiology and molecular biology reviews: MMBR 71
(3) (September): 477-494.
Kaneko, T., S. Sato, H. Kotani, A. Tanaka, E. Asamizu, Y. Nakamura, N. Miyajima, M. Hirosawa,
M. Sugiura, and S. Sasamoto. 1996. "Sequence Analysis of the Genome of the
Unicellular Cyanobacterium Synechocystis Sp. Strain PCC6803. II. Sequence
Determination of the Entire Genome and Assignment of Potential Protein-coding
Regions." DNA Research 3 (3): 109-136.
149
Kannan, G., J. C.Wilks, D. M. Fitzgerald, B. D. Jones, S. S. Bondurant, and J. L. Slonczewski.
2008. "Rapid acid treatment of Escherichia coli: transcriptomic response and
recovery." BMC microbiology 8 (February 26): 37.
Kerr, M. Kathleen, and Gary A. Churchill. 2001. "Bootstrapping Cluster Analysis: Assessing
the Reliability of Conclusions from Microarray Experiments." Proceedingsof the
NationalAcademy of Sciences 98 (16) (July 31): 8961-8965.
doi:10.1073/pnas.161273698.
Kettler, G. C., A. C.Martiny, K. Huang, J.Zucker, M. L. Coleman, S. Rodrigue, F. Chen, A.
Lapidus, S. Ferriera, and J. Johnson. 2007. "Patterns and Implications of Gene Gain
and Loss in the Evolution of Prochlorococcus." PLoS Genetics 3 (12): e231.
Keymer, D. P., M. C.Miller, G. K.Schoolnik, and A. B. Boehm. 2007. "Genomic and Phenotypic
Diversity of Coastal Vibrio Cholerae Strains Is Linked to Environmental Factors."
Applied and EnvironmentalMicrobiology 73 (11): 3705-3714.
Khaitovich, P., W. Enard, M. Lachmann, and S. Paabo. 2006. "Evolution of Primate Gene
Expression." Nature Reviews Genetics 7 (9): 693-702.
Khaitovich, P., G. Weiss, M. Lachmann, I. Hellmann, W. Enard, B. Muetzel, U. Wirkner, W.
Ansorge, and S. Paabo. 2004. "A neutral model of transcriptome evolution." PLoS
biology 2 (5) (May): E132.
Kirkup, Benjamin, LeeAnn Chang, Sarah Chang, Dirk Gevers, and Martin Polz. 2010. "Vibrio
Chromosomes Share Common History." BMC Microbiology 10 (1) (May 10): 137.
doi:10.1186/1471-2180-10-137.
Klosinska, M. M., C. A. Crutchfield, P. H. Bradley, J. D. Rabinowitz, and J. R. Broach. 2011.
"Yeast cells can access distinct quiescent states." Genes & development 25 (4)
(February 15): 336-349.
Konstantinidis, K. T., M. H. Serres, M. F. Romine, J. L. M. Rodrigues, J. Auchtung, L. A. McCue,
M. S. Lipton, A. Obraztsova, C. S. Giometti, and K. H. Nealson. 2009. "Comparative
Systems Biology Across an Evolutionary Gradient Within the Shewanella Genus."
Proceedingsof the NationalAcademy of Sciences 106 (37): 15909.
Koskinen, J. Patrik, and Liisa Holm. 2012. "SANS: High-throughput Retrieval of Protein
Sequences Allowing 50% Mismatches." Bioinformatics 28 (18) (September 15):
i438-i443. doi:10.1093/bioinformatics/bts4l7.
Kuo, Chih-Horng, Nancy A Moran, and Howard Ochman. 2009. "The Consequences of
Genetic Drift for Bacterial Genome Complexity." Genome Research 19 (8) (August):
1450-1454. doi:10.1101/gr.091785.109.
Kuo, Chih-Horng, and Howard Ochman. 2009. "Deletional Bias Across the Three Domains of
Life." Genome Biology and Evolution 1: 145-152. doi:10.1093/gbe/evpO16.
Kussell, Edo, and Stanislas Leibler. 2005. "Phenotypic Diversity, Population Growth, and
Information in Fluctuating Environments." Science 309 (5743) (September 23):
2075-2078. doi:10.1126/science.1114383.
Lan, R., and P. R. Reeves. 2001. "When Does a Clone Deserve a Name? A Perspective on
Bacterial Species Based on Population Genetics." Trends in Microbiology9 (9): 419-
424.
Lan, Ruiting, and Peter R. Reeves. 2000. "Intraspecies Variation in Bacterial Genomes: The
Need for a Species Genome Concept." Trends in Microbiology 8 (9) (September 1):
396-401. doi:10.1016/SO966-842X(00)01791-1.
Langmead, B., and S. L. Salzberg. 2012. "Fast Gapped-read Alignment with Bowtie 2." Nature
Methods 9 (4): 357-359.
Lapierre, P., and J. P. Gogarten. 2009. "Estimating the Size of the Bacterial Pan-genome."
Trends in Genetics 25 (3): 107-110.
150
Lauro, Federico M., Diane McDougald, Torsten Thomas, Timothy J.Williams, Suhelen Egan,
Scott Rice, Matthew Z. DeMaere, et al. 2009. "The Genomic Basis of Trophic Strategy
in Marine Bacteria." Proceedingsof the NationalAcademy of Sciences 106 (37)
(September 15): 15527-15533. doi:10.1073/pnas.0903507106.
Lazzeroni, L., and A. Owen. 2002. Statistica Sinica 12 (1): 61-86.
Leaphart, A. B., D. K. Thompson, K. Huang, E. Alm, X. F. Wan, A. Arkin, S. D. Brown, L. Wu, T.
Yan, and X. Liu. 2006. "Transcriptome Profiling of Shewanella Oneidensis Gene
Expression Following Exposure to Acidic and Alkaline pH." Journalof Bacteriology
188 (4): 1633.
Leary, Daniel J., and Sui Huang. 2001. "Regulation of Ribosome Biogenesis Within the
Nucleolus." FEBS Letters 509 (2): 145 <last-page> 150.
Lee, M. C., and C. J. Marx. 2012. "Repeated, Selection-driven Genome Reduction of Accessory
Genes in Experimental Populations." PLoS Genetics 8 (5): e1002651.
Lee, Z. M. P., C. Bussema Ill, and T. M. Schmidt. 2009. "rrnDB: Documenting the Number of
rRNA and tRNA Genes in Bacteria and Archaea." Nucleic Acids Research 37 (suppl 1):
D489-D493.
Lef~bure, T., P. D. P. Bitar, H. Suzuki, and M. J. Stanhope. 2010. "Evolutionary Dynamics of
Complete Campylobacter Pan-genomes and the Bacterial Species Concept." Genome
Biology and Evolution 2: 646.
Lennon, N. J., R. E. Lintner, S. Anderson, P. Alvarez, A. Barry, W. Brockman, R. Daza, R. L.
Erlich, G. Giannoukos, and L. Green. 2010. "Method A Scalable, Fully Automated
Process for Construction of Sequence-ready Barcoded Libraries for 454."
http://www.biomedcentral.com/content/pdf/gb-2010-11-2-r15.pdf.
Letunic, I., and P. Bork. 2011. "Interactive Tree Of Life V2: Online Annotation and Display of
Phylogenetic Trees Made Easy." Nucleic Acids Research 39 (Web Server) (April 5):
W475-W478. doi:10.1093/nar/gkr201.
Levine, Stuart. 2012. "Biomicrocenter Website."
http://openwetware.org/wiki/BioMicroCenter:Pricing.
Li, H., and R. Durbin. 2009. "Fast and Accurate Short Read Alignment with BurrowsWheeler Transform." Bioinformatics 25 (14): 1754-1760.
Li, H., J. Ruan, and R. Durbin. 2008. "Mapping Short DNA Sequencing Reads and Calling
Variants Using Mapping Quality Scores." Genome Research 18 (11): 1851-1858.
Lilburn, Timothy G., Hong Cai, and Jianying Gu. 2009. "The Core and Pan-Genome of the
Vibrionaceae." In, 84-90. IEEE. doi:10.1109/IJCBS.2009.60.
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5260735.
Logares, Ramiro, Thomas H.A. Haverkamp, Surendra Kumar, Anders Lanz6n, Alexander J.
Nederbragt, Christopher Quince, and H avard Kauserud. 2012. "Environmental
Microbiology Through the Lens of High-throughput DNA Sequencing: Synopsis of
Current Platforms and Bioinformatics Approaches."Journalof Microbiological
Methods 91 (1) (October): 106-113. doi:10.1016/j.mimet.2012.07.017.
Lopez, L. C. S., M. P. de Aguiar Fracasso, D. 0. Mesquita, A. R. T. Palma, and P. Riul. 2012. "The
Relationship Between Percentage of Singletons and Sampling Effort: A New
Approach to Reduce the Bias of Richness Estimates." EcologicalIndicators 14 (1):
164-169.
Lovley, D. R., S. J. Giovannoni, D. C. White, J. E. Champine, E. J. P. Phillips, Y. A. Gorby, and S.
Goodwin. 1993. "Geobacter Metallireducens Gen. Nov. Sp. Nov., a Microorganism
Capable of Coupling the Complete Oxidation of Organic Compounds to the Reduction
of Iron and Other Metals." Archives of Microbiology 159 (4): 336-344.
Lozada-Chavez, I., S. C. janga, and J. Collado-Vides. 2006. "Bacterial Regulatory Networks Are
Extremely Flexible in Evolution." Nucleic Acids Research 34 (12): 3434-3445.
151
Luo, Chengwei, Seth T. Walk, David M. Gordon, Michael Feldgarden, James M. Tiedje, and
Konstantinos T. Konstantinidis. 2011. "Genome Sequencing of Environmental
Escherichia Coli Expands Understanding of the Ecology and Speciation of the Model
Bacterial Species." Proceedingsof the NationalAcademy of Sciences 108 (17) (April
26): 7200-7205. doi:10.1073/pnas.1015622108.
Lynch, M. 2007. "The evolution of genetic networks by non-adaptive processes." Nature
reviews.Genetics 8 (10) (October): 803-813.
Lynch, Michael. 2011. "Statistical Inference on the Mechanisms of Genome Evolution." Ed.
Nancy A. Moran. PLoS Genetics 7 (6) (June 9): e1001389.
doi:10.1371/journal.pgen.1001389.
Macalalad, A. R., M. C.Zody, P. Charlebois, N. J.Lennon, R. M. Newman, C.M. Malboeuf, E. M.
Ryan, et al. 2012. "Highly Sensitive and Specific Detection of Rare Variants in Mixed
Viral Populations from Massively Parallel Sequence Data." PLoS Computational
Biology 8 (3): e1002417.
Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. 2012. Cluster:
ClusterAnalysis Basics and Extensions.
Makarova, K, A Slesarev, Y Wolf, A Sorokin, B Mirkin, E Koonin, A Pavlov, et al. 2006.
"Comparative Genomics of the Lactic Acid Bacteria." Proceedingsof the National
Academy of Sciences of the United States ofAmerica 103 (42) (October 17): 15611-
15616. doi:10.1073/pnas.0607117103.
Malmstrom, R. R., S. Rodrigue, K. H. Huang, L. Kelly, S. E. Kern, A. Thompson, S. Roggensack,
P. M. Berube, M. R. Henn, and S. W. Chisholm. 2012. "Ecology of Uncultured
Prochlorococcus Clades Revealed Through Single-cell Genomics and Biogeographic
Analysis." The ISME Journal.
http://www.nature.com/ismej/journal/vaop/ncurrent/full/ismej201289a.html.
Markowitz, V. M., K. Mavromatis, N. N. Ivanova, I. Chen, A. Min, K. Chu, and N. C. Kyrpides.
2009. "IMG ER: a System for Microbial Genome Annotation Expert Review and
Curation."Bioinformatics 25 (17): 2271-2278.
Maurer, L. M., E. Yohannes, S. S. Bondurant, M. Radmacher, and J. L. Slonczewski. 2005. "pH
Regulates Genes for Flagellar Motility, Catabolism, and Oxidative Stress in
Escherichia Coli K-12." Journalof Bacteriology 187 (1): 304.
McAdams, H. H., B. Srinivasan, and A. P. Arkin. 2004. "The Evolution of Genetic Regulatory
Systems in Bacteria." Nature Reviews Genetics 5 (3): 169-178.
Miller, Christopher, Brett Baker, Brian Thomas, Steven Singer, and Jillian Banfield. 2011.
"EMIRGE: Reconstruction of Full-length Ribosomal Genes from Microbial
Community Short Read Sequencing Data." Genome Biology 12 (5) (May 19): R44.
doi:10.1186/gb-2011-12-5-r44.
Mitchell, Amir, Gal H Romano, Bella Groisman, Avihu Yona, Erez Dekel, Martin Kupiec, Orna
Dahan, and Yitzhak Pilpel. 2009. "Adaptive Prediction of Environmental Changes by
Microorganisms." Nature 460 (7252) (July 9): 220-224. doi:10.1038/nature08112.
Mittenhuber, G. 2002. "A Phylogenomic Study of the General Stress Response Sigma Factor
sigmaB of Bacillus Subtilis and Its Regulatory Proteins."Journalof Molecular
Microbiology and Biotechnology 4 (4): 427.
Moreno, C., J. Romero, and R. T. Espejo. 2002. "Polymorphism in Repeated 16S rRNA Genes
Is a Common Property of Type Strains and Environmental Isolates of the Genus
Vibrio." Microbiology 148 (4): 1233-1239.
Mukhopadhyay, A., Z. He, E. J.Alm, A. P. Arkin, E. E. Baidoo, S. C.Borglin, W. Chen, T. C.
Hazen, Q. He, and H. Y. Holman. 2006. "Salt Stress in Desulfovibrio Vulgaris
Hildenborough: An Integrated Genomics Approach." Journalof Bacteriology 188
(11): 4068-4078.
152
Mulder, NicolaJ, Rolf Apweiler, TeresaK Attwood, Amos Bairoch, Alex Bateman, David Binns,
Peer Bork, et al. 2007. "New Developments in the InterPro Database." Nucleic Acids
Research 35 (suppl_1) (January 12): D224-228.
Muzzi, A., and C. Donati. 2011. "Population Genetics and Evolution of the Pan-genome of< I>
Streptococcus Pneumoniae</i>." InternationalJournalof Medical Microbiology.
422
111000944.
http://www.sciencedirect.com/science/article/pii/S1438
Nekrutenko, A., and J. Taylor. 2012. "Next-generation Sequencing Data Interpretation:
Enhancing Reproducibility and Accessibility." Nature Reviews Genetics 13 (9): 667-
672.
Nishiguchi, Michele K., and Vinod S. Nair. 2003. "Evolution of Symbiosis in the Vibrionaceae:
a Combined Approach Using Molecules and Physiology." Intern ationalJournalof
Systematic and Evolutionary Microbiology 53 (6) (November 1): 2019-2026.
doi:10.1099/ijs.0.02792-0.
Niu, Beifang, Zhengwei Zhu, Limin Fu, Sitao Wu, and Weizhong Li. 2011. "FR-HIT, a Very
Fast Program to Recruit Metagenomic Reads to Homologous Reference Genomes."
Bioinformatics27 (12) (June 15): 1704-1705. doi:10.1093/bioinformatics/btr252.
Oakley, T. H., Z. Gu, E. Abouheif, N. H. Patel, and W. H. Li. 2005. "Comparative Methods for
the Analysis of Gene-Expression Evolution: An Example Using Yeast Functional
Genomic Data." Molecular Biology and Evolution 22 (1): 40-50.
Pal, C., B. Papp, and M. J. Lercher. 2005. "Adaptive evolution of bacterial metabolic networks
by horizontal gene transfer." Nature genetics 37 (12) (December): 1372-1375.
Pan, Wei. 2002. "A Comparative Review of Statistical Methods for Discovering Differentially
Expressed Genes in Replicated Microarray Experiments." Bioinformatics 18 (4)
(April 1): 546-554. doi:10.1093/bioinformatics/18.4.546.
Papa, Eliseo, Michael Docktor, Christopher Smillie, Sarah Weber, Sarah P. Preheim, Dirk
Gevers, Georgia Giannoukos, et al. 2012. "Non-Invasive Mapping of the
Gastrointestinal Microbiota Identifies Children with Inflammatory Bowel Disease."
PLoS ONE 7 (6) (June 29): e39242. doi:10.1371/journal.pone.0039242.
Polz, M. F., D. E. Hunt, S. P. Preheim, and D. M. Weinreich. 2006. "Patterns and Mechanisms
of Genetic and Phenotypic Differentiation in Marine Microbes." Philosophical
Transactionsof the Royal Society B: BiologicalSciences 361 (1475) (November 29):
2009-2021. doi:10.1098/rstb.2006.1928.
Preheim, S. P. 2010. "Ecology and Population Structure of Vibrionaceae in the Coastal
Ocean". Massachusetts Institute of Technology.
http://dspace.mit.edu/handle/1721.1/58184.
Preheim, S. P., Y. Boucher, H. Wildschutte, L. A. David, D. Veneziano, E. J. Alm, and M. F. Polz.
2011. "Metapopulation Structure of Vibrionaceae Among Coastal Marine
Invertebrates." EnvironmentalMicrobiology 13 (1): 265-275.
Preheim, S. P., S. Timberlake, and M. F. Polz. 2011. "Merging Taxonomy with Ecological
Population Prediction in a Case Study of Vibrionaceae." Applied and Environmental
Microbiology 77 (20): 7195-7206.
Price, M. N., K. H. Huang, A. P. Arkin, and E. J. Alm. 2005. "Operon Formation Is Driven by Coregulation and Not by Horizontal Gene Transfer." Genome Research 15 (6): 809-819.
Price, Morgan N., Adam P. Arkin, and Eric J.Alm. 2006. "The Life-Cycle of Operons." PLoS
Genetics 2 (6) (June): e96 OP.
Price, MorganN, ParamvirS Dehal, and AdamP Arkin. 2007. "Orthologous Transcription
Factors in Bacteria Have Different Functions and Regulate Different Genes." PLoS
ComputationalBiology 3 (9) (September): e175 OP.
153
Quince, Christopher, Thomas P. Curtis, and William T. Sloan. 2008. "The Rational
Exploration of Microbial Diversity." The ISMEJournal2 (10): 997-1006.
doi:10.1038/ismej.2008.69.
Ricklefs, R. E., and J. M. Starck. 1996. "Applications of Phylogenetically Independent
Contrasts: a Mixed Progress Report." Oikos 77 (1): 167-172.
Rifkin, S. A., J. Kim, and K. P. White. 2003. "Evolution of Gene Expression in the Drosophila
Melanogaster Subgroup." Nature Genetics 33 (2): 138-144.
Riley, M. 1993. "Functions of the Gene Products of Escherichia Coli." MicrobiologicalReviews
57 (4): 862.
Rocap, G., F. W. Larimer, J. Lamerdin, S. Malfatti, P. Chain, N. A. Ahlgren, A. Arellano, M.
Coleman, L. Hauser, and W. R. Hess. 2003. "Genome Divergence in Two
Prochlorococcus Ecotypes Reflects Oceanic Niche Differentiation." Nature 424
(6952): 1042-1047.
Rocha, E. P. C. 2003. "DNA Repeats Lead to the Accelerated Loss of Gene Order in Bacteria."
TRENDS in Genetics 19 (11): 600-603.
Rodrigue, Sebastien, Arne C. Materna, Sonia C. Timberlake, Matthew C. Blackburn, Rex R.
Malmstrom, Eric J. Alm, and Sallie W. Chisholm. 2010. "Unlocking Short Read
Sequencing for Metagenomics." Ed. Jack Anthony Gilbert. PLoS ONE 5 (7) (July 28):
e11840. doi:10.1371/journal.pone.0011840.
Roguev, A., S. Bandyopadhyay, M. Zofall, K. Zhang, T. Fischer, S. R. Collins, H. Qu, et al. 2008.
"Conservation and rewiring of functional modules revealed by an epistasis map in
fission yeast." Science (New York, N.Y.) 322 (5900) (October 17): 405-410.
Rustici, G., J. Mata, K. Kivinen, P. Lio, C.J. Penkett, G. Burns, J. Hayles, A. Brazma, P. Nurse,
and J. Bahler. 2004. "Periodic gene expression program of the fission yeast cell
cycle." Nature genetics 36 (8) (August): 809-817.
Rzhetsky, Andrey, Edoardo M. Airoldi, Curtis Huttenhower, David Gresham, Charles Lu, Amy
A. Caudy, Maitreya J. Dunham, James R. Broach, David Botstein, and Olga G.
Troyanskaya. 2009. "Predicting Cellular Growth from Gene Expression Signatures."
PLoS ComputationalBiology 5 (1): e1000257.
Salgado, H., S. Gama-Castro, M. Peralta-Gil, E. Diaz-Peredo, F. Sanchez-Solano, A. SantosZavaleta, I. Martinez-Flores, et al. 2006. "RegulonDB (version 5.0): Escherichia coli
K-12 transcriptional regulatory network, operon organization, and growth
conditions." Nucleic acids research 34 (Database issue) (January 1): D394-7.
Salzberg, S. L., D. D. Sommer, D. Puiu, and V. T. Lee. 2008. "Gene-boosted Assembly of a
Novel Bacterial Genome from Very Short Reads." PLoS ComputationalBiology 4 (9):
e1000186.
Schloss, P. D., S. L. Westcott, T. Ryabin, J. R. Hall, M. Hartmann, E. B. Hollister, R. A.
Lesniewski, B. B. Oakley, D. H. Parks, and C. J. Robinson. 2009. "Introducing Mothur:
Open-source, Platform-independent, Community-supported Software for Describing
and Comparing Microbial Communities." Applied and EnvironmentalMicrobiology 75
(23): 7537-7541.
Shapiro, B. J. 2010. "Genomic Signatures of Sex, Selection and Speciation in the Microbial
World". Massachusetts Institute of Technology.
http://dspace.mit.edu/handle/1721.1/61788.
Shapiro, B. J., L. A. David, J. Friedman, and E. J. Alm. 2009. "Looking for Darwin's Footprints
in the Microbial World." Trends in Microbiology 17 (5): 196-204.
Shapiro, B. J., J. Friedman, 0. X. Cordero, S. P. Preheim, S. C.Timberlake, G. Szab6, M. F. Polz,
and E. J. Alm. 2012. "Population Genomics of Early Events in the Ecological
Differentiation of Bacteria." Science 336 (6077): 48-51.
154
Shatkay, H., S. Edwards, W. J. Wilbur, and M. Boguski. 2000. "Genes, Themes and
Microarrays." In Int. Conf on Intelligent Systems in Molecular Biology.
http://www.aaai.org/Papers/ISMB/2000/ISMB0O-033.pdf.
Shi, L., L. H. Reid, W. D. Jones, R. Shippy, J. A. Warrington, S. C. Baker, P. J. Collins, F. De
Longueville, E. S. Kawasaki, and K. Y. Lee. 2006. "The MicroArray Quality Control
(MAQC) Project Shows Inter-and Intraplatform Reproducibility of Gene Expression
Measurements." Nature Biotechnology 24 (9): 1151-1161.
Shippy, R., S. Fulmer-Smentek, R. V. Jensen, W. D. Jones, P. K. Wolber, C. D. Johnson, P. S. Pine,
C. Boysen, X. Guo, and E. Chudin. 2006. "Using RNA Sample Titrations to Assess
Microarray Platform Performance and Normalization Techniques." Nature
Biotechnology 24 (9): 1123-1131.
Shokralla, Shadi, Jennifer L Spall, Joel F Gibson, and Mehrdad Hajibabaei. 2012. "Nextgeneration Sequencing Technologies for Environmental DNA Research." Molecular
Ecology 21 (8) (April): 1794-1805. doi:10.1111/j.1365-294X.2012.05538.x.
Sim, Siew Hoon, Yiting Yu, Chi Ho Lin, R. Krishna M. Karuturi, Vanaporn Wuthiekanun,
Apichai Tuanyok, Hui Hoon Chua, et al. 2008. "The Core and Accessory Genomes of
Burkholderia Pseudomallei: Implications for Human Melioidosis." PLoS Pathog 4
(10) (October 17): e1000178. doi:10.1371/journal.ppat.1000178.
Simon, M., H. P. Grossart, B. Schweitzer, and H. Plough. 2002. "Microbial Ecology of Organic
Aggregates in Aquatic Ecosystems." Aquatic MicrobialEcology 28 (2): 175-2 11.
Singh, A. H., D. M. Wolf, P. Wang, and A. P. Arkin. 2008. "Modularity of Stress Response
Evolution." Science's STKE 105 (21): 7500.
Slavov, N., E. M. Airoldi, A. van Oudenaarden, and D. Botstein. 2012. "A Conserved Cell
Growth Cycle Can Account for the Environmental Stress Responses of Divergent
Eukaryotes." MolecularBiology of the Cell 23 (10) (March 28): 1986-1997.
doi: 10.1091/mbc.E1 1-1 1-0961.
Smillie, Chris S., Mark B. Smith, Jonathan Friedman, Otto X. Cordero, Lawrence A. David, and
Eric J. Alm. 2011. "Ecology Drives a Global Network of Gene Exchange Connecting
the Human Microbiome." Nature.
Smoot, Michael E, Keiichiro Ono, Johannes Ruscheinski, Peng-Liang Wang, and Trey Ideker.
2011. "Cytoscape 2.8: New Features for Data Integration and Network
Visualization." Bioinformatics (Oxford, England) 27 (3) (February 1): 431-432.
doi:10.1093/bioinformatics/btq675.
Soergel, David A W, Neelendu Dey, Rob Knight, and Steven E Brenner. 2012. "Selection of
Primers for Optimal Taxonomic Classification of Environmental 16S rRNA Gene
Sequences." The ISME Journal6 (7) (July): 1440-1444. doi:10.1038/ismej.2011.208.
Sommer, D. D., A. L. Delcher, S. L. Salzberg, and M. Pop. 2007. "Minimus: a Fast, Lightweight
Genome Assembler." BMC Bioinformatics 8 (1): 64.
Sorber, K., C. Chiu, D. Webster, M. Dimon, J. G. Ruby, A. Hekele, and J. L. DeRisi. 2008. "The
Long March: a Sample Preparation Technique That Enhances Contig Length and
Coverage by High-throughput Short-read Sequencing." PloS One 3 (10): e3495.
Staron, Anna, and Thorsten Mascher. 2010. "General Stress Response in A-proteobacteria:
PhyR and Beyond." Molecular Microbiology 78 (2): 271 <last-page> 277.
Stewart, F. J., A. K. Sharma, J. A. Bryant, J. M. Eppley, E. F. DeLong, and others. 2011.
"Community Transcriptomics Reveals Universal Patterns of Protein Sequence
Conservation in Natural Microbial Communities." Genome Biology 12 (3): R26.
Stolyar, S., Q. He, M. P. Joachimiak, Z. He, Z. K. Yang, S. E. Borglin, D. C. Joyner, et al. 2007.
"Response of Desulfovibrio vulgaris to Alkaline Stress."Journalof Bacteriology
(October 5).
155
Stuart, J. M., E. Segal, D. Koller, and S. K. Kim. 2003. "A Gene-Coexpression Network for
Global Discovery of Conserved Genetic Modules." Science 302 (5643): 249-255.
Su, C., J. M. Peregrin-Alvarez, G. Butland, S. Phanse, V. Fong, A. Emili, and J. Parkinson. 2008.
"Bacteriome. Org-an Integrated Protein Interaction Database for E. Coli." Nucleic
Acids Research 36 (suppl 1): D632.
Suyama, M., D. Torrents, and P. Bork. 2006. "PAL2NAL: Robust Conversion of Protein
Sequence Alignments into the Corresponding Codon Alignments." NucleicAcids
Research 34 (suppl 2): W609-W612.
Suzuki, S., T. Ishida, K. Kurokawa, and Y.Akiyama. 2012. "GHOSTM: A GPU-Accelerated
Homology Search Tool for Metagenomics." PLoS One 7 (5): e36060.
Szabo, Gitta, Sarah P. Preheim, Kathryn M. Kauffman, Lawrence A. David, Jesse Shapiro, Eric
J. Alm, and Martin F. Polz. 2012. "Reproducibility of Vibrionaceae Population
Structure in Coastal Bacterioplankton." The ISME Journal(November 22).
doi:10.1038/ismej.2012.134.
http://www.nature.com/ismej/journal/vaop/ncurrent/abs/ismej2012134a.html.
Szklarczyk, D., A. Franceschini, M. Kuhn, M. Simonovic, A. Roth, P. Minguez, T. Doerks, et al.
2010. "The STRING database in 2011: functional interaction networks of proteins,
globally integrated and scored." Nucleic acids research (November 2).
www.refworks.com.
Tai, S. L., V. M. Boer, P. Daran-Lapujade, M. C.Walsh, J.H. de Winde, J.M. Daran, and J.T.
Pronk. 2005. "Two-dimensional transcriptome analysis in chemostat cultures.
Combinatorial effects of oxygen availability and macronutrient limitation in
Saccharomyces cerevisiae." The Journalof biologicalchemistry 280 (1) (January 7):
437-447.
Tai, S. L., I. Snoek, M. A. H. Luttik, M. J.H. Almering, M. C.Walsh, J.T. Pronk, and J.M. Daran.
2007. "Correlation Between Transcript Profiles and Fitness of Deletion Mutants in
Anaerobic Chemostat Cultures of Saccharomyces Cerevisiae."Microbiology 153 (3):
877.
Tai, V., A. F. Y. Poon, I. T. Paulsen, and B. Palenik. 2011. "Selection in Coastal Synechococcus
(Cyanobacteria) Populations Evaluated from Environmental Metagenomes." PloS
One 6 (9): e24249.
Taoka, M., Y. Yamauchi, T. Shinkawa, H. Kaji, W. Motohashi, H. Nakayama, N. Takahashi, and
T. Isobe. 2004. "Only a Small Subset of the Horizontally Transferred Chromosomal
Genes in Escherichia Coli Are Translated into Proteins." Molecular& Cellular
Proteomics 3 (8): 780-787.
Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M.
Krylov, et al. 2003. "The COG database: an updated version includes eukaryotes."
BMC bioinformatics4 (September 11): 41.
Tautz, Diethard, Hans Ellegren, and Detlef Weigel. 2010. "Next Generation Molecular
Ecology." MolecularEcology 19: 1-3. doi:10.111 1/j.1365-294X.2009.04489.x.
Team, R. Core. 2012. R: A Language and Environmentfor StatisticalComputing. Vienna,
Austria. http://www.R-project.org.
Tettelin, H., V. Masignani, M. J. Cieslewicz, C. Donati, D. Medini, N. L. Ward, S. V. Angiuoli, et
al. 2005. "Genome Analysis of Multiple Pathogenic Isolates of Streptococcus
Agalactiae: Implications for the Microbial 'pan-genome'." Proceedingsof the National
Academy of Sciences of the United States ofAmerica 102 (39): 13950-13955.
Thompson, J. R. 2005. "Genotypic Diversity Within a Natural Coastal Bacterioplankton
Population." Science 307 (5713) (February 25): 1311-1313.
doi:10.1126/science.1106028.
156
Thompson, J. R., M. A. Randa, L. A. Marcelino, A. Tomita-Mitchell, E. Lim, and M. F. Polz. 2004.
"Diversity and Dynamics of a North Atlantic Coastal Vibrio Community." Applied and
Environmental Microbiology 70 (7): 4103-4110.
Timberlake, Sonia. 2012. "SHERA Website." Almlab Software. Accessed October 27.
http://almlab.mit.edu/ALM/Software/Software.html.
Tirosh, I., A. Weinberger, M. Carmi, and N. Barkai. 2006. "A Genetic Signature of Interspecies
Variations in Gene Expression." Nature Genetics 38: 830-834.
Treangen, Todd J., and Eduardo P. C. Rocha. 2011. "Horizontal Transfer, Not Duplication,
Drives the Expansion of Protein Families in Prokaryotes." PLoS Genet 7 (1) (January
27): e1001284. doi:10.1371/journal.pgen.1001284.
Tuller, T., Y. Girshovich, Y. Sella, A. Kreimer, S. Freilich, M. Kupiec, U. Gophna, and E. Ruppin.
2011. "Association between translation efficiency and horizontal gene transfer
within microbial communities." Nucleic acids research 39 (11): 4743-4755.
Uniprot Consortium. 2011. "Ongoing and Future Developments at the Universal Protein
Resource." Nucleic Acids Res. 39: 214-219.
Vallania, F., E. Ramos, S. Cresci, R. D. Mitra, and T. E. Druley. 2012. "Detection of Rare
Genomic Variants from Pooled Sequencing Using SPLINTER." Journal of Visualized
Experiments: JoVE (64). http://www.ncbi.nlm.nih.gov/pubmed/22760212.
Venkateswaran, K., D. P. Moser, M. E. Dollhopf, D. P. Lies, D. A. Saffarini, B. J. MacGregor, D. B.
Ringelberg, et al. 1999. "Polyphasic Taxonomy of the Genus Shewanella and
Description of Shewanella Oneidensis Sp. Nov." InternationalJournalof Systematic
Bacteriology 49 (2): 705-724.
Waltman, P., T. Kacmarczyk, A. R. Bate, D. B. Kearns, D. J. Reiss, P. Eichenberger, and R.
Bonneau. 2010. "Multi-species Integrative Biclustering." Genome Biology 11 (9):
R96.
Wang, L., S. Chen, K. L. Vergin, S. J. Giovannoni, S. W. Chan, M. S. DeMott, K. Taghizadeh, et al.
2011. "DNA Phosphorothioation Is Widespread and Quantized in Bacterial
Genomes." Proceedingsof the NationalAcademy of Sciences 108 (7): 2963-2968.
Warner, J. R. 1999. "The economics of ribosome biosynthesis in yeast." Trends in biochemical
sciences 24 (11) (November): 437-440.
Wetzel, Joshua, Carl Kingsford, and Mihai Pop. 2011. "Assessing the Benefits of Using Matepairs to Resolve Repeats in De Novo Short-read Prokaryotic Assemblies." BMC
Bioinformatics 12 (1) (April 13): 95. doi:10.1186/1471-2105-12-95.
Whitehead, A., and D. L. Crawford. 2006. "Neutral and adaptive variation in gene
expression." Proceedingsof the NationalAcademy of Sciences of the United States of
America 103 (14) (April 4): 5425-5430.
Whitney, Kenneth D, Bastien Boussau, Eric J Baack, and Theodore Garland Jr. 2011. "Drift
and Genome Complexity Revisited." PLoS Genetics 7 (6) (June): e1002092.
doi:10.1371/journal.pgen.1002092.
Whitney, Kenneth D, and Theodore Garland Jr. 2010. "Did Genetic Drift Drive Increases in
Genome Complexity?" PLoS Genetics 6 (8) (August).
doi:10.1371/journal.pgen.1001080.
Wilder, Cara N., Stephen P. Diggle, and Martin Schuster. 2011. "Cooperation and Cheating in
Pseudomonas Aeruginosa: The Roles of the Las, Rhl and Pqs Quorum-sensing
Systems." The ISME Journal 5 (8): 1332-1343. doi:10.1038/ismej.2011.13.
Yoder-Himes, D. R., P. S. G. Chain, Y. Zhu, 0. Wurtzel, E. M. Rubin, J. M. Tiedje, and R. Sorek.
2009. "Mapping the Burkholderia Cenocepacia Niche Response via High-throughput
Sequencing." Proceedingsof the NationalAcademy of Sciences 106 (10): 3976.
157
Young, A. L., H. 0. Abaan, D. Zerbino, J.C.Mullikin, E. Birney, and E. H. Margulies. 2010. "A
New Strategy for Genome Assembly Using Short Sequence Reads and Reduced
Representation Libraries." Genome Research 20 (2): 249-256.
Zagordi, 0., A. Bhattacharya, N. Eriksson, and N. Beerenwinkel. 2011. "ShoRAH: Estimating
the Genetic Diversity of a Mixed Sample from Next-generation Sequencing Data."
BMC Bioinformatics 12 (1): 119.
Zarrineh, P., A. C. Fierro, A. Sanchez-Rodriguez, B. De Moor, K. Engelen, and K. Marchal.
2010. "COMODO: an adaptive coclustering strategy to identify conserved
coexpression modules between organisms." Nucleic acids research.
http://nar.oxfordjournals.org.
Zaslaver, A., A. E. Mayo, R. Rosenberg, P. Bashkin, H. Sberro, M. Tsalyuk, M. G. Surette, and U.
Alon. 2004. "Just-in-time Transcription Program in Metabolic Pathways." Nature
Genetics 36: 486-49 1.
Zerbino, D. R., and E. Birney. 2008. "Velvet: Algorithms for De Novo Short Read Assembly
Using De Bruijn Graphs." Genome Research 18 (5): 821-829.
Zerbino, D. R., G. K. McEwen, E. H. Margulies, and E. Birney. 2009. "Pebble and Rock Band:
Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-read De Novo
Assembler." PloS One 4 (12): e8407.
Zhang, Wei, Edward G. Dudley, and Joseph T. Wade. 2011. "Genomic and Transcriptomic
Analyses of Foodborne Bacterial Pathogens." In Genomics of FoodborneBacterial
Pathogens,ed. Martin Wiedmann and Wei Zhang, 311-341. New York, NY: Springer
New York. http://www.springerlink.com/index/10.1007/978-1-4419-7686-4_10.
Zheng, Dongling, Chrystala Constantinidou, Jon L Hobman, and Stephen D Minchin. 2004.
"Identification of the CRP Regulon Using in Vitro and in Vivo Transcriptional
Profiling."Nucleic Acids Research 32 (19): 5874-5893. doi:10.1093/nar/gkh9O8.
Zhou, A., Z. He, A. M. Redding-Johanson, A. Mukhopadhyay, C. L. Hemme, M. P. Joachimiak, F.
Luo, Y. Deng, K. S. Bender, and Q. He. 2010. "Hydrogen Peroxide-induced Oxidative
Stress Responses in Desulfovibrio Vulgaris Hildenborough." Environmental
Microbiology.
Zinman, G. E., S. Zhong, and Z. Bar-Joseph. 2011. "Biological interaction networks are
conserved at the module level." BMC systems biology 5 (August 23): 134-0509-5-
134.
Zwick, M. E., S. J.Joseph, X. Didelot, P. E. Chen, K.A. Bishop-Lilly, A. C.Stewart, K. Willner, et
al. 2012. "Genomic Characterization of the Bacillus Cereus Sensu Lato Species:
Backdrop to the Evolution of Bacillus Anthracis." Genome Research 22 (8): 1512-
1524.
158