Beiko_et_al_2008_043_R1

advertisement
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Running Head: LGT and Genome Phylogeny
The Impact of Reticulate Evolution on
Genome Phylogeny
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Research Article
Robert G. Beiko1†, W. Ford Doolittle2, and Robert L. Charlebois2
1
Faculty of Computer Science, Dalhousie University, and Institute for Molecular
Bioscience / ARC Centre for Bioinformatics, Brisbane, Australia
2
GenomeAtlantic, Department of Biochemistry & Molecular Biology, Dalhousie
University, Halifax, Nova Scotia, Canada
†
Corresponding author:
Robert G. Beiko
Faculty of Computer Science, Dalhousie University
6050 University Avenue
Halifax, Nova Scotia, Canada B3H 1W5
Tel: 1 902 494 8043
Fax: 1 902 492 1517
Email: beiko@cs.dal.ca
Keywords: Evolutionary simulation, lateral genetic transfer, genome phylogeny
Abbreviations: LGT, lateral genetic transfer; PDS, phylogenetically discordant
sequence; SPR, subtree prune and regraft
1
1
2
ABSTRACT
Genome phylogenies are used to build tree-like representations of
3
evolutionary relationships among genomes. However, in condensing the phylogenetic
4
signals within a set of genomes down to a single tree, these methods generally do not
5
explicitly take into account discordant signals arising due to lateral genetic transfer.
6
Since conflicting vertical and horizontal signals can produce compromise trees that do
7
not reflect either type of history, it is essential to understand the sensitivity of inferred
8
genome phylogenies to these confounding effects. Using replicated simulations of
9
genome evolution, we show that different scenarios of lateral genetic transfer have
10
significant impacts on the ability to recover the ‘true’ tree of genomes, even when
11
corrections for phylogenetically discordant signals are used.
12
13
2
1
An important motivator in many phylogenetic analyses is that the branching
2
relationships inferred from a set of orthologous sequences may serve as a direct
3
indicator of organismal phylogeny. The best example of this is found in the use of 16S
4
and 18S ribosomal DNA sequences for phylogenetic classification of organisms
5
(Woese et al., 1990). The recognition that a single set of putatively orthologous
6
sequences may not yield an accurate depiction of organismal descent due to violations
7
of the phylogenetic model used, insufficient phylogenetic signal, cryptic paralogy or
8
lateral genetic transfer (LGT) led to a partial abandonment of single-gene methods in
9
prokaryotes. In their place emerged a plethora of methods that depend on much
10
greater data availability: concatenated sequence phylogenies (Baldauf et al., 2000;
11
Brochier et al., 2004), supertrees and networks constructed from many individual
12
phylogenetic trees (Daubin et al., 2001; Creevey et al., 2004; Beiko et al., 2005;
13
Holland et al., 2005), and whole-genome methods that typically simplify genetic data
14
to yield an easily computed summary of relationships between genomes.
15
Many genome properties have been used as basic characters for the inference
16
of genome-genome relationships, including gene content (Snel et al., 1999; Lake and
17
Rivera, 2004), gene order (Sankoff et al., 1992; Belda et al., 2005) and properties of
18
the distribution of sequence similarities between genomes (Clarke et al., 2002; Auch
19
et al., 2006). All of the above analyses assume that convergent evolution of the
20
character under consideration is unlikely, when compared with other genome
21
properties such as (for example) G+C content or synonymous codon usage. The
22
clearest violation of this non-convergence assumption is in the reduction of parasitic
23
genomes: since genomes tend to lose many of the same genes when undergoing
24
genome reduction, using gene absence as a parsimoniously informative character
25
leads to artifactual grouping of small genomes in gene content trees (Wolf et al.,
3
1
2001; Kunin et al., 2005). Other, more subtle biases may influence these methods as
2
well: if distantly related organisms have similar biases in nucleotide or amino acid
3
usage due to e.g. mutational biases, nutrient limitations or environmental conditions,
4
then some of their genes may appear less distant from one another than the true
5
divergence time would suggest (Weisburg et al., 1989). Lateral genetic transfer,
6
which can introduce sequences into a genome from organisms with any degree of
7
relatedness, yields an apparent convergence of gene content, similarity, or sometimes
8
order, but in fact violates the fundamental assumption in phylogenetic methods that
9
evolution is tree-like.
10
An understanding of the extent and impact of LGT is crucial to interpreting
11
the relationships shown in a genome tree. One approach is to identify sequences that
12
are discordant using a surrogate method, and either downweight their contribution to
13
the genome tree, or eliminate them entirely from consideration (Clarke et al., 2002;
14
Dutilh et al., 2004; Gophna et al., 2005). A limitation of this approach is that different
15
methods for identifying conflicting genes tend to identify different sets of genes
16
(Ragan, 2001; Ragan, 2006), so the choice of homology criterion and filtering method
17
can have a substantial impact on the inferred genome history. The fundamental
18
problem is that without an accurate estimate of the extent and source of LGT within a
19
given data set, it is very difficult to assess the impact of LGT on the final genome
20
tree. In many published analyses, there is strong reason to suspect that LGT has
21
influenced the position of certain lineages. In some cases, taxa that are thought to
22
participate in frequent transfer are drawn towards one another in the tree. This effect
23
is suggested in the case of the archaeal genus Thermoplasma, which concatenated
24
informational gene phylogenies suggest to be secondarily non-methanogenic
25
(Brochier et al., 2005), but appears as an early-branching euryarchaeal or archaeal
4
1
lineage in many published studies including Wolf et al. (2001), Beiko et al. (2005),
2
and Gophna et al. (2005). There is strong evidence for extensive LGT from the
3
thermoacidophilic crenarchaon Sulfolobus to Thermoplasma, which may produce a
4
compromise in the positioning of Thermoplasma in aggregated trees. In some cases,
5
transfer partners appear as sisters in genome trees, for instance when Arabidopsis
6
appears as a sister taxon to the cyanobacteria in genome trees of metabolic genes
7
(Charlebois et al., 2004).
8
While simulation has been used extensively in the investigation and validation
9
of methods in molecular evolution, such techniques are only now being applied to the
10
study of genome evolution (Zhaxybayeva et al., 2006; Galtier, 2007). Part of the
11
reason for this is the relative novelty of genome sequences, but another barrier to
12
meaningful genome simulation has been the difficulty in merging traditional models
13
of sequence change with evolutionary scenarios and interactions within and among
14
genomes. EvolSimulator (Beiko and Charlebois, 2007) has been developed to
15
simulate the evolutionary phenomena most relevant to the study of lateral genetic
16
transfer, including genome-specific mutational and selective regimes, gene content
17
evolution via gene duplications, losses and LGT, and organismal evolution with
18
speciation, extinction, and competition for simulated niches and habitats. Here we use
19
EvolSimulator to generate populations of genomes under different scenarios and
20
frequencies of lateral genetic transfer, to allow a precise delineation of the extent to
21
which different modes of LGT can impact inferred genome histories. The weighting
22
schemes introduced by Gophna et al. (2005), based on the observed phylogenetic
23
concordance or discordance of proteins, can have a substantial impact on the genome
24
tree, and here we assess the effectiveness of these schemes in improving the tree of
25
genomes that is recovered.
5
1
2
METHODS
3
Evolutionary Simulations
4
EvolSimulator version 2.0.4 (Beiko and Charlebois, 2007) was used to evolve
5
populations of genomes with a consistent set of constraints on genomic properties and
6
sequence substitution, but different types of LGT regime. Each simulation began with
7
a single, ancestral genome having 1000 unrelated genes (240-1500 nt in size), from
8
which a population of genomes would evolve over 5000 iterations. Point mutations
9
were assessed against the standard genetic code as described in Beiko and Charlebois
10
(2007), with amino acid acceptance probabilities proportional to the WAG matrix
11
(Whelan and Goldman, 2000). Insertion and deletion events were not simulated.
12
Genomes were permitted to drift in size by loss and gain (by duplication, and if
13
prescribed, by lateral acquisition) of genes, to as few as 500 genes or as many as
14
3500. Speciation and extinction events were balanced such that the simulation
15
maintained between 50 and 60 genomes at any given time, following an initial growth
16
phase.
17
EvolSimulator constructs a user-defined number of “niches” within which
18
genomes reside, the occupation of which is competitively determined by relative gene
19
complements. Each niche has a finite number of spaces that are identical in terms of
20
required genes, potentially limiting the number of genomes that can exploit that niche.
21
Habitats comprise one or more niches, and also impose specific gene requirements on
22
a resident genome. In these simulations, we distributed 1000 spaces evenly among
23
100 niches, and these 100 niches among 10 habitats. For a genome to spend time in a
24
niche, it must possess all of the genes required by a niche and by its enclosing habitat,
25
such necessities being randomly chosen by EvolSimulator at the start of the run. In
6
1
addition to this qualitative requirement, the quantitative usefulness of genes within the
2
current niche regulates their propensity to be retained or lost, as well as the amount of
3
time a genome may spend within that niche. One can, though here we did not, bias
4
speciation/extinction probabilities according to overall genomic fitness.
5
We explored five independent LGT scenarios in this study: (a) no LGT, (b)
6
random LGT, (c) relations-biased LGT (occurring more often with closer relatives),
7
(d) gene content-biased LGT (occurring more often between genomes with more-
8
similar ortholog constitution), and (e) habitat-restricted LGT (occurring only between
9
genomes concurrently residing in the same habitat). A successful transfer event
10
according to the biasing criterion always led to uptake of the transferred gene; if an
11
ortholog was already present in the recipient genome, it was replaced with the
12
incoming gene. Although we did not explore blends of these scenarios here, we did
13
explore three rates of LGT: in separate runs of simulation sets (b)-(e), the mean
14
number of attempted events E was set to 10, 50, and 250 nominal events per iteration.
15
The single run (a), plus three runs each of (b)-(e) with variable E, yielded a set of 13
16
evolutionary scenarios. By using the same random number seed for each simulation in
17
a set, we were able to exactly replicate the speciation and extinction history (i.e., each
18
simulation had the same reference ‘organismal’ tree) and lineage-specific mutation
19
biases, thus ensuring that differences among the 13 simulations would be due only to
20
the LGT model that was used.
21
Five sets of replicate runs of the thirteen scenarios were executed, with each
22
set of replicates employing its own seed for pseudorandom number generation, for a
23
total of 65 simulation runs. The use of a consistent random number seed across every
24
run within a given replicate ensured the same history of speciation and extinction, as
25
well as the same mutational biases within each lineage. For a complete summary of
7
1
all parameters used in the simulation, please refer to the EvolSimulator configuration
2
file in the Supplemental Material (available online at
3
http://www.systematicbiology.org).
4
Supplementary Figure 1 (also available at http://www.systematicbiology.org)
5
illustrates the complete set of speciation and extinction events for the entire 5000
6
iterations of one of our replicate runs. Extant genomes and closely related extinct
7
lineages have been assigned to eight monophyletic groups α-θ: the precise delineation
8
of these groups is arbitrary, but they all diverged in the earliest 10% of simulated
9
iterations. Colors have been assigned to phylum-level groups in order to highlight
10
cases of invasive intermingling in the inferred phylogenies shown below.
11
12
13
Inference of Genome-Scale Phylogeny
Normalized BLASTP-based phylogeny was performed in a manner similar to
14
Clarke et al., (2002), with the distance between every pair of genomes equal to 1.0
15
minus the mean normalized BLASTP 2.2.2 (Altschul et al., 1997) distance of all
16
pairwise reciprocal best matches (RBMs) between the two genomes. Only RBMs with
17
BLASTP e-values of 1.0 × 10-5 or less were used to build this matrix. The minimum
18
evolution algorithm implemented in the November 28, 2003 release of FastME
19
(Desper and Gascuel, 2002) was used to build a phylogenetic tree from the matrix of
20
all genome pairwise distances. Statistical support for each distance tree was assessed
21
by resampling the set of RBMs with replacement for each pair of genomes to generate
22
bootstrapped distance matrices, with the corresponding trees constructed using
23
FastME. Resampling was performed 100 times on each data set, to yield support
24
values for each bipartition in a given genome tree.
8
1
Phylogenetically discordant sequences (PDS) disagree with the majority
2
phylogenetic signal, and are frequently observed in real data due both to violations of
3
phylogenetic assumptions and to bona fide instances of LGT. To limit the effect of
4
phylogenetic discordance on inferred genome trees, we applied the PDS procedure
5
described in Clarke et al. (2002) to the set of genomes obtained from each simulation
6
in replicate 1 (other replicates were not examined). Each individual protein has an
7
associated PDS score, which is calculated by comparing the ranking of its similarity
8
to putative orthologs in a list of other genomes (u-values), versus a ranking based on
9
the median of all pairs of putative orthologs from every other genome (w-values). The
10
Spearman rank correlation thus obtained is compared to a large number of
11
correlations obtained by randomizing the comparisons, to obtain a p-value. The mean
12
of the PDS p-values associated with a given pair of RBM proteins could then be used
13
to weight the contribution of that pair to the overall distance measure between the two
14
relevant genomes (Gophna et al., 2005). The unweighted distance D between a pair of
15
genomes A and B with a total of n RBMs is given by the following formula:
16
1 n
DA, Bunweighted  1   S ( AB) i
n i 1
17
(1)
18
19
Where S(AB)i is the normalized BLASTP score for the ith RBM between the pair of
20
genomes. The concordance-weighted distance is modified by the PDS p-values P(AB)i
21
defined above:
22
23
DA, Bconc
1 n
1
S ( AB)i P( AB)i

nPc i 1
(2)
9
1
2
Where nPc is the sum of all P(AB)i between a pair of genomes. The discordance-
3
weighted distance between a pair of genomes is computed in a similar manner,
4
replacing all P(AB)i with (1 - P(AB)i):
5
DA, Bdisc
6
1
1
nPd
n
 S ( AB) (1  P( AB) )
i 1
i
i
(1.3)
(3)
7
8
With nPd equal to the sum of all (1 - P(AB)i) between a pair of genomes.
9
10
11
Quantifying Differences Between Inferred Genome Trees
Each inferred genome tree was compared to the true ‘organismal’ history, to
12
assess the accuracy of the inferred relationships. A range of bootstrap support
13
thresholds between 0.50 and 1.00 were used as minimal criteria for strongly supported
14
relationships in the inferred genome trees: in several analyses, conservative (0.90) and
15
liberal (0.70) thresholds were contrasted. Since the organismal reference tree is
16
completely resolved, strongly supported bipartitions present in a given genome tree
17
must either be congruent or incongruent with the reference tree. Disagreements
18
between trees were expressed in terms of the total count of concordant and discordant
19
bipartitions, and in terms of the proportion of all resolved bipartitions that were
20
concordant.
21
EEEP (Beiko and Hamilton, 2006) was used to recover the distance between
22
the organismal reference and each inferred genome tree in turn. The distance used
23
characterizes the number of subtree prune-and-regraft (SPR: Swofford and Olsen,
24
1990) operations that need to be performed on the organismal reference tree, to obtain
10
1
the inferred genome tree. Each SPR operation in an edit path implies a donor/recipient
2
pairing, and in the context of single-molecule phylogeny identifies a putative transfer
3
event from one branch to another. SPR events from genome trees may identify major
4
highways of gene sharing, either directly if a given taxon or clade is paired with a
5
major transfer partner, or indirectly if the position of a taxon in the tree is intermediate
6
between its transfer partners and its correct position in the organismal tree. It is
7
therefore useful to characterize incongruence both quantitatively in terms of edit
8
distance, and qualitatively in terms of the nature of the proposed discordant
9
relationships, given complete knowledge of the ‘true’ trees and scenario of LGT.
10
11
RESULTS
12
Species Tree, Genome Divergence, and Gene Exchange
13
Figure 1 shows the phylogenetic relationships between extant organisms at the
14
end of the replicate 1 simulations. Replicate 1 had the largest population of surviving
15
genomes at the end of the simulation, with 56 extant lineages. The other four
16
replicates had a minimum of 48 and a maximum of 54 genomes at iteration 5000.
17
Each replicate had an initial phase of rapid speciation to reach the target population
18
size of 50 genomes, but following this phase the extant genomes could be separated
19
into lineages supported basally by relatively long branches, appearing as logically
20
distinct phyla. Protein sequences that diverged at the beginning of the simulation were
21
still recognizably homologous.
22
The relationship between genome divergence time and mean normalized
23
BLASTP distance for each pair of genomes is shown in Figure 2. The mean
24
normalized BLASTP distance increases quickly in the first 250 iterations after
25
divergence, then increases more gradually to the maximum number of iterations,
11
1
albeit with high variance. Distances between pairs of genomes whose last common
2
ancestor occurred near the beginning of the simulation ranged between approximately
3
0.6 and 0.7. Phylum-level divergence between real pairs of genomes is approximately
4
0.7 – 0.75 from Clarke et al. (2002), so the maximum level of divergence among
5
genomes in this analysis is roughly equivalent to that seen among bacterial phyla.
6
When the mean number of LGT events per iteration E was set to 250, the rate
7
of LGT was sufficiently high that every gene created at the beginning of the
8
simulation was transferred at least once in its history. In the random LGT scenario
9
with E = 250, the distribution of historical transfers for a sample of 10% of the genes
10
from genome 314 was examined. At the end of the simulation (iteration 5000), each
11
gene had been transferred an average of 70 times during the course of its simulated
12
evolution. No gene had been transferred fewer than 31 times, and the maximum
13
number of transfers in the history of any given gene was 103.
14
15
16
Genome Trees Under Different LGT Scenarios
BLASTP-based genome trees are strictly bifurcating, with associated
17
bootstrap proportions (BP) reflecting the frequency that a given bipartition was
18
observed within the set of bootstrap replicate trees. Figure 3 shows, for a series of
19
thresholds, the proportion of bipartitions supported at or above that threshold that are
20
either concordant or discordant with the true genome tree, e.g. that either support or
21
conflict with relationships in the true tree (Wilkinson et al., 2005), recovered from the
22
five replicate simulations performed without LGT. In each of the five cases, there was
23
a BP threshold at and above which no discordant bipartitions were present in the
24
inferred genome tree. This minimum threshold ranged from 0.65 in replicate 2 to 0.90
25
in replicate 5, with discordance in the latter case due to a misplaced deep branch with
12
1
bootstrap support of exactly 0.85.. High BP thresholds yielded exclusively concordant
2
relationships, but had a negative impact on the number of resolved bipartitions, with
3
fewer than 40% of all bipartitions resolved at a BP threshold of 1.00 for each of the
4
five trees. Lowering the BP threshold from 1.00 to 0.50 increased the number of
5
resolved bipartitions by approximately a factor of two, at the expense of accepting
6
some discordant relationships: across the five replicates, between 3% and 16% of the
7
resolved relationships at a BP threshold of 0.50 were not consistent with the original
8
genome tree.
9
The genome tree inferred from the LGT-free simulation (E = 0) in replicate 1
10
is shown in Figure 4. The rapid increase in normalized BLASTP distances in the
11
period immediately following genome divergence (shown in Fig. 2) is evident here in
12
the relatively long branches separating recently diverged taxa. Nonetheless, seven of
13
the eight main monophyletic groupings (α, β, γ, ε, ζ, η, and θ) were correctly
14
reconstructed. While several of these groups were reconstructed with bootstrap
15
support ≥ 90%, there was some difficulty in recovering the deepest branches, with the
16
deepest-branching genome from group δ separated from the other two genomes in this
17
group in spite of a relatively long supporting internal branch in the true tree, and low
18
bootstrap support for groups ζ and θ, likely reflecting the frequent intrusion of
19
genome 205 into group θ. While genome 205 does not branch within group θ in the
20
tree shown in Figure 4, a tree computed from the same starting distance matrix but
21
using the Fitch-Margoliash method (Fitch and Margoliash, 1967) displayed this
22
intermingling of phyla. The three groups that were not recovered or recovered with
23
weak support were also the three earliest groups to diverge in Figure 1, suggesting
24
that groups of this age or older may not be recoverable. Within some groups there
25
were discrepancies in the recovered branching order, however these inconsistencies
13
1
had low associated bootstrap values. The branching order of major groups was also
2
unreliable; again incorrect relationships were associated with bootstrap values less
3
than 70%.
4
Each of the 13 genome histories from replicate 1 was used to construct a
5
bootstrapped genome phylogeny. Table 1 shows the degree of strongly supported
6
concordance and discordance that was found for each reconstructed tree relative to the
7
original genome history. As described above, the tree of genomes simulated without
8
LGT had no discordant bipartitions with bootstrap support of 70% or greater,
9
although only 28 of a possible 51 internal bipartitions had a bootstrap support at least
10
this high, and only 25 bipartitions were supported at a bootstrap threshold of 90%.
11
Among simulations where LGT occurred more often between closely related genomes
12
(relations-biased), relatively low rates of LGT yielded resolution and concordance
13
values similar to those observed in the absence of LGT. However, the degree of
14
resolution actually increased with increasing rates of LGT; when the mean rate of
15
LGT E was 250 nominal events per iteration, 40 bipartitions had bootstrap support of
16
70% of greater; five of these bipartitions were incongruent with the true tree. An
17
increase in the number of resolved bipartitions was also observed at a bootstrap
18
threshold of 90%, with the number of discordant bipartitions increasing from 1 to 2.
19
Consequently the gain in resolution appears to reinforce the simulated genome
20
phylogeny; even if the histories of shared genes do not exactly match the reference
21
tree, they can still aid in its recovery.
22
In all of the other three types of LGT scenario that were simulated, there was a
23
decrease in the number of concordant nodes between E = 10 and E = 250 attempted
24
transfers per iteration, although in some cases there were more concordant nodes at E
25
= 50 than at E = 10. Under content-biased and random LGT in this replicate, the
14
1
number of discordant bipartitions dropped to zero as E increased to 250, so the 20
2
strongly supported bipartitions that remained in each of these simulations were all in
3
agreement with the true tree. However, a low rate of habitat-biased LGT was
4
sufficient to introduce discordant relationships into the inferred tree; with E = 10 only
5
31 out of 38 bipartitions with BP ≥ 0.7 were concordant with the true tree (Fig. 5).
6
And unlike the trees built from content-biased or random LGT, the proportion of
7
strongly supported and discordant bipartitions increased with increasing E.
8
9
Nonparametric Kruskal-Wallace tests were performed across all five replicates
(which are summarized in Fig. 6) to assess the significance of differences in %
10
concordance distribution among LGT scenarios. These tests were applied
11
independently to a total of six combinations of E (10, 50, and 250) and bootstrap
12
threshold (70% in Fig. 6a and 90% in Fig. 6b). The difference in distribution of
13
percent concordance was not significant or just below an alpha threshold of 0.05 for E
14
= 10 (BP threshold of 70: p = 0.033; BP threshold of 90: p = 0.73), but p-values of
15
0.015 and 0.005 were obtained for both BP thresholds with E = 50, and p < 0.005
16
when E = 250. Pairwise post-hoc Nemenyi tests (Hollander and Wolfe, 1999) showed
17
that only habitat-biased LGT produced trees that were significantly less concordant
18
than any of the other groups. In each of the four sets of tests with E equal to 50 or 250
19
and a BP threshold of 0.7 or 0.9, the distribution of concordances from the five
20
habitat-biased replicates differed from that of the random and gene content-biased
21
trials, but was statistically indistinguishable at α < 0.05 from that of the five relations-
22
biased replicates. Since a speciation event yields two lineages that are adapted to the
23
same set of niches, recently diverged genomes have a better than random probability
24
of occupying the same habitat. This effect will be attenuated by independent habitat
15
1
switching and gene loss, but may be responsible for the similar distributions seen
2
here.
3
Across all sets of unweighted genome trees, the bootstrap support for each
4
node showed a strong relationship with the length of the branch supporting that node.
5
Figure 7 shows this relationship for three sets of genome trees, corresponding to the
6
five replicates simulated without LGT and the five replicates simulated with either
7
habitat-biased or random LGT and E equal to 250 attempted events per iteration.
8
While the relationship between branch length and bootstrap support is similar, the
9
trees simulated with random LGT have shorter branches, and therefore lower overall
10
bootstrap support. Random LGT leads to a greater average similarity among genomes,
11
compressing the tree and confounding relationships that were clearly resolved in
12
simulations lacking LGT. Conversely, habitat-directed LGT does not produce the
13
same ‘squashing’ of early branches in the tree, and the discordant nodes recovered
14
have longer supporting branches and higher bootstrap support.
15
While a single branch migration within a tree can disrupt potentially many
16
bipartitions, multiple SPR operations were needed to reconcile all of the habitat-
17
biased LGT trees (resolved at a bootstrap threshold of 70%) with the true tree of
18
genomes (Table 1). Consequently the discordance of these trees was not due to a
19
single branch migration, but to several such events.
20
21
22
Concordance and Discordance-Weighted Trees
In separate analyses, the normalized BLASTP scores between proteins from
23
replicate 1 were subjected to concordance and discordance weighting prior to distance
24
matrix reconstruction. Under a concordance weighting scheme, proteins with ranked
25
similarities to putative orthologs in other genomes will have a high weight if the
16
1
ranking is similar to the overall ranking of genome similarities. If a consistent pattern
2
of similarities exists for many proteins from a given simulation, then these proteins
3
will exert a strong influence on the overall genome similarity ranking, and will make
4
a disproportionately high contribution to the distance matrix as a consequence
5
(Gophna et al., 2005).
6
The contribution from a subset of proteins with strong adherence to the
7
genome similarity ranking might be expected to yield more well-supported
8
bipartitions than are seen in the unweighted case. Figure 8 shows that this is true for E
9
= 0, 10, and 50: the total number of bipartitions resolved at a BP threshold of 0.7
10
increased in all nine simulations (Fig. 8a). However, in four out of nine of these trials,
11
there is a decrease in the proportion of total concordant bipartitions (Fig. 8b), so the
12
additional information that emerges from concordance weighting is sometimes in
13
disagreement with the true tree. The effect of concordance weighting when E = 250 is
14
less clear: the number of resolved bipartitions decreases in the unbiased (random
15
LGT) simulation, and increases slightly in the other three simulations. In none of the
16
replicates did PDS scores correlate significantly with the number of historical
17
transfers for a given gene: many factors such as paralogy and compositional artifacts
18
are intentionally captured in the PDS weighting process (Clarke et al., 2002), and a
19
simple model of LGT frequency may not be sufficient to gauge its expected impact on
20
phylogenetic discordance.
21
The effect of discordance weighting at low levels of LGT (E = 0 and E = 10)
22
is a near-complete loss of all strongly supported bipartitions. All five simulations with
23
E ≤ 10 produced trees with a maximum of eight bipartitions with a bootstrap value of
24
70% or greater: these bipartitions supported only the most recent divergences in the
25
tree, although they were not invariably in agreement with the true tree. A drop in the
17
1
number of resolved bipartitions relative to the unweighted trees was also seen when E
2
= 50 and E = 250. In 7 out of 8 of these simulations, 100% of the resolved bipartitions
3
were concordant, with the only exception being the habitat-biased simulation at E =
4
250.
5
6
DISCUSSION
7
Effectiveness of the Normalized BLASTP Method and Refinements
8
9
By including confounding effects such as gene duplications and losses, and
changes in the underlying mutational biases of genomes, EvolSimulator produces
10
datasets that can violate the assumptions of many phylogenetic reconstruction
11
methods. Drifts in genomic G+C bias could potentially confound the BLASTP
12
analysis by making distantly related genomes appear more similar to one another if
13
they share similar genomic G+C biases: this effect could potentially be mirrored by
14
compositional or functional convergence in real genomes. In spite of this, trees
15
constructed using the normalized BLASTP method were able to correctly recover
16
(with BP ≥ 0.7) >60% of the bipartitions in the original simulated tree when LGT was
17
absent from the simulation. The earliest, closely spaced, branches in the tree, which
18
describe the relationships between the major groupings of simulated taxa, were
19
weakly supported and generally incorrect. This result provides an interesting contrast
20
with the genome trees of Gophna et al. (2005), where nearly all recovered bipartitions
21
had an associated BP of 100%. It is not uncommon for genomic phylogenies to offer
22
strong support for deep relationships (Wolf et al., 2001), but these relationships may
23
be reflective of compositional biases and unequal rates of evolution, which can
24
overwhelm what little phylogenetic signal remains at great depths (Jermiin et al.,
25
2004). The non-linearity of the relationship between evolutionary distance and
18
1
normalized BLASTP distance between pairs of genomes may influence the recovery
2
of correct relationships: further refinements to the normalized BLASTP method could
3
include transformations of the normalized BLASTP distance and reweighting of
4
distances between individual pairs of orthologs to yield linear relationships with lower
5
variance.
6
Concordance weighting of normalized BLASTP scores tended to increase the
7
overall statistical support for the recovered genome phylogeny, but with the drawback
8
of increasing the support for a few discordant bipartitions to a point above the
9
threshold of significance. In cases where true history and bias (compositional or
10
otherwise) conflict in the relationships they support, concordance weighting will
11
favour one solution over the other. It appears that in most but not all cases in this
12
simulation, the balance favoured the true vertical history. Ultimately, it may be
13
worthwhile to investigate the role of bias in detail, and examine the combined effects
14
of concordance weighting and accounting for composition using modified
15
evolutionary models (Jayasawal et al., 2007) or residue recoding (Phillips et al., 2004;
16
Susko and Roger, 2007).
17
In these simulations, discordance weighting did not emphasize major
18
pathways of LGT: in most cases the effect of weighting for discordance was to
19
drastically decrease the statistical support for most relationships in the recovered tree.
20
The most likely reason for this effect is the nature of vertical and lateral relationships
21
in these simulations: while there is only one vertical history, in every class of
22
simulation many different types of lateral relationships are expected. This is true even
23
for the habitat-directed scenario, where organisms in our simulations could switch
24
between habitats with frequencies that are likely higher than the extreme cases (e.g.,
25
Thermoplasmatales) described in the Introduction and below. In all cases, the
19
1
resulting lateral relationships would be better represented with a network rather than a
2
tree of putative lateral histories.
3
4
5
The Effects of Directed vs. Random Exchanges
Two quantities were examined to assess the decay of phylogenomic signal
6
with different rates and scenarios of LGT: the total number of strongly supported
7
bipartitions in the reconstructed genome tree, and the proportion of these bipartitions
8
that were in agreement with the true tree. When LGT events preferentially occurred
9
between closely related genomes, the overall effect was to increase the total number
10
of resolved bipartitions; most (but not all) of these gained bipartitions were
11
concordant. These events will decrease the effective time since divergence, with
12
many proteins subjected to orthologous replacement.
13
Although gene content-biased LGT might have been expected to yield results
14
similar to relations-biased LGT, in most cases the degree of resolution and
15
concordance from the content-biased simulations was most similar to that obtained
16
from random LGT. In fact, content-biased LGT and random LGT are equivalent if the
17
gene content is identical among all organisms. While gene content did vary across
18
organisms in these simulations, the variation appears to have been neither sufficiently
19
large nor consistent across lineages to distinguish the content-biased simulations from
20
the random ones. This observation highlights an interesting distinction between our
21
simulations and the expected empirical case as articulated by the Complexity
22
Hypothesis (Jain et al., 1999): in our simulations, all genes were equally transferable,
23
regardless of their importance to the organism or distribution across the simulated tree
24
of life. The Complexity Hypotheses states that proteins involved in large complexes
25
are resistant to transfer: since most of these proteins are informational and ubiquitous
20
1
(or nearly so) in living organisms, the most frequently occurring proteins are therefore
2
least likely to be transferred. Reducing the contribution of essential and ubiquitous
3
genes to the calculation of shared gene content would have enhanced the differences
4
among genomes, and likely led to LGT scenarios that were more strongly influenced
5
by phylogenetic relatedness and possibly habitat as well.
6
Unlike the other classes of simulation examined here, habitat-biased
7
simulations always produced genome trees with features that were strongly supported
8
but discordant with the true tree. While LGT events can have disruptive influences on
9
the recovered genome phylogeny, a random distribution of events should (absent
10
other biases) merely reduce the statistical support for the correct phylogeny. But if
11
gene sharing events occur preferentially between certain lineages, as is the case with
12
habitat-directed LGT, then the lateral signal may produce a strongly-supported
13
alternative to the vertical topology. The ultimate effect will likely not be a clear co-
14
location of frequently exchanging taxa in the recovered genome tree, but rather a
15
phylogenetic compromise that is influenced by both the vertical and lateral histories,
16
but in fact displays neither. Emergent features of supertrees have sometimes been
17
described as ‘signal enhancement’ (Bininda-Emonds et al, 1999), but in this case the
18
relationship that emerges is not reflective of any true signal.
19
A survey of published genome phylogenies, supermatrices and supertrees
20
shows the potential that many such effects may be influencing recovered trees. An
21
example reported in the genome phylogeny work of Gophna et al (2005) concerns the
22
positioning of the Thermoplasmatales, which are typically placed near the base of the
23
Archaea in genome phylogenies. It appears that the positioning of Thermoplasmatales
24
reflects a compromise between its vertical history (shared with other Euryarchaeota)
25
and a habitat-directed highway of gene sharing with the thermoacidophilic
21
1
crenarchaeal genus Sulfolobus. A breakdown of recovered quartets from the
2
phylogenomic analysis of Beiko et al. (2005) identified groups of Thermoplasma
3
proteins with strong affinities for both the hypothetical vertical and lateral histories,
4
with very little support for the reported positioning of this group as basal to the
5
Archaea.
6
The datasets simulated here could be used to validate other phylogenomic
7
methods as well, including concatenation, supertrees and parsimony methods such as
8
conditioned reconstruction (Lake and Rivera, 2004). All of these methods are
9
sensitive to phylogenetic incongruence, but some approaches may be better able to
10
extract the historical, ‘vertical’ signal from among a set of discordant relationships (if
11
this is indeed the goal). For instance, supertree approaches may be preferable to
12
sequence concatenation, since orthologous gene sets with weak and potentially
13
misleading phylogenetic signal can be removed from the analysis if the trees they
14
generate have weak statistical support. A concatenated alignment would need to
15
consider every site from these genes, and might therefore be expected to be more
16
susceptible to ‘compromise’ topologies.
17
Other groups potentially affected by such vertical / lateral conflicts include the
18
hyperthermophiles Aquifex aeolicus and Thermotoga maritima: there is strong
19
phylogenetic and physiological (see e.g. Cavalier-Smith, 2002) evidence that A.
20
aeolicus is ancestrally an ε-proteobacterium, and T. maritima a low G+C Firmicute.
21
Their frequent positioning together at the base of the tree appears to arise from a
22
combination of factors, including many putative highways of gene sharing among
23
thermophiles (including T. maritima with Archaeal groups such as Pyrococcus), and
24
biased nucleotide and amino acid compositions that influence the results of BLAST
25
searches and phylogenetic analyses. When photosynthetic organisms from different
22
1
domains are included in phylogenomic analyses, endosymbiotic events may influence
2
the recovered phylogeny (Charlebois et al., 2004; Gophna et al., 2005). Many more
3
strongly conflicting pathways of vertical and lateral inheritance may exist:
4
examination of environmental organisms with shared habitats using metagenomic
5
methods (Musovic et al., 2006) has the potential to clarify the role of LGT in the
6
environment.
7
While the experiments performed in this analysis by no means recapture the
8
entire complexity of microbial evolution, they illustrate the confounding effects of
9
different scenarios of lateral genetic transfer and have important consequences for our
10
ultimate ability to recover a tree (or network) of microbial life. It is perhaps startling
11
that the very high incidence of LGT in some our simulated data sets (particularly the
12
random LGT simulation with E = 250) does not completely erase the vertical
13
evolutionary signal in the data. The surprising persistence of vertical signal (albeit
14
with relatively greater loss of ancient relationships) supports the idea of laterally
15
transferred genes as "fibres in a rope" (Zhaxybayeva et al., 2004) that carry different
16
individual histories, but together constitute a cohesive picture of vertical evolution.
17
Such a picture must depend (as does a supertree) on different (locally untransferred)
18
genes carrying vertical signal in different parts of the tree, and demands further
19
elucidation through modeling. However, the introduction of bias into LGT scenarios
20
leads to the conflation of different vertical and lateral signals in the resulting genome
21
phylogeny. The true nature of LGT regimes in the wild will depend on mechanistic
22
and selective factors that influence the probability of success of individual LGT
23
events. If the majority of persistent LGT (as opposed to transient LGT: see e.g. Hao
24
and Golding, 2006) is confined to close relatives, then the vertical history or ‘tree of
25
cells’ will be reinforced by ‘xenologous’ genes that diverged more recently than the
23
1
last common ancestor of the cells that contain those genes. While these genes do not
2
match the reference phylogeny, they can contribute to its recovery and resolution.
3
However, where LGT occurs in a non-random fashion between distant relatives, our
4
ability to recover such a tree, or even indeed a significantly restricted network will be
5
severely compromised or lost.
6
7
SUPPLEMENTAL MATERIAL
8
9
The configuration files used in the above simulations, and all of the genome trees
10
used in this paper are given in the Supplemental Material (available online at
11
http://systematicbiology.org). EvolSimulator 2.0.4 can be obtained freely from
12
http://bioinformatics.org.au/evolsim.
13
14
ACKNOWLEDGMENTS
15
16
The authors would like to thank Olaf Bininda-Emonds, James McInerney, James Lake
17
and Craig Herbold for helpful comments on the manuscript. This work was supported
18
by Australian Research Council Grants DP0342987 and CE0348221 and by the
19
Genome Canada-funded project "Understanding Prokaryotic Genome Evolution and
20
Diversity" (WF Doolittle, PI).
21
22
REFERENCES
23
24
1
Altschul, S.F., T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J.
2
Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein
3
database search programs. Nucleic Acids Res. 25:3389-3402.
4
Auch, A.F., S.R. Henz, B.R. Holland, and M. Göker. 2006. Genome BLAST distance
5
phylogenies inferred from whole plastid and whole mitochondrion genome
6
sequences. BMC Bioinformatics 7:350.
7
Baldauf, S.L., A.J. Roger, I. Wenk-Siefert, and W.F. Doolittle. 2000. A
8
kingdom-level phylogeny of eukaryotes based on combined protein data.
9
Science 290:972-977.
10
11
12
13
14
15
16
17
18
19
20
Beiko, R.G. and R.L. Charlebois. 2007. A simulation test bed for hypotheses of
genome evolution. Bioinformatics 23:825-831.
Beiko, R.G. and N. Hamilton. 2006. Phylogenetic identification of lateral genetic
transfer events. BMC Evol. Biol. 6:15.
Beiko, R.G., T.J. Harlow, and M.A. Ragan. 2005. Highways of gene sharing in
prokaryotes. Proc. Natl. Acad. Sci. 104:14332-14337.
Belda, E., A. Moya, and F.J. Silva. 2005. Genome rearrangement distances and gene
order phylogeny in gamma-Proteobacteria. Mol. Biol. Evol. 22:1456-1467.
Bininda-Emonds, O.R.P. 2004. The evolution of supertrees. Trends Ecol. Evol.
19:315-322.
Bininda-Emonds, O.R.P., J.L. Gittleman, and A. Purvis. 1999. Building large trees by
21
combining phylogenetic information: A complete phylogeny of the extant
22
Carnivora (Mammalia). Biol. Rev. 74:143–175.Brochier, C., P.Forterre, and
23
S.Gribaldo. 2004. Archaeal phylogeny based on proteins of the transcription
24
and translation machineries: tackling the Methanopyrus kandleri paradox.
25
Genome Biol. 5:R17.
25
1
Cavalier-Smith, T. 2002. The neomuran origin of archaebacteria, the negibacterial
2
root of the universal tree and bacterial megaclassification. Int. J. Syst. Evol.
3
Microbiol. 52:7-76.
4
Charlebois, R.L., R.G. Beiko, and M.A. Ragan. 2004. Genome phylogenies. In
5
Organelles, genomes and eukaryote phylogeny: An evolutionary synthesis in
6
the age of genomics (eds. R.P. Hirt and D.S. Horne), pp. 189-206. CRC Press,
7
Boca Raton, FL.
8
9
Clarke, G.D.P., R.G. Beiko, M.A. Ragan, and R.L. Charlebois. 2002. Inferring
genome trees by using a filter to eliminate phylogenetically discordant
10
sequences and a distance matrix based on mean normalized BLASTP scores.
11
J. Bacteriol. 184:2072-2080.
12
Creevey, C.J., D.A. Fitzpatrick, G.K. Philip, R.J. Kinsella, M.J. O'Connell, M.M.
13
Pentony, S.A. Travers, M. Wilkinson, and J.O. McInerney. 2004. Does a tree-
14
like phylogeny only exist at the tips in the prokaryotes? Proc. Biol. Sci.
15
271:2551-2558.
16
17
18
Daubin, V., M. Gouy, and G. Perrière. 2001. Bacterial molecular phylogeny using
supertree approach. Genome Inform. 12:155-164.
Desper, R. and O. Gascuel. 2002. Fast and accurate phylogeny reconstruction
19
algorithms based on the minimum-evolution principle. J. Comput. Biol.
20
9:687-705.
21
Dutilh, B.E., M.A. Huynen, W.J. Bruno, and B. Snel. 2004. The consistent
22
phylogenetic signal in genome trees revealed by reducing the impact of noise.
23
J. Mol. Evol. 58:527-539.
24
25
Fitch, W.M. and E. Margoliash. 1967. Construction of phylogenetic trees. Science
155:279-84.
26
1
2
3
4
5
6
7
Galtier, N. 2007. A model of horizontal gene transfer and the bacterial phylogeny
problem. Syst. Biol. 56:633-642.
Gophna, U., W.F. Doolittle, and R.L. Charlebois. 2005. Weighted genome trees:
refinements and applications. J. Bacteriol. 187:1305-1316.
Hao, W. and G.B. Golding. 2006. The fate of laterally transferred genes: life in the
fast lane to adaptation or death. Genome Res. 16:636-643.
Holland, B.R., K.T. Huber, V. Moulton, and P.J. Lockhart. 2004. Using consensus
8
networks to visualize contradictory evidence for species phylogeny. Mol. Biol.
9
Evol. 21:1459-1461.
10
11
12
Hollander, M. and D.A. Wolfe. 1999. Nonparametric Statistical Methods, 2nd
Edition. New York: John Wiley & Sons.
Kunin, V., L. Goldovsky, N. Darzentas, and C.A. Ouzounis. 2005. The net of life:
13
reconstructing the microbial phylogenetic network. Genome Res. 15:954-959.
14
Jain, R., M.C. Rivera, and J.A. Lake. 1999. Horizontal gene transfer among genomes:
15
the complexity hypothesis. Proc. Natl. Acad. Sci. 96:3801-3806.
16
Jayaswal, V., J. Robinson, and L. Jermiin. 2007. Estimation of phylogeny and
17
invariant sites under the general Markov model of nucleotide sequence
18
evolution. Syst. Biol. 56:155-162.
19
Jermiin, L., S.Y. Ho, F. Ababneh, J. Robinson, and A. W. Larkum. 2004. The biasing
20
effect of compositional heterogeneity on phylogenetic estimates may be
21
underestimated. Syst. Biol. 53:638-643.
22
Lake, J.A. and M.C. Rivera. 2004. Deriving the genomic tree of life in the presence of
23
horizontal gene transfer: conditioned reconstruction. Mol. Biol. Evol.
24
21:681-690.
27
1
Musovic, S., G. Oregaard, N. Kroer, and S.J. Sørensen. 2006. Cultivation-independent
2
examination of horizontal transfer and host range of an IncP-1 plasmid among
3
gram-positive and gram-negative bacteria indigenous to the barley
4
rhizosphere. Appl. Environ. Microbiol. 72:6687-6692.
5
6
7
8
9
Phillips, M.J., F. Delsuc, and D. Penny. 2004. Genome-scale phylogeny and the
detection of systematic biases. Mol. Biol. Evol. 21:1455-1458.
Ragan, M.A. 2006. On surrogate methods for detecting lateral gene transfer. FEMS
Microbiol. Lett. 201:187-191.
Ragan, M.A., T.J. Harlow, and R.G. Beiko. 2006. Do different surrogate methods
10
detect lateral genetic transfer events of different relative ages? Trends
11
Microbiol. 14:4-8.
12
Sankoff, D., G. Leduc, N. Antoine, B. Paquin, B.F. Lang, and R. Cedergren. 1992.
13
Gene order comparisons for phylogenetic inference: evolution of the
14
mitochondrial genome. Proc. Natl. Acad. Sci. 89:6575-6579.
15
16
17
18
19
Snel, B., P. Bork, and M.A. Huynen. 1999. Genome phylogeny based on gene
content. Nat. Genet. 21:108-110.
Susko, E. and A.J. Roger. 2007. On reduced amino acid alphabets for phylogenetic
inference. Mol. Biol. Evol. 24:2139-2150.
Swofford, D.L. and G.J. Olsen. 1990. Phylogeny reconstruction. In Molecular
20
Systematics (eds. D.M.Hillis and C.Moritz). pp. 411-501. Sinauer Associates,
21
Sunderland, Massacheusetts.
22
Weisburg, W.G., S.J. Giovannoni, and C.R. Woese. 1989. The Deinococcus-Thermus
23
phylum and the effect of rRNA composition on phylogenetic tree construction.
24
Syst. Appl. Microbiol. 11:128-134.
28
1
Whelan, S. and Goldman, N. 2000. A general empirical model of protein evolution
2
derived from multiple protein families using a maximum-likelihood approach.
3
Mol. Biol. Evol. 18:691-699.
4
Wilkinson, M., D. Pisani, J.A. Cotton, and I. Corfe. 2005. Measuring support and
5
finding unsupported groups in supertrees. Syst. Biol. 54:823-831.
6
Woese, C.R., O. Kandler, and M.L. Wheelis. 1990. Towards a natural system of
7
organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc.
8
Natl. Acad. Sci. 87:4576-4579.
9
Wolf, Y.I., I.B. Rogozin, N.V. Grishin, R.L. Tatusov, and E.V. Koonin. 2001.
10
Genome trees constructed using five different approaches suggest new major
11
bacterial clades. BMC Evol. Biol. 1:8.
12
13
14
Zhaxybayeva, O., P. Lapierre, and J.P. Gogarten. 2004. Genome mosaicism and
organismal lineages. Trends Genet. 20:254-260.
Zhaxybayeva, O., J.P. Gogarten, R.L. Charlebois, W.F. Doolittle, and R.T. Papke.
15
2006. Phylogenetic analyses of cyanobacterial genomes: quantification of
16
horizontal gene transfer events. Genome Res. 16:1099-1108.
17
18
19
20
29
1
Table 1: Topological features of genome trees constructed from different LGT
2
scenarios and rates of transfer. The first two columns show the model used (scenario
3
and rate). The following three columns summarize the bipartitions in the resulting
4
genome tree that are supported by at least 70% of bootstrap replicates: the numbers of
5
such bipartitions that are concordant and discordant, followed by the percentage of
6
bipartitions that are concordant. Following this are three columns showing equivalent
7
results for bipartitions that are supported by at least 90% of bootstrap replicates. The
8
final two columns show the edit distance recovered by EEEP between the genome
9
tree inferred by the normalized BLASTP method (considering only those bipartitions
10
that are supported at a BP of 70% and 90%, respectively) and the ‘true tree’ that was
11
used to simulate the evolution of the set of genomes.
12
LGT
Scenario
None
Rate of
Support70
Transfer Conc Disc
%C
0
28
0
100
Conc
25
Support90
Disc
%C
0
100
Relations
10
50
250
28
34
35
1
4
5
96.6
89.4
87.5
24
28
29
1
2
2
Habitat
10
50
250
31
29
20
7
5
9
81.6
85.3
69.0
23
22
18
Content
10
50
250
29
34
24
3
2
0
90.6
94.4
100
Random
10
50
250
29
31
20
2
0
0
93.5
100
100
ED70
ED90
0
0
96.0
93.3
93.5
0
3
3
0
1
0
2
4
6
92.0
84.6
75.0
5
4
>7
1
4
8
27
27
22
1
1
0
96.4
96.4
100
2
1
0
0
0
0
25
29
20
1
0
0
96.2
100
100
0
0
0
0
0
0
13
14
30
1
Figure Legends
2
3
SUPPLEMENTAL FIGURE 1. Tree representation of the organismal history simulated in
4
replicate 1, including extinct and extant taxa. Branch lengths are proportional to the
5
number of iterations in the simulation, with the common ancestor at iteration 1 and
6
extant lineages extending to iteration 5000. Major groupings of taxa (supported by
7
long internal branches in the tree of extant genomes) are indicated with Greek letters.
8
9
FIGURE 1. Tree representation of the organismal history simulated in replicate 1,
10
showing only the lineages ancestral to the population of 56 taxa extant at the end of
11
the simulation (iteration 5000). Branch lengths are proportional to simulation
12
iterations. Major groupings of taxa (supported by long internal branches in the tree)
13
are indicated with Greek letters as in Supplemental Figure 1.
14
15
FIGURE 2. Relationship between the number of iterations since the divergence of each
16
pair of extant genomes, and the corresponding mean normalized BLASTP distance for
17
all shared orthologs between the pair of genomes. Each point in the graph corresponds
18
to a pair of genomes from a given replicate: all genome pairs from all replicates are
19
shown.
20
21
FIGURE 3. Proportion of bipartitions supported at or above a series of thresholds that
22
are concordant or discordant with the true genome tree, recovered from the five
23
replicate simulations performed without LGT. Replicates are ordered 1-5 from left to
24
right, and sets of bars within each replicate correspond to bootstrap thresholds
25
increasing from 0.50 to 1.00 in increments of 0.05. Shaded bars indicate the
31
1
proportion of resolved bipartitions that are concordant (i.e., are consistent with the
2
simulated genome history), while the stacked white bars indicate the corresponding
3
proportion that are discordant.
4
5
FIGURE 4. Normalized BLASTP-based genome tree inferred from the set of genomes
6
simulated without LGT in Replicate 1. Numbers at the internal nodes indicate
7
bootstrap support (from a total of 100 replicates) for the partitioning of taxa indicated
8
by that node. The colouring scheme for true monophyletic groups is consistent with
9
that used in Supplemental Figure 1 and Figure 1 in the text.
10
11
FIGURE 5. Normalized BLASTP-based genome tree inferred from the set of genomes
12
simulated with habitat-directed LGT (E = 10 attempted events per iteration) in
13
Replicate 1. Numbers at the internal nodes indicate bootstrap support (from a total of
14
100 replicates) for the partitioning of taxa indicated by that node; only nodes with ≥
15
70% bootstrap support are shown, and strongly supported nodes that are discordant
16
with the true tree are indicated with a larger font.
17
18
FIGURE 6. The percentage of strongly supported bipartitions in genome trees from all
19
replicates that are concordant. Two bootstrap thresholds (70% in panel a and 90% in
20
panel b) were used to define strong support: the mean ± standard deviation is shown
21
for each combination of BP threshold, LGT regime, and rate of LGT (= E). The
22
different types of LGT regime are abbreviated as follows: n = no LGT (E = 0); r =
23
relations-biased LGT; h = habitat-biased LGT; p = gene content-biased LGT; x =
24
random LGT (proposed events are always successful).
25
32
1
FIGURE 7. The relationship between branch length and bootstrap support of genome
2
trees recovered from data simulated under three different LGT regimes (empty circles
3
= no LGT; gray circles = habitat-directed LGT with E = 250; black circles = random
4
LGT with E = 250). Each set of data points is an aggregate of the five replicate
5
simulations for the stated evolutionary scenario. The lines above the plot show the
6
distribution of branch lengths recovered for each set of genome trees: the length of the
7
horizontal line shows the range, dashed vertical lines show the upper value of the first
8
and third quartiles, and the solid vertical line corresponds to the median branch
9
length.
10
11
FIGURE 8. The percentage of strongly supported nodes (BP ≥ 70) in variously
12
weighted genome trees from Replicate 1 that are concordant (panel a), and the total
13
count of strongly supported nodes (panel b). Each triplet of bars associated with a
14
unique combination of E and LGT regime refers to a different weighting criterion for
15
normalized BLASTP values, from left to right: concordance-weighted (darkest gray),
16
unweighted (lightest gray), and discordance-weighted (intermediate gray). LGT
17
regime abbreviations are consistent with Figure 6.
33
Download