1 2 3 4 5 6 7 8 9 10 11 12 13 14 Running Head: LGT and Genome Phylogeny The Impact of Reticulate Evolution on Genome Phylogeny 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 Research Article Robert G. Beiko1†, W. Ford Doolittle2, and Robert L. Charlebois2 1 Faculty of Computer Science, Dalhousie University, and Institute for Molecular Bioscience / ARC Centre for Bioinformatics, Brisbane, Australia 2 GenomeAtlantic, Department of Biochemistry & Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada † Corresponding author: Robert G. Beiko Faculty of Computer Science, Dalhousie University 6050 University Avenue Halifax, Nova Scotia, Canada B3H 1W5 Tel: 1 902 494 8043 Fax: 1 902 492 1517 Email: beiko@cs.dal.ca Keywords: Evolutionary simulation, lateral genetic transfer, genome phylogeny Abbreviations: LGT, lateral genetic transfer; PDS, phylogenetically discordant sequence; SPR, subtree prune and regraft 1 1 2 ABSTRACT Genome phylogenies are used to build tree-like representations of 3 evolutionary relationships among genomes. However, in condensing the phylogenetic 4 signals within a set of genomes down to a single tree, these methods generally do not 5 explicitly take into account discordant signals arising due to lateral genetic transfer. 6 Since conflicting vertical and horizontal signals can produce compromise trees that do 7 not reflect either type of history, it is essential to understand the sensitivity of inferred 8 genome phylogenies to these confounding effects. Using replicated simulations of 9 genome evolution, we show that different scenarios of lateral genetic transfer have 10 significant impacts on the ability to recover the ‘true’ tree of genomes, even when 11 corrections for phylogenetically discordant signals are used. 12 13 2 1 An important motivator in many phylogenetic analyses is that the branching 2 relationships inferred from a set of orthologous sequences may serve as a direct 3 indicator of organismal phylogeny. The best example of this is found in the use of 16S 4 and 18S ribosomal DNA sequences for phylogenetic classification of organisms 5 (Woese et al., 1990). The recognition that a single set of putatively orthologous 6 sequences may not yield an accurate depiction of organismal descent due to violations 7 of the phylogenetic model used, insufficient phylogenetic signal, cryptic paralogy or 8 lateral genetic transfer (LGT) led to a partial abandonment of single-gene methods in 9 prokaryotes. In their place emerged a plethora of methods that depend on much 10 greater data availability: concatenated sequence phylogenies (Baldauf et al., 2000; 11 Brochier et al., 2004), supertrees and networks constructed from many individual 12 phylogenetic trees (Daubin et al., 2001; Creevey et al., 2004; Beiko et al., 2005; 13 Holland et al., 2005), and whole-genome methods that typically simplify genetic data 14 to yield an easily computed summary of relationships between genomes. 15 Many genome properties have been used as basic characters for the inference 16 of genome-genome relationships, including gene content (Snel et al., 1999; Lake and 17 Rivera, 2004), gene order (Sankoff et al., 1992; Belda et al., 2005) and properties of 18 the distribution of sequence similarities between genomes (Clarke et al., 2002; Auch 19 et al., 2006). All of the above analyses assume that convergent evolution of the 20 character under consideration is unlikely, when compared with other genome 21 properties such as (for example) G+C content or synonymous codon usage. The 22 clearest violation of this non-convergence assumption is in the reduction of parasitic 23 genomes: since genomes tend to lose many of the same genes when undergoing 24 genome reduction, using gene absence as a parsimoniously informative character 25 leads to artifactual grouping of small genomes in gene content trees (Wolf et al., 3 1 2001; Kunin et al., 2005). Other, more subtle biases may influence these methods as 2 well: if distantly related organisms have similar biases in nucleotide or amino acid 3 usage due to e.g. mutational biases, nutrient limitations or environmental conditions, 4 then some of their genes may appear less distant from one another than the true 5 divergence time would suggest (Weisburg et al., 1989). Lateral genetic transfer, 6 which can introduce sequences into a genome from organisms with any degree of 7 relatedness, yields an apparent convergence of gene content, similarity, or sometimes 8 order, but in fact violates the fundamental assumption in phylogenetic methods that 9 evolution is tree-like. 10 An understanding of the extent and impact of LGT is crucial to interpreting 11 the relationships shown in a genome tree. One approach is to identify sequences that 12 are discordant using a surrogate method, and either downweight their contribution to 13 the genome tree, or eliminate them entirely from consideration (Clarke et al., 2002; 14 Dutilh et al., 2004; Gophna et al., 2005). A limitation of this approach is that different 15 methods for identifying conflicting genes tend to identify different sets of genes 16 (Ragan, 2001; Ragan, 2006), so the choice of homology criterion and filtering method 17 can have a substantial impact on the inferred genome history. The fundamental 18 problem is that without an accurate estimate of the extent and source of LGT within a 19 given data set, it is very difficult to assess the impact of LGT on the final genome 20 tree. In many published analyses, there is strong reason to suspect that LGT has 21 influenced the position of certain lineages. In some cases, taxa that are thought to 22 participate in frequent transfer are drawn towards one another in the tree. This effect 23 is suggested in the case of the archaeal genus Thermoplasma, which concatenated 24 informational gene phylogenies suggest to be secondarily non-methanogenic 25 (Brochier et al., 2005), but appears as an early-branching euryarchaeal or archaeal 4 1 lineage in many published studies including Wolf et al. (2001), Beiko et al. (2005), 2 and Gophna et al. (2005). There is strong evidence for extensive LGT from the 3 thermoacidophilic crenarchaon Sulfolobus to Thermoplasma, which may produce a 4 compromise in the positioning of Thermoplasma in aggregated trees. In some cases, 5 transfer partners appear as sisters in genome trees, for instance when Arabidopsis 6 appears as a sister taxon to the cyanobacteria in genome trees of metabolic genes 7 (Charlebois et al., 2004). 8 While simulation has been used extensively in the investigation and validation 9 of methods in molecular evolution, such techniques are only now being applied to the 10 study of genome evolution (Zhaxybayeva et al., 2006; Galtier, 2007). Part of the 11 reason for this is the relative novelty of genome sequences, but another barrier to 12 meaningful genome simulation has been the difficulty in merging traditional models 13 of sequence change with evolutionary scenarios and interactions within and among 14 genomes. EvolSimulator (Beiko and Charlebois, 2007) has been developed to 15 simulate the evolutionary phenomena most relevant to the study of lateral genetic 16 transfer, including genome-specific mutational and selective regimes, gene content 17 evolution via gene duplications, losses and LGT, and organismal evolution with 18 speciation, extinction, and competition for simulated niches and habitats. Here we use 19 EvolSimulator to generate populations of genomes under different scenarios and 20 frequencies of lateral genetic transfer, to allow a precise delineation of the extent to 21 which different modes of LGT can impact inferred genome histories. The weighting 22 schemes introduced by Gophna et al. (2005), based on the observed phylogenetic 23 concordance or discordance of proteins, can have a substantial impact on the genome 24 tree, and here we assess the effectiveness of these schemes in improving the tree of 25 genomes that is recovered. 5 1 2 METHODS 3 Evolutionary Simulations 4 EvolSimulator version 2.0.4 (Beiko and Charlebois, 2007) was used to evolve 5 populations of genomes with a consistent set of constraints on genomic properties and 6 sequence substitution, but different types of LGT regime. Each simulation began with 7 a single, ancestral genome having 1000 unrelated genes (240-1500 nt in size), from 8 which a population of genomes would evolve over 5000 iterations. Point mutations 9 were assessed against the standard genetic code as described in Beiko and Charlebois 10 (2007), with amino acid acceptance probabilities proportional to the WAG matrix 11 (Whelan and Goldman, 2000). Insertion and deletion events were not simulated. 12 Genomes were permitted to drift in size by loss and gain (by duplication, and if 13 prescribed, by lateral acquisition) of genes, to as few as 500 genes or as many as 14 3500. Speciation and extinction events were balanced such that the simulation 15 maintained between 50 and 60 genomes at any given time, following an initial growth 16 phase. 17 EvolSimulator constructs a user-defined number of “niches” within which 18 genomes reside, the occupation of which is competitively determined by relative gene 19 complements. Each niche has a finite number of spaces that are identical in terms of 20 required genes, potentially limiting the number of genomes that can exploit that niche. 21 Habitats comprise one or more niches, and also impose specific gene requirements on 22 a resident genome. In these simulations, we distributed 1000 spaces evenly among 23 100 niches, and these 100 niches among 10 habitats. For a genome to spend time in a 24 niche, it must possess all of the genes required by a niche and by its enclosing habitat, 25 such necessities being randomly chosen by EvolSimulator at the start of the run. In 6 1 addition to this qualitative requirement, the quantitative usefulness of genes within the 2 current niche regulates their propensity to be retained or lost, as well as the amount of 3 time a genome may spend within that niche. One can, though here we did not, bias 4 speciation/extinction probabilities according to overall genomic fitness. 5 We explored five independent LGT scenarios in this study: (a) no LGT, (b) 6 random LGT, (c) relations-biased LGT (occurring more often with closer relatives), 7 (d) gene content-biased LGT (occurring more often between genomes with more- 8 similar ortholog constitution), and (e) habitat-restricted LGT (occurring only between 9 genomes concurrently residing in the same habitat). A successful transfer event 10 according to the biasing criterion always led to uptake of the transferred gene; if an 11 ortholog was already present in the recipient genome, it was replaced with the 12 incoming gene. Although we did not explore blends of these scenarios here, we did 13 explore three rates of LGT: in separate runs of simulation sets (b)-(e), the mean 14 number of attempted events E was set to 10, 50, and 250 nominal events per iteration. 15 The single run (a), plus three runs each of (b)-(e) with variable E, yielded a set of 13 16 evolutionary scenarios. By using the same random number seed for each simulation in 17 a set, we were able to exactly replicate the speciation and extinction history (i.e., each 18 simulation had the same reference ‘organismal’ tree) and lineage-specific mutation 19 biases, thus ensuring that differences among the 13 simulations would be due only to 20 the LGT model that was used. 21 Five sets of replicate runs of the thirteen scenarios were executed, with each 22 set of replicates employing its own seed for pseudorandom number generation, for a 23 total of 65 simulation runs. The use of a consistent random number seed across every 24 run within a given replicate ensured the same history of speciation and extinction, as 25 well as the same mutational biases within each lineage. For a complete summary of 7 1 all parameters used in the simulation, please refer to the EvolSimulator configuration 2 file in the Supplemental Material (available online at 3 http://www.systematicbiology.org). 4 Supplementary Figure 1 (also available at http://www.systematicbiology.org) 5 illustrates the complete set of speciation and extinction events for the entire 5000 6 iterations of one of our replicate runs. Extant genomes and closely related extinct 7 lineages have been assigned to eight monophyletic groups α-θ: the precise delineation 8 of these groups is arbitrary, but they all diverged in the earliest 10% of simulated 9 iterations. Colors have been assigned to phylum-level groups in order to highlight 10 cases of invasive intermingling in the inferred phylogenies shown below. 11 12 13 Inference of Genome-Scale Phylogeny Normalized BLASTP-based phylogeny was performed in a manner similar to 14 Clarke et al., (2002), with the distance between every pair of genomes equal to 1.0 15 minus the mean normalized BLASTP 2.2.2 (Altschul et al., 1997) distance of all 16 pairwise reciprocal best matches (RBMs) between the two genomes. Only RBMs with 17 BLASTP e-values of 1.0 × 10-5 or less were used to build this matrix. The minimum 18 evolution algorithm implemented in the November 28, 2003 release of FastME 19 (Desper and Gascuel, 2002) was used to build a phylogenetic tree from the matrix of 20 all genome pairwise distances. Statistical support for each distance tree was assessed 21 by resampling the set of RBMs with replacement for each pair of genomes to generate 22 bootstrapped distance matrices, with the corresponding trees constructed using 23 FastME. Resampling was performed 100 times on each data set, to yield support 24 values for each bipartition in a given genome tree. 8 1 Phylogenetically discordant sequences (PDS) disagree with the majority 2 phylogenetic signal, and are frequently observed in real data due both to violations of 3 phylogenetic assumptions and to bona fide instances of LGT. To limit the effect of 4 phylogenetic discordance on inferred genome trees, we applied the PDS procedure 5 described in Clarke et al. (2002) to the set of genomes obtained from each simulation 6 in replicate 1 (other replicates were not examined). Each individual protein has an 7 associated PDS score, which is calculated by comparing the ranking of its similarity 8 to putative orthologs in a list of other genomes (u-values), versus a ranking based on 9 the median of all pairs of putative orthologs from every other genome (w-values). The 10 Spearman rank correlation thus obtained is compared to a large number of 11 correlations obtained by randomizing the comparisons, to obtain a p-value. The mean 12 of the PDS p-values associated with a given pair of RBM proteins could then be used 13 to weight the contribution of that pair to the overall distance measure between the two 14 relevant genomes (Gophna et al., 2005). The unweighted distance D between a pair of 15 genomes A and B with a total of n RBMs is given by the following formula: 16 1 n DA, Bunweighted 1 S ( AB) i n i 1 17 (1) 18 19 Where S(AB)i is the normalized BLASTP score for the ith RBM between the pair of 20 genomes. The concordance-weighted distance is modified by the PDS p-values P(AB)i 21 defined above: 22 23 DA, Bconc 1 n 1 S ( AB)i P( AB)i nPc i 1 (2) 9 1 2 Where nPc is the sum of all P(AB)i between a pair of genomes. The discordance- 3 weighted distance between a pair of genomes is computed in a similar manner, 4 replacing all P(AB)i with (1 - P(AB)i): 5 DA, Bdisc 6 1 1 nPd n S ( AB) (1 P( AB) ) i 1 i i (1.3) (3) 7 8 With nPd equal to the sum of all (1 - P(AB)i) between a pair of genomes. 9 10 11 Quantifying Differences Between Inferred Genome Trees Each inferred genome tree was compared to the true ‘organismal’ history, to 12 assess the accuracy of the inferred relationships. A range of bootstrap support 13 thresholds between 0.50 and 1.00 were used as minimal criteria for strongly supported 14 relationships in the inferred genome trees: in several analyses, conservative (0.90) and 15 liberal (0.70) thresholds were contrasted. Since the organismal reference tree is 16 completely resolved, strongly supported bipartitions present in a given genome tree 17 must either be congruent or incongruent with the reference tree. Disagreements 18 between trees were expressed in terms of the total count of concordant and discordant 19 bipartitions, and in terms of the proportion of all resolved bipartitions that were 20 concordant. 21 EEEP (Beiko and Hamilton, 2006) was used to recover the distance between 22 the organismal reference and each inferred genome tree in turn. The distance used 23 characterizes the number of subtree prune-and-regraft (SPR: Swofford and Olsen, 24 1990) operations that need to be performed on the organismal reference tree, to obtain 10 1 the inferred genome tree. Each SPR operation in an edit path implies a donor/recipient 2 pairing, and in the context of single-molecule phylogeny identifies a putative transfer 3 event from one branch to another. SPR events from genome trees may identify major 4 highways of gene sharing, either directly if a given taxon or clade is paired with a 5 major transfer partner, or indirectly if the position of a taxon in the tree is intermediate 6 between its transfer partners and its correct position in the organismal tree. It is 7 therefore useful to characterize incongruence both quantitatively in terms of edit 8 distance, and qualitatively in terms of the nature of the proposed discordant 9 relationships, given complete knowledge of the ‘true’ trees and scenario of LGT. 10 11 RESULTS 12 Species Tree, Genome Divergence, and Gene Exchange 13 Figure 1 shows the phylogenetic relationships between extant organisms at the 14 end of the replicate 1 simulations. Replicate 1 had the largest population of surviving 15 genomes at the end of the simulation, with 56 extant lineages. The other four 16 replicates had a minimum of 48 and a maximum of 54 genomes at iteration 5000. 17 Each replicate had an initial phase of rapid speciation to reach the target population 18 size of 50 genomes, but following this phase the extant genomes could be separated 19 into lineages supported basally by relatively long branches, appearing as logically 20 distinct phyla. Protein sequences that diverged at the beginning of the simulation were 21 still recognizably homologous. 22 The relationship between genome divergence time and mean normalized 23 BLASTP distance for each pair of genomes is shown in Figure 2. The mean 24 normalized BLASTP distance increases quickly in the first 250 iterations after 25 divergence, then increases more gradually to the maximum number of iterations, 11 1 albeit with high variance. Distances between pairs of genomes whose last common 2 ancestor occurred near the beginning of the simulation ranged between approximately 3 0.6 and 0.7. Phylum-level divergence between real pairs of genomes is approximately 4 0.7 – 0.75 from Clarke et al. (2002), so the maximum level of divergence among 5 genomes in this analysis is roughly equivalent to that seen among bacterial phyla. 6 When the mean number of LGT events per iteration E was set to 250, the rate 7 of LGT was sufficiently high that every gene created at the beginning of the 8 simulation was transferred at least once in its history. In the random LGT scenario 9 with E = 250, the distribution of historical transfers for a sample of 10% of the genes 10 from genome 314 was examined. At the end of the simulation (iteration 5000), each 11 gene had been transferred an average of 70 times during the course of its simulated 12 evolution. No gene had been transferred fewer than 31 times, and the maximum 13 number of transfers in the history of any given gene was 103. 14 15 16 Genome Trees Under Different LGT Scenarios BLASTP-based genome trees are strictly bifurcating, with associated 17 bootstrap proportions (BP) reflecting the frequency that a given bipartition was 18 observed within the set of bootstrap replicate trees. Figure 3 shows, for a series of 19 thresholds, the proportion of bipartitions supported at or above that threshold that are 20 either concordant or discordant with the true genome tree, e.g. that either support or 21 conflict with relationships in the true tree (Wilkinson et al., 2005), recovered from the 22 five replicate simulations performed without LGT. In each of the five cases, there was 23 a BP threshold at and above which no discordant bipartitions were present in the 24 inferred genome tree. This minimum threshold ranged from 0.65 in replicate 2 to 0.90 25 in replicate 5, with discordance in the latter case due to a misplaced deep branch with 12 1 bootstrap support of exactly 0.85.. High BP thresholds yielded exclusively concordant 2 relationships, but had a negative impact on the number of resolved bipartitions, with 3 fewer than 40% of all bipartitions resolved at a BP threshold of 1.00 for each of the 4 five trees. Lowering the BP threshold from 1.00 to 0.50 increased the number of 5 resolved bipartitions by approximately a factor of two, at the expense of accepting 6 some discordant relationships: across the five replicates, between 3% and 16% of the 7 resolved relationships at a BP threshold of 0.50 were not consistent with the original 8 genome tree. 9 The genome tree inferred from the LGT-free simulation (E = 0) in replicate 1 10 is shown in Figure 4. The rapid increase in normalized BLASTP distances in the 11 period immediately following genome divergence (shown in Fig. 2) is evident here in 12 the relatively long branches separating recently diverged taxa. Nonetheless, seven of 13 the eight main monophyletic groupings (α, β, γ, ε, ζ, η, and θ) were correctly 14 reconstructed. While several of these groups were reconstructed with bootstrap 15 support ≥ 90%, there was some difficulty in recovering the deepest branches, with the 16 deepest-branching genome from group δ separated from the other two genomes in this 17 group in spite of a relatively long supporting internal branch in the true tree, and low 18 bootstrap support for groups ζ and θ, likely reflecting the frequent intrusion of 19 genome 205 into group θ. While genome 205 does not branch within group θ in the 20 tree shown in Figure 4, a tree computed from the same starting distance matrix but 21 using the Fitch-Margoliash method (Fitch and Margoliash, 1967) displayed this 22 intermingling of phyla. The three groups that were not recovered or recovered with 23 weak support were also the three earliest groups to diverge in Figure 1, suggesting 24 that groups of this age or older may not be recoverable. Within some groups there 25 were discrepancies in the recovered branching order, however these inconsistencies 13 1 had low associated bootstrap values. The branching order of major groups was also 2 unreliable; again incorrect relationships were associated with bootstrap values less 3 than 70%. 4 Each of the 13 genome histories from replicate 1 was used to construct a 5 bootstrapped genome phylogeny. Table 1 shows the degree of strongly supported 6 concordance and discordance that was found for each reconstructed tree relative to the 7 original genome history. As described above, the tree of genomes simulated without 8 LGT had no discordant bipartitions with bootstrap support of 70% or greater, 9 although only 28 of a possible 51 internal bipartitions had a bootstrap support at least 10 this high, and only 25 bipartitions were supported at a bootstrap threshold of 90%. 11 Among simulations where LGT occurred more often between closely related genomes 12 (relations-biased), relatively low rates of LGT yielded resolution and concordance 13 values similar to those observed in the absence of LGT. However, the degree of 14 resolution actually increased with increasing rates of LGT; when the mean rate of 15 LGT E was 250 nominal events per iteration, 40 bipartitions had bootstrap support of 16 70% of greater; five of these bipartitions were incongruent with the true tree. An 17 increase in the number of resolved bipartitions was also observed at a bootstrap 18 threshold of 90%, with the number of discordant bipartitions increasing from 1 to 2. 19 Consequently the gain in resolution appears to reinforce the simulated genome 20 phylogeny; even if the histories of shared genes do not exactly match the reference 21 tree, they can still aid in its recovery. 22 In all of the other three types of LGT scenario that were simulated, there was a 23 decrease in the number of concordant nodes between E = 10 and E = 250 attempted 24 transfers per iteration, although in some cases there were more concordant nodes at E 25 = 50 than at E = 10. Under content-biased and random LGT in this replicate, the 14 1 number of discordant bipartitions dropped to zero as E increased to 250, so the 20 2 strongly supported bipartitions that remained in each of these simulations were all in 3 agreement with the true tree. However, a low rate of habitat-biased LGT was 4 sufficient to introduce discordant relationships into the inferred tree; with E = 10 only 5 31 out of 38 bipartitions with BP ≥ 0.7 were concordant with the true tree (Fig. 5). 6 And unlike the trees built from content-biased or random LGT, the proportion of 7 strongly supported and discordant bipartitions increased with increasing E. 8 9 Nonparametric Kruskal-Wallace tests were performed across all five replicates (which are summarized in Fig. 6) to assess the significance of differences in % 10 concordance distribution among LGT scenarios. These tests were applied 11 independently to a total of six combinations of E (10, 50, and 250) and bootstrap 12 threshold (70% in Fig. 6a and 90% in Fig. 6b). The difference in distribution of 13 percent concordance was not significant or just below an alpha threshold of 0.05 for E 14 = 10 (BP threshold of 70: p = 0.033; BP threshold of 90: p = 0.73), but p-values of 15 0.015 and 0.005 were obtained for both BP thresholds with E = 50, and p < 0.005 16 when E = 250. Pairwise post-hoc Nemenyi tests (Hollander and Wolfe, 1999) showed 17 that only habitat-biased LGT produced trees that were significantly less concordant 18 than any of the other groups. In each of the four sets of tests with E equal to 50 or 250 19 and a BP threshold of 0.7 or 0.9, the distribution of concordances from the five 20 habitat-biased replicates differed from that of the random and gene content-biased 21 trials, but was statistically indistinguishable at α < 0.05 from that of the five relations- 22 biased replicates. Since a speciation event yields two lineages that are adapted to the 23 same set of niches, recently diverged genomes have a better than random probability 24 of occupying the same habitat. This effect will be attenuated by independent habitat 15 1 switching and gene loss, but may be responsible for the similar distributions seen 2 here. 3 Across all sets of unweighted genome trees, the bootstrap support for each 4 node showed a strong relationship with the length of the branch supporting that node. 5 Figure 7 shows this relationship for three sets of genome trees, corresponding to the 6 five replicates simulated without LGT and the five replicates simulated with either 7 habitat-biased or random LGT and E equal to 250 attempted events per iteration. 8 While the relationship between branch length and bootstrap support is similar, the 9 trees simulated with random LGT have shorter branches, and therefore lower overall 10 bootstrap support. Random LGT leads to a greater average similarity among genomes, 11 compressing the tree and confounding relationships that were clearly resolved in 12 simulations lacking LGT. Conversely, habitat-directed LGT does not produce the 13 same ‘squashing’ of early branches in the tree, and the discordant nodes recovered 14 have longer supporting branches and higher bootstrap support. 15 While a single branch migration within a tree can disrupt potentially many 16 bipartitions, multiple SPR operations were needed to reconcile all of the habitat- 17 biased LGT trees (resolved at a bootstrap threshold of 70%) with the true tree of 18 genomes (Table 1). Consequently the discordance of these trees was not due to a 19 single branch migration, but to several such events. 20 21 22 Concordance and Discordance-Weighted Trees In separate analyses, the normalized BLASTP scores between proteins from 23 replicate 1 were subjected to concordance and discordance weighting prior to distance 24 matrix reconstruction. Under a concordance weighting scheme, proteins with ranked 25 similarities to putative orthologs in other genomes will have a high weight if the 16 1 ranking is similar to the overall ranking of genome similarities. If a consistent pattern 2 of similarities exists for many proteins from a given simulation, then these proteins 3 will exert a strong influence on the overall genome similarity ranking, and will make 4 a disproportionately high contribution to the distance matrix as a consequence 5 (Gophna et al., 2005). 6 The contribution from a subset of proteins with strong adherence to the 7 genome similarity ranking might be expected to yield more well-supported 8 bipartitions than are seen in the unweighted case. Figure 8 shows that this is true for E 9 = 0, 10, and 50: the total number of bipartitions resolved at a BP threshold of 0.7 10 increased in all nine simulations (Fig. 8a). However, in four out of nine of these trials, 11 there is a decrease in the proportion of total concordant bipartitions (Fig. 8b), so the 12 additional information that emerges from concordance weighting is sometimes in 13 disagreement with the true tree. The effect of concordance weighting when E = 250 is 14 less clear: the number of resolved bipartitions decreases in the unbiased (random 15 LGT) simulation, and increases slightly in the other three simulations. In none of the 16 replicates did PDS scores correlate significantly with the number of historical 17 transfers for a given gene: many factors such as paralogy and compositional artifacts 18 are intentionally captured in the PDS weighting process (Clarke et al., 2002), and a 19 simple model of LGT frequency may not be sufficient to gauge its expected impact on 20 phylogenetic discordance. 21 The effect of discordance weighting at low levels of LGT (E = 0 and E = 10) 22 is a near-complete loss of all strongly supported bipartitions. All five simulations with 23 E ≤ 10 produced trees with a maximum of eight bipartitions with a bootstrap value of 24 70% or greater: these bipartitions supported only the most recent divergences in the 25 tree, although they were not invariably in agreement with the true tree. A drop in the 17 1 number of resolved bipartitions relative to the unweighted trees was also seen when E 2 = 50 and E = 250. In 7 out of 8 of these simulations, 100% of the resolved bipartitions 3 were concordant, with the only exception being the habitat-biased simulation at E = 4 250. 5 6 DISCUSSION 7 Effectiveness of the Normalized BLASTP Method and Refinements 8 9 By including confounding effects such as gene duplications and losses, and changes in the underlying mutational biases of genomes, EvolSimulator produces 10 datasets that can violate the assumptions of many phylogenetic reconstruction 11 methods. Drifts in genomic G+C bias could potentially confound the BLASTP 12 analysis by making distantly related genomes appear more similar to one another if 13 they share similar genomic G+C biases: this effect could potentially be mirrored by 14 compositional or functional convergence in real genomes. In spite of this, trees 15 constructed using the normalized BLASTP method were able to correctly recover 16 (with BP ≥ 0.7) >60% of the bipartitions in the original simulated tree when LGT was 17 absent from the simulation. The earliest, closely spaced, branches in the tree, which 18 describe the relationships between the major groupings of simulated taxa, were 19 weakly supported and generally incorrect. This result provides an interesting contrast 20 with the genome trees of Gophna et al. (2005), where nearly all recovered bipartitions 21 had an associated BP of 100%. It is not uncommon for genomic phylogenies to offer 22 strong support for deep relationships (Wolf et al., 2001), but these relationships may 23 be reflective of compositional biases and unequal rates of evolution, which can 24 overwhelm what little phylogenetic signal remains at great depths (Jermiin et al., 25 2004). The non-linearity of the relationship between evolutionary distance and 18 1 normalized BLASTP distance between pairs of genomes may influence the recovery 2 of correct relationships: further refinements to the normalized BLASTP method could 3 include transformations of the normalized BLASTP distance and reweighting of 4 distances between individual pairs of orthologs to yield linear relationships with lower 5 variance. 6 Concordance weighting of normalized BLASTP scores tended to increase the 7 overall statistical support for the recovered genome phylogeny, but with the drawback 8 of increasing the support for a few discordant bipartitions to a point above the 9 threshold of significance. In cases where true history and bias (compositional or 10 otherwise) conflict in the relationships they support, concordance weighting will 11 favour one solution over the other. It appears that in most but not all cases in this 12 simulation, the balance favoured the true vertical history. Ultimately, it may be 13 worthwhile to investigate the role of bias in detail, and examine the combined effects 14 of concordance weighting and accounting for composition using modified 15 evolutionary models (Jayasawal et al., 2007) or residue recoding (Phillips et al., 2004; 16 Susko and Roger, 2007). 17 In these simulations, discordance weighting did not emphasize major 18 pathways of LGT: in most cases the effect of weighting for discordance was to 19 drastically decrease the statistical support for most relationships in the recovered tree. 20 The most likely reason for this effect is the nature of vertical and lateral relationships 21 in these simulations: while there is only one vertical history, in every class of 22 simulation many different types of lateral relationships are expected. This is true even 23 for the habitat-directed scenario, where organisms in our simulations could switch 24 between habitats with frequencies that are likely higher than the extreme cases (e.g., 25 Thermoplasmatales) described in the Introduction and below. In all cases, the 19 1 resulting lateral relationships would be better represented with a network rather than a 2 tree of putative lateral histories. 3 4 5 The Effects of Directed vs. Random Exchanges Two quantities were examined to assess the decay of phylogenomic signal 6 with different rates and scenarios of LGT: the total number of strongly supported 7 bipartitions in the reconstructed genome tree, and the proportion of these bipartitions 8 that were in agreement with the true tree. When LGT events preferentially occurred 9 between closely related genomes, the overall effect was to increase the total number 10 of resolved bipartitions; most (but not all) of these gained bipartitions were 11 concordant. These events will decrease the effective time since divergence, with 12 many proteins subjected to orthologous replacement. 13 Although gene content-biased LGT might have been expected to yield results 14 similar to relations-biased LGT, in most cases the degree of resolution and 15 concordance from the content-biased simulations was most similar to that obtained 16 from random LGT. In fact, content-biased LGT and random LGT are equivalent if the 17 gene content is identical among all organisms. While gene content did vary across 18 organisms in these simulations, the variation appears to have been neither sufficiently 19 large nor consistent across lineages to distinguish the content-biased simulations from 20 the random ones. This observation highlights an interesting distinction between our 21 simulations and the expected empirical case as articulated by the Complexity 22 Hypothesis (Jain et al., 1999): in our simulations, all genes were equally transferable, 23 regardless of their importance to the organism or distribution across the simulated tree 24 of life. The Complexity Hypotheses states that proteins involved in large complexes 25 are resistant to transfer: since most of these proteins are informational and ubiquitous 20 1 (or nearly so) in living organisms, the most frequently occurring proteins are therefore 2 least likely to be transferred. Reducing the contribution of essential and ubiquitous 3 genes to the calculation of shared gene content would have enhanced the differences 4 among genomes, and likely led to LGT scenarios that were more strongly influenced 5 by phylogenetic relatedness and possibly habitat as well. 6 Unlike the other classes of simulation examined here, habitat-biased 7 simulations always produced genome trees with features that were strongly supported 8 but discordant with the true tree. While LGT events can have disruptive influences on 9 the recovered genome phylogeny, a random distribution of events should (absent 10 other biases) merely reduce the statistical support for the correct phylogeny. But if 11 gene sharing events occur preferentially between certain lineages, as is the case with 12 habitat-directed LGT, then the lateral signal may produce a strongly-supported 13 alternative to the vertical topology. The ultimate effect will likely not be a clear co- 14 location of frequently exchanging taxa in the recovered genome tree, but rather a 15 phylogenetic compromise that is influenced by both the vertical and lateral histories, 16 but in fact displays neither. Emergent features of supertrees have sometimes been 17 described as ‘signal enhancement’ (Bininda-Emonds et al, 1999), but in this case the 18 relationship that emerges is not reflective of any true signal. 19 A survey of published genome phylogenies, supermatrices and supertrees 20 shows the potential that many such effects may be influencing recovered trees. An 21 example reported in the genome phylogeny work of Gophna et al (2005) concerns the 22 positioning of the Thermoplasmatales, which are typically placed near the base of the 23 Archaea in genome phylogenies. It appears that the positioning of Thermoplasmatales 24 reflects a compromise between its vertical history (shared with other Euryarchaeota) 25 and a habitat-directed highway of gene sharing with the thermoacidophilic 21 1 crenarchaeal genus Sulfolobus. A breakdown of recovered quartets from the 2 phylogenomic analysis of Beiko et al. (2005) identified groups of Thermoplasma 3 proteins with strong affinities for both the hypothetical vertical and lateral histories, 4 with very little support for the reported positioning of this group as basal to the 5 Archaea. 6 The datasets simulated here could be used to validate other phylogenomic 7 methods as well, including concatenation, supertrees and parsimony methods such as 8 conditioned reconstruction (Lake and Rivera, 2004). All of these methods are 9 sensitive to phylogenetic incongruence, but some approaches may be better able to 10 extract the historical, ‘vertical’ signal from among a set of discordant relationships (if 11 this is indeed the goal). For instance, supertree approaches may be preferable to 12 sequence concatenation, since orthologous gene sets with weak and potentially 13 misleading phylogenetic signal can be removed from the analysis if the trees they 14 generate have weak statistical support. A concatenated alignment would need to 15 consider every site from these genes, and might therefore be expected to be more 16 susceptible to ‘compromise’ topologies. 17 Other groups potentially affected by such vertical / lateral conflicts include the 18 hyperthermophiles Aquifex aeolicus and Thermotoga maritima: there is strong 19 phylogenetic and physiological (see e.g. Cavalier-Smith, 2002) evidence that A. 20 aeolicus is ancestrally an ε-proteobacterium, and T. maritima a low G+C Firmicute. 21 Their frequent positioning together at the base of the tree appears to arise from a 22 combination of factors, including many putative highways of gene sharing among 23 thermophiles (including T. maritima with Archaeal groups such as Pyrococcus), and 24 biased nucleotide and amino acid compositions that influence the results of BLAST 25 searches and phylogenetic analyses. When photosynthetic organisms from different 22 1 domains are included in phylogenomic analyses, endosymbiotic events may influence 2 the recovered phylogeny (Charlebois et al., 2004; Gophna et al., 2005). Many more 3 strongly conflicting pathways of vertical and lateral inheritance may exist: 4 examination of environmental organisms with shared habitats using metagenomic 5 methods (Musovic et al., 2006) has the potential to clarify the role of LGT in the 6 environment. 7 While the experiments performed in this analysis by no means recapture the 8 entire complexity of microbial evolution, they illustrate the confounding effects of 9 different scenarios of lateral genetic transfer and have important consequences for our 10 ultimate ability to recover a tree (or network) of microbial life. It is perhaps startling 11 that the very high incidence of LGT in some our simulated data sets (particularly the 12 random LGT simulation with E = 250) does not completely erase the vertical 13 evolutionary signal in the data. The surprising persistence of vertical signal (albeit 14 with relatively greater loss of ancient relationships) supports the idea of laterally 15 transferred genes as "fibres in a rope" (Zhaxybayeva et al., 2004) that carry different 16 individual histories, but together constitute a cohesive picture of vertical evolution. 17 Such a picture must depend (as does a supertree) on different (locally untransferred) 18 genes carrying vertical signal in different parts of the tree, and demands further 19 elucidation through modeling. However, the introduction of bias into LGT scenarios 20 leads to the conflation of different vertical and lateral signals in the resulting genome 21 phylogeny. The true nature of LGT regimes in the wild will depend on mechanistic 22 and selective factors that influence the probability of success of individual LGT 23 events. If the majority of persistent LGT (as opposed to transient LGT: see e.g. Hao 24 and Golding, 2006) is confined to close relatives, then the vertical history or ‘tree of 25 cells’ will be reinforced by ‘xenologous’ genes that diverged more recently than the 23 1 last common ancestor of the cells that contain those genes. While these genes do not 2 match the reference phylogeny, they can contribute to its recovery and resolution. 3 However, where LGT occurs in a non-random fashion between distant relatives, our 4 ability to recover such a tree, or even indeed a significantly restricted network will be 5 severely compromised or lost. 6 7 SUPPLEMENTAL MATERIAL 8 9 The configuration files used in the above simulations, and all of the genome trees 10 used in this paper are given in the Supplemental Material (available online at 11 http://systematicbiology.org). EvolSimulator 2.0.4 can be obtained freely from 12 http://bioinformatics.org.au/evolsim. 13 14 ACKNOWLEDGMENTS 15 16 The authors would like to thank Olaf Bininda-Emonds, James McInerney, James Lake 17 and Craig Herbold for helpful comments on the manuscript. This work was supported 18 by Australian Research Council Grants DP0342987 and CE0348221 and by the 19 Genome Canada-funded project "Understanding Prokaryotic Genome Evolution and 20 Diversity" (WF Doolittle, PI). 21 22 REFERENCES 23 24 1 Altschul, S.F., T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J. 2 Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein 3 database search programs. Nucleic Acids Res. 25:3389-3402. 4 Auch, A.F., S.R. Henz, B.R. Holland, and M. Göker. 2006. Genome BLAST distance 5 phylogenies inferred from whole plastid and whole mitochondrion genome 6 sequences. BMC Bioinformatics 7:350. 7 Baldauf, S.L., A.J. Roger, I. Wenk-Siefert, and W.F. Doolittle. 2000. A 8 kingdom-level phylogeny of eukaryotes based on combined protein data. 9 Science 290:972-977. 10 11 12 13 14 15 16 17 18 19 20 Beiko, R.G. and R.L. Charlebois. 2007. A simulation test bed for hypotheses of genome evolution. Bioinformatics 23:825-831. Beiko, R.G. and N. Hamilton. 2006. Phylogenetic identification of lateral genetic transfer events. BMC Evol. Biol. 6:15. Beiko, R.G., T.J. Harlow, and M.A. Ragan. 2005. Highways of gene sharing in prokaryotes. Proc. Natl. Acad. Sci. 104:14332-14337. Belda, E., A. Moya, and F.J. Silva. 2005. Genome rearrangement distances and gene order phylogeny in gamma-Proteobacteria. Mol. Biol. Evol. 22:1456-1467. Bininda-Emonds, O.R.P. 2004. The evolution of supertrees. Trends Ecol. Evol. 19:315-322. Bininda-Emonds, O.R.P., J.L. Gittleman, and A. Purvis. 1999. Building large trees by 21 combining phylogenetic information: A complete phylogeny of the extant 22 Carnivora (Mammalia). Biol. Rev. 74:143–175.Brochier, C., P.Forterre, and 23 S.Gribaldo. 2004. Archaeal phylogeny based on proteins of the transcription 24 and translation machineries: tackling the Methanopyrus kandleri paradox. 25 Genome Biol. 5:R17. 25 1 Cavalier-Smith, T. 2002. The neomuran origin of archaebacteria, the negibacterial 2 root of the universal tree and bacterial megaclassification. Int. J. Syst. Evol. 3 Microbiol. 52:7-76. 4 Charlebois, R.L., R.G. Beiko, and M.A. Ragan. 2004. Genome phylogenies. In 5 Organelles, genomes and eukaryote phylogeny: An evolutionary synthesis in 6 the age of genomics (eds. R.P. Hirt and D.S. Horne), pp. 189-206. CRC Press, 7 Boca Raton, FL. 8 9 Clarke, G.D.P., R.G. Beiko, M.A. Ragan, and R.L. Charlebois. 2002. Inferring genome trees by using a filter to eliminate phylogenetically discordant 10 sequences and a distance matrix based on mean normalized BLASTP scores. 11 J. Bacteriol. 184:2072-2080. 12 Creevey, C.J., D.A. Fitzpatrick, G.K. Philip, R.J. Kinsella, M.J. O'Connell, M.M. 13 Pentony, S.A. Travers, M. Wilkinson, and J.O. McInerney. 2004. Does a tree- 14 like phylogeny only exist at the tips in the prokaryotes? Proc. Biol. Sci. 15 271:2551-2558. 16 17 18 Daubin, V., M. Gouy, and G. Perrière. 2001. Bacterial molecular phylogeny using supertree approach. Genome Inform. 12:155-164. Desper, R. and O. Gascuel. 2002. Fast and accurate phylogeny reconstruction 19 algorithms based on the minimum-evolution principle. J. Comput. Biol. 20 9:687-705. 21 Dutilh, B.E., M.A. Huynen, W.J. Bruno, and B. Snel. 2004. The consistent 22 phylogenetic signal in genome trees revealed by reducing the impact of noise. 23 J. Mol. Evol. 58:527-539. 24 25 Fitch, W.M. and E. Margoliash. 1967. Construction of phylogenetic trees. Science 155:279-84. 26 1 2 3 4 5 6 7 Galtier, N. 2007. A model of horizontal gene transfer and the bacterial phylogeny problem. Syst. Biol. 56:633-642. Gophna, U., W.F. Doolittle, and R.L. Charlebois. 2005. Weighted genome trees: refinements and applications. J. Bacteriol. 187:1305-1316. Hao, W. and G.B. Golding. 2006. The fate of laterally transferred genes: life in the fast lane to adaptation or death. Genome Res. 16:636-643. Holland, B.R., K.T. Huber, V. Moulton, and P.J. Lockhart. 2004. Using consensus 8 networks to visualize contradictory evidence for species phylogeny. Mol. Biol. 9 Evol. 21:1459-1461. 10 11 12 Hollander, M. and D.A. Wolfe. 1999. Nonparametric Statistical Methods, 2nd Edition. New York: John Wiley & Sons. Kunin, V., L. Goldovsky, N. Darzentas, and C.A. Ouzounis. 2005. The net of life: 13 reconstructing the microbial phylogenetic network. Genome Res. 15:954-959. 14 Jain, R., M.C. Rivera, and J.A. Lake. 1999. Horizontal gene transfer among genomes: 15 the complexity hypothesis. Proc. Natl. Acad. Sci. 96:3801-3806. 16 Jayaswal, V., J. Robinson, and L. Jermiin. 2007. Estimation of phylogeny and 17 invariant sites under the general Markov model of nucleotide sequence 18 evolution. Syst. Biol. 56:155-162. 19 Jermiin, L., S.Y. Ho, F. Ababneh, J. Robinson, and A. W. Larkum. 2004. The biasing 20 effect of compositional heterogeneity on phylogenetic estimates may be 21 underestimated. Syst. Biol. 53:638-643. 22 Lake, J.A. and M.C. Rivera. 2004. Deriving the genomic tree of life in the presence of 23 horizontal gene transfer: conditioned reconstruction. Mol. Biol. Evol. 24 21:681-690. 27 1 Musovic, S., G. Oregaard, N. Kroer, and S.J. Sørensen. 2006. Cultivation-independent 2 examination of horizontal transfer and host range of an IncP-1 plasmid among 3 gram-positive and gram-negative bacteria indigenous to the barley 4 rhizosphere. Appl. Environ. Microbiol. 72:6687-6692. 5 6 7 8 9 Phillips, M.J., F. Delsuc, and D. Penny. 2004. Genome-scale phylogeny and the detection of systematic biases. Mol. Biol. Evol. 21:1455-1458. Ragan, M.A. 2006. On surrogate methods for detecting lateral gene transfer. FEMS Microbiol. Lett. 201:187-191. Ragan, M.A., T.J. Harlow, and R.G. Beiko. 2006. Do different surrogate methods 10 detect lateral genetic transfer events of different relative ages? Trends 11 Microbiol. 14:4-8. 12 Sankoff, D., G. Leduc, N. Antoine, B. Paquin, B.F. Lang, and R. Cedergren. 1992. 13 Gene order comparisons for phylogenetic inference: evolution of the 14 mitochondrial genome. Proc. Natl. Acad. Sci. 89:6575-6579. 15 16 17 18 19 Snel, B., P. Bork, and M.A. Huynen. 1999. Genome phylogeny based on gene content. Nat. Genet. 21:108-110. Susko, E. and A.J. Roger. 2007. On reduced amino acid alphabets for phylogenetic inference. Mol. Biol. Evol. 24:2139-2150. Swofford, D.L. and G.J. Olsen. 1990. Phylogeny reconstruction. In Molecular 20 Systematics (eds. D.M.Hillis and C.Moritz). pp. 411-501. Sinauer Associates, 21 Sunderland, Massacheusetts. 22 Weisburg, W.G., S.J. Giovannoni, and C.R. Woese. 1989. The Deinococcus-Thermus 23 phylum and the effect of rRNA composition on phylogenetic tree construction. 24 Syst. Appl. Microbiol. 11:128-134. 28 1 Whelan, S. and Goldman, N. 2000. A general empirical model of protein evolution 2 derived from multiple protein families using a maximum-likelihood approach. 3 Mol. Biol. Evol. 18:691-699. 4 Wilkinson, M., D. Pisani, J.A. Cotton, and I. Corfe. 2005. Measuring support and 5 finding unsupported groups in supertrees. Syst. Biol. 54:823-831. 6 Woese, C.R., O. Kandler, and M.L. Wheelis. 1990. Towards a natural system of 7 organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc. 8 Natl. Acad. Sci. 87:4576-4579. 9 Wolf, Y.I., I.B. Rogozin, N.V. Grishin, R.L. Tatusov, and E.V. Koonin. 2001. 10 Genome trees constructed using five different approaches suggest new major 11 bacterial clades. BMC Evol. Biol. 1:8. 12 13 14 Zhaxybayeva, O., P. Lapierre, and J.P. Gogarten. 2004. Genome mosaicism and organismal lineages. Trends Genet. 20:254-260. Zhaxybayeva, O., J.P. Gogarten, R.L. Charlebois, W.F. Doolittle, and R.T. Papke. 15 2006. Phylogenetic analyses of cyanobacterial genomes: quantification of 16 horizontal gene transfer events. Genome Res. 16:1099-1108. 17 18 19 20 29 1 Table 1: Topological features of genome trees constructed from different LGT 2 scenarios and rates of transfer. The first two columns show the model used (scenario 3 and rate). The following three columns summarize the bipartitions in the resulting 4 genome tree that are supported by at least 70% of bootstrap replicates: the numbers of 5 such bipartitions that are concordant and discordant, followed by the percentage of 6 bipartitions that are concordant. Following this are three columns showing equivalent 7 results for bipartitions that are supported by at least 90% of bootstrap replicates. The 8 final two columns show the edit distance recovered by EEEP between the genome 9 tree inferred by the normalized BLASTP method (considering only those bipartitions 10 that are supported at a BP of 70% and 90%, respectively) and the ‘true tree’ that was 11 used to simulate the evolution of the set of genomes. 12 LGT Scenario None Rate of Support70 Transfer Conc Disc %C 0 28 0 100 Conc 25 Support90 Disc %C 0 100 Relations 10 50 250 28 34 35 1 4 5 96.6 89.4 87.5 24 28 29 1 2 2 Habitat 10 50 250 31 29 20 7 5 9 81.6 85.3 69.0 23 22 18 Content 10 50 250 29 34 24 3 2 0 90.6 94.4 100 Random 10 50 250 29 31 20 2 0 0 93.5 100 100 ED70 ED90 0 0 96.0 93.3 93.5 0 3 3 0 1 0 2 4 6 92.0 84.6 75.0 5 4 >7 1 4 8 27 27 22 1 1 0 96.4 96.4 100 2 1 0 0 0 0 25 29 20 1 0 0 96.2 100 100 0 0 0 0 0 0 13 14 30 1 Figure Legends 2 3 SUPPLEMENTAL FIGURE 1. Tree representation of the organismal history simulated in 4 replicate 1, including extinct and extant taxa. Branch lengths are proportional to the 5 number of iterations in the simulation, with the common ancestor at iteration 1 and 6 extant lineages extending to iteration 5000. Major groupings of taxa (supported by 7 long internal branches in the tree of extant genomes) are indicated with Greek letters. 8 9 FIGURE 1. Tree representation of the organismal history simulated in replicate 1, 10 showing only the lineages ancestral to the population of 56 taxa extant at the end of 11 the simulation (iteration 5000). Branch lengths are proportional to simulation 12 iterations. Major groupings of taxa (supported by long internal branches in the tree) 13 are indicated with Greek letters as in Supplemental Figure 1. 14 15 FIGURE 2. Relationship between the number of iterations since the divergence of each 16 pair of extant genomes, and the corresponding mean normalized BLASTP distance for 17 all shared orthologs between the pair of genomes. Each point in the graph corresponds 18 to a pair of genomes from a given replicate: all genome pairs from all replicates are 19 shown. 20 21 FIGURE 3. Proportion of bipartitions supported at or above a series of thresholds that 22 are concordant or discordant with the true genome tree, recovered from the five 23 replicate simulations performed without LGT. Replicates are ordered 1-5 from left to 24 right, and sets of bars within each replicate correspond to bootstrap thresholds 25 increasing from 0.50 to 1.00 in increments of 0.05. Shaded bars indicate the 31 1 proportion of resolved bipartitions that are concordant (i.e., are consistent with the 2 simulated genome history), while the stacked white bars indicate the corresponding 3 proportion that are discordant. 4 5 FIGURE 4. Normalized BLASTP-based genome tree inferred from the set of genomes 6 simulated without LGT in Replicate 1. Numbers at the internal nodes indicate 7 bootstrap support (from a total of 100 replicates) for the partitioning of taxa indicated 8 by that node. The colouring scheme for true monophyletic groups is consistent with 9 that used in Supplemental Figure 1 and Figure 1 in the text. 10 11 FIGURE 5. Normalized BLASTP-based genome tree inferred from the set of genomes 12 simulated with habitat-directed LGT (E = 10 attempted events per iteration) in 13 Replicate 1. Numbers at the internal nodes indicate bootstrap support (from a total of 14 100 replicates) for the partitioning of taxa indicated by that node; only nodes with ≥ 15 70% bootstrap support are shown, and strongly supported nodes that are discordant 16 with the true tree are indicated with a larger font. 17 18 FIGURE 6. The percentage of strongly supported bipartitions in genome trees from all 19 replicates that are concordant. Two bootstrap thresholds (70% in panel a and 90% in 20 panel b) were used to define strong support: the mean ± standard deviation is shown 21 for each combination of BP threshold, LGT regime, and rate of LGT (= E). The 22 different types of LGT regime are abbreviated as follows: n = no LGT (E = 0); r = 23 relations-biased LGT; h = habitat-biased LGT; p = gene content-biased LGT; x = 24 random LGT (proposed events are always successful). 25 32 1 FIGURE 7. The relationship between branch length and bootstrap support of genome 2 trees recovered from data simulated under three different LGT regimes (empty circles 3 = no LGT; gray circles = habitat-directed LGT with E = 250; black circles = random 4 LGT with E = 250). Each set of data points is an aggregate of the five replicate 5 simulations for the stated evolutionary scenario. The lines above the plot show the 6 distribution of branch lengths recovered for each set of genome trees: the length of the 7 horizontal line shows the range, dashed vertical lines show the upper value of the first 8 and third quartiles, and the solid vertical line corresponds to the median branch 9 length. 10 11 FIGURE 8. The percentage of strongly supported nodes (BP ≥ 70) in variously 12 weighted genome trees from Replicate 1 that are concordant (panel a), and the total 13 count of strongly supported nodes (panel b). Each triplet of bars associated with a 14 unique combination of E and LGT regime refers to a different weighting criterion for 15 normalized BLASTP values, from left to right: concordance-weighted (darkest gray), 16 unweighted (lightest gray), and discordance-weighted (intermediate gray). LGT 17 regime abbreviations are consistent with Figure 6. 33