Detecting horizontal gene transfers using discrepancies in species and gene classifications Vladimir Makarenkov and Alix Boc Université du Québec à Montréal Presentation summary • Network models in phylogenetic analysis • What is a horizontal gene transfer (HGT)? • Description of the new method • Examples of application • Future works • T-REX software Inferring phylogenetic trees Four main approaches: • Distance-based methods • UPGMA by Michener and Sokal (1957) • ADDTREE by Sattath et Tversky (1977) • Neighbor-joining (NJ) by Saitou and Nei (1988) • UNJ and BioNJ methods by Gascuel (1997) • Fitch by Felsenstein (1997) • Weighted least-squares MW by Makarenkov and Leclerc (1999) • Maximum Parsimony (Camin and Sokal 1965; Farris 1970; Fitch 1971) • Maximum Likelihood (Felsenstein 1981) • Bayesian approach (Rannala and Yang 1996; Huelsenbeck and Ronquist 2001) Phylogenetic mechanisms requiring a network representation 1 2 3 1 4 5 2 3 4 5 • Horizontal gene transfer (i.e. lateral gene transfer) • Hybridization • Homoplasy and gene convergence • Gene duplication and gene loss Software for building phylogenetic networks • SplitsTree, Huson (1998) • T-Rex, Makarenkov (2001) • NeighborNet, Bryant and Moulton (2002) Methods for detecting horizontal gene transfers • Hein (1990) and Hein et al. (1995, 1996) • Haseler and Churchill (1993) • Page (1994); Page and Charleston (1998) • Charleston (1998) • Hallet and Lagergren (2001) • Mirkin, Fenner, Galperin and Koonin (2003) • V’yugin, Gelfand and Lyubetsky (2003) • Boc and Makarenkov (2003); Makarenkov, Boc and Diallo (2004) Thee types of horizontal gene transfer Partial gene transfer versus complete transfer Root 5 3 2 4 1 A B C D E Partial Transfer F Complete Transfer Root Root 5 3 2 7 4 1 6 A B C (a) D E F A B C D (b) E F The new model Basic ideas: Species Tree A A B D C C D F E E F B Gene Tree 1) Reconcile the species and gene phylogenetic trees using either a topological (Robinson and Foulds topological distance) or a metric (least-squares) criterion 2) Incorporate necessary biological rules into the mathematical model 3) Maintain the algorithmic time complexity polynomial Partial gene transfer. Incorporating biological rules. Root j i1 Situations branch x a y i z b w when (a,b) evolutionary a can distance new HGT affect between species i and j, and cannot affect the distance between i1 and j. the Partial gene transfer. Incorporating biological rules (2). Root Root Root j j j x a y z b w i x a y z b w x a y z b w i i (a) (b) (c) Three cases when the evolutionary distance between the species i and j is not affected by addition of a new HGT branch (a,b) Partial gene transfer. Incorporating biological rules (3). Root No HGTs can be considered when affected branches are located on the same lineage Partial gene transfer. Incorporating biological rules (4). Root LGT1 LGT2 Lineage 1 Lineage 2 No HGT can be considered when two HGTs affecting a pair of lineages intersect as shown Partial gene transfer. Incorporating biological rules (5). b a j i j i (b) (a) a1 b b1 j i (c) a1 b1 a b a1 b1 a1 b1 a b a j i (d) • Cases (a) and (b): path between the leaves i and j is allowed to go through both HGT branches (a,b) and (a1,b1). • Cases (c) and (d) : path between the leaves i and j is not allowed to go through both HGT branches (a,b) and (a1,b1). Algorithmic scheme Step 1. Construction of the species and gene phylogenies T and T1 • Let X be a set of n taxa (i.e. species or objects). • Infer a binary species phylogenetic tree T from the sequence or distance data (using 16S or 23S rRNAs or other genes that are not supposed to be transferred horizontally). This tree has 2n-3 branches and n leaves. • T is explicitly rooted. • Consider the same n taxa from X. Infer a binary gene phylogenetic tree T1. Algorithmic scheme Step 2. LS mapping of the gene tree into the species tree • If the topologies of T and T1 are identical, no horizontal gene transfers can be indicated. • If the topologies of T and T1 are different, it may be the result of horizontal gene transfers. The gene tree T1 can be mapped into the species tree T by fitting by least squares the branch lengths of T to the pairwise distances in T1 (Bryant and Wadell 1998; Makarenkov and Leclerc 1999). • Each pair of branches of the species tree T is tested for the possibility of an HGT. All branch lengths in T are reassessed according to the pairwise distance in T1. Algorithmic scheme Step 3. Unique and Multiple gene transfer scenarios Once all pairs of branches in T are tested, an ordered list L of all possible HGT connections between pairs of branches in T can be established. Each entry of L is associated with a gain in fit obtained for a particular HGT. The researchers can then either: 1. Choose the best (most probable HGTs) from this list taking into account the HGT order in L as well as any useful knowledge about the data at hand (Multiple Scenario). 2. Or, add to the species tree T the best second (according to the LS criterion), third, and so forth HGT branches in the way that the computation of each new HGT branch is done taking into account all previously added HGTs (Unique Scenario). Optimization problem : Least-squares The least-squares loss function to be minimized with an unknown length l of the HGT branch (a,b): Q(ab,l) = ( Min {d (i , a) d ( j , b); d ( j ,a) d (i , b)} l (i , j ) )2 dist ( i , j ) l + ( d (i , j ) (i , j ) )2 dist ( i , j ) l min Root d(i,j) - the minimum path-length distance between the leaves (i.e. taxa) i and j in the tree T j x a y (i,j) - the given dissimilarity value between i and j dist(i,j) = d(i,j) – Min { d(i,a) + d(j,b); d(j,a) + d(i,b) } i z b w Complete gene transfer A A A A B D D D C B C C D C E F E B E F F B 1 2 E F Species Tree + HGT1 Upcoming HGT2 Species Tree Upcoming HGT1 Species Tree T 3 Species Tree + HGT2 Upcoming HGT3 Species Tree + HGT3 (Gene Tree) A A A B D B C C D F E E F B C D E Gene Tree T 1 F Optimization problem : Robinson and Foulds topological distance A C D A E D B T E B T1 C The topological distance of Robinson and Foulds (1981) between two phylogenetic trees is equal to the minimum number of elementary operations consisting of merging or splitting vertices necessary to transform one tree into another. Robinson and Foulds topological distance A C B T D A E B C D A E B C T1 E Robinson and Foulds distance between T and T1 is 2. The HGT minimizing the Robinson and Foulds topological distance between the species and gene phylogenetic trees can be considered as the best candidate to reconcile the species and gene phylogenies. D HGT detection algorithm : Complete transfer • Test all connections between pairs of branches in the species tree T. • Compute the RF distance or LS coefficient for each connexion. In the case of LS optimization, the length of each edge of the species tree is reassessed according to the gene distance matrix. • The best HGT found (i.e. HGT minimizing the optimization criterion) is added to the species tree transforming it into another phylogenetic tree. • Run the algorithm while the transformed species tree is topologically different from the gene tree. • Time complexity: O(kn4) for k HGTs and n species. Application example 1 Horizontal transfer of the Rubisco Large subunit gene Delwiche, C.F., and J. D. Palmer. 1996. Rampant Horizontal Transfer and Duplication of Rubisco Genes in Eubacteria and Plastids. Mol. Biol. Evol. 13:873-882. rbcL Gene Phylogeny Rhodobacter Sphaeroides I Xanthobacter Alcaligenes H16 chromosomal Alcaligenes H16 plasmid Alcaligenes 17707 chromosomal Mn oxidizing bacterium (S|85-9a1) Cyanidium Ahnfeltia Antithamnion Porphyridium Cryptomonas Ectocarpus Olistodiscus Cylindrotheca Proteobacteria Proteobacteria Proteobacteria Red Type (FORM I) Red and Brown Plastids Cyanobacteria Proteobacteria Prochlorococcus Hydrogenovibrio L2 Chromatium L Pseudomonas Thiobacillus ferrooxidans fe1 Nitrobacter Thiobacillus ferr. 19859 Thiobacillus denitrificans I Endosymbiont Synechococcus Anabaena Prochlorothrix Anacystis Synechocystis Prochloron Cyanophora Euglena Pyramimonas Chlamidomonas Chlorella Bryopsis Coleochaete Marchantia Pseudotsuga Nicotiana Oryza Proteobacteria Proteobacteria Proteobacteria Proteobacteria Proteobacteria Proteobacteria Cyanobacteria Glaucophyte Plastid Green Plastids Green Type (FORM I) Delwiche and Palmer (1996) - hypotheses of HGTs 1- Cyanobacteria → γ-Proteobacteria 2- α-Proteobacteria → Red and brown algae 3- γ-Proteobacteria → α-Proteobacteria 4- γ-Proteobacteria → β-Proteobacteria HGTs of the rbcL gene 2 4 7 3 5 6 1 Rhodobacter Sphaeroïde I Xanthobacter Mn oxidizing bacterium Nitrobacter Alcaligenes H 16 plasmid Alcaligenes Chromosomal Alcaligenes 17707 Chromosomal Thiobacillus denitificans I Endosymbiont Pseudomonas Thiobacillus fe 1 Thiobacillus ferr . 19859 Chromatium L Hydrogenovibrio L 2 Prochlorococus Anabaena Synechococcus Anacystis Prochlorothrix Synechocystis Prochloron Cyanophora Cylindrotheca Olistodiscus Ectocarpus Cryptomonas Anthithamnion Ahnfeltia Porphyridium Cyanidium Oryza Nicotiana Pseudotsuga Marchantia Coleochaete Pyramimonas Euglena Chlamidomonas Bryopsis Cholrella - Proteobacteria ß- Proteobacteria - Proteobacteria Cyanobacteria Glaucophyte plastid Red and Brown algae Green Plastids HGTs of the rbcL gene - comparison Hypotheses by Delwiche and Palmer (1996) 1- Cyanobacteria → γ-Proteobacteria 2- α-Proteobacteria → Red and brown algae 3- γ-Proteobacteria → α-Proteobacteria 4- γ-Proteobacteria → β-Proteobacteria Solution 1. α-Proteobacteria → Red and brown algae 2. -Proteobacteria → β-Proteobacteria 3. -Proteobacteria → γ-Proteobacteria 4. -Proteobacteria → -Proteobacteria 5. γ-Proteobacteria → β-Proteobacteria 6. γ-Proteobacteria → Cyanobacteria 7. γ-Proteobacteria → β-Proteobacteria Application example 2 Horizontal transfers of the protein rpl12e Data taken from: Matte-Tailliez O., Brochier C., Forterre P. & Philippe H. Archaeal phylogeny based on ribosomal proteins. (2002). Mol. Biol. Evol. 19, 631-639. Rpl12e HGTs Assumed HGTs of the rpl12e gene involved the clusters of Crenarchaeota and Thermoplasmatales (Matte-Tailliez, 2004) Ferroplasma acidarinanus Pyrobaculum aerophilum Thermoplasma acidophilum Aeropyrum pernix Aeropyrum pernix Sulfolobus solfataricus Pyrobaculum aerophilum Pyrococcus furiosus Sulfolobus solfataricus Pyrococcus abyssi Pyrococcus abyssi Pyrococcus horikoshii Methanococcus jannaschii Methanobacterium thermoautotrophicum Archaeoglobus fulgidus Methanosarcina barkeri Haloarcula marismortui Halobacterium sp. Pyrococcus horikoshii Pyrococcus furiosus Methanococcus jannaschii Archaeoglobus fulgidus Methanosarcina barkeri Methanobacterium thermoautotrophicum Thermoplasma acidophilum Haloarcula marismortui Ferroplasma acidarinanus Halobacterium sp. Species tree Rpl12e gene tree Reconciliation scenario Pyrobaculum aerophilum Aeropyrum pernix 3 Sulfolobus solfataricus Pyrococcus furiosus 5 Pyrococcus abyssi Pyrococcus horikoshii Methanococcus jannaschii 1 Methanobacterium thermoautotrophicum 4 Archaeoglobus fulgidus Methanosarcina barkeri 2 Haloarcula marismortui Halobacterium sp. Thermoplasma acidophilum Ferroplasma acidarinanus Future developments • Maximum Likelihood model • Maximum Parsimony model • Validation methods (bootstrapping) • Decreasing the running time Bibliography • Boc, A. and Makarenkov, V. (2003), New Efficient Algorithm for Detection of Horizontal Gene Transfer Events, Algorithms in Bioinformatics, G. Benson and R. Page (Eds.), 3rd Workshop on Algorithms in Bioinformatics, Springer-Verlag, pp. 190-201. • Delwiche, C.F., and J. D. Palmer (1996). Rampant Horizontal Transfer and Duplication of Rubisco Genes in Eubacteria and Plastids. Mol. Biol. Evol. 13:873-882. • Makarenkov,V. (2001), T-Rex: reconstructing and visualizing phylogenetic trees and reticulation networks. Bioinformatics, 17, 664-668. • Makarenkov, V., Boc, A. and Diallo A.B. (2004), Representing Lateral gene transfer in species classification. Unique scenario, IFCS’2004 proceedings, Chicago. • Matte-Tailliez O., Brochier C., Forterre P. & Philippe H. (2002). Archaeal phylogeny based on ribosomal proteins. Mol. Biol. Evol. 19, 631-639. • Robinson, D.R. and Foulds L.R. (1981), Comparison of phylogenetic trees, Mathematical Biosciences 53, 131-147. T-REX — Tree and Reticulogram Reconstruction1 Downloadable from http://www.info.uqam.ca/~makarenv/trex.html Authors: Vladimir Makarenkov Versions: Windows 9x/NT/2000/XP and Macintosh With contributions from A. Boc, P. Casgrain, A. B. Diallo, O. Gascuel, A. Guénoche, P.-A. Landry, F.-J. Lapointe, B. Leclerc, and P. Legendre. ________ Makarenkov, V. 2001. T-REX: reconstructing and visualizing phylogenetic trees and reticulation networks. Bioinformatics 17: 664-668. 1 Methods available • Six methods for inferring phylogenetic trees from distance data • Three methods for reconstructing reticulograms (phylogenetic networks) Methods available • Four methods for inferring phylogenetic trees from incomplete distance data • Visualization and interactive manipulation of phylogenetic trees and networks Phylogenetic tree inferring methods • ADDTREE by Sattath et Tversky (1977) • Neighbor-joining (NJ) by Saitou and Nei (1988) • UNJ and BioNJ methods by Gascuel (1997) • Circular order reconstruction by Makarenkov and Leclerc (1997) • Weighted least-squares MW by Makarenkov and Leclerc (1999) Tree reconstruction with missing data • Ultrametric procedure by De Soete (1984) • Additive procedure by Landry et al. (1996) • Triangles by Guénoche and Leclerc (2001) • MW* by Makarenkov and Lapointe (2004) Reticulogram reconstruction methods • Reticulogram with detection of reticulate evolution processes, hybridization, or recombination events (Legendre and Makarenkov 2002; Makarenkov and Legendre 2004). • Reticulogram with detection of horizontal gene transfer among species. Complete and Partial gene tranfer models. Unique and Multiple scenarios (Boc and Makarenkov 2003; Makarenkov, Boc and Diallo 2004). • Graphical representations: Hierarchical, Axial, or Radial views. Interactive manipulation of trees and reticulograms. Horizontal gene transfer detection Reticulogram reconstruction methods Bioinformatics software Tree reconstruction options Bioinformatics software T-Rex output Bioinformatics software Results available (tree map + HGTs) Bioinformatics software Reticulogram : Hierarchical view Bioinformatics software Reticulogram : Axial view Bioinformatics software Reticulogram : Radial view Bioinformatics software Color selection option Bioinformatics software Copy as Bitmap or Metafile