1) Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia e-mail: [email protected] Ecole Phylogénomique Carry-Le-Rouet 11-14/12/2006 Organized by INRA Talk given on 13/12/2006. This image appeared on Plos Computational Biology, Vol 1, N° 7. December 2005. http://compbiol.plosjournals.org/perlserv/?request=get-toc&issn=1553-7358&volume=1&issue=7 The last part of the talk corresponds to results from a collaboration with Edouard Yeramian. 2) Outline of the talk Species tree construction has long been a difficult task. The aim of this talk is to report traditional species tree construction and emphasize their strength and limitations. In the post genome era many attempts to construct species trees from their whole gene content are emerging. Large scale genome comparisons studies shed light on many evolutionary processes other than the phylogenetic evolutionary process. We have introduced a Genome tree construction based on conservation profiles. This construction method takes into account all proteins included in the considered proteomes and reduces the duplication effects. This tree does not show the evolutionary histories of the species, but rather clusters species according to the similarity of their evolutionary histories. In this talk i will present: -the traditional “species tree” as constructed by Woese in 1990 and the alternative topologies that have been proposed particularly in the light of recently acquired knowledge from genome sequences; -the alternative species tree constructions taking into account several orthologous genes, instead of only one gene type; -limitations of constructing species trees from gene trees; -necessity of considering the whole gene content of the genomes to construct species trees; -the main evolutionary processes revealed by large-scale genome comparisons (including phylogeny, expansion, exchange and deletion); -How to reduce the non phylogenetic evolutionary processes particularly duplication; -large-scale proteome comparison and determination of protein conservation profiles; -Genome tree construction from conservation profiles: advantages and limitations. 3) The phylogenetic tree constructed by Wose in 1990, based on ribosomal RNA genes is generally referred to as the «Species tree» or « Tree of life ». This tree is supposed to show the evolutionary relationships between all organisms. These are organized into three main domains, called phylogenetic domains: Bacteria, Archaea and Eukaryotes. The tree of life has long served as a useful tool for describing the history and relationships of organisms over evolutionary time. One species is represented as a branching point, or node, on the tree, and the branches represent paths of descent from a parental node. The tree diagram carries an implicit assumption that genes are transferred vertically, from parent to child, and that all the genes in a new species come from the ancestral species. In theory, one should be able to trace the origin of each gene in a species back to its ancestor. In practice, however, the ancestral gene is rarely available, so researchers look for the gene in a closely related species. (These similar genes, which diverge slightly after a speciation event, are called orthologs.) 4) The topology of the tree of life has been and is still under debate. In a recent work, Martin and Embley diagrammed the main alternative topologies that has been proposed: -the first alternative proposal came in 1998, Mayer proposed the “two-empire” topology separating Eukaryotes from Prokaryotes and Eubacteria from Archaebacteria. -in 1999, Doolittle proposed the topology of “three domains with continuous lateral gene transfer among domains”. - in 2004, Riviera and Lake proposed the “Ring of life”topology incorporating “lateral gene transfer” but preserving the “Prokaryote Eukaryote” dive. 5) The same year 2004, Raoult et al. suggested to introduce a new "Mimivirus" branch on the three domains tree, and others proposed the construction of the "Tree of life" from large sequence databases. 6) But already in 1998, just after the completion of the sequences of few small genomes, Elisabeth Pennisi, in her comments "Genome data shake tree of life", noted that : "New genome sequences are mystifying evolutionary biologists by revealing unexpected connections between microbes thought to have diverged hundreds of millions of years ago". Pennisi suggested the "tree of life" should take into account the whole information included in the genomes. 7) Just after these comments appeared almost at the same time the first "Genome trees" based on the genome sequences available at that time. Here is the one based on the set of orthologs as deduced from the species "gene content comparisons". This tree is consistent with the "three domains" tree. 8) And here is the one based on "Whole proteome comparisons” by considering the degree of duplication and conservation between genomes. Here also the tree is consistent with the "three domains" tree: Eukarya, Archaea and Bacteria, with the notable difference that Archaea are closer to Bacteria than to Eukaryotes. 9) The significant number of available completely sequences genomes from the 3 phylogenetic domains and with different lifestyles offers an unprecedented opportunity to explore at the genome level, species evolution and their classification. Abundance of genome data is raising expectations to accurately depict the evolutionary history of all organisms. The expectation is now to construct species trees from many genes instead of only one gene, and possibly from all genes of the considered species . Note that uptodate staistics on complete http://www.genomesonline.org/gold.cgi sequencing projects can be found: 10) But there are serious difficulties. This schema adapted from the book by Brown "Genome" Second edition, 2002, shows the difficulty of constructing species trees from genes. Given this evolutionary schema where we have 2 "gene duplication" events and 2 "speciation" events and random genetic drift resulting in one “blue” gene in species C, a “green” gene in species B and “red” gene in species A. The topologies of the gene tree would be the upper one; whereas the topology of the species tree would be the lower one. It is clear that these topologies are different! Gene trees are not the same as species trees. Be aware that a node is: a mutation (duplication) event in a gene tree and a speciation event in a species tree. Estimating the species phylogeny is not easy! 11) "Before, people tended to equate rRNA trees with the [life history] tree of the organism”: "From the whole genomes, one quickly comes across [genes] that don't agree with the rRNA tree.” 12) To overcome these difficulties, alternative solutions have been proposed including: "Integrative methods" and "whole genome phylogeny”: The first proposed integrative method is the “supertree” method. See the reference for details on the method: Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002) THE (SUPER)TREE OF LIFE: Procedures, Problems, and Prospects. Annual Review of Ecology and Systematics, Vol. 33: 265-289. To solve the problem of sparseness, the authors built a "supertree". The supertree approach estimates phylogenies for subsets of data with good overlap, then combines these subtree estimates into a supertree. As shown on this figure: Only genes: 3, 4 and 5 are common to both subtrees. -2 trees in (a): 1 and 2 are present only on the first tree; 6 and 7 are present only on the second tree. -A matrix is constructed : where rows correspond to the different elements and columns to the different nodes found in all subtrees. A supertree is constructed from this matrix. Several questions remain, however, about this strategy. First, the supertree strategy depends fundamentally on our ability to distinguish between orthologous (derived from a speciation event) and paralogous (derived from a duplication event) gene sequences. Second, supertree approaches themselves are controversial, in part because the methodology results in a degree of disconnect between the underlying genetic data and the final tree produced. Moreover, this strategy has yet to be validated by computer simulation or well-established phylogenetic methods. Third, the supertree approach makes a fundamental assumption: that a bifurcating tree topology represents the genomic evolutionary history of species. This assumption has been called into question because of the reality of genetic exchange across species boundaries through mechanisms such as horizontal gene transfer and hybridization. Depicting genealogical relationships as networks might better represent the true underlying biology. 13) The second integrative method that has been proposed is: "phylogenomic tree". This method is based on the concatenation of a gene sample common to the considered species. These genes are then aligned and traditional phylogeny construction methods are applied to construct a phylogenetic tree, called a "phylogenomic tree". The problem is that: - genes don't evolve at the same rate nor in the same way; - only a limited number of genes are shared among all species; Also, concatenation of sequences of different genes takes hardly in consideration the specific evolutionary rate of each gene. At last, building a consensus tree is strongly limited by the low number of genes sharedamong all organisms. In a recent review Dagan and Martin called this type of tree “the tree of one percent”. They estimate to 1% the genome gene proportion common to many species. 14) More generally the previous methods suffer difficulties related to the phylogenetic tree construction, as for example: -the quality of global alignments of different sequences (particularly those related to the gaps); -the different evolutionary histories of the considered genes; -the estimation of substitution saturation; and more seriously from gene sampling difficulties. To precisely illustrate this problem with some details (adapted from C. Randal Linder, Bernard M.E. Moret, Luay Nakhleh, and Tandy Warnow) the following slides show some examples of sampling difficulties. 15) These slides are adaptation from: C. Randal Linder*, Bernard M.E. Moret†, Luay Nakhleh*, and Tandy Warnow* * University of Texas at Austin †University of New Mexico Through inadequate sampling on our part or irretrievable loss of some gene lineages via extinction we may unknowingly reconstruct incorrect species trees when attempting to reconstruct gene trees. When we use multiple gene markers for a phylogeny this sampling problem can produce well supported, incongruent phylogenetic trees that suggest speciations that may never have occurred. Upper Figure: The paths outlined by the black lines delineate the phylogenetic history of species A, B and C, where B and C are sister to one another and their clade is sister to A. The colored lines inside the species tree represent the evolutionary history of a duplicated stretch of DNA. Note that in this case the duplication event occurs before the origin of the ABC clade. The different gene lineages will be referred to as the red and the blue lineages. All of the red lineages are orthologs because they have evolved from a single common ancestor within the ABC clade. All of the blue lineages are also orthologs for the same reason. The blue and the red lineages are called paralogs because they represent different evolutionary histories in each species of the ABC clade (see later a note on paralogs and orthologs). Lower Figure left: Note in this example that the red and blue paralogs are inherited by each lineage when a speciation event occurs, but that in lineages A and B the blue paralog is lost, and in lineage C the red paralog is lost. A number of biological reasons could lead to loss of different copies of DNA: random drift, selection, ... The main point is that when we sample this gene we only have the red version of it in A and B and the blue version in C. If the gene lineages that we cannot detect are removed, it becomes obvious that the red versions of the gene that are found in species A and B share a more recent common ancestor than they do with the blue lineage in species C. Lower Figure Right: Because of this, we end up reconstructing this set of relationships between species A, B and C, where A and B are sister. It isn’t the correct species tree, but since we are unaware of our error in equating all of the genes sequenced as orthologs, we mistakenly construct this tree, perhaps with considerable bootstrap or posterior probability support. 16) Figure on the Left: Now consider a second hypothetical gene that also underwent a gene duplication event prior to the ABC clade. In this case, we are more fortunate. Only the red ortholog has been lost in the three species, and when we sample the extant species we only get the blue ortholog. Figure on the Right: Under these circumstances we reconstruct the true species tree. 17) Alternatively, all of the versions of the gene might be in all of the species, but we might sample each species inadequately to detect all of the versions. The net result will be the same for our phylogeny. Figure on the left: With perfect knowledge and perfect sampling: both the red and blue copies are in A, B and C. Figure on the right: Under these circumstances we have the following correct reconstructions for both sets of orthologs with the correct topology. 18) An other alternative to construct species trees is to take into account the whole gene content of each completely sequenced organisms. We then need an overall gene content similarity score between pairs of species. This construction can be applied of course only on completely sequenced genomes. 19) To construct the species tree we use this methodology: -A data table where observations and variables are the species. At the intersection of a line j and a column i, kij is a positive number Multivariate analysis methods are particularly helpful for the discovery of fundamental evolutionary trends associated with the multidimensional structure of genomic data, as obtained from the large scale genome comparisons. The methodology behind this type of data can be diagrammed by the following procedure: 1) a data table is constructed to describe n observations relatively to p variables (in our case p=n = the number of considered species); We assume that kij at the intersection of j and i is a positive number. kij represents a score of similarity between j and i. This score should be normalized in order to insure the additivity of the scores on a given line or column. 2) We use Correspondence analysis to construct an orthogonal system, in which we represent the n observations and the p variables. Note that the above data table is suitable for Correspondence analysis, since scores on a given line can be summed and also the scores on a given column. The original scores can then be normalized by their corresponding total scores of their respective line or column. The general interesting properties of using Correspondence analysis is that the representation of both sets on the same factorial space, is possible. So we can look for relationships between observations; between variables and between observations and variables. The similarity is represented by the neighborhood between points. The interesting property of this procedure is that clustering of the observations can be performed using euclidean distances between observations. Euclidean distances can be calculated using coordinates on the orthogonal system (factorial axes). It is this procedure that will be used in constructing genome trees corresponding to different data tables or matrices. In what follows we will consider a table table constructed from conservation profiles, but other examples of data tables (duplication/conservation or orthologs) have been constructed and analysed (see Tekaia, Lazcano and Dujon (1999), Tekaia and Yeramian (2005) or Tekaia and Yeramian (2002, 2006). 3) a hierarchical method is used to cluster species according to their euclidean distances as calculated in the factorial space. 20) To construct a data table that encompass all pairwise species similarity scores, we have surveyed 99 predicted proteomes (including 33 bacterial, 19 archaeal and 27 eukaryal species) and performed species specific comparisons: that is the protein sequences of each proteome are compared to a database that includes all proteins of another species, using the blastp program with the PAM250 substitution matrix and SEG filter. A total of 541880 proteins have been compared in this way. For the description of the Methodology behind this procedure see: Tekaia & Dujon (1999). Pervasiveness of gene conservation and persistence of duplicates in cellular genomes. J Mol Evol. 1999 49(5):591-600. 21) These comparisons are performed following our local system called: "Systematic Analysis of Completely Sequenced Organisms". It allows among others to calculate: -the degree of "ancestral duplication" in each species and the degree of "ancestral conservation" between pairs of species; From these comparisons families of paralogs and families of orthologs based on protein reciprocal best hits and using Partition and MCL algorithms (http://micans.org/mcl/). -the determination of what we call the gene dictionary for each species (i.e the orthologs of each gene); -and the determination of protein conservation profiles. 22) This is a note on "Homologs, Paralogs and Orthologs" and how to determine them at the sequence level. This figure shows an ancestral gene A; a duplication event that transforms A into 2 identical genes A and B; with evolution A and B are no more identical but remains similar. The speciation event allows to have in species_1: A1 and B1 copies from A and B and in species_2: A2 and B2 copies from A and B. In this scheme: A1, B1, A2, B2 are homologs A1 vs B1 and A2 vs B2 are paralogs (they are in the same species and have the same ancestor) A1 vs A2 and B1 vs B2 are orthologs (they are in different species and have the same ancestor). In our comparisons we considered as orthologs, proteins that are reciprocally the best hit to each other. This is now considered as rough method. More accurate methods taking into account possible synteny between genes or chromosomal segments, are now considered in the determination of orthologous genes (see for example: http://sonnhammer.cgb.ki.se). 23) But when considering species whole gene content, we are faced with many evolutionary processes, discovered mainly with large scale genome comparisons studies. Genomes are not static collections of DNA materials. Various biochemical and cellular processes—including point mutation, recombination, gene conversion, replication slippage, DNA repair, translocation and horizontal transfer—constantly act on genomes and drive the genomes to evolve dynamically. Evolutionary processes include: -Phylogeny (deals with what a given species inherited from its ancestor); -Expansion (including duplication and genesis); -Exchange (horizontal gene transfer); -Deletion (loss of genes) -Recombination and natural selection. In principal in species tree construction, we are interested solely in “Phylogeny”, the rest of the evolutionary processes should be considered as noise. They should be eliminated or at least reduced. 24) To overcome some of the difficulties in eliminating noisy evolutionary processes, we attempt to construct genome trees from “consercation profiles” and attempt to reduce parts of these noisy evolutionary processes. 25) Conservation profiles Proteome species specific comparisons allowed to determine the conservation profiles for the total 541880 proteins included in the 99 species. A conservation profile is an n-component binary vector describing a protein conservation pattern across n species. Components are 0 and 1, following absence and presence of homologs in the corresponding species. The conservation profiles can be considered as "signatures of evolutionary relationships” of the corresponding proteins. A conservation profile may be considered as the “trace of a protein evolutionary history jointly captured in a set of species”. This is an interesting evolutionary multidimensional feature of the “conservation profile”. 26) Here are some examples of conservation profiles: -specific to a given species (the first black one). A “1” is indicated at the first position and 0 elsewhere; -common to many species (the red one): unique in 2 species and duplicated in a third one; -a duplicated conservation profile (in only one species) (purple example); -etc... If we assume that different conservation profiles represent different evolutionary histories and attempt to determine the set of different conservation profiles, we will have an approximation of possible different evolutionary histories. 27) In order to determine the different evolutionary histories, this slide shows the different reduction steps that lead to the set of distinct conservation profiles. About 82% of the original set of proteins are non-specific proteins i.e. have non trivial conservation profiles. From these only 184130 correspond to distinct conservation profiles i.e 42%. We considered only one representative from each set of identical conservation profiles. In this set, effects of the duplication process is significantly reduced. Most interestingly this set should correspond to the various evolutionary histories embedded in the original set of proteins. 28) This figure shows the distribution of the number of conservation profiles as a function of the sum of "1" or presence in the conservation profile. (how many "1"s are in a given conservation profile?) On the x-axis we have the different classes from 1 to 99 (c01,..., c99): corresponding to the number of distinct species in which an homolog may be found. On the y-axis we have the number of conservation profiles. The figure shows a uniform decrease of the number of conservation profiles relative to the increase of the number of "1"s (presence in distinct species). The maximum number of profiles corresponds to profiles including 5 or 6 “1”. 29) Similarity between a pair of species is calculated using the set of 184130 distinct conservation profiles. In order to take into account the whole ancestral information included in this set, Jaccard similarity scores were calculated between all pairs of species: sij = N11/(N11+N01+N10); N11; N01; N10 are respectively total occurrences of (1,1), (0,1) and (1,0) between i and j. Note that each score is normalized and varies between 0 and 1. 30) The whole pairwise Jaccard scores between all pairs of species are arranged in a data matrix. This data table (similarity matrix) is submitted to correspondence analysis and clustering of species is performed using euclidean distances as calculated from their factorial coordinates (see Methodology slide). 31) We applied the Methodology presented in the beginning of the talk and obtained the "profiles tree" shown here. Detailed discussion of this tree can be found in our paper: Tekaia F, Yeramian E. (2005). PLoS Comput Biol.1(7):e75. Main clusters are: - three phylogenetic groups: Eukarya (E), Archaea (A) and Bacteria (B). -as shown by the colors in the different branches, most of the traditional clusterings are preserved; Some clustering are not in accordance with traditional clustering as for example: -S. coelicolor which is separated from the other actinobacteria; -T. tengcongensis is separated from the other fimicutes; 32) Conclusions -Species classification is not an easy task. -Species tree construction should take into account the whole information included in their sequences; -We still need methods that construct such trees; -In the case of conservation profiles (multidimensional features of evolutionary histories) Correspondence Analysis might be helpful in revealing significant trends embedded in the multidimensional relationships as obtained from large scale genome comparisons. 33) -Conservation profiles represent most conserved and meaningful evolutionary signals jointly captured in a set of species; -they should correspond to most accurate type of markers for species classification; -In principal the “profiles tree” should considerably minimize genome acquisition effects and should reflect less noisy phylogenetic signals. In considering distinct conservation profiles, variations in size of protein families should not influence the tree building procedure (one conservation profile per set of proteins with identical conservation profiles) and significantly reduces the sensitivity to gene acquisition processes. -the “profiles tree” presents evidence of conservation of stable phylogenetic relationships and reveals unconventional species clustering. -It corresponds to the classification of the evolutionary scenari that are embedded in the considered species. 34) Acknowledgments: The support of: • The Institut Pasteur (Strategic Horizontal Programme on Anopheles gambiae) • The Ministère de la Recherche Scientifique (France): ACI-IMPBIO-2004–98-GENEPHYS program. • Bernard Dujon (Institut Pasteur) for constant support. 35) References • Tekaia, F. and Dujon, B. (1999). Pervasiveness of gene conservation and persistence of duplicates in cellular genomes. Journal of Molecular Evolution, 49:591-600. • Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from whole proteome comparisons. Genome Res. 12:17-25. • Tekaia, F., Yeramian, E. and Dujon, B. (2002). Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 297: 51-60. • Tekaia, F. and Yeramian, E. (2005). Genome Trees from Conservation Profiles. PLoS Comput Biol.1(7):e75. • Tekaia, F. and Yeramian, E. (2006). Evolution of Proteomes: Fundamental signatures and global trends in amino acid composition. BMC Genomics. 7:307. • Tekaia F, Latgé JP. (2005). Aspergillus fumigatus: saprophyte or pathogen? Curr Opin Microbiol. 8:385-92. Review. • Systematic analysis of completely sequenced organisms: http://www.pasteur.fr/~tekaia/sacso.html 36) References • Bininda-Emonds ORP (2005). Supertree Construction in the Genomic Age. Methods in Enzymology 395: p.745-757. • Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002). The (super)Tree Of Life: Procedures, Problems, and Prospects. Annual Review of Ecology and Systematics, Vol. 33: 265-289. • Dagan, T. and W, Martin (2006). The tree of one percent. Genome Biology, 7:118. • Delsuc F, Brinkmann H, Philippe H. (2005). Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 6:361-75. Review. • Doolittle. Science 284:2124-8. (1999). • Driskell, et al. (2004). Sciences, 306; 1172-1174. • http://www.genomesonline.org/gold.cgi (list of genome projects) • Keith A. Crandall and Jennifer E. Buhay (2004). Sciences, 306; 1144-1145. • Linder, Moret, Nakhleh, and Warnow: http://compbio.unm.edu/networks1.ppt • Martin & Embley (2004). Nature 431:152-5. • MCL: a cluster algorithm for graphs: http://micans.org/mcl/ • Pennisi, E. (1998). Genome data shake tree of life.Science. 280:672-4. • Rivera & Lake JA. (2004). Nature 431: 152-5. • Raoult et al.(2004). Sciences, 306:1344-1350. • Snel, Bork, Huynen (1999). Genome phylogeny based on gene content.Nature Genetics 21, 108-110. • Snel B, Huynen MA, Dutilh BE (2005). Genome trees and the nature of genome evolution.Annu Rev Microbiol.;59:191-209. Review. • Woese et al. (1990). PNAS. 87:4576-4579.