Exploring Evolutionary Trends in Proteomes Fredj Tekaia Edouard Yeramian Institut Pasteur tekaia@pasteur.fr Psychrop hiles Eukary otes Hypertherm ophiles Thermop hiles Prokaryotes mesophiles • •• • 433 36 Tree of life 46 http://www.genomesonline.org/ Complete genomes 2434 projects • 520 published (01-03-07) • 1086 Bacteria • 59 Archaea • 696 eukaryotes • 73 metagenomes • 3 phylogenetic domains; • Lifestyles: mesophiles; (hyper)thermophiles; psychrophiles; extreme conditions,... • Data driven exploratory analyses as opposed to model driven methods. • In the post genomic era, multidimensional data resulting from large scale genome comparisons are available. • Multivariate analysis methods are particularly helpful for the discovery of evolutionary trends associated with such data. Methodology Fp 1 i p sup 1 j kij • • • • • • • • • • n •• • • • •• • •• • • F1 • • sup Matrice T kij > 0 Correspondence Analysis F(is) = -1/2.∑{fisj.G(j) ; j=1,p}; Methodology Fp 1 i p 1 j kij • • • • • • • • • • n •• • • •• • • • • • • • • • F1 • • • • • • sup Matrice T kij > 0 Correspondence Analysis Classification • orthogonal system; • use of euclidean distance; 1. Evolution of Proteomes: Signatures and Trends in Amino Acid Compositions 2. Genome Trees from Whole Proteome Comparisons Evolution of Proteomes: Signatures and Trends in Amino Acid Compositions Hyperthermophiles • • •• Thermophiles Psychrophiles Eukaryotes Prokaryotes mesophiles • Mining the wealth of information contained in complete genomes, to decipher genomic characteristics to the adaptive evolution of organisms in extreme conditions as high or low temperatures, has long been a matter of interest: • Kreil DP, Ouzounis CA (2001). Identification of thermophilic species by the amino acid compositions deduced from their genomes. NAR 2001, 468: 1608-15. • Tekaia F, Yeramian E, Dujon B (2002). Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene, 297: 51-60. • Suhre K, Claverie JM (2003). Genomic correlates of hyperthermostability, an update. J. Biol. Chem., 278: 17198-202. • Hickey DA, Singer GA (2004). Genomic and proteomic adaptations to growth at high temperature. Genome Biol., 5: 117. Epub 2004. • Brocchieri L (2004). Environmental signatures in proteome properties. Proc Natl Acad Sci U S A., 101: 8257-8. • Cavicchioli R (2006). Cold-adapted archaea. Nat. Rev. Microbiology,4: 331-3. • Lobry JR, Necsulea A. (2006). Synonymous codon usage and its potential link with optimal growth temperature in prokaryotes.Gene. 385:128-36. • Zeldovich KB, Berezovsky IN, Shakhnovich EI. (2007). Protein and DNA Sequence Determinants of Thermophilic Adaptation. PLoS Comput Biol. 3:e5. The significant number of available completely sequenced genomes with different lifestyles offers an unprecedented opportunity to explore species evolution. Among simple analyses: amino acid composition of proteomes. • Which universal properties can be deduced from amino acid compositions of proteomes? • Are there specific properties associated with lifestyles and with phylogeny? • What are the underlying evolutionary trends? Outline • Methodology; • Species considered and data analysed; • Species and amino acids distributions; • Amino acids distribution and comparison with theoretical and experimental model chronologies of amino acids recruitment into the genetic code; • Example: application to predicting candidate thermostable proteins in Aspergillus fumigatus. Methodology Fp 1 i p sup 1 j kij • • • • • • • • • • n •• • • • •• • •• • • F1 • • sup Matrice T kij > 0 Correspondence Analysis F(is) = -1/2.∑{fisj.G(j) ; j=1,p}; Previous work showed: Growth t° Hyperthermophiles Thermophiles GC% Mesophiles 54 species Tekaia, F., Yeramian, E. and Dujon, B. 2002. Gene 297: 51-60. Amino Acid composition of 208 proteomes including: • 20 hyperthermophiles (HTH) (OGT >60°C up to 120°C), • 7 thermophiles (TH) (OGT >50°C up to 60°C), • 8 psychrophiles (PSYC) (OGT: -10°C, up to 15°C), • 173 mesophiles (BMES) including 53 eukaryotes (EUK) Data table: 222 (208 + 14 sup) vs 23 (20 aa + pol, char, hyd) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj + specific sites Amino Acid composition 208 org sc sp ncu ca mgr fg an ecun A R N D C Q E G H I L K M F P S T W Y V char pol hyd PC 5.5 6.3 8.7 4.9 9.4 8.2 8.6 5.0 4.4 4.8 6.2 3.7 6.6 5.8 6.2 6.7 6.1 5.2 3.7 6.7 3.5 3.9 3.7 3.9 5.8 5.3 5.6 5.7 5.7 5.9 5.6 5.5 1.3 1.5 1.1 1.2 1.3 1.3 1.2 2.0 3.9 3.8 4.3 4.4 4.1 4.0 4.0 2.3 6.6 6.5 6.5 6.2 5.9 6.2 6.2 8.1 5.0 5.0 7.2 5.0 7.4 6.7 6.8 6.5 2.1 2.3 2.5 2.1 2.3 2.4 2.4 1.9 6.6 6.1 4.4 7.1 4.4 5.1 5.0 6.7 9.6 9.8 8.4 9.3 8.5 8.7 9.2 9.5 7.4 6.4 5.1 7.2 4.8 5.1 4.6 7.1 2.1 2.1 2.2 1.9 2.2 2.3 2.0 3.0 4.5 4.6 3.4 4.5 3.5 3.8 3.7 4.8 4.3 4.7 6.5 4.5 6.3 5.9 6.0 3.4 9.0 9.4 8.3 9.3 8.0 8.1 8.4 8.0 5.8 5.6 6.1 6.2 5.9 6.1 6.0 4.1 1.0 1.1 1.4 1.0 1.5 1.5 1.5 0.8 3.3 3.4 2.6 3.5 2.5 2.8 2.9 3.6 5.6 6.0 6.0 5.5 6.2 6.1 6.1 7.0 26.3 25.3 25.8 25. 25.3 25.4 24.9 29.3 ............... 13 HTH TH PSYC BMES EUK SPEC A B E EA EB AB EAB 7.4 9.0 8.4 8.6 6.9 7.6 6.7 9.4 6.9 6.8 7.4 8.6 8.1 5.8 6.3 4.6 5.1 5.4 6.1 5.4 5.8 5.7 5.7 5.5 5.3 5.4 3.5 3.6 4.3 4.4 4.9 4.8 4.8 4.1 4.4 4.5 4.3 3.9 4.0 4.7 5.3 5.7 5.4 5.4 5.1 5.4 5.4 5.3 5.5 5.5 5.0 5.4 0.8 0.8 1.1 1.0 1.7 1.8 1.2 1.0 2.0 1.8 1.5 1.0 1.3 2.0 3.1 4.0 3.8 4.2 4.0 2.6 4.1 4.6 4.1 4.0 3.3 3.8 8.3 6.4 6.3 6.3 6.6 6.3 7.8 6.0 6.6 6.8 6.3 6.3 6.6 7.4 7.5 6.9 7.0 6.0 6.1 6.3 7.3 6.0 5.8 6.7 7.1 7.0 1.6 1.9 2.2 2.1 2.4 2.5 1.8 2.1 2.6 2.4 2.5 1.9 2.2 7.4 7.0 7.2 6.9 5.6 4.9 7.3 5.6 4.8 5.7 5.4 7.0 6.1 10.6 9.9 9.9 10.2 9.3 8.8 9.6 10.1 9.1 9.6 9.5 10.7 9.9 7.0 4.7 5.5 5.8 6.1 5.7 6.9 5.0 5.8 6.5 5.4 5.5 5.7 2.2 2.6 2.7 2.3 2.2 2.2 2.3 2.2 2.2 2.3 2.2 2.4 2.4 4.2 4.0 4.1 4.3 4.0 3.6 4.1 3.9 3.8 4.0 4.1 4.5 4.1 4.5 4.7 3.9 4.1 5.2 5.7 4.0 4.7 5.8 4.6 5.4 4.2 4.6 5.2 6.1 6.5 6.2 8.4 8.8 6.7 6.6 8.7 7.6 7.7 6.2 6.9 4.4 5.1 5.8 5.4 5.6 5.8 5.0 5.5 5.7 5.5 5.6 5.1 5.5 1.1 1.2 1.1 1.1 1.2 1.2 1.1 1.4 1.2 1.1 1.3 1.3 1.2 3.9 3.6 3.2 3.2 3.1 2.9 3.9 3.0 2.9 3.2 3.1 3.3 3.0 8.0 7.4 6.9 6.9 6.0 6.0 7.1 6.8 5.9 6.5 6.5 7.3 7.0 34.4 33.9 33.3 36.2 32.7 32.9 32.9 30.4 27.4 24.6 24.2 24.6 25.9 25.8 27.2 24.3 26.0 26.8 25.1 24.0 25.2 39.1 40.7 40.8 38.7 42.0 41.6 42.0 40.2 27.0 29.7 31.8 30.9 33.8 34.2 30.5 31.5 34.2 32.4 33. 29.8 31.4 45.4 45.6 44.0 44.4 40.2 39.9 42.2 44.0 39.7 40.7 41.8 46.0 43.3 8.1 8.6 7.5 11.2 7.4 7.5 8 1.1 -0.4 5.2 7.6 6.3 7.9 8.4 3.3 7.2 8.2 5.6 7.9 5.8 6.2 Correspondence Analysis was used to explore relationships between species and amino acids. Species specific comparisons • bestp1np blastp, pam250, SEG filter • allp1np • segmatchp1np NP P1 proteome1 new proteome • bestnpp1 • allnpp1 • segmatchnpp1 • bestpnnp Pn • allpnnp proteomen • bestnppn • allnppn • segmatchnppn • segmatchpnnp bestnppi allnppi np1 size pij e-value1 HS/IS/NS np1 size pij e-value1 HS/IS/NS np1 size pik e-value HS/IS/NS • Paralogs • Orthologs The expected number of HSPs with score at least S is given by: E = Kmne-S. m and n are sequence and database lengths. • Hyperthermophiles Thermophiles Psychrophiles • •• Prokaryotes mesophiles Thermosynechococcus elongatus • Encephalitozoon cuniculi Eukaryotes Methanococcus jannaschii:31% Pyrococcus abyssi:44% growth t° • Thermus-thermophilus:69% Methanopyrus kandleri:61% Nocardia farcinica: 70% Mycoplasma mycoides 23% •• Encephalitozoon cuniculi Colwellia psychrerythraea • Streptomyces coelicolor: 72% GC% Pseudoalteromonas haloplanktis Entamoeba histolytica (Protists) Cryptosporidium hominis • Cyanidioschyzon merolae Leishmania major:60% Saccharomyces Candida Glabrata Homo sapiens Tetrahymena thermophila (Protists) Mus musculus Rat A. nidulans Aspergilus fumigatus:50% A. oryzae C. neoformans Statistical characterization of the observed groups: Mean amino acids between the 3 groups were compared using: -One-way analysis of variance; -Newman-Keuls multiple comparison test to detect significant differences at the probability level of p<0.001. Mean aa composition in (hyper)thermophiles, prokaryotic mesophiles-psychrophiles and eukaryotes (*: sig. different at p<0.001) 11 10 * 9 8 7 6 * * * * * 5 4 3 2 1 * * * * * * * * * * * * V (V Y al) (T E yr) (G G lu) (G ly I( ) L Ile) (L e A u) (A H la) (H S is) (S Q er) (G T ln) (T C hr) (C D ys) (A s P p) (P N ro) (A s R n) (A M rg) (M K et) (L F ys) (P W he) (T rp ) 0 AA physico-chemical properties in (hyper)thermophiles, prokaryotic-pshychrophiles and eukaryotes(*: sig. different at p<0.001) 50 45 * 40 35 * 30 * * 25 20 15 10 5 * 0 hyd pol * * pol-char char Amino acid signatures (p<0.001) HTH-TH BMES-PSYC EUK V(Val) H(His) S (Ser) pol pol-char Y (Tyr) E (Glu a) Q (Gln) T (Thr) D (Asp a) V(Val) H(His) S (Ser) pol pol-char • R (Arg), M (Met), F (Phe), K (Lys), N (Asn) and W (Trp) show no significant difference (at p<0.001). V (Val) H (His) S (Ser) pol pol-char G (Gly) I (Ile) L (Leu) C (Cys) Hyd Species evolutionary trends growth t° [high_temperature]-[high_GC] •A •AB EAB B • • Ancient GC% EA • EB• Q uickTim e™ et un décom pr esseur TI FF ( non com pr essé) sont r equis pour vis ionner cet t e im age. SPEC • •E T1 Recent T2 [moderate_temperature]-[low_GC] Comparison with model chronologies of amino acids recruitment into the genetic code • Comparison of amino acid distribution with recent models of: • Jordan et al. Nature 433: 633-638 (2005) • Trifonov, J. Biomol. Struct. & Dyn. 22: 1-11 (2004) • and with ancient amino acids: • Miller’s experiments: Science 117, 528-529. (1953) • Analysis of Murchison meteorite (1983) Model of Jordan et al. 2005: A universal trend of amino acid gain and loss in protein evolution. Nature.433:633-8. • They analysed 15 sets of three-way alignments of orthologous proteins encoded by triplets of closely related genomes from 15 taxa representing all three domains of life (Bacteria, Archaea and Eukaryota), and used phylogenies to polarize amino acid substitutions. • All amino acids with declining frequencies are thought to be among the first incorporated into the genetic code; • conversely, all amino acids with increasing frequencies, except Ser, were probably recruited late. Following observed frequencies, they subdivided amino acids into what they called: • 4 strong “losers”: Pro, Ala, Glu, and Gly (decline in at least 13 taxa/15) “thought to be among the first incorporated into the genetic code” i.e most ancient aa. • 5 strong “gainers”: Cys, Met, His, Ser and Phe (accrue in 14/15 taxa) “were probably recruited late” i.e most recent aa. • 1 “weak looser”: Lys (lost in 10 taxa/15). • 4 “weak gainers”: Asn, Thr, Ile (accrue in 11 taxa/15) and Val (accrues slowly in all taxa); • In contrast: the remaining six amino-acids (Arg, Gln, Trp, Leu and Tyr) evolve more erratically. Jordan et al. 2005. growth t° • Ile Val Gly Glu Phe Asn •”strong loosers” in T1: Met •• • Thr GC% Pro Q uickTim e™ et un décom pr esseur TI FF ( non com pr essé) sont r equis pour vis ionner cet t e im age. • Ser Ala His T1 T2 most ancient aa •”weak gainers” • “strong gainer” in T2: Cys recruited late to the genetic code Jordan et al., Nature 433, 633 (2005). A universal trend of aa gain and loss in protein evolution. Model of Trifonov, E.N. 2004. The triplet code from first principles. J. Biomol. Struct. & Dyn. 22: 1-11. • A consensus chronology of amino acids is built on the basis of 60 different criteria each offering certain temporal order. • The chronology results in the consensus order: G1 (Gly), A2 (Ala), D3 (Asp), V4 (Val), P5 (Pro), S6 (Ser), E7 (Glu), (L8 (Leu), T8 (Thr)), R10 (Arg), (I11 (Ile), Q11 (Gln), N11 (Asn)), H14 (His), K15 (Lys), C16 (Cys), F17 (Phe), Y18 (Tyr), M19 (Met), W20 (Trp). growth t° Ile11 • Tyr18 Lys15 Asn11 Phe17 Val4 Gly1 • Glu7 Met19Leu8 3 •• Asp •Thr 8 Arg10 GC% Trp20 Pro5 Q uickTim e™ et un décom pr esseur TI FF ( non com pr essé) sont r equis pour vis ionner cet t e im age. • Ser His14 6 Gln11 Trifonov, E.N. (2004). Ala2 Cys16 The triplet code from first principles. J. Biomol. Struct. & Dyn. 22: 1-11. T1 T2 Comparison with ancient amino acids Miller/Urey Experiment: 1953 • By the 1950s, scientists were in hot pursuit of the origin of life. The scientific community was examining what kind of environment would be needed to allow life to begin. • In 1953, Miller took molecules which were believed to represent the major components of the early Earth's atmosphere and put them into a closed system • Miller's experiment showed that organic compounds such as amino acids, which are essential to cellular life, could be made easily under the conditions that scientists believed to be present on the early earth. growth t° Ile+ • + Val Gly+++ • Leu •• Asp Glu + + GC% + Thr+ Q uickTim e™ et un décom pr esseur TI FF ( non com pr essé) sont r equis pour vis ionner cet t e im age. • Ser Ala+++ Pro+ T1 + T2 Miller, S.L. Science 117, 528-529. (1953) Production of aa under possible primitive earth conditions. Murchison meteorite 09-28-1969 The Murchison meteorite fall occurred on September 28, 1969 over Murchison, Australia. Over 100 kilograms of this meteorite have been found. This meteorite is of possible cometary origin due to its high water content of 12%. An abundance of amino acids found within this meteorite has led to intense study by researchers as to its origins. More than 92 different amino acids have been identified within the Murchison meteorite to date. Nineteen of these are found on Earth. The remaining amino acids have no apparent terrestrial source. growth t° Ile+ • Glu++ Val ++Gly+++ •• Leu+ Asp+ Q uickTim e™ et un décom pr esseur TI FF ( non com pr essé) sont r equis pour vis ionner cet t e im age. • Ala++ GC% Pro++ T1 T2 Cronin, J.R. and Pizzarello, S. (1983). Amino acids in meteorites. Adv Space Res. 3: 5-18. Murchison meteorite 28-09-1969 Conclusions: • Simple description of amino acid compositions of proteomes (free from a priori model) revealed fundamental evolutionary properties: • segregation of eukaryotes; • segregation of hyperthermophiles; • non discrimination of psychrophiles. • Amino acid signatures for hyperthermophiles and for eukaryotes. Conclusions...: • Amino acids distribution is consistent with suggested model chronologies of their recruitment into the genetic code; • Correspondence Analysis helped these properties to be shown. General Conclusion • Amino acids are significant markers for species evolution. Genome Trees from Whole Proteome Comparisons Outline • Species tree construction and difficulties; • Post genome era species tree construction; • Conservation profiles; • Genome tree construction based on conservation profiles; • Conclusions; • References. Species tree - Tree Of Life • 16/18s rRNA tree (Woese 1990); Woese and others have used rRNA comparisons to construct a “Tree Of Life” showing the evolutionary relationships of a wide variety of organisms. The « Tree Of Life » has long served as a useful tool for describing the history and relationships of organisms over evolutionary time. One species is represented as a branching point, or node, on the tree, and the branches represent paths of descent from a parental node. Martin & Embley Nature 431:152-5.(2004) The three-domain proposal based on the ribosomal RNA tree. Woese et al. PNAS. 87:4576-4579. (1990) The three-domain proposal, with continuous lateral gene transfer among domains. Doolittle. Science 284:2124-8. (1999) The two-empire proposal, separating eukaryotes from prokaryotes and eubacteria from archaebacteria. Mayr, D. PNAS 95:9720-23. (1998). The ring of life, incorporating lateral gene transfer but preserving the prokaryote eukaryote divide. Rivera & Lake JA. Nature 431: 152-5. (2004) Genomic Databases and the Tree of Life Keith A. Crandall and Jennifer E. Buhay Sciences, 306; 1144-1145. (2004) Prospects for Building the Tree of Life from Large Sequence Databases The 1.2-Megabase Genome Sequence of Mimivirus Raoult et al. Sciences, 306:1344-1350. (2004) Driskell, et al . Sciences, 306; 1172-1174. (2004) Pennisi, E. (1998). Genome data shake tree of life. Science 280:672-4. New genome sequences are mystifying evolutionary biologists by revealing unexpected connections between microbes thought to have diverged hundreds of millions of years ago. and suggests to construct species trees from their whole gene content. B A E Genome phylogeny based on gene content (1999) Snel, Bork, Huynen. Nature Genetics 21, 108-110. Tekaia, Lazcano & Dujon (1999) Genome Research 9: 550-7. B A E 433 36 Tree of life 46 http://www.genomesonline.org/ Complete genomes 2434 projects • 520 published (01-03-07) • 1086 Bacteria • 59 Archaea • 696 eukaryotes • 73 metagenomes Abundance of genome data is raising expectations to accurately depict the evolutionary history of all genomes. Idea: construct a species tree from many genes instead of only one gene. Gene tree - Species tree • Time Duplication • Duplication A B C Gene tree Speciation Speciation A A B C Genomes 2 edition 2002. T.A. Brown B Species tree C Problems with species tree construction • main difficulties in species tree construction include extensive incongruence between alternative phylogenies generated from single-gene data sets; -Genes don't evolve at the same rate nor in the same way; -the evolutionary history inferred from one gene may be different from what another gene appears to show. Alternative solutions: integrative methods • “supertree” The supertree approach estimates phylogenies for subsets of genes with good overlap, then combines these subtree estimates into a supertree. • Depends on the ability to distinguish between orthologs and paralogs; • Supertree approaches are controversial, in part because the methodology results in a degree of disconnection between the underlying genetic data and the final tree produced. Bininda-Emonds et al. 2002 • “phylogenomic tree” (based on concatenation of a gene sample common to the considered species); S1 . . Sn • genes don't evolve at the same rate nor in the same way; • a limited number of genes are shared among all species; The tree of one percent (2006) Dagan and Martin. Genome Biology, 7:118. More generally these methods suffer difficulties related to the phylogenetic tree construction: • global sequence alignment (quality, gaps,...); • different evolutionary histories of genes; • substitution saturation;... and • more seriously from gene sampling difficulties. Adapted from: Gene tree - Species tree: The gene Linder, Moret, Nakhleh, Warnow. sampling problem True species tree A B gene tree # species tree Blue is lost in A and B A C Red is lost in C B C A B C Gene tree - Species tree: The gene sampling problem A B C All red orthologs has been lost in the 3 species. A B C Luckily: sampling gives the blue orthologs. The true species tree is reconstructed. Gene tree - Species tree: The gene sampling problem A B C All versions of the gene are in the 3 species A B CA B C Gene trees are the same as the species tree Genome tree is another alternative to construct species tree. • The concept of genome tree is based on overall gene content similarity. (consider more than single gene information) Methodology Fp 1 i p 1 j kij • • • • • • • • • • n •• • • •• • • • • • • • • • F1 • • • • • • sup Matrice T kij > 0 Correspondence Analysis Classification • orthogonal system; • use of euclidean distance; Systematic Analysis of Completely Sequenced Organisms • In silico species specific comparisons (Tekaia & Dujon. J. Mol. Evol. 1999) (27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins) blastp, pam250, SEG filter Proteome1 Proteome • 99 species (B: 33; A: 19; E:27) • total of 541880 proteins Proteomen Systematic Analysis of Completely Sequenced Organisms • In silico species specific comparisons (27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins) • Degree of ancestral duplication and of ancestral conservation between pairs of species; • Families of paralogs (Partition-MCL); • Families of orthologs (Partition-MCL); • Distribution of orthologous families according to the three domains of life; • Determination of the protein dictionary (orthologs); • Determination of protein conservation profiles; Genome trees: data matrices T = {Tij ; i=1,n; j=1,n; n is the number of surveyed species} Tij is the overall similarity score between species j and i. • Ancestral duplication and ancestral conservation T = {Tij = wij = (number of proteins in j conserved in i)/size(j)); i=1,n; j=1,n }. n = 99 species and T corresponds to 541880 total proteins Ancestral duplication and ancestral conservation org SC SP CE DM AG CA ATH HS MUS FR PF ECUN MJ MTH AF PH PA APEM TA TV H SSP2 PFU STO PYAE MA MK MMA HI ….. tnsp SC 40.5 58.4 38.1 40.5 40.9 71.8 40.3 43.0 41.7 42.0 25.9 19.5 11.5 13.6 14.4 16.3 14.3 15.5 15.2 15.4 14.8 16.7 17.0 18.6 15.6 16.0 13.0 14.8 13.0 SP 63.9 37.4 46.6 50.2 50.2 65.5 47.8 53.3 52.5 52.6 31.2 23.4 13.3 16.2 16.5 18.7 15.2 20.1 17.5 17.8 17.7 19.4 22.8 23.1 19.5 18.9 14.6 17.4 14.3 CE 17.5 18.8 65.2 39.2 39.8 18.4 21.7 40.0 39.5 40.0 13.1 8.9 4.9 4.6 5.9 5.0 5.4 4.8 5.9 6.2 5.8 7.1 6.5 6.8 5.3 7.1 4.0 6.4 4.8 DM 27.1 29.3 51.9 65.8 73.1 27.7 31.5 61.3 62.1 60.7 19.3 13.1 6.7 7.4 8.2 7.1 7.5 7.3 8.3 8.3 8.3 9.1 9.3 8.6 8.2 10.8 6.2 9.2 7.3 AG 22.3 26.3 50.6 69.9 59.5 25.7 30.3 54.5 54.7 59.9 15.9 10.8 6.0 7.6 8.7 9.2 7.3 10.6 8.3 8.7 9.8 9.4 11.1 11.4 9.9 12.5 6.1 9.5 8.5 CA 65.9 54.3 35.5 37.5 38.0 35.8 37.0 39.7 39.1 39.5 22.2 16.2 10.2 11.2 11.8 11.1 11.9 10.3 12.7 13.3 12.0 14.2 13.3 13.7 11.8 14.7 10.7 13.5 11.1 ATH 23.4 25.0 27.5 29.5 30.6 24.3 83.6 32.1 31.5 32.7 16.3 11.4 6.0 8.0 8.7 9.7 7.4 9.4 8.2 8.3 10.2 9.5 12.3 11.1 9.5 9.7 6.9 8.1 8.7 HS 22.9 25.0 44.6 50.3 50.2 23.2 25.6 66.7 76.8 68.7 17.2 12.0 4.8 5.1 5.6 5.2 5.5 5.2 5.3 5.6 5.5 6.2 7.0 5.9 5.8 7.4 4.6 6.6 4.4 MUS 27.3 29.6 54.4 62.7 60.3 27.8 29.7 90.8 77.8 81.8 21.0 15.2 5.6 6.1 6.6 6.0 6.4 5.9 6.3 6.8 6.6 7.4 8.0 7.1 6.9 8.7 5.4 7.9 5.4 FR 18.0 20.0 42.4 47.9 48.7 18.5 21.9 68.8 67.7 63.4 13.2 9.0 3.7 4.0 4.5 4.1 4.3 3.9 4.2 4.4 4.5 4.9 5.6 4.5 4.5 6.4 3.5 5.3 4.0 PF 22.5 24.6 24.8 26.5 26.5 22.3 26.2 28.2 27.6 27.6 28.3 13.6 8.7 8.3 8.6 7.9 8.3 7.2 8.6 8.7 8.0 9.5 9.1 9.1 8.1 9.8 7.3 9.7 8.2 ECUN 35.8 38.4 34.8 36.3 36.0 35.7 33.4 37.7 37.2 37.4 28.9 26.1 15.4 15.2 15.4 15.3 15.9 14.9 14.8 15.0 13.9 15.9 17.1 15.7 15.0 17.0 14.1 15.8 8.7 74.4 79.2 49.7 76.4 81.0 72.6 58.8 78.7 93.7 72.8 42.3 48.1 Wij conservation tree •species are clustered into 3 phylogenetic domains; • bacterial species cluster with archaeal species; • similar species cluster together; • “whole genome” species clustering tree; • very low resolution of deep clustering; Genome trees: data matrices T = {Tij ; i=1,n; j=1,n; n is the number of surveyed species} Tij is the overall similarity score between species j and i. • Shared orthologous genes {sij = (shared orthologs between i and j) } T = {Tij = sij/size(j); i=1,n; j=1,n } Ancestor A Note on: Homologs - Paralogs - Orthologs Duplication A Time Homologs: A1, B1, A2, B2 B Paralogs : A1 vs B1 and A2 vs B2 Evolution A Orthologs: A1 vs A2 and B1 vs B2 B Speciation A1 A2 B1 B2 Species-1 Species-2 Sequence analysis a S1 S2 b Shared orthologous genes org SC SP CE DM AG CA ATH HS MUS FR SC 0 2532 1533 1660 1671 3371 1582 1789 1733 1731 SP 2532 0 1753 1917 1907 2588 1754 2060 2032 2024 CE 1533 1753 0 3910 3869 1611 1902 4036 3994 4047 DM 1660 1917 3910 0 7018 1728 2094 5057 5147 5035 AG 1671 1907 3869 7018 0 1738 2160 5016 5013 5059 CA 3371 2588 1611 1728 1738 0 1590 1850 1824 1827 ATH 1582 1754 1902 2094 2160 1590 0 2404 2406 2399 HS 1789 2060 4036 5057 5016 1850 2404 0 14053 10286 MUS 1733 2032 3994 5147 5013 1824 2406 14053 0 10304 FR 1731 2024 4047 5035 5059 1827 2399 10286 10304 0 PF 890 1008 1015 1106 1085 873 1067 1185 1169 1146 ECUN 600 645 580 616 617 595 539 638 632 626 MJ 238 233 214 216 242 230 279 223 216 217 MTH 254 247 237 247 278 245 306 251 248 249 AF 261 255 254 260 303 248 310 260 263 265 PH 251 245 250 259 297 237 281 273 258 271 PA 267 261 255 268 311 256 312 276 273 278 APEM 212 233 228 228 251 215 242 248 237 230 TA 264 260 252 254 279 261 298 268 264 261 TV 263 255 256 249 276 258 296 260 258 270 H 255 264 258 249 284 248 318 271 267 272 SSP2 302 317 293 292 326 300 360 310 309 311 PFU 264 284 256 275 324 286 316 292 274 280 STO 281 291 273 263 313 278 329 293 282 298 PYAE 245 258 236 249 285 238 278 258 246 256 MA 303 316 298 293 368 301 369 329 326 326 MK 210 214 195 204 216 211 244 205 202 195 MMA 289 298 276 280 338 280 349 305 299 297 HI 268 273 231 243 388 268 382 259 259 267 PF ECUN 890 600 1008 645 1015 580 1106 616 1085 617 873 595 1067 539 1185 638 1169 632 1146 626 0 453 453 0 169 142 171 141 182 151 187 155 189 156 165 136 182 141 184 138 173 140 200 155 195 150 196 143 170 143 200 161 160 125 194 160 181 86 sij orthologs tree • 3 phylogenetic domains; • bacterials species cluster with archaeal species; • similar species cluster together; • better resolution of deep species clustering. • Large scale comparative analysis of predicted proteomes revealed significant evolutionary processes: Evolutionary processes include Ancestor Expansion* Phylogeny* genesis duplication HGT species genome Exchange* selection* HGT loss Deletion* Expansion, Exchange and Deletion are noise. They should be eliminated or at least reduced. To overcome some of these limitations, we consider Genome tree construction from “Protein Conservation Profiles” and attempt to reduce noisy evolutionary processes Conservation profiles • 99 species (B: 33; A: 19; E:27); 541880 proteins p 0111111000111111111000110110111101001111101111 • A “conservation profile” is an n-component binary vector describing a protein conservation pattern across n species. Components are 0 and 1, following absence or presence of homologs. Main interesting properties of conservation profiles: • Conservation profiles are signatures of evolutionary relationships; • A conservation profile is the trace of protein evolutionary histories jointly captured in a set of n species (multidimensional feature); Protein conservation profiles E A B S1..............I.............I................Sn G1,1 100000000000000000000000000000000000000000000000 G2,1 111111111111111111111111111111111111111111111111 G3,1 111111110011111111111111011101110101111111101111 ....................................................... Gn1,1 100001110001000000000000000000000000000000000000 G1,2 010000000000000000010100000000000111000011100011 G2,2 010000000000000000010100000000000111000011100011 ........................................................ Gn2,2 111111110011111111111111011101110101111111101111 ........................................................ G1,n 011110100000000000000000001000000000000000000001 G2,n 111111110011111111100011011101110101111111101111 G3,n 111111110011111111100011011101110101111111101111 ........................................................ Gnp,n 100110000000000000000000000000000000000000000001 Table : 541880 proteins x 99 species • Different conservation profiles represent different evolutionary histories Distinct conservation profiles 541880 original total proteins (99 species) 442460 non-specific proteins i.e conservation profiles (82%) 184130 distinct conservation profiles (42%) 100000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111 111111110011111111111111011101110101111111101111 010000000000000000010100000000000111000011100011 100110000000000000000000000000000000000000000001 ................................................ (one representative from each set of identical conservation profiles) • Effect of the duplication process is reduced • This set is indicative of the various observed evolutionary histories. c01 c02 c03 c04 c05 c06 c07 c08 c09 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c30 c31 c32 c33 c34 c35 c36 c37 c38 c39 c40 c41 c42 c43 c44 c45 c46 c47 c48 c49 c50 c51 c52 c53 c54 c55 c56 c57 c58 c59 c60 c61 c62 c63 c64 c65 c66 c67 c68 c69 c70 c71 c72 c73 c74 c75 c76 c77 c78 c79 c80 c81 c82 c83 c84 c85 c86 c87 c88 c89 c90 c91 c92 c93 c94 c95 c96 c97 c98 c99 Fractions (*10000) of distinct conservation profiles 250 240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 Presence in the 184130 distinct conservation profiles: Mean=32.2; SD=23.3; min=1; Max=99. Conservation weights (sum of "1":presence) Genome tree construction: data matrices • 184130 d.c.prof various evolutionary histories i j 100000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111 111111110011111111111111011101110101111111101111 010000000000000000010100000000000111000011100011 100110000000000000000000000000000000000000000001 ................................................ • Jaccard similarity scores between species sij = N11/(N11+N01+N10); N11; N01; N10 are respectively total occurrences of (1,1), (0,1) and (1,0) between i,j. T = { Tij = sij ; i=1,n; j=1,n; n } profiles tree Tekaia F, Yeramian E. (2005). PLoS Comput Biol.1(7):e75 Conclusions: Methodology • Species classification is not an easy task! • Species tree construction should take into account the whole information included in the genomes; • Methods that take into account whole genome informations are still needed; • Correspondence analysis method might be helpful in revealing evolutionary trends embedded in the multidimensional relationships as obtained from large scale genome comparisons; Conclusions... • Conservation profiles represent most conserved and meaningful evolutionary signals jointly captured in a set of species; • Thus they should correspond to the most accurate type of markers for species classification; • In principal profiles tree derived from distinct conservation profiles should considerably minimize genome acquisition effects and should reflect less noisy phylogenetic signals; • The profiles tree presents evidence of conservation of stable phylogenetic relationships and reveals unconventional species clustering; • The profiles tree corresponds to the classification of the evolutionary scenari. References: • Tekaia, F. and Dujon, B. (1999). Pervasiveness of gene conservation and persistence of duplicates in cellular genomes. Journal of Molecular Evolution, 49:591-600. • Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from whole proteome comparisons. Genome Res. 12:17-25. • Tekaia, F., Yeramian, E. and Dujon, B. (2002). Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 297: 51-60. • Tekaia, F. and Yeramian, E. (2005). Genome Trees from Conservation Profiles. PLoS Comput Biol.1(7):e75. • Tekaia F, Latgé JP. (2005). Aspergillus fumigatus: saprophyte or pathogen? Curr Opin Microbiol. 8:385-92. Review. • Tekaia, F. and Yeramian, E. (2006). Evolution of Proteomes: Fundamental signatures and global trends in amino acid composition. BMC Genomics. 7:307. • Systematic analysis of completely sequenced organisms: http://www.pasteur.fr/~tekaia/sacso.html References: • Bininda-Emonds ORP (2005). Supertree Construction in the Genomic Age. Methods in Enzymology 395: p.745-757. • Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002) The (super)Tree Of Life: Procedures, Problems, and Prospects. Annual Review of Ecology and Systematics, Vol. 33: 265-289. • Dagan, T. and W, Martin (2006). The tree of one percent. Genome Biology, 7:118. • Delsuc F, Brinkmann H, Philippe H. (2005). Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 6:361-75. Review. • Doolittle. Science 284:2124-8. (1999) • Driskell, et al. (2004). Sciences, 306; 1172-1174. • http://www.genomesonline.org/gold.cgi (list of genome projects) • Keith A. Crandall and Jennifer E. Buhay (2004). Sciences, 306; 1144-1145. • Linder, Moret, Nakhleh, and Warnow: http://compbio.unm.edu/networks1.ppt • Martin & Embley (2004). Nature 431:152-5. • MCL: a cluster algorithm for graphs: http://micans.org/mcl/ • Pennisi, E.(1998). Genome data shake tree of life.Science. 280:672-4. • Rivera & Lake JA.(2004). Nature 431: 152-5. • Raoult et al.(2004). Sciences, 306:1344-1350. • Snel, Bork, Huynen (1999). Genome phylogeny based on gene content.Nature Genetics 21, 108-110. • Snel B, Huynen MA, Dutilh BE (2005). Genome trees and the nature of genome evolution.Annu Rev Microbiol.;59:191-209. Review. • Woese et al.(1990). PNAS. 87:4576-4579.