1) Construction of Genome Trees from Conservation Profiles of

1) Construction of Genome Trees from Conservation Profiles of Proteins
Fredj Tekaia
e-mail: [email protected]
Ecole Phylogénomique
Organized by INRA
Talk given on 13/12/2006.
This image appeared on Plos Computational Biology, Vol 1, N° 7. December 2005.
The last part of the talk corresponds to results from a collaboration with Edouard Yeramian.
2) Outline of the talk
Species tree construction has long been a difficult task.
The aim of this talk is to report traditional species tree construction and emphasize their
strength and limitations.
In the post genome era many attempts to construct species trees from their whole gene content
are emerging.
Large scale genome comparisons studies shed light on many evolutionary processes other
than the phylogenetic evolutionary process.
We have introduced a Genome tree construction based on conservation profiles. This
construction method takes into account all proteins included in the considered proteomes and
reduces the duplication effects. This tree does not show the evolutionary histories of the
species, but rather clusters species according to the similarity of their evolutionary histories.
In this talk i will present:
-the traditional “species tree” as constructed by Woese in 1990 and the alternative topologies
that have been proposed particularly in the light of recently acquired knowledge from genome
-the alternative species tree constructions taking into account several orthologous genes,
instead of only one gene type;
-limitations of constructing species trees from gene trees;
-necessity of considering the whole gene content of the genomes to construct species trees;
-the main evolutionary processes revealed by large-scale genome comparisons (including
phylogeny, expansion, exchange and deletion);
-How to reduce the non phylogenetic evolutionary processes particularly duplication;
-large-scale proteome comparison and determination of protein conservation profiles;
-Genome tree construction from conservation profiles: advantages and limitations.
3) The phylogenetic tree constructed by Wose in 1990, based on ribosomal RNA genes is
generally referred to as the «Species tree» or « Tree of life ». This tree is supposed to show
the evolutionary relationships between all organisms. These are organized into three main
domains, called phylogenetic domains: Bacteria, Archaea and Eukaryotes.
The tree of life has long served as a useful tool for describing the history and relationships of
organisms over evolutionary time. One species is represented as a branching point, or node,
on the tree, and the branches represent paths of descent from a parental node. The tree
diagram carries an implicit assumption that genes are transferred vertically, from parent to
child, and that all the genes in a new species come from the ancestral species. In theory, one
should be able to trace the origin of each gene in a species back to its ancestor. In practice,
however, the ancestral gene is rarely available, so researchers look for the gene in a closely
related species. (These similar genes, which diverge slightly after a speciation event, are
called orthologs.)
4) The topology of the tree of life has been and is still under debate.
In a recent work, Martin and Embley diagrammed the main alternative topologies that has
been proposed:
-the first alternative proposal came in 1998, Mayer proposed the “two-empire” topology
separating Eukaryotes from Prokaryotes and Eubacteria from Archaebacteria.
-in 1999, Doolittle proposed the topology of “three domains with continuous lateral gene
transfer among domains”.
- in 2004, Riviera and Lake proposed the “Ring of life”topology incorporating “lateral gene
transfer” but preserving the “Prokaryote Eukaryote” dive.
5) The same year 2004, Raoult et al. suggested to introduce a new "Mimivirus" branch on the
three domains tree, and others proposed the construction of the "Tree of life" from large
sequence databases.
6) But already in 1998, just after the completion of the sequences of few small genomes,
Elisabeth Pennisi, in her comments "Genome data shake tree of life", noted that :
"New genome sequences are mystifying evolutionary biologists by revealing unexpected
connections between microbes thought to have diverged hundreds of millions of years ago".
Pennisi suggested the "tree of life" should take into account the whole information included in
the genomes.
7) Just after these comments appeared almost at the same time the first "Genome trees" based
on the genome sequences available at that time.
Here is the one based on the set of orthologs as deduced from the species "gene content
This tree is consistent with the "three domains" tree.
8) And here is the one based on "Whole proteome comparisons” by considering the degree of
duplication and conservation between genomes.
Here also the tree is consistent with the "three domains" tree: Eukarya, Archaea and Bacteria,
with the notable difference that Archaea are closer to Bacteria than to Eukaryotes.
9) The significant number of available completely sequences genomes from the 3
phylogenetic domains and with different lifestyles offers an unprecedented opportunity to
explore at the genome level, species evolution and their classification.
Abundance of genome data is raising expectations to accurately depict the evolutionary
history of all organisms.
The expectation is now to construct species trees from many genes instead of only one gene,
and possibly from all genes of the considered species .
Note that uptodate staistics on complete
10) But there are serious difficulties.
This schema adapted from the book by Brown "Genome" Second edition, 2002, shows the
difficulty of constructing species trees from genes.
Given this evolutionary schema where we have 2 "gene duplication" events and 2 "speciation"
events and random genetic drift resulting in one “blue” gene in species C, a “green” gene in
species B and “red” gene in species A.
The topologies of the gene tree would be the upper one; whereas the topology of the species
tree would be the lower one.
It is clear that these topologies are different!
Gene trees are not the same as species trees.
Be aware that a node is: a mutation (duplication) event in a gene tree and a speciation event in
a species tree.
Estimating the species phylogeny is not easy!
11) "Before, people tended to equate rRNA trees with the [life history] tree of the organism”:
"From the whole genomes, one quickly comes across [genes] that don't agree with the rRNA
12) To overcome these difficulties, alternative solutions have been proposed including:
"Integrative methods" and "whole genome phylogeny”:
The first proposed integrative method is the “supertree” method.
See the reference for details on the method:
Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002)
THE (SUPER)TREE OF LIFE: Procedures, Problems, and Prospects.
Annual Review of Ecology and Systematics, Vol. 33: 265-289.
To solve the problem of sparseness, the authors built a "supertree".
The supertree approach estimates phylogenies for subsets of data with good overlap, then
combines these subtree estimates into a supertree.
As shown on this figure:
Only genes: 3, 4 and 5 are common to both subtrees.
-2 trees in (a): 1 and 2 are present only on the first tree; 6 and 7 are present only on the
second tree.
-A matrix is constructed :
where rows correspond to the different elements and columns to the different nodes found in
all subtrees.
A supertree is constructed from this matrix.
Several questions remain, however, about this strategy. First, the supertree strategy depends
fundamentally on our ability to distinguish between orthologous (derived from a speciation
event) and paralogous (derived from a duplication event) gene sequences. Second, supertree
approaches themselves are controversial, in part because the methodology results in a degree
of disconnect between the underlying genetic data and the final tree produced. Moreover, this
strategy has yet to be validated by computer simulation or well-established phylogenetic
methods. Third, the supertree approach makes a fundamental assumption: that a bifurcating
tree topology represents the genomic evolutionary history of species. This assumption has
been called into question because of the reality of genetic exchange across species boundaries
through mechanisms such as horizontal gene transfer and hybridization. Depicting
genealogical relationships as networks might better represent the true underlying biology.
13) The second integrative method that has been proposed is: "phylogenomic tree".
This method is based on the concatenation of a gene sample common to the considered
These genes are then aligned and traditional phylogeny construction methods are applied to
construct a phylogenetic tree, called a "phylogenomic tree".
The problem is that:
- genes don't evolve at the same rate nor in the same way;
- only a limited number of genes are shared among all species;
Also, concatenation of sequences of different genes takes hardly in consideration the specific
evolutionary rate of each gene. At last, building a consensus tree is strongly limited by the
low number of genes sharedamong all organisms.
In a recent review Dagan and Martin called this type of tree “the tree of one percent”. They
estimate to 1% the genome gene proportion common to many species.
14) More generally the previous methods suffer difficulties related to the phylogenetic tree
construction, as for example:
-the quality of global alignments of different sequences (particularly those related to the
-the different evolutionary histories of the considered genes;
-the estimation of substitution saturation;
and more seriously from gene sampling difficulties.
To precisely illustrate this problem with some details (adapted from C. Randal Linder,
Bernard M.E. Moret, Luay Nakhleh, and Tandy Warnow) the
following slides show some examples of sampling difficulties.
15) These slides are adaptation from:
C. Randal Linder*, Bernard M.E. Moret†, Luay Nakhleh*, and Tandy Warnow*
University of Texas at Austin †University of New Mexico
Through inadequate sampling on our part or irretrievable loss of some gene lineages via
extinction we may unknowingly reconstruct incorrect species trees when attempting to
reconstruct gene trees. When we use multiple gene markers for a phylogeny this sampling
problem can produce well supported, incongruent phylogenetic trees that suggest speciations
that may never have occurred.
Upper Figure:
The paths outlined by the black lines delineate the phylogenetic history of species A, B and C,
where B and C are sister to one another and their clade is sister to A. The colored lines inside
the species tree represent the evolutionary history of a duplicated stretch of DNA. Note that
in this case the duplication event occurs before the origin of the ABC clade. The different
gene lineages will be referred to as the red and the blue lineages.
All of the red lineages are orthologs because they have evolved from a single common
ancestor within the ABC clade. All of the blue lineages are also orthologs for the same
reason. The blue and the red lineages are called paralogs because they represent different
evolutionary histories in each species of the ABC clade (see later a note on paralogs and
Lower Figure left:
Note in this example that the red and blue paralogs are inherited by each lineage when
a speciation event occurs, but that in lineages A and B the blue paralog is lost, and in lineage
C the red paralog is lost. A number of biological reasons could lead to loss of different copies
of DNA: random drift, selection, ... The main point is that when we sample this gene we only
have the red version of it in A and B and the blue version in C.
If the gene lineages that we cannot detect are removed, it becomes obvious that the red
versions of the gene that are found in species A and B share a more recent common ancestor
than they do with the blue lineage in species C.
Lower Figure Right:
Because of this, we end up reconstructing this set of relationships between species A, B and
C, where A and B are sister.
It isn’t the correct species tree, but since we are unaware of our error in equating all of the
genes sequenced as orthologs, we mistakenly construct this tree, perhaps with considerable
bootstrap or posterior probability support.
16) Figure on the Left:
Now consider a second hypothetical gene that also underwent a gene duplication event
prior to the ABC clade. In this case, we are more fortunate. Only the red ortholog has been
lost in the three species, and when we sample the extant species we only get the blue ortholog.
Figure on the Right:
Under these circumstances we reconstruct the true species tree.
17) Alternatively, all of the versions of the gene might be in all of the species, but we might
sample each species inadequately to detect all of the versions. The net result will be the same
for our phylogeny.
Figure on the left:
With perfect knowledge and perfect sampling: both the red and blue copies are in A, B
and C.
Figure on the right:
Under these circumstances we have the following correct reconstructions for both sets of
orthologs with the correct topology.
18) An other alternative to construct species trees is to take into account the whole gene
content of each completely sequenced organisms.
We then need an overall gene content similarity score between pairs of species.
This construction can be applied of course only on completely sequenced genomes.
19) To construct the species tree we use this methodology:
-A data table where observations and variables are the species. At the intersection of a line j
and a column i, kij is a positive number
Multivariate analysis methods are particularly helpful for the discovery of fundamental
evolutionary trends associated with the multidimensional structure of genomic data, as
obtained from the large scale genome comparisons.
The methodology behind this type of data can be diagrammed by the following procedure:
1) a data table is constructed to describe n observations relatively to p variables (in our case
p=n = the number of considered species);
We assume that kij at the intersection of j and i is a positive number. kij represents a score of
similarity between j and i. This score should be normalized in order to insure the additivity of
the scores on a given line or column.
2) We use Correspondence analysis to construct an orthogonal system, in which we represent
the n observations and the p variables.
Note that the above data table is suitable for Correspondence analysis, since scores on a given
line can be summed and also the scores on a given column. The original scores can then be
normalized by their corresponding total scores of their respective line or column.
The general interesting properties of using Correspondence analysis is that the representation
of both sets on the same factorial space, is possible.
So we can look for relationships between observations; between variables and between
observations and variables. The similarity is represented by the neighborhood between points.
The interesting property of this procedure is that clustering of the observations can be
performed using euclidean distances between observations. Euclidean distances can be
calculated using coordinates on the orthogonal system (factorial axes).
It is this procedure that will be used in constructing genome trees corresponding to different
data tables or matrices.
In what follows we will consider a table table constructed from conservation profiles, but
other examples of data tables (duplication/conservation or orthologs) have been constructed
and analysed (see Tekaia, Lazcano and Dujon (1999), Tekaia and Yeramian (2005) or Tekaia
and Yeramian (2002, 2006).
3) a hierarchical method is used to cluster species according to their euclidean distances as
calculated in the factorial space.
20) To construct a data table that encompass all pairwise species similarity scores, we have
surveyed 99 predicted proteomes (including 33 bacterial, 19 archaeal and 27 eukaryal
species) and performed species specific comparisons: that is the protein sequences of each
proteome are compared to a database that includes all proteins of another species, using the
blastp program with the PAM250 substitution matrix and SEG filter.
A total of 541880 proteins have been compared in this way.
For the description of the Methodology behind this procedure see:
Tekaia & Dujon (1999). Pervasiveness of gene conservation and persistence of duplicates in
cellular genomes. J Mol Evol. 1999 49(5):591-600.
21) These comparisons are performed following our local system called:
"Systematic Analysis of Completely Sequenced Organisms".
It allows among others to calculate:
-the degree of "ancestral duplication" in each species and the degree of "ancestral
conservation" between pairs of species;
From these comparisons families of paralogs and families of orthologs based on protein
reciprocal best hits and using Partition and MCL algorithms (http://micans.org/mcl/).
-the determination of what we call the gene dictionary for each species (i.e the orthologs of
each gene);
-and the determination of protein conservation profiles.
22) This is a note on "Homologs, Paralogs and Orthologs" and how to determine them at the
sequence level.
This figure shows an ancestral gene A; a duplication event that transforms A into 2 identical
genes A and B; with evolution A and B are no more identical but remains similar.
The speciation event allows to have in species_1: A1 and B1 copies from A and B and in
species_2: A2 and B2 copies from A and B.
In this scheme:
A1, B1, A2, B2 are homologs
A1 vs B1 and A2 vs B2 are paralogs (they are in the same species and have the same ancestor)
A1 vs A2 and B1 vs B2 are orthologs (they are in different species and have the same ancestor).
In our comparisons we considered as orthologs, proteins that are reciprocally the best hit to
each other.
This is now considered as rough method. More accurate methods taking into account possible
synteny between genes or chromosomal segments, are now considered in the determination of
orthologous genes (see for example: http://sonnhammer.cgb.ki.se).
23) But when considering species whole gene content, we are faced with many evolutionary
processes, discovered mainly with large scale genome comparisons studies.
Genomes are not static collections of DNA materials. Various biochemical and cellular
processes—including point mutation, recombination, gene conversion, replication slippage,
DNA repair, translocation and horizontal transfer—constantly act on genomes and drive the
genomes to evolve dynamically.
Evolutionary processes include:
-Phylogeny (deals with what a given species inherited from its ancestor);
-Expansion (including duplication and genesis);
-Exchange (horizontal gene transfer);
-Deletion (loss of genes)
and natural selection.
In principal in species tree construction, we are interested solely in “Phylogeny”, the rest of
the evolutionary processes should be considered as noise.
They should be eliminated or at least reduced.
24) To overcome some of the difficulties in eliminating noisy evolutionary processes, we
attempt to construct genome trees from “consercation profiles” and attempt to reduce parts of
these noisy evolutionary processes.
25) Conservation profiles
Proteome species specific comparisons allowed to determine the conservation profiles for the
total 541880 proteins included in the 99 species.
A conservation profile is an n-component binary vector describing a protein conservation
pattern across n species.
Components are 0 and 1, following absence and presence of homologs in the corresponding
The conservation profiles can be considered as "signatures of evolutionary relationships” of
the corresponding proteins.
A conservation profile may be considered as the “trace of a protein evolutionary history
jointly captured in a set of species”.
This is an interesting evolutionary multidimensional feature of the “conservation profile”.
26) Here are some examples of conservation profiles:
-specific to a given species (the first black one). A “1” is indicated at the first position and 0
-common to many species (the red one): unique in 2 species and duplicated in a third one;
-a duplicated conservation profile (in only one species) (purple example);
If we assume that different conservation profiles represent different evolutionary histories and
attempt to determine the set of different conservation profiles, we will have an approximation
of possible different evolutionary histories.
27) In order to determine the different evolutionary histories, this slide shows the different
reduction steps that lead to the set of distinct conservation profiles.
About 82% of the original set of proteins are non-specific proteins i.e. have non trivial
conservation profiles.
From these only 184130 correspond to distinct conservation profiles i.e 42%.
We considered only one representative from each set of identical conservation profiles.
In this set, effects of the duplication process is significantly reduced.
Most interestingly this set should correspond to the various evolutionary histories embedded
in the original set of proteins.
28) This figure shows the distribution of the number of conservation profiles as a function of
the sum of "1" or presence in the conservation profile. (how many "1"s are in a given
conservation profile?)
On the x-axis we have the different classes from 1 to 99 (c01,..., c99): corresponding to the
number of distinct species in which an homolog may be found.
On the y-axis we have the number of conservation profiles.
The figure shows a uniform decrease of the number of conservation profiles relative to the
increase of the number of "1"s (presence in distinct species).
The maximum number of profiles corresponds to profiles including 5 or 6 “1”.
29) Similarity between a pair of species is calculated using the set of 184130 distinct
conservation profiles. In order to take into account the whole ancestral information included
in this set, Jaccard similarity scores were calculated between all pairs of species:
sij = N11/(N11+N01+N10);
N11; N01; N10 are respectively total occurrences of (1,1), (0,1) and (1,0) between i and j.
Note that each score is normalized and varies between 0 and 1.
30) The whole pairwise Jaccard scores between all pairs of species are arranged in a data
This data table (similarity matrix) is submitted to correspondence analysis and clustering of
species is performed using euclidean distances as calculated from their factorial coordinates
(see Methodology slide).
31) We applied the Methodology presented in the beginning of the talk and obtained the
"profiles tree" shown here.
Detailed discussion of this tree can be found in our paper: Tekaia F, Yeramian E. (2005).
PLoS Comput Biol.1(7):e75.
Main clusters are:
- three phylogenetic groups: Eukarya (E), Archaea (A) and Bacteria (B).
-as shown by the colors in the different branches, most of the traditional clusterings are
Some clustering are not in accordance with traditional clustering as for example:
-S. coelicolor which is separated from the other actinobacteria;
-T. tengcongensis is separated from the other fimicutes;
32) Conclusions
-Species classification is not an easy task.
-Species tree construction should take into account the whole information included in their
-We still need methods that construct such trees;
-In the case of conservation profiles (multidimensional features of evolutionary histories)
Correspondence Analysis might be helpful in revealing significant trends embedded in the
multidimensional relationships as obtained from large scale genome comparisons.
-Conservation profiles represent most conserved and meaningful evolutionary signals jointly
captured in a set of species;
-they should correspond to most accurate type of markers for species classification;
-In principal the “profiles tree” should considerably minimize genome acquisition effects and
should reflect less noisy phylogenetic signals.
In considering distinct conservation profiles, variations in size of protein families should not
influence the tree building procedure (one conservation profile per set of proteins with
identical conservation profiles) and significantly reduces the sensitivity to gene acquisition
-the “profiles tree” presents evidence of conservation of stable phylogenetic relationships and
reveals unconventional species clustering.
-It corresponds to the classification of the evolutionary scenari that are embedded in the
considered species.
34) Acknowledgments:
The support of:
• The Institut Pasteur (Strategic Horizontal Programme on Anopheles gambiae)
• The Ministère de la Recherche Scientifique (France):
ACI-IMPBIO-2004–98-GENEPHYS program.
• Bernard Dujon (Institut Pasteur) for constant support.
35) References
• Tekaia, F. and Dujon, B. (1999).
Pervasiveness of gene conservation and persistence of duplicates in cellular genomes.
Journal of Molecular Evolution, 49:591-600.
• Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from whole proteome
comparisons. Genome Res. 12:17-25.
• Tekaia, F., Yeramian, E. and Dujon, B. (2002).
Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a
global picture with correspondence analysis. Gene 297: 51-60.
• Tekaia, F. and Yeramian, E. (2005).
Genome Trees from Conservation Profiles. PLoS Comput Biol.1(7):e75.
• Tekaia, F. and Yeramian, E. (2006).
Evolution of Proteomes: Fundamental signatures and global trends in amino acid
composition. BMC Genomics. 7:307.
• Tekaia F, Latgé JP. (2005). Aspergillus fumigatus: saprophyte or pathogen?
Curr Opin Microbiol. 8:385-92. Review.
• Systematic analysis of completely sequenced organisms:
36) References
• Bininda-Emonds ORP (2005). Supertree Construction in the Genomic Age.
Methods in Enzymology 395: p.745-757.
• Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002).
The (super)Tree Of Life: Procedures, Problems, and Prospects.
Annual Review of Ecology and Systematics, Vol. 33: 265-289.
• Dagan, T. and W, Martin (2006). The tree of one percent. Genome Biology, 7:118.
• Delsuc F, Brinkmann H, Philippe H. (2005). Phylogenomics and the reconstruction of the
tree of life. Nat Rev Genet. 6:361-75. Review.
• Doolittle. Science 284:2124-8. (1999).
• Driskell, et al. (2004). Sciences, 306; 1172-1174.
• http://www.genomesonline.org/gold.cgi (list of genome projects)
• Keith A. Crandall and Jennifer E. Buhay (2004). Sciences, 306; 1144-1145.
• Linder, Moret, Nakhleh, and Warnow: http://compbio.unm.edu/networks1.ppt
• Martin & Embley (2004). Nature 431:152-5.
• MCL: a cluster algorithm for graphs: http://micans.org/mcl/
• Pennisi, E. (1998). Genome data shake tree of life.Science. 280:672-4.
• Rivera & Lake JA. (2004). Nature 431: 152-5.
• Raoult et al.(2004). Sciences, 306:1344-1350.
• Snel, Bork, Huynen (1999). Genome phylogeny based on gene content.Nature Genetics 21,
• Snel B, Huynen MA, Dutilh BE (2005). Genome trees and the nature of genome
evolution.Annu Rev Microbiol.;59:191-209. Review.
• Woese et al. (1990). PNAS. 87:4576-4579.
Related flashcards

17 Cards


23 Cards

Create flashcards