Exploring evolutionary trends in Proteomes

advertisement
Exploring Evolutionary Trends in
Proteomes
Fredj Tekaia
Edouard Yeramian
Institut Pasteur
tekaia@pasteur.fr
Psychrop
hiles
Eukary
otes
Hypertherm
ophiles
Thermop
hiles
Prokaryotes
mesophiles
•
••
•
433
36
Tree of life
46
http://www.genomesonline.org/
Complete genomes
2434 projects
• 520 published
(01-03-07)
• 1086 Bacteria
• 59 Archaea
• 696 eukaryotes
• 73 metagenomes
• 3 phylogenetic
domains;
• Lifestyles:
mesophiles;
(hyper)thermophiles;
psychrophiles;
extreme conditions,...
• Data driven exploratory analyses as opposed to
model driven methods.
• In the post genomic era, multidimensional data
resulting from large scale genome comparisons are
available.
• Multivariate analysis methods are particularly
helpful for the discovery of evolutionary trends
associated with such data.
Methodology
Fp
1
i
p sup
1
j
kij
•
•
•
•
•
•
•
•
• •
n
••
•
•
•
•• •
••
•
•
F1
•
•
sup
Matrice T
kij > 0
Correspondence
Analysis
F(is) = -1/2.∑{fisj.G(j) ; j=1,p};
Methodology
Fp
1
i
p
1
j
kij
•
•
•
•
•
•
•
•
• •
n
••
•
•
••
•
•
•
•
•
•
•
•
•
F1
•
•
•
•
•
•
sup
Matrice T
kij > 0
Correspondence
Analysis
Classification
• orthogonal system;
• use of euclidean distance;
1. Evolution of Proteomes:
Signatures and Trends in Amino Acid
Compositions
2. Genome Trees from Whole Proteome
Comparisons
Evolution of Proteomes:
Signatures and Trends in Amino Acid
Compositions
Hyperthermophiles
•
•
••
Thermophiles
Psychrophiles
Eukaryotes
Prokaryotes mesophiles
•
Mining the wealth of information contained in complete
genomes, to decipher genomic characteristics to the
adaptive evolution of organisms in extreme conditions as
high or low temperatures, has long been a matter of
interest:
• Kreil DP, Ouzounis CA (2001). Identification of thermophilic species by the amino acid compositions
deduced from their genomes. NAR 2001, 468: 1608-15.
• Tekaia F, Yeramian E, Dujon B (2002). Amino acid composition of genomes, lifestyles of organisms,
and evolutionary trends: a global picture with correspondence analysis. Gene, 297: 51-60.
• Suhre K, Claverie JM (2003). Genomic correlates of hyperthermostability, an update. J. Biol. Chem.,
278: 17198-202.
• Hickey DA, Singer GA (2004). Genomic and proteomic adaptations to growth at high temperature.
Genome Biol., 5: 117. Epub 2004.
• Brocchieri L (2004). Environmental signatures in proteome properties. Proc Natl Acad Sci U S A., 101:
8257-8.
• Cavicchioli R (2006). Cold-adapted archaea. Nat. Rev. Microbiology,4: 331-3.
• Lobry JR, Necsulea A. (2006). Synonymous codon usage and its potential link with optimal growth
temperature in prokaryotes.Gene. 385:128-36.
• Zeldovich KB, Berezovsky IN, Shakhnovich EI. (2007). Protein and DNA Sequence Determinants of
Thermophilic Adaptation. PLoS Comput Biol. 3:e5.
The significant number of available completely
sequenced genomes with different lifestyles offers
an unprecedented opportunity to explore species
evolution.
Among simple analyses:
amino acid composition of proteomes.
• Which universal properties can be deduced from
amino acid compositions of proteomes?
• Are there specific properties associated with
lifestyles and with phylogeny?
• What are the underlying evolutionary trends?
Outline
• Methodology;
• Species considered and data analysed;
• Species and amino acids distributions;
• Amino acids distribution and comparison with
theoretical and experimental model chronologies of
amino acids recruitment into the genetic code;
• Example: application to predicting candidate
thermostable proteins in Aspergillus fumigatus.
Methodology
Fp
1
i
p sup
1
j
kij
•
•
•
•
•
•
•
•
• •
n
••
•
•
•
•• •
••
•
•
F1
•
•
sup
Matrice T
kij > 0
Correspondence
Analysis
F(is) = -1/2.∑{fisj.G(j) ; j=1,p};
Previous work showed:
Growth t°
Hyperthermophiles
Thermophiles
GC%
Mesophiles
54 species
Tekaia, F., Yeramian, E. and Dujon, B. 2002. Gene 297: 51-60.
Amino Acid composition of 208 proteomes
including:
• 20 hyperthermophiles (HTH) (OGT >60°C up to 120°C),
• 7 thermophiles (TH) (OGT >50°C up to 60°C),
• 8 psychrophiles (PSYC) (OGT: -10°C, up to 15°C),
• 173 mesophiles (BMES) including 53 eukaryotes (EUK)
Data table: 222 (208 + 14 sup) vs 23
(20 aa + pol, char, hyd)
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj
+ specific sites
Amino Acid composition
208
org
sc
sp
ncu
ca
mgr
fg
an
ecun
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
char pol hyd PC
5.5
6.3
8.7
4.9
9.4
8.2
8.6
5.0
4.4
4.8
6.2
3.7
6.6
5.8
6.2
6.7
6.1
5.2
3.7
6.7
3.5
3.9
3.7
3.9
5.8
5.3
5.6
5.7
5.7
5.9
5.6
5.5
1.3
1.5
1.1
1.2
1.3
1.3
1.2
2.0
3.9
3.8
4.3
4.4
4.1
4.0
4.0
2.3
6.6
6.5
6.5
6.2
5.9
6.2
6.2
8.1
5.0
5.0
7.2
5.0
7.4
6.7
6.8
6.5
2.1
2.3
2.5
2.1
2.3
2.4
2.4
1.9
6.6
6.1
4.4
7.1
4.4
5.1
5.0
6.7
9.6
9.8
8.4
9.3
8.5
8.7
9.2
9.5
7.4
6.4
5.1
7.2
4.8
5.1
4.6
7.1
2.1
2.1
2.2
1.9
2.2
2.3
2.0
3.0
4.5
4.6
3.4
4.5
3.5
3.8
3.7
4.8
4.3
4.7
6.5
4.5
6.3
5.9
6.0
3.4
9.0
9.4
8.3
9.3
8.0
8.1
8.4
8.0
5.8
5.6
6.1
6.2
5.9
6.1
6.0
4.1
1.0
1.1
1.4
1.0
1.5
1.5
1.5
0.8
3.3
3.4
2.6
3.5
2.5
2.8
2.9
3.6
5.6
6.0
6.0
5.5
6.2
6.1
6.1
7.0
26.3
25.3
25.8
25.
25.3
25.4
24.9
29.3
...............
13
HTH
TH
PSYC
BMES
EUK
SPEC
A
B
E
EA
EB
AB
EAB
7.4
9.0
8.4
8.6
6.9
7.6
6.7
9.4
6.9
6.8
7.4
8.6
8.1
5.8
6.3
4.6
5.1
5.4
6.1
5.4
5.8
5.7
5.7
5.5
5.3
5.4
3.5
3.6
4.3
4.4
4.9
4.8
4.8
4.1
4.4
4.5
4.3
3.9
4.0
4.7
5.3
5.7
5.4
5.4
5.1
5.4
5.4
5.3
5.5
5.5
5.0
5.4
0.8
0.8
1.1
1.0
1.7
1.8
1.2
1.0
2.0
1.8
1.5
1.0
1.3
2.0
3.1
4.0
3.8
4.2
4.0
2.6
4.1
4.6
4.1
4.0
3.3
3.8
8.3
6.4
6.3
6.3
6.6
6.3
7.8
6.0
6.6
6.8
6.3
6.3
6.6
7.4
7.5
6.9
7.0
6.0
6.1
6.3
7.3
6.0
5.8
6.7
7.1
7.0
1.6
1.9
2.2
2.1
2.4
2.5
1.8
2.1
2.6
2.4
2.5
1.9
2.2
7.4
7.0
7.2
6.9
5.6
4.9
7.3
5.6
4.8
5.7
5.4
7.0
6.1
10.6
9.9
9.9
10.2
9.3
8.8
9.6
10.1
9.1
9.6
9.5
10.7
9.9
7.0
4.7
5.5
5.8
6.1
5.7
6.9
5.0
5.8
6.5
5.4
5.5
5.7
2.2
2.6
2.7
2.3
2.2
2.2
2.3
2.2
2.2
2.3
2.2
2.4
2.4
4.2
4.0
4.1
4.3
4.0
3.6
4.1
3.9
3.8
4.0
4.1
4.5
4.1
4.5
4.7
3.9
4.1
5.2
5.7
4.0
4.7
5.8
4.6
5.4
4.2
4.6
5.2
6.1
6.5
6.2
8.4
8.8
6.7
6.6
8.7
7.6
7.7
6.2
6.9
4.4
5.1
5.8
5.4
5.6
5.8
5.0
5.5
5.7
5.5
5.6
5.1
5.5
1.1
1.2
1.1
1.1
1.2
1.2
1.1
1.4
1.2
1.1
1.3
1.3
1.2
3.9
3.6
3.2
3.2
3.1
2.9
3.9
3.0
2.9
3.2
3.1
3.3
3.0
8.0
7.4
6.9
6.9
6.0
6.0
7.1
6.8
5.9
6.5
6.5
7.3
7.0
34.4
33.9
33.3
36.2
32.7
32.9
32.9
30.4
27.4
24.6
24.2
24.6
25.9
25.8
27.2
24.3
26.0
26.8
25.1
24.0
25.2
39.1
40.7
40.8
38.7
42.0
41.6
42.0
40.2
27.0
29.7
31.8
30.9
33.8
34.2
30.5
31.5
34.2
32.4
33.
29.8
31.4
45.4
45.6
44.0
44.4
40.2
39.9
42.2
44.0
39.7
40.7
41.8
46.0
43.3
8.1
8.6
7.5
11.2
7.4
7.5
8
1.1
-0.4
5.2
7.6
6.3
7.9
8.4
3.3
7.2
8.2
5.6
7.9
5.8
6.2
Correspondence Analysis was used to explore relationships between
species and amino acids.
Species specific comparisons
• bestp1np
blastp, pam250, SEG filter
• allp1np
• segmatchp1np
NP
P1
proteome1
new proteome
• bestnpp1
• allnpp1
• segmatchnpp1
• bestpnnp
Pn
• allpnnp
proteomen
• bestnppn
• allnppn
• segmatchnppn
• segmatchpnnp
bestnppi
allnppi
np1 size pij e-value1 HS/IS/NS
np1 size pij e-value1 HS/IS/NS
np1 size pik e-value HS/IS/NS
• Paralogs
• Orthologs
The expected number of HSPs with score at least S is given by: E = Kmne-S.
m and n are sequence and database lengths.
•
Hyperthermophiles
Thermophiles
Psychrophiles
•
•• Prokaryotes mesophiles
Thermosynechococcus
elongatus
•
Encephalitozoon cuniculi
Eukaryotes
Methanococcus jannaschii:31%
Pyrococcus abyssi:44%
growth t°
•
Thermus-thermophilus:69%
Methanopyrus kandleri:61%
Nocardia
farcinica:
70%
Mycoplasma
mycoides
23%
••
Encephalitozoon cuniculi
Colwellia psychrerythraea
•
Streptomyces
coelicolor: 72%
GC%
Pseudoalteromonas haloplanktis
Entamoeba histolytica
(Protists)
Cryptosporidium hominis
•
Cyanidioschyzon merolae
Leishmania major:60%
Saccharomyces
Candida Glabrata
Homo sapiens
Tetrahymena thermophila (Protists)
Mus musculus
Rat
A. nidulans
Aspergilus fumigatus:50%
A. oryzae
C. neoformans
Statistical characterization of the observed groups:
Mean amino acids between the 3 groups were compared
using:
-One-way analysis of variance;
-Newman-Keuls multiple comparison test to detect
significant differences at the probability level of p<0.001.
Mean aa composition in (hyper)thermophiles, prokaryotic
mesophiles-psychrophiles and eukaryotes (*: sig. different at p<0.001)
11
10
*
9
8
7
6
*
*
*
*
*
5
4
3
2
1
*
*
*
*
*
*
*
*
*
*
*
*
V
(V
Y al)
(T
E yr)
(G
G lu)
(G
ly
I( )
L Ile)
(L
e
A u)
(A
H la)
(H
S is)
(S
Q er)
(G
T ln)
(T
C hr)
(C
D ys)
(A
s
P p)
(P
N ro)
(A
s
R n)
(A
M rg)
(M
K et)
(L
F ys)
(P
W he)
(T
rp
)
0
AA physico-chemical properties in (hyper)thermophiles,
prokaryotic-pshychrophiles and eukaryotes(*: sig. different at p<0.001)
50
45
*
40
35
*
30
*
*
25
20
15
10
5
*
0
hyd
pol
* *
pol-char
char
Amino acid signatures (p<0.001)
HTH-TH
BMES-PSYC
EUK
V(Val)
H(His)
S (Ser)
pol
pol-char
Y (Tyr)
E (Glu a)
Q (Gln)
T (Thr)
D (Asp a)
V(Val)
H(His)
S (Ser)
pol
pol-char
• R (Arg), M (Met), F (Phe), K
(Lys), N (Asn) and W (Trp) show
no significant difference (at p<0.001).
V (Val)
H (His)
S (Ser)
pol
pol-char
G (Gly)
I (Ile)
L (Leu)
C (Cys)
Hyd
Species evolutionary trends
growth t°
[high_temperature]-[high_GC]
•A
•AB
EAB B
• • Ancient
GC%
EA • EB•
Q uickTim e™ et un
décom pr esseur TI FF ( non com pr essé)
sont r equis pour vis ionner cet t e im age.
SPEC •
•E
T1
Recent
T2
[moderate_temperature]-[low_GC]
Comparison with model chronologies of amino
acids recruitment into the genetic code
• Comparison of amino acid distribution with recent
models of:
• Jordan et al. Nature 433: 633-638 (2005)
• Trifonov, J. Biomol. Struct. & Dyn. 22: 1-11 (2004)
• and with ancient amino acids:
• Miller’s experiments: Science 117, 528-529. (1953)
• Analysis of Murchison meteorite (1983)
Model of Jordan et al. 2005: A universal trend of
amino acid gain and loss in protein evolution. Nature.433:633-8.
• They analysed 15 sets of three-way alignments of orthologous
proteins encoded by triplets of closely related genomes from 15
taxa representing all three domains of life (Bacteria, Archaea and
Eukaryota), and used phylogenies to polarize amino acid
substitutions.
• All amino acids with declining frequencies are thought to be
among the first incorporated into the genetic code;
• conversely, all amino acids with increasing frequencies, except
Ser, were probably recruited late.
Following observed frequencies, they subdivided amino
acids into what they called:
• 4 strong “losers”: Pro, Ala, Glu, and Gly (decline in at least 13 taxa/15)
“thought to be among the first incorporated into the genetic code”
i.e most ancient aa.
• 5 strong “gainers”: Cys, Met, His, Ser and Phe (accrue in 14/15 taxa)
“were probably recruited late” i.e most recent aa.
• 1 “weak looser”: Lys (lost in 10 taxa/15).
• 4 “weak gainers”: Asn, Thr, Ile (accrue in 11 taxa/15) and Val
(accrues slowly in all taxa);
• In contrast: the remaining six amino-acids (Arg, Gln, Trp, Leu and
Tyr) evolve more erratically.
Jordan et al. 2005.
growth t°
•
Ile
Val Gly
Glu
Phe
Asn
•”strong loosers” in T1:
Met
••
•
Thr
GC%
Pro
Q uickTim e™ et un
décom pr esseur TI FF ( non com pr essé)
sont r equis pour vis ionner cet t e im age.
•
Ser
Ala
His
T1
T2
most ancient aa
•”weak gainers”
• “strong gainer” in T2:
Cys
recruited late to the genetic code
Jordan et al., Nature 433, 633 (2005).
A universal trend of aa gain and loss in protein evolution.
Model of Trifonov, E.N. 2004. The triplet code from
first principles. J. Biomol. Struct. & Dyn. 22: 1-11.
• A consensus chronology of amino acids is built on the
basis of 60 different criteria each offering certain
temporal order.
• The chronology results in the consensus order:
G1 (Gly), A2 (Ala), D3 (Asp), V4 (Val), P5 (Pro), S6 (Ser),
E7 (Glu), (L8 (Leu), T8 (Thr)), R10 (Arg), (I11 (Ile), Q11
(Gln), N11 (Asn)), H14 (His), K15 (Lys), C16 (Cys), F17
(Phe), Y18 (Tyr), M19 (Met), W20 (Trp).
growth t°
Ile11
•
Tyr18
Lys15
Asn11
Phe17
Val4 Gly1
•
Glu7
Met19Leu8 3
•• Asp
•Thr
8
Arg10 GC%
Trp20
Pro5
Q uickTim e™ et un
décom pr esseur TI FF ( non com pr essé)
sont r equis pour vis ionner cet t e im age.
•
Ser
His14
6
Gln11
Trifonov, E.N. (2004).
Ala2
Cys16
The triplet code from first principles. J. Biomol. Struct. & Dyn. 22: 1-11.
T1
T2
Comparison with ancient amino acids
Miller/Urey Experiment: 1953
• By the 1950s, scientists were in hot pursuit of the origin of life. The scientific community was
examining what kind of environment would be needed to allow life to begin.
• In 1953, Miller took molecules which were believed to represent the major
components of the early Earth's atmosphere and put them into a closed system
• Miller's experiment showed that organic compounds such as amino acids, which
are essential to cellular life, could be made easily under the conditions that
scientists believed to be present on the early earth.
growth t°
Ile+
•
+
Val Gly+++
•
Leu
•• Asp
Glu
+
+
GC%
+
Thr+
Q uickTim e™ et un
décom pr esseur TI FF ( non com pr essé)
sont r equis pour vis ionner cet t e im age.
•
Ser
Ala+++
Pro+
T1
+
T2
Miller, S.L. Science 117, 528-529. (1953)
Production of aa under possible primitive earth conditions.
Murchison meteorite 09-28-1969
The Murchison meteorite fall occurred on September 28, 1969 over
Murchison, Australia. Over 100 kilograms of this meteorite have been
found. This meteorite is of possible cometary origin due to its high water
content of 12%.
An abundance of amino acids found within this meteorite has led to
intense study by researchers as to its origins. More than 92 different
amino acids have been identified within the Murchison meteorite to date.
Nineteen of these are found on Earth. The remaining amino acids have
no apparent terrestrial source.
growth t°
Ile+
•
Glu++
Val ++Gly+++
••
Leu+
Asp+
Q uickTim e™ et un
décom pr esseur TI FF ( non com pr essé)
sont r equis pour vis ionner cet t e im age.
•
Ala++
GC%
Pro++
T1
T2
Cronin, J.R. and Pizzarello, S. (1983).
Amino acids in meteorites. Adv Space Res. 3: 5-18.
Murchison meteorite 28-09-1969
Conclusions:
• Simple description of amino acid compositions of
proteomes (free from a priori model) revealed
fundamental evolutionary properties:
• segregation of eukaryotes;
• segregation of hyperthermophiles;
• non discrimination of psychrophiles.
• Amino acid signatures for hyperthermophiles and
for eukaryotes.
Conclusions...:
• Amino acids distribution is consistent with
suggested model chronologies of their recruitment
into the genetic code;
• Correspondence Analysis helped these properties
to be shown.
General Conclusion
• Amino acids are significant markers for species
evolution.
Genome Trees from Whole Proteome
Comparisons
Outline
• Species tree construction and difficulties;
• Post genome era species tree construction;
• Conservation profiles;
• Genome tree construction based on conservation
profiles;
• Conclusions;
• References.
Species tree - Tree Of Life
• 16/18s rRNA tree (Woese 1990);
Woese and others have used rRNA comparisons to
construct a “Tree Of Life” showing the evolutionary
relationships of a wide variety of organisms.
The « Tree Of Life » has long served as a useful tool for describing
the history and relationships of organisms over evolutionary time.
One species is represented as a branching point, or node, on the tree, and
the branches represent paths of descent from a parental node.
Martin & Embley
Nature 431:152-5.(2004)
The three-domain proposal based on the ribosomal
RNA tree. Woese et al. PNAS. 87:4576-4579. (1990)
The three-domain proposal, with continuous
lateral gene transfer among domains.
Doolittle. Science 284:2124-8. (1999)
The two-empire proposal, separating
eukaryotes from prokaryotes and
eubacteria from archaebacteria.
Mayr, D. PNAS 95:9720-23. (1998).
The ring of life, incorporating lateral gene
transfer but preserving the prokaryote
eukaryote divide.
Rivera & Lake JA. Nature 431: 152-5. (2004)
Genomic Databases and the
Tree of Life
Keith A. Crandall and Jennifer E. Buhay
Sciences, 306; 1144-1145. (2004)
Prospects for Building the Tree
of Life from Large Sequence
Databases
The 1.2-Megabase Genome Sequence of
Mimivirus
Raoult et al. Sciences, 306:1344-1350. (2004)
Driskell, et al .
Sciences, 306; 1172-1174. (2004)
Pennisi, E. (1998). Genome data shake tree of life.
Science 280:672-4.
New genome sequences are mystifying evolutionary
biologists by revealing unexpected connections between
microbes thought to have diverged hundreds of millions
of years ago.
and suggests to construct species trees from their whole gene content.
B
A
E
Genome phylogeny based on gene content (1999)
Snel, Bork, Huynen. Nature Genetics 21, 108-110.
Tekaia, Lazcano & Dujon (1999)
Genome Research 9: 550-7.
B
A
E
433
36
Tree of life
46
http://www.genomesonline.org/
Complete genomes
2434 projects
• 520 published
(01-03-07)
• 1086 Bacteria
• 59 Archaea
• 696 eukaryotes
• 73 metagenomes
Abundance of genome
data is raising expectations
to accurately depict the
evolutionary history of all
genomes.
Idea: construct a species
tree from many genes
instead of only one gene.
Gene tree - Species tree
•
Time
Duplication
•
Duplication
A
B
C
Gene tree
Speciation
Speciation
A
A
B
C
Genomes 2 edition 2002. T.A. Brown
B
Species tree
C
Problems with species tree construction
• main difficulties in species tree construction
include extensive incongruence between alternative
phylogenies generated from single-gene data sets;
-Genes don't evolve at the same rate nor in the same way;
-the evolutionary history inferred from one gene may be
different from what another gene appears to show.
Alternative solutions: integrative methods
• “supertree”
The supertree approach estimates phylogenies for subsets
of genes with good overlap, then combines these subtree
estimates into a supertree.
• Depends on the ability to
distinguish between
orthologs and paralogs;
• Supertree approaches
are controversial, in part
because the methodology
results in a degree of
disconnection between the
underlying genetic data
and the final tree
produced.
Bininda-Emonds et al. 2002
• “phylogenomic tree”
(based on concatenation of a gene sample common to the
considered species);
S1
.
.
Sn
• genes don't evolve at the same rate nor in the same way;
• a limited number of genes are shared among all species;
The tree of one percent (2006)
Dagan and Martin. Genome Biology, 7:118.
More generally these methods suffer difficulties
related to the phylogenetic tree construction:
• global sequence alignment (quality, gaps,...);
• different evolutionary histories of genes;
• substitution saturation;...
and
• more seriously from gene sampling difficulties.
Adapted from:
Gene tree - Species tree: The gene
Linder, Moret,
Nakhleh,
Warnow.
sampling problem
True species tree
A
B
gene tree #
species tree
Blue is lost
in A and B
A
C
Red is lost in C
B
C
A
B
C
Gene tree - Species tree: The gene sampling problem
A
B
C
All red orthologs has been lost
in the 3 species.
A
B
C
Luckily: sampling gives the
blue orthologs. The true
species tree is reconstructed.
Gene tree - Species tree: The gene sampling problem
A
B
C
All versions of the gene are in
the 3 species
A
B
CA
B
C
Gene trees are the same as the
species tree
Genome tree is another alternative to construct
species tree.
• The concept of genome tree is based on overall
gene content similarity.
(consider more than single gene information)
Methodology
Fp
1
i
p
1
j
kij
•
•
•
•
•
•
•
•
• •
n
••
•
•
••
•
•
•
•
•
•
•
•
•
F1
•
•
•
•
•
•
sup
Matrice T
kij > 0
Correspondence
Analysis
Classification
• orthogonal system;
• use of euclidean distance;
Systematic Analysis of Completely Sequenced
Organisms
• In silico species specific comparisons (Tekaia & Dujon. J. Mol. Evol. 1999)
(27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins)
blastp, pam250, SEG filter
Proteome1
Proteome
• 99 species
(B: 33; A: 19; E:27)
• total of 541880 proteins
Proteomen
Systematic Analysis of Completely Sequenced
Organisms
• In silico species specific comparisons
(27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins)
• Degree of ancestral duplication and of ancestral
conservation between pairs of species;
• Families of paralogs (Partition-MCL);
• Families of orthologs (Partition-MCL);
• Distribution of orthologous families according to the three domains of life;
• Determination of the protein dictionary (orthologs);
• Determination of protein conservation profiles;
Genome trees: data matrices
T = {Tij ; i=1,n; j=1,n; n is the number of surveyed species}
Tij is the overall similarity score between species j and i.
• Ancestral duplication and ancestral conservation
T = {Tij = wij = (number of proteins in j conserved in i)/size(j)); i=1,n; j=1,n }.
n = 99 species and T corresponds to 541880 total proteins
Ancestral duplication and ancestral conservation
org
SC
SP
CE
DM
AG
CA
ATH
HS
MUS
FR
PF
ECUN
MJ
MTH
AF
PH
PA
APEM
TA
TV
H
SSP2
PFU
STO
PYAE
MA
MK
MMA
HI
…..
tnsp
SC
40.5
58.4
38.1
40.5
40.9
71.8
40.3
43.0
41.7
42.0
25.9
19.5
11.5
13.6
14.4
16.3
14.3
15.5
15.2
15.4
14.8
16.7
17.0
18.6
15.6
16.0
13.0
14.8
13.0
SP
63.9
37.4
46.6
50.2
50.2
65.5
47.8
53.3
52.5
52.6
31.2
23.4
13.3
16.2
16.5
18.7
15.2
20.1
17.5
17.8
17.7
19.4
22.8
23.1
19.5
18.9
14.6
17.4
14.3
CE
17.5
18.8
65.2
39.2
39.8
18.4
21.7
40.0
39.5
40.0
13.1
8.9
4.9
4.6
5.9
5.0
5.4
4.8
5.9
6.2
5.8
7.1
6.5
6.8
5.3
7.1
4.0
6.4
4.8
DM
27.1
29.3
51.9
65.8
73.1
27.7
31.5
61.3
62.1
60.7
19.3
13.1
6.7
7.4
8.2
7.1
7.5
7.3
8.3
8.3
8.3
9.1
9.3
8.6
8.2
10.8
6.2
9.2
7.3
AG
22.3
26.3
50.6
69.9
59.5
25.7
30.3
54.5
54.7
59.9
15.9
10.8
6.0
7.6
8.7
9.2
7.3
10.6
8.3
8.7
9.8
9.4
11.1
11.4
9.9
12.5
6.1
9.5
8.5
CA
65.9
54.3
35.5
37.5
38.0
35.8
37.0
39.7
39.1
39.5
22.2
16.2
10.2
11.2
11.8
11.1
11.9
10.3
12.7
13.3
12.0
14.2
13.3
13.7
11.8
14.7
10.7
13.5
11.1
ATH
23.4
25.0
27.5
29.5
30.6
24.3
83.6
32.1
31.5
32.7
16.3
11.4
6.0
8.0
8.7
9.7
7.4
9.4
8.2
8.3
10.2
9.5
12.3
11.1
9.5
9.7
6.9
8.1
8.7
HS
22.9
25.0
44.6
50.3
50.2
23.2
25.6
66.7
76.8
68.7
17.2
12.0
4.8
5.1
5.6
5.2
5.5
5.2
5.3
5.6
5.5
6.2
7.0
5.9
5.8
7.4
4.6
6.6
4.4
MUS
27.3
29.6
54.4
62.7
60.3
27.8
29.7
90.8
77.8
81.8
21.0
15.2
5.6
6.1
6.6
6.0
6.4
5.9
6.3
6.8
6.6
7.4
8.0
7.1
6.9
8.7
5.4
7.9
5.4
FR
18.0
20.0
42.4
47.9
48.7
18.5
21.9
68.8
67.7
63.4
13.2
9.0
3.7
4.0
4.5
4.1
4.3
3.9
4.2
4.4
4.5
4.9
5.6
4.5
4.5
6.4
3.5
5.3
4.0
PF
22.5
24.6
24.8
26.5
26.5
22.3
26.2
28.2
27.6
27.6
28.3
13.6
8.7
8.3
8.6
7.9
8.3
7.2
8.6
8.7
8.0
9.5
9.1
9.1
8.1
9.8
7.3
9.7
8.2
ECUN
35.8
38.4
34.8
36.3
36.0
35.7
33.4
37.7
37.2
37.4
28.9
26.1
15.4
15.2
15.4
15.3
15.9
14.9
14.8
15.0
13.9
15.9
17.1
15.7
15.0
17.0
14.1
15.8
8.7
74.4 79.2 49.7 76.4 81.0 72.6 58.8 78.7 93.7 72.8 42.3 48.1
Wij
conservation tree
•species are clustered into 3 phylogenetic domains;
• bacterial species cluster with archaeal species;
• similar species cluster together;
• “whole genome” species clustering tree;
• very low resolution of deep clustering;
Genome trees: data matrices
T = {Tij ; i=1,n; j=1,n; n is the number of surveyed species}
Tij is the overall similarity score between species j and i.
• Shared orthologous genes
{sij = (shared orthologs between i and j) }
T = {Tij = sij/size(j); i=1,n; j=1,n }
Ancestor
A
Note on: Homologs - Paralogs - Orthologs
Duplication
A
Time
Homologs: A1, B1, A2, B2
B
Paralogs : A1 vs B1 and A2 vs B2
Evolution
A
Orthologs: A1 vs A2 and B1 vs B2
B
Speciation
A1
A2
B1
B2
Species-1
Species-2
Sequence analysis
a
S1
S2
b
Shared orthologous genes
org SC
SP
CE
DM
AG
CA
ATH HS
MUS
FR
SC
0 2532 1533 1660 1671 3371 1582 1789 1733 1731
SP
2532
0 1753 1917 1907 2588 1754 2060 2032 2024
CE
1533 1753
0 3910 3869 1611 1902 4036 3994 4047
DM
1660 1917 3910
0 7018 1728 2094 5057 5147 5035
AG
1671 1907 3869 7018
0 1738 2160 5016 5013 5059
CA
3371 2588 1611 1728 1738
0 1590 1850 1824 1827
ATH
1582 1754 1902 2094 2160 1590
0 2404 2406 2399
HS
1789 2060 4036 5057 5016 1850 2404
0 14053 10286
MUS
1733 2032 3994 5147 5013 1824 2406 14053
0 10304
FR
1731 2024 4047 5035 5059 1827 2399 10286 10304
0
PF
890 1008 1015 1106 1085 873 1067 1185 1169 1146
ECUN
600 645 580 616 617 595 539
638
632
626
MJ
238 233 214 216 242 230 279
223
216
217
MTH
254 247 237 247 278 245 306
251
248
249
AF
261 255 254 260 303 248 310
260
263
265
PH
251 245 250 259 297 237 281
273
258
271
PA
267 261 255 268 311 256 312
276
273
278
APEM
212 233 228 228 251 215 242
248
237
230
TA
264 260 252 254 279 261 298
268
264
261
TV
263 255 256 249 276 258 296
260
258
270
H
255 264 258 249 284 248 318
271
267
272
SSP2
302 317 293 292 326 300 360
310
309
311
PFU
264 284 256 275 324 286 316
292
274
280
STO
281 291 273 263 313 278 329
293
282
298
PYAE
245 258 236 249 285 238 278
258
246
256
MA
303 316 298 293 368 301 369
329
326
326
MK
210 214 195 204 216 211 244
205
202
195
MMA
289 298 276 280 338 280 349
305
299
297
HI
268 273 231 243 388 268 382
259
259
267
PF
ECUN
890 600
1008 645
1015 580
1106 616
1085 617
873 595
1067 539
1185 638
1169 632
1146 626
0 453
453
0
169 142
171 141
182 151
187 155
189 156
165 136
182 141
184 138
173 140
200 155
195 150
196 143
170 143
200 161
160 125
194 160
181
86
sij
orthologs tree
• 3 phylogenetic domains;
• bacterials species cluster with archaeal
species;
• similar species cluster together;
• better resolution of deep species clustering.
• Large scale comparative analysis of predicted proteomes
revealed significant evolutionary processes:
Evolutionary processes include
Ancestor
Expansion*
Phylogeny*
genesis
duplication
HGT
species genome
Exchange* selection*
HGT
loss
Deletion*
Expansion, Exchange and Deletion are noise. They should be
eliminated or at least reduced.
To overcome some of these limitations, we consider
Genome tree construction from “Protein
Conservation Profiles” and attempt to reduce
noisy evolutionary processes
Conservation profiles
• 99 species (B: 33; A: 19; E:27); 541880 proteins
p 0111111000111111111000110110111101001111101111
• A “conservation profile” is an n-component binary vector
describing a protein conservation pattern across n species.
Components are 0 and 1, following absence or presence of homologs.
Main interesting properties of conservation profiles:
• Conservation profiles are signatures of evolutionary relationships;
• A conservation profile is the trace of protein evolutionary histories
jointly captured in a set of n species (multidimensional feature);
Protein conservation profiles
E
A
B
S1..............I.............I................Sn
G1,1
100000000000000000000000000000000000000000000000
G2,1
111111111111111111111111111111111111111111111111
G3,1
111111110011111111111111011101110101111111101111
.......................................................
Gn1,1
100001110001000000000000000000000000000000000000
G1,2
010000000000000000010100000000000111000011100011
G2,2
010000000000000000010100000000000111000011100011
........................................................
Gn2,2
111111110011111111111111011101110101111111101111
........................................................
G1,n
011110100000000000000000001000000000000000000001
G2,n
111111110011111111100011011101110101111111101111
G3,n
111111110011111111100011011101110101111111101111
........................................................
Gnp,n
100110000000000000000000000000000000000000000001
Table : 541880 proteins x 99 species
• Different conservation profiles represent different evolutionary
histories
Distinct conservation profiles
541880 original total proteins (99 species)
442460 non-specific proteins i.e conservation profiles (82%)
184130 distinct conservation profiles (42%)
100000000000000000000000000000000000000000000000
111111111111111111111111111111111111111111111111
111111110011111111111111011101110101111111101111
010000000000000000010100000000000111000011100011
100110000000000000000000000000000000000000000001
................................................
(one representative from each set of identical conservation profiles)
• Effect of the duplication process is reduced
• This set is indicative of the various observed
evolutionary histories.
c01
c02
c03
c04
c05
c06
c07
c08
c09
c10
c11
c12
c13
c14
c15
c16
c17
c18
c19
c20
c21
c22
c23
c24
c25
c26
c27
c28
c29
c30
c31
c32
c33
c34
c35
c36
c37
c38
c39
c40
c41
c42
c43
c44
c45
c46
c47
c48
c49
c50
c51
c52
c53
c54
c55
c56
c57
c58
c59
c60
c61
c62
c63
c64
c65
c66
c67
c68
c69
c70
c71
c72
c73
c74
c75
c76
c77
c78
c79
c80
c81
c82
c83
c84
c85
c86
c87
c88
c89
c90
c91
c92
c93
c94
c95
c96
c97
c98
c99
Fractions (*10000) of distinct conservation profiles
250
240
230
220
210
200
190
180
170
160
150
140
130
120
110
100
90
80
70
60
50
40
30
20
10
0
Presence in the 184130 distinct conservation profiles:
Mean=32.2; SD=23.3; min=1; Max=99.
Conservation weights (sum of "1":presence)
Genome tree construction: data matrices
• 184130 d.c.prof
various evolutionary histories
i
j
100000000000000000000000000000000000000000000000
111111111111111111111111111111111111111111111111
111111110011111111111111011101110101111111101111
010000000000000000010100000000000111000011100011
100110000000000000000000000000000000000000000001
................................................
• Jaccard similarity scores between species
sij = N11/(N11+N01+N10);
N11; N01; N10 are respectively total occurrences of (1,1), (0,1)
and (1,0) between i,j.
T = { Tij = sij ; i=1,n; j=1,n; n }
profiles tree
Tekaia F, Yeramian E. (2005).
PLoS Comput Biol.1(7):e75
Conclusions: Methodology
• Species classification is not an easy task!
• Species tree construction should take into account the
whole information included in the genomes;
• Methods that take into account whole genome
informations are still needed;
• Correspondence analysis method might be helpful in
revealing evolutionary trends embedded in the
multidimensional relationships as obtained from large
scale genome comparisons;
Conclusions...
• Conservation profiles represent most conserved and
meaningful evolutionary signals jointly captured in a set
of species;
• Thus they should correspond to the most accurate type
of markers for species classification;
• In principal profiles tree derived from distinct
conservation profiles should considerably minimize
genome acquisition effects and should reflect less noisy
phylogenetic signals;
• The profiles tree presents evidence of conservation of
stable phylogenetic relationships and reveals
unconventional species clustering;
• The profiles tree corresponds to the classification of the
evolutionary scenari.
References:
• Tekaia, F. and Dujon, B. (1999).
Pervasiveness of gene conservation and persistence of duplicates in cellular
genomes. Journal of Molecular Evolution, 49:591-600.
• Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from
whole proteome comparisons. Genome Res. 12:17-25.
• Tekaia, F., Yeramian, E. and Dujon, B. (2002).
Amino acid composition of genomes, lifestyles of organisms, and evolutionary
trends: a global picture with correspondence analysis. Gene 297: 51-60.
• Tekaia, F. and Yeramian, E. (2005).
Genome Trees from Conservation Profiles. PLoS Comput Biol.1(7):e75.
• Tekaia F, Latgé JP. (2005). Aspergillus fumigatus: saprophyte or pathogen?
Curr Opin Microbiol. 8:385-92. Review.
• Tekaia, F. and Yeramian, E. (2006).
Evolution of Proteomes: Fundamental signatures and global trends in amino acid
composition. BMC Genomics. 7:307.
• Systematic analysis of completely sequenced organisms:
http://www.pasteur.fr/~tekaia/sacso.html
References:
• Bininda-Emonds ORP (2005). Supertree Construction in the Genomic Age.
Methods in Enzymology 395: p.745-757.
• Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002)
The (super)Tree Of Life: Procedures, Problems, and Prospects.
Annual Review of Ecology and Systematics, Vol. 33: 265-289.
• Dagan, T. and W, Martin (2006). The tree of one percent. Genome Biology, 7:118.
• Delsuc F, Brinkmann H, Philippe H. (2005). Phylogenomics and the reconstruction of the tree of life.
Nat Rev Genet. 6:361-75. Review.
• Doolittle. Science 284:2124-8. (1999)
• Driskell, et al. (2004). Sciences, 306; 1172-1174.
• http://www.genomesonline.org/gold.cgi (list of genome projects)
• Keith A. Crandall and Jennifer E. Buhay (2004). Sciences, 306; 1144-1145.
• Linder, Moret, Nakhleh, and Warnow: http://compbio.unm.edu/networks1.ppt
• Martin & Embley (2004). Nature 431:152-5.
• MCL: a cluster algorithm for graphs: http://micans.org/mcl/
• Pennisi, E.(1998). Genome data shake tree of life.Science. 280:672-4.
• Rivera & Lake JA.(2004). Nature 431: 152-5.
• Raoult et al.(2004). Sciences, 306:1344-1350.
• Snel, Bork, Huynen (1999). Genome phylogeny based on gene content.Nature Genetics 21, 108-110.
• Snel B, Huynen MA, Dutilh BE (2005). Genome trees and the nature of genome evolution.Annu Rev
Microbiol.;59:191-209. Review.
• Woese et al.(1990). PNAS. 87:4576-4579.
Download