Rooting the Universal Tree of Life Using Genomic Fusion

advertisement
Rooting the Universal Tree of Life Using Genomic
Fusion Flux
Zhenjun Hu1, Jie Wu2, Shujiro Okuda3, Charles Delisi*,1,2
1
Bioinformatics Program
2
Department of Biomedical Engineering, Boston University
Boston, MA 02215, USA
3
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji,
Kyoto, 611-0011, Japan
*To whom corresponding should be addressed. Email: delisi@bu.edu
In the last decade of the last century, it has been the great
debate whether it is possible to find the root of the universal
tree due to the controversial results from different phylogeny
analysis. The availability of the
genome sequence
seems only
create more confusion results rather than the clarification of
the
problem
with
the
more
information.
Here
we
introduce
a
systemic rooting strategy based on genomic fusion events that
seems a great help to solve the puzzle to root the universal tree
of life. Total 70 organisms have been analyzed and a rooted
universal tree of 62 species has been built using the novel
rooting strategy and general taxonomy information
Background
A universal phylogenetic tree was proposed based on small-subunit (SSU) ribosomal
RNA (rRNA) sequence comparison[1] in 1989 since its first concept emerged about 100
year ago[2], and three new principal domains are also proposed: Archaea, Bacteria and
Eucarya. However, recent studies based on multi-protein/whole-genome sequences
comparison results in the confusion of the tree and raised the question whether there is
1
a universal tree of life and whether the whole life comes from a single root[3-7]. The
confusion is often contributed by several factors: horizontal gene transfer (HGT), long
branch attraction (LBA) [8] and degradation of phylogenetic signal because of saturation
of amino acid substitutions. Furthermore, although the most important point in a
phylogenetic tree is its root, there is in principle no way to root the universal tree of life
based on a collection of homologous sequence only[9]. The root is the oldest point in the
tree and corresponds to the last common ancestor (LCA). The rRNA phylogenetic tree is
rooted based on the rooting strategy proposed by Iwabe et al.[10] that is derived from an
ancient gene duplication in the common ancestor of prokaryotes. This method, not only
requires the discovery of appropriate genes for different trees[11-14], but also is
questionable because it only depends a very limited number of genes and there is
increasing evidence for multiple gene duplication events in the history of them[15-17]. In
addition, the rooting strategy requires an external point of reference as the outgroup,
which may not be always available.
An alternative approach based on the fusion of dihydrofolate reductase and thymidylate
synthase has been proposed to root the eukaryote tree and put the eukaryotic root
between bikonts and opisthokonts. However, the approach has the similar shortcoming
and requires that the genes fused just once and will never secondarily split or laterally
transferred within eukaryotes, which may not be valid always [18, 19].
The genome sequences of human and several model organisms have shown that more
advanced organisms often have more multidomain proteins than less complex
species[20-24]. Koonin etc.[25] found the increased complexity of domain architecture in
more complex organism for many orthologous sets of eukaryotic proteins. There are also
selective advantages of the protein multidomain architecture as it will increase the
efficiency of coupling of the biochemical reactions and signal transductions[26]. These
evidences suggest the possible connection between the accretion of the multidomain
proteins and the organismic complexity, and the natural selection may favor the
multidomain architecture.
However, simply counting the number of multidomain proteins of different organisms will
not reveal the vertical evolution characteristic because some evolution events, such as
HGT may transfer the multidomain proteins from the organism other than its closest
2
ancestor[3, 27]. Meanwhile, the evolution should be seen as a "procession of forms,"
with the focus on the process that produces the gem rather than the reflections from its
facets[27, 28]. Based on the biology knowledge we have now, the dominant biological
mechanisms for the expansion of multidomain protein repertoire are gene duplication
and gene fusion/fission[19, 25, 29-32]. It is therefore reasonably to compare two species’
closeness to the LCA (or evolutional complexity) by defining a variable that contains the
process information of gene fusion/fission and gene duplication, which we named as
GFF (genomic fusion flux). For two given species, the GFF is defined as:
GFFA B  FA  FB
and
GFF A B 
GFFA B
( FA  FB ) / 2
where FA is the total number of multidomain proteins fused in species A but separated
in species B (Fig. 1), and FB represents the total number of multidomain proteins fused
in species B but separated in species A. GFFA B represents the net fusion flux from
species B to species A because FB corresponding the fission flux from A to B. GFFA B
>0 indicates that there is evolutional accretion of multidomain proteins and vice versa.
The two species are evolutionally indistinguishable using GFF if GFFA B =0: they are
either evolutionally too close or too far to compare. GFF A B is the weighted GFF to
indicate the relative significance of GFF against total fusion and fission events.
Fig. 1. Sketch of fusion/fission events between the orthologous proteins between
3
specie A and B.
As shown in Fig. 1, GFF only counts the fusion events whose final products are the
multidomain protein to trace its evolutional accretion, i.e., GFF ignores the fusion events
in which two independent genes are fused into a gene coding a single-domain protein.
The fusion events are counted based on the orthologous groups provided by Clusters of
Orthologous Groups (COG) database[33, 34]. Each COG orthologous group presumably
contains the “same” gene or domain from multiple organisms and multidomain proteins
are split into the constituent domains if the orthologs of the latter are present as standalone forms in some of the genomes. We allow the paralogs of fused multidomain
proteins within the same COG group to be counted as multiple fusion events to
incorporate the accretion of multidomain proteins contributed by gene duplication.
Fig 2. Sketch of GFF-rooted universal tree of life.
4
5
Results and Discussions
Total 70 organisms have been analyzed in which 62 species have been incorporated in
a rooted tree of life (Fig. 2). The tree shown in Fig. 2 is fully rooted using the rooting
strategy described below except the Archaea branch which is shown using dashed line
in Fig. 2.
Correlation between GFF and Evolution
Table 1 GFF of seven eukaryotic multicellula organism
hsa
ath
cel
dme
spo
Sce
ecu
total
hsa
0
24
148
189
95
103
31
279
ath
-24
0
95
135
97
96
19
194
cel
-148
-95
0
7
30
35
9
134
dme
-189
-135
-7
0
12
11
5
97
spo
-95
-97
-30
-12
0
4
4
28
sce
-103
-96
-35
-11
-4
0
6
24
ecu
-31
-19
-9
-5
-4
-6
0
1
GFF is defined to measure the net fusion flux between genomes. GFFA B >0 indicates
there is the accretion of multidomain domain proteins from B to A; therefore A is
evolutionally more advanced than B. In other words, B is more ancient than A. However,
it remains as a question whether this conclusion is really true. Table 1 shows the GFF
for seven eukaryotic organisms based on extended COG group for complex,
multicellular eukaryotes named as KOGs (eukaryotic orthologous groups).
Except the
last column, each cell in Table 1 shows the fusion flux from the organism of
corresponding row to the organism of corresponding column. The cell is negative
symmetric along the diagonal line because GFFA B =- GFFB A . The last column shows
the total multidomain fusion events of each organism. The data in Table 1
unambiguously deciphers the following evolutional order:
ecu<sce<spo<dme<cel<ath<hsa
6
where the symbol “<” can be read as more “ancient”. The result is based on exhaustive
pairwise comparison between seven species and the data is consistent across all
comparison, resulting in distinguished negative diagonal symmetry of Table 1. The
above result shows that Homo sapiens (hsa) is the most advanced species and the
microsporidian Encephalitozoon cuniculi (ecu) is the least advanced one. The result also
tells that budding yeast Saccharomyces cerevisiae (sce) is more ancient than fission
yeast Schizosaccharomyces pombe (spo), which is supported by the result from
Brinkmann et al. [5] and Bullerwell et al [35]. The result shown above is obviously not
random and the P-value to get an ordered line of n organisms (either all ‘<’, or all ‘>’ or
all ‘=’) based on random GFF value can be estimated as following:
P
n!
n!3
n 1
3

1
3
n2
For seven eukaryotic organisms, the P value is 1/35=0.004. It is also worth noticing that
same ordered species sequence can be achieved simply based on the total fused
multidomain proteins. Although this result indicates the plausible connection between
GFF and evolution, it is still not clear why the fruit fly Drosophila melanogaster (dme)
and nematode Caenorhabditis elegans (cel) are both less complex than the green plant
Arabidopsis thaliana (ath). As hsa, dme and cel are all animals, it remains controversy
whether it is plant or animal that is closer to their LCA.
In contrast to the seven eukaryotic species, the GFF data for the thirteen archaea
species are much messier: it is impossible to get the similar ordered line for the whole
archaebacterial group. The corresponding weighted GFF are shown in Table 2:
Table 2 Weighted GFF of 13 archaebacterial organisms, see reference[36] for complete species
name
Hbs
Hbs
Mac
0
Mth
0.125
Tac
0.146
Pho
0.14
Tvo
0.104
Afu
0.116
Mka
0.248
Mja
0.104
Pab
0.168
0.299
Pya
Sso
0
0.245
Ape
Total
0.225
64
Mac
-0.125
0
0.055
0.077
0.042
0.044
0.391
0.094
0.276
0.197
-0.031
-0.052
0.068
160
Mth
-0.146
-0.055
0
0.049
0.044
0.123
0.242
0.396
0.294
0.366
0.089
0.194
0.08
59
Tac
-0.14
-0.077
-0.049
0
0.037
0.091
0.023
0.222
0.215
0.133
0.038
-0.036
0.053
22
Pho
-0.104
-0.042
-0.044
-0.037
0
0.333
0.021
0.312
0.213
0.145
-0.095
0.091
-0.167
32
Tvo
-0.116
-0.044
-0.123
-0.091
-0.333
0
0.023
0.222
0.185
0.178
-0.075
-0.107
0
22
Afu
-0.248
-0.391
-0.242
-0.023
-0.021
-0.023
0
0.021
0.037
0.227
-0.125
-0.101
0.049
65
Mka
-0.104
-0.094
-0.396
-0.222
-0.312
-0.222
-0.021
0
0.08
0.255
-0.19
0.091
-0.042
32
Mja
-0.168
-0.276
-0.294
-0.215
-0.213
-0.185
-0.037
-0.08
0
0.485
-0.27
-0.026
-0.068
43
7
Pab
-0.299
-0.197
-0.366
-0.133
-0.145
-0.178
-0.227
-0.255
-0.485
0
-0.37
-0.07
-0.154
23
Pya
0
0.031
-0.089
-0.038
0.095
0.075
0.125
0.19
0.27
0.37
0
0.185
0.085
31
Sso
-0.245
0.052
-0.194
0.036
-0.091
0.107
0.101
-0.091
0.026
0.07
-0.185
0
0.12
34
Ape
-0.225
-0.068
-0.08
-0.053
0.167
0
-0.049
0.042
0.068
0.154
-0.085
-0.12
0
16
After examination of the taxonomy we notice that three species of the archaebacterial
group shown in Table 2 belong to Crenarchaeota (indicated by light-green color) and the
rest ten belong to Euryarchaeota (indicated by light-blue color). When the crenarchaeota
is isolated as an independent group, we get the same pattern for both groups as for
eukaryotic species (Table 1), that is, three Crenarchaeota organism can be lined up as:
Ape<Sso<Pya
and ten Euryarchaeota organisms can be lined up as:
Pab<Mja<Mka<Afu<Tvo<Pho<Tac<Mth<Mac<Hbs
This dramatic change indicates that GFF is sensitive to the genealogical information,
which further enhances its correlation with evolution because in general an organism’s
complexity depends on not only its own evolution rate, but also the evolution rate of the
phylum it belongs to. However, it still needs to be verified whether the correlation is
correct. For this purpose, we compared the branch order determined by GFF with the
corresponding one proposed by rRNA universal tree[37] which is generally thought as
the gold standard. Taxonomy information is used to match the branch name in rRNA
tree (the rRNA tree is reproduced at the right bottom corner in Fig. 2 for comparison
purpose) with the species used in this study. Taxonomy information is based on NCBI
Taxonomy Browser[38] and a brief version with all taxonomy information used in this
paper can be found at reference[36]. Table 3 lists these mappings:
Table 3. Name matching between rRNA tree and GFF-rooted tree
Number and name in rRNA Species in GFF-rooted tree
tree
1: Thermotogales
Thermotoga maritime (Tma)
4: The purple bacteria
Proteobacteria
8
5: Gram-positive bacteria
Firmicutes
7:Pyrodictium
Aeropyrum pernix (Ape)
8 :Thermoprotesu
Pyrobaculum aerophilum
9:Thermococcales
Pyrococcus abyssi (Pab), Pyrococcus horikoshii (Pho)
10:Methanococcales
Methanocaldococcus jannaschii (Mja)
11:Methanobacteriales
Methanothermobacter thermautotrophicus (Mth)
12:Methanomicrobiales
Methanosarcina acetivorans C2A (Mac)
13: the extremehalophiles
Halobacterium sp. NRC-1(Hbs)
14:The animals
Homo sapiens (Hsa), Caenorhabditis elegans (Cel), Drosophila
melanogaster (Dme)
16:Plant
Arabidopsis thaliana (Ath)
17:Fungi
Saccharomyces
cerevisiae
(Sce),
Schizosaccharomyces
pombe (Spo)
Our results indicates that desulfurococcales represented by Aeropyrum pernix (Ape) is
closer to their LCA than thermoproteales represented by Pyrobaculum aerophilum (Pya),
which is in agreement with the crenarchaeota branch in the rRNA universal tree. When
comparing the euryarchaeota branch, it is noticed that Pyrococcus horikoshii (Pho) and
Pyrococcus abyssi (Pab) are in the same phylum (Thermococci) and out results put Pho
between Thermoplasma acidophilum (Tac) and Thermoplasma volcanium (Tvo) which
are both Thermoplasmata[2]. This results in the following contradiction:
Thermoplasmata< Thermococci< Thermoplasmata
On the other hand, our results also shows Pab<Mja and leads to the conclusion that
Thermococci < Methanococci, which is in agreement with the rRNA tree as it indicates
that Thermococci (represented by Thermococcales) branches earlier than Thermococci
(represented by Methanococcales). The GFF also indicates that Pab<Pho, which means
that Pab is closer to the LCA of Thermococci branch. It is therefore reasonable to use
Pab to represent Thermococci when rooting the euryarchaeota tree. Similarly, Tvo is
used to represent Thermoplasmata branch as Tvo<Tac. With these considerations, our
results for euryarchaeota become:
9
Pab<Mja<Mka<Afu<Tvo <Mth<Mac<Hbs
which is in full agreement with the euryarchaeota branch in rRNA tree (see Fig. 2 for
detail).
Rooting Strategy
Experience teaches that the complex tends to arise from the simple, and GFF seems a
good measurement of this trend in the genome evolution. The fact that the "organism"
represented by the root of a tree is in effect a modern organism[27] makes it nature for
us to use the species that is closest to the root of tree to represent the tree root. The
pattern of GFF distribution shown in both Table 1 and Table 2 unveils the close
correlation between GFF and evolution. The successful comparison of the branch order,
especially the one of euryarchaeota tree, between our result and rRNA tree provides the
hints for us to propose a systematic rooting strategy using GFF. The strategy is based
on the combination of GFF with the genealogy information. We prefer to use taxonomy
in this paper rather than phylogeny for the genealogical data because phylogeny in
general differs from each other while taxonomies based on molecular sequences will
remain indispensable[3]. For similar reason, we will use NCBI taxonomy database[38] to
root the tree of life using GFF because it attempts to incorporate phylogenetic and
taxonomic knowledge from a variety of sources, although it is not a phylogenetic or
taxonomic authority.
10
Fig. 3 Sketch of rooting strategy. Gammaproteobacteria tree is used as the example.
The hierarchical structure of taxonomy is shown at the left side. The right-up side is table
of corresponding GFF value. The yellow rectangular lists the GFF for Enterobacteriales
branch. Red spot in the tree represents the specific organism and green spot represents
the branching point.
Briefly speaking, the rooting strategy uses the taxonomy to determine the tree topology
and GFF to determine the branch order. Figure 3 uses Gammaproteobacteria as the
example to present the rooting strategy: given the hierarchy structure for 11 organisms
(some taxonomic levels have been omitted), we start with the organisms in deepest
taxon (Escherichia): Escherichia coli K12 (Eco), Escherichia coli O157:H7 (Ecs) and
Escherichia coli O157:H7 EDL933 (Ecz). The weighted GFF data (table at right-upper of
Fig. 3) tell that
11
Ecs<Ecz<Eco
Therefore Ecs is used as the root of Escherichia when comparing Escherichia against
other branches at same taxonomic level: Buchnera, Salmonella and Yersinia. Because
each of the other three branches only has one organism, the organism is directly used
as the root of the branch. The weighted GFF of these roots of the four branches shows
that:
Buc<Ecs<Ype<Sty
It is therefore concluded that Buchnera is the “oldest” branch and its root, Buchnera sp.
(Buc), will be used as the root of Enterobacteriales branch. This strategy, similar as the
neighbor-join method, can be recursively used to find the root of Gammaproteobacteria
which again is Buc. The corresponding rooted Gammaproteobacteria tree is drawn at
the right-bottom in Fig. 3. It is worthy to point out that Haemophilus influenzae (Hin) and
Pasteurella multocida (Pmu) in Oceanospirillales branch can not be distinguished by
GFF as their GFF =0. However, Pmu is used to represent the root of Oceanospirillales
because it can be exclusively ordered in the Gammaproteobacteria tree so that all rootspecies of the Gammaproteobacteria can be lined up while Hin can not. In addition, GFF
of all species within Enterobacteriales branch (yellow rectangular in Fig. 3) exhibits the
same distinguish pattern as shown in both Table 1 and Table 2, which indicates that
GFF is insensitive to the taxonomy within certain range of phylogenetic breadth in
contrast to its sensitivity to the Euryarchaeota and Crenarchaeota branch. On the other
hand, if the root of one branch can not be lined up with the roots of other branches, the
branch will be moved out from the tree as its order can not be determined based on GFF
only. This could happen if there is no enough data provided for a given branch. For
example, if we do not have the genome data for Buc, Ecs and Ecz, then Eco will be used
as the root of Enterobacteriales branch, however, the GFF data shown in Fig 3 shows
that Sty>Eco>Pae while Pae>Sty. In this case, Enterobacteriales branch will have to be
moved out from Enterobacteriales branch.
Rooted Tree of Life
The tree of life with 62 species is sketched in Fig. 2 with rRNA universal tree reproduced
12
at the right-bottom corner for comparison purpose. The number in the yellow circle
shown in the GFF-rooted corresponds to the number of branch in rRNA tree. There is no
meaning for branch length and orientation except the vertical branch order of the
sketched universal tree. In general, the position of each species (red spot) is put in such
a way that the relative vertical position reflects corresponding GFF within the branch for
the illustration purpose, so as the branch order (green spot). The tree is restrictively
rooted using the strategy presented above except the Archaea branch order for three
domains that will be discussed later.
Within the 70 species analyzed in this study, eight species have been taken off from the
tree shown in Fig. 2. Actinobacteria branch is removed from bacteria stem because its
root Corynebacterium glutamicum (Cgl) can not be lined up with other seven branch
roots, although four species within it can be lined up well. Cyanobacteria branch and
Deinococcus-Thermus branch are both removed from Bacteria stem for the same
reason. The Microsporidia branch is removed from Fungi stem due to the surprisingly
small number of total multidomain proteins of Ecu (Table 1).
Except the two branches mentioned above, our study brings Bacteria stem seven
ordered branches and put Spirochaetes at the bottom of the tree. Interestingly, the
universal tree based on 23 combined protein data set of 45 species also placed
Spirochaetes as the lowest branching Bacteria[4]. More interestingly, the same study
puts Aquificae as the lowest branch after removing 9 HGT proteins. While in our study,
Aquificae branches out from Bacteria stem immediately after Spirochaetes. Spirochaetes
as the lowest branch of Bacteria stem is consistent with hypotheses of extensive
horizontal gene transfer between Spirochaetes and Archaea. For example, Archaea and
Spirochaetes share novel class I type lysyl-tRNA synthetase to the exclusion of
eukaryotes and most other bacteria[39]. In addition, the spirochetes are also notable for
the number of cases in which their aminoacyl-tRNA synthetases are of the archaeal and
eukaryotic genre. On the other hand, the closeness to the basal position of Bacteria
stem of Aquificae and Thermotogae satisfies the hypotheses about the thermophilic
origins of life for prokaryotes[40]. The GFF-rooted tree also move the Firmcutes close to
the root of Bacteria domain, which is also evidenced by the review of protein
phylogenies and signature sequences that concluded that the root of the prokaryotic tree
has been placed between archaebacteria and gram-positive bacteria (Firmcutes)[41].
13
Compared with the rRNA universal tree, both trees agree that Firmcutes branches out
from Bacteria stem earlier than Proteobacteria. However, our results suggest that
Thermotogae branches out after the Firmcutes while in rRNA tree it branches out earlier
than Firmcutes. This difference is possibly introduced by the artifact in our method
similar as the LBA due to the limited number of species of corresponding branch. Recall
that the Thermococci branch in Euryarchaeota stem is correctly moved down to the
bottom by the GFF of Pab (without GFF of Pab, our strategy will put Euryarchaeota stem
between Thermoplasmatales and Methanobacteria branch, which is inconsistent with
rRNA tree). Since Thermotogae branch only contains one species: Thermotoga
maritime (Tma), it is therefore very possible that an undiscovered organism (or an
organism discovered without the availability of whole genome) in Thermotogae branch
will correctly move it down after Firmcutes. Similar hypothesis can also be used to
explain why GFF-rooted tree may incorrectly indicate that Euryarchaeota stems out
earlier than Crenarchaeota although the view that Crenarchaeota is the ancestral
lineage of the two archaebacterial groups is not supported by the signature sequence
presented in the EF-1α/Tu protein[41, 42].. Similar reason may also be used to interpret
the different branch order of animal (Metazoa) and fungi between rRNA tree and GFFrooted tree.
Table 3 Comparison of weighted GFF between three domains
Pab
Mja
Spo
Bbu
0.186
Aae
0.179
Sau
0.39
Tma
0.457
Jhp
0.228
Fnu
0.059
Ctr
0
0.312
0.248
Mka
0.305
Afu
0.132
Tvo
0.426
Sce
0.198
Spo
0
total
86
Sce
0.046
0.195
0.43
0.239
0.346
-0.016
0.021
0.465
0.084
0.333
0.298
0.633
0
-0.198
76
Tvo
-0.121
-0.169
-0.13
-0.254
-0.16
-0.306
-0.143
0.178
0.185
0.222
0.023
0
-0.633
-0.426
22
Afu
0.105
-0.157
0.15
0.17
0.065
-0.139
-0.141
0.227
0.037
0.021
0
-0.023
-0.298
-0.132
65
Mka
0.047
0.174
-0.04
-0.027
-0.1
-0.268
-0.192
0.255
0.08
0
-0.021
-0.222
-0.333
-0.305
32
Mja
0.037
0.3
-0.1
-0.119
0
-0.215
-0.159
0.485
0
-0.08
-0.037
-0.185
-0.084
-0.248
43
Pab
-0.118
-0.367
-0.12
-0.156
-0.275
-0.274
-0.233
0
-0.49
-0.255
-0.227
-0.178
-0.465
-0.312
23
Ctr
0.129
0.211
0.18
0.131
0.083
0.057
0
0.233
0.159
0.192
0.141
0.143
-0.021
0
20
Fnu
0.23
0.207
0.19
0.154
0.256
0
-0.057
0.274
0.215
0.268
0.139
0.306
0.016
-0.059
50
Jhp
0.308
0.062
0.09
0.058
0
-0.256
-0.083
0.275
0
0.1
-0.065
0.16
-0.346
-0.228
28
Tma
0.192
0.077
0.1
0
-0.058
-0.154
-0.131
0.156
0.119
0.027
-0.17
0.254
-0.239
-0.457
41
Sau
0.29
0.063
0
-0.101
-0.093
-0.185
-0.179
0.123
0.099
0.044
-0.146
0.125
-0.433
-0.389
58
Aae
0.25
0
-0.06
-0.077
-0.062
-0.207
-0.211
0.367
-0.3
-0.174
0.157
0.169
-0.195
-0.179
37
Bbu
0
-0.25
-0.29
-0.192
-0.308
-0.23
-0.129
0.118
-0.04
-0.047
-0.105
0.121
-0.046
-0.186
11
The GFF data between three domains are shown in Table 3, the full data set can be
found at reference [36]. The yellow rectangular represents domain, cyan rectangular
represents domain Archaea (we used Euryarchaeota branch) and green rectangular
14
represents the Bacteria domain. It can be seen that almost all organisms represented as
the branch roots in domain Archaea and Bacteria are “ancient” than Eukaryota domain
represented by Sce and Spo, although our rooting strategy only requires Pab, Bbu <Sce.
This result clearly distinguishes Eukaryota domain from both Bacteria and Archaea
domain and put the root of universal tree between Bacteria and Archaea domain. In
addition, our study puts the root of Eukaryota near the Fungi branch, which is in line with
the results from other research group[43], including the one that used single fusion event
to root the eukaryotic tree[44]. Meanwhile, the difference between Bacteria and Archaea
domain is not so clear based on GFF data in Table 3. In fact, the GFF of Pab indicates
that Archaea domain should branch out earlier than Bacteria domain based on our
rooting strategy, which is obviously inconsistent with the rRNA tree. However, we still put
Archaea domain between Eukaryota and Bacteria except that we used dashed line to
represent Archaea branch indicating the possible problem based on current GFF data
sets, with following considerations:
First, the inconsistency of GFF data is weak. Except Pab, other GFF data in Archaea
domain, such as Mja and Mka, shows that Archaea does branches later than Bacteria.
Second, it may not be correct to use Pab as the root of Archaea due to the lack of
genome data of the organisms in Crenarchaeota branch. Furthermore, recently study
has shown that the new archaeal kingdom Nanoarchaeota even branches earlier than
Crenarchaeota[45]. Therefore the inconsistency of branch order of Archaea domain may
possibly be introduced by the wrong archaeal root. We expect more genome data in
Archaea domain will resolve this puzzle.
The tree shown in Fig. 2 is merely a demonstration of the rooting strategy we proposed
in the paper. The tree topology provided by the taxonomy will certainly affect rooting
results. For example, if we place the Archaea as the sister group of Bacteria, then the
inconsistency of GFF data of Archaea domain mentioned above will simply disappear
because the ancientness of Archaea determined by GFF corresponding to Bacteria will
naturedly explains the close relationship between Archaea and Eukaryota. This
archaebacterial/eubacterial sisterhood in fact is the second most promising topology of
rooted universal tree based on broader survey of homologous genes in three
domains(Fig. 7 in reference [9]) and is also supported by the phylogenetic analysis of the
SRP54 sequences[5]. All these arguments imply that the prokaryotes to be a natural
15
group.
Conclusion
The evolution process is irreversible. Nevertheless, it may leave some evidences for us
to trace back its history. The GFF of multidomain proteins has been shown as one of the
good marker of the possible evidences. The rooting strategy based on GFF is systemic:
it can be used to root any trees as long as the organisms within the tree have certain
amount of multidomain proteins. The rooting strategy is also extensible: the more
organisms will make the tree more accurate.
The conclusion we got here not only unveils a novel and promising rooting alternative,
but also reveals the close correlation between evolution and accretion of multidomain
proteins. Furthermore, the fact that the tree can be rooted using GFF indirectly verifies
the proposal of universal tree of life initiated by Darwin despite the confusion introduced
mainly by HGT and LBA that the life should be treated as a network rather than a single
tree[3, 7].
Reference
1.
2.
3.
4.
5.
6.
7.
Woese, C.R., Bacterial evolution. Microbiol Rev, 1987.
51(2): p. 221-71.
Darwin, C., F. Burkhardt, and S. Smith, The
correspondence of Charles Darwin. Vol. 6. 1990,
Cambridge [Cambridgeshire] ; New York: Cambridge
University Press. 1856-1857.
Doolittle, W.F., Phylogenetic classification and the
universal tree. Science, 1999. 284(5423): p. 2124-9.
Brown, J.R., et al., Universal trees based on large
combined protein sequence data sets. Nat Genet, 2001.
28(3): p. 281-5.
Brinkmann, H. and H. Philippe, Archaea sister group of
Bacteria? Indications from tree reconstruction
artifacts in ancient phylogenies. Mol Biol Evol, 1999.
16(6): p. 817-25.
Kurland, C.G., B. Canback, and O.G. Berg, Horizontal
gene transfer: a critical view. Proc Natl Acad Sci U S
A, 2003. 100(17): p. 9658-62.
Pennisi, E., Is it time to uproot the tree of life?
Science, 1999. 284(5418): p. 1305-7.
16
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
Felsenstein, J., Cases in which parsimony and
compatibility methods will be positively misleading.
Systematic Zoology, 1978. 27: p. 401-410.
Doolittle, W.F. and J.R. Brown, Tempo, mode, the
progenote, and the universal root. Proc Natl Acad Sci
U S A, 1994. 91(15): p. 6721-8.
Iwabe, N., et al., Evolutionary relationship of
archaebacteria, eubacteria, and eukaryotes inferred
from phylogenetic trees of duplicated genes. Proc Natl
Acad Sci U S A, 1989. 86(23): p. 9355-9.
Baldauf, S.L., J.D. Palmer, and W.F. Doolittle, The
root of the universal tree and the origin of
eukaryotes based on elongation factor phylogeny. Proc
Natl Acad Sci U S A, 1996. 93(15): p. 7749-54.
Brown, J.R. and W.F. Doolittle, Root of the universal
tree of life based on ancient aminoacyl-tRNA
synthetase gene duplications. Proc Natl Acad Sci U S A,
1995. 92(7): p. 2441-5.
Mathews, S. and M.J. Donoghue, The root of angiosperm
phylogeny inferred from duplicate phytochrome genes.
Science, 1999. 286(5441): p. 947-50.
Embley, T.M. and R.P. Hirt, Early branching eukaryotes?
Curr Opin Genet Dev, 1998. 8(6): p. 624-9.
Hirt, R.P., et al., Microsporidia are related to Fungi:
evidence from the largest subunit of RNA polymerase II
and other proteins. Proc Natl Acad Sci U S A, 1999.
96(2): p. 580-5.
Hilario, E. and J.P. Gogarten, Horizontal transfer of
ATPase genes--the tree of life becomes a net of life.
Biosystems, 1993. 31(2-3): p. 111-9.
Philippe, H. and P. Forterre, The rooting of the
universal tree of life is not reliable. J Mol Evol,
1999. 49(4): p. 509-23.
Wolf, Y.I., A.S. Kondrashov, and E.V. Koonin,
Interkingdom gene fusions. Genome Biol, 2000. 1(6): p.
RESEARCH0013.
Yanai, I., Y.I. Wolf, and E.V. Koonin, Evolution of
gene fusions: horizontal transfer versus independent
events. Genome Biol, 2002. 3(5): p. research0024.
Goffeau, A., et al., Life with 6000 genes. Science,
1996. 274(5287): p. 546, 563-7.
Myers, E.W., et al., A whole-genome assembly of
Drosophila. Science, 2000. 287(5461): p. 2196-204.
Lander, E.S., et al., Initial sequencing and analysis
of the human genome. Nature, 2001. 409(6822): p. 860921.
Pennisi, E., Genomics. Sequence tells mouse, human
17
genome secrets. Science, 2002. 298(5600): p. 1863-5.
24. Venter, J.C., et al., The sequence of the human genome.
Science, 2001. 291(5507): p. 1304-51.
25. Koonin, E.V., Y.I. Wolf, and G.P. Karev, The structure
of the protein universe and genome evolution. Nature,
2002. 420(6912): p. 218-23.
26. Marcotte, E.M., et al., Detecting protein function and
protein-protein interactions from genome sequences.
Science, 1999. 285(5428): p. 751-3.
27. Woese, C.R., On the evolution of cells. Proc Natl Acad
Sci U S A, 2002. 99(13): p. 8742-7.
28. Whitehead, A.N., Process and reality, an essay in
cosmology; Gifford lectures delivered in the
University of Edinburgh during the session 1927-28.
1929, New York,
Cambridge, Eng.,: The Macmillan Company;
University Press. xii, 547.
29. Chothia, C., et al., Evolution of the protein
repertoire. Science, 2003. 300(5626): p. 1701-3.
30. Yanai, I., A. Derti, and C. DeLisi, Genes linked by
fusion events are generally of the same functional
category: a systematic analysis of 30 microbial
genomes. Proc Natl Acad Sci U S A, 2001. 98(14): p.
7940-5.
31. Reizer, J. and M.H. Saier, Jr., Modular multidomain
phosphoryl transfer proteins of bacteria. Curr Opin
Struct Biol, 1997. 7(3): p. 407-15.
32. Trifonov, E.N., et al., Distinct stages of protein
evolution as suggested by protein sequence analysis. J
Mol Evol, 2001. 53(4-5): p. 394-401.
33. Tatusov, R.L., et al., The COG database: an updated
version includes eukaryotes. BMC Bioinformatics, 2003.
4(1): p. 41.
34. Tatusov, R.L., E.V. Koonin, and D.J. Lipman, A genomic
perspective on protein families. Science, 1997.
278(5338): p. 631-7.
35. Bullerwell, C.E., et al., A comparison of three
fission yeast mitochondrial genomes. Nucleic Acids Res,
2003. 31(2): p. 759-68.
36. Supporting_materials, http://bioinfo.bu.edu/GFF.
37. Woese, C.R., O. Kandler, and M.L. Wheelis, Towards a
natural system of organisms: proposal for the domains
Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S
A, 1990. 87(12): p. 4576-9.
38. Wheeler, D.L., et al., Database resources of the
National Center for Biotechnology Information: update.
Nucleic Acids Res, 2004. 32(1): p. D35-40.
18
39. Ibba, M., et al., A euryarchaeal lysyl-tRNA synthetase:
resemblance to class I synthetases. Science, 1997.
278(5340): p. 1119-22.
40. Woese, C.R., et al., Aminoacyl-tRNA synthetases, the
genetic code, and the evolutionary process. Microbiol
Mol Biol Rev, 2000. 64(1): p. 202-36.
41. Gupta, R.S., Protein phylogenies and signature
sequences: A reappraisal of evolutionary relationships
among archaebacteria, eubacteria, and eukaryotes.
Microbiol Mol Biol Rev, 1998. 62(4): p. 1435-91.
42. Rivera, M.C. and J.A. Lake, Evidence that eukaryotes
and eocyte prokaryotes are immediate relatives.
Science, 1992. 257(5066): p. 74-6.
43. Baldauf, S.L., The deep roots of eukaryotes. Science,
2003. 300(5626): p. 1703-6.
44. Stechmann, A. and T. Cavalier-Smith, Rooting the
eukaryote tree by using a derived gene fusion. Science,
2002. 297(5578): p. 89-91.
45. Waters, E., et al., The genome of Nanoarchaeum
equitans: insights into early archaeal evolution and
derived parasitism. Proc Natl Acad Sci U S A, 2003.
100(22): p. 12984-8.
19
Download