Rooting the Universal Tree of Life Using Genomic Fusion Flux Zhenjun Hu1, Jie Wu2, Shujiro Okuda3, Charles Delisi*,1,2 1 Bioinformatics Program 2 Department of Biomedical Engineering, Boston University Boston, MA 02215, USA 3 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, 611-0011, Japan *To whom corresponding should be addressed. Email: delisi@bu.edu In the last decade of the last century, it has been the great debate whether it is possible to find the root of the universal tree due to the controversial results from different phylogeny analysis. The availability of the genome sequence seems only create more confusion results rather than the clarification of the problem with the more information. Here we introduce a systemic rooting strategy based on genomic fusion events that seems a great help to solve the puzzle to root the universal tree of life. Total 70 organisms have been analyzed and a rooted universal tree of 62 species has been built using the novel rooting strategy and general taxonomy information Background A universal phylogenetic tree was proposed based on small-subunit (SSU) ribosomal RNA (rRNA) sequence comparison[1] in 1989 since its first concept emerged about 100 year ago[2], and three new principal domains are also proposed: Archaea, Bacteria and Eucarya. However, recent studies based on multi-protein/whole-genome sequences comparison results in the confusion of the tree and raised the question whether there is 1 a universal tree of life and whether the whole life comes from a single root[3-7]. The confusion is often contributed by several factors: horizontal gene transfer (HGT), long branch attraction (LBA) [8] and degradation of phylogenetic signal because of saturation of amino acid substitutions. Furthermore, although the most important point in a phylogenetic tree is its root, there is in principle no way to root the universal tree of life based on a collection of homologous sequence only[9]. The root is the oldest point in the tree and corresponds to the last common ancestor (LCA). The rRNA phylogenetic tree is rooted based on the rooting strategy proposed by Iwabe et al.[10] that is derived from an ancient gene duplication in the common ancestor of prokaryotes. This method, not only requires the discovery of appropriate genes for different trees[11-14], but also is questionable because it only depends a very limited number of genes and there is increasing evidence for multiple gene duplication events in the history of them[15-17]. In addition, the rooting strategy requires an external point of reference as the outgroup, which may not be always available. An alternative approach based on the fusion of dihydrofolate reductase and thymidylate synthase has been proposed to root the eukaryote tree and put the eukaryotic root between bikonts and opisthokonts. However, the approach has the similar shortcoming and requires that the genes fused just once and will never secondarily split or laterally transferred within eukaryotes, which may not be valid always [18, 19]. The genome sequences of human and several model organisms have shown that more advanced organisms often have more multidomain proteins than less complex species[20-24]. Koonin etc.[25] found the increased complexity of domain architecture in more complex organism for many orthologous sets of eukaryotic proteins. There are also selective advantages of the protein multidomain architecture as it will increase the efficiency of coupling of the biochemical reactions and signal transductions[26]. These evidences suggest the possible connection between the accretion of the multidomain proteins and the organismic complexity, and the natural selection may favor the multidomain architecture. However, simply counting the number of multidomain proteins of different organisms will not reveal the vertical evolution characteristic because some evolution events, such as HGT may transfer the multidomain proteins from the organism other than its closest 2 ancestor[3, 27]. Meanwhile, the evolution should be seen as a "procession of forms," with the focus on the process that produces the gem rather than the reflections from its facets[27, 28]. Based on the biology knowledge we have now, the dominant biological mechanisms for the expansion of multidomain protein repertoire are gene duplication and gene fusion/fission[19, 25, 29-32]. It is therefore reasonably to compare two species’ closeness to the LCA (or evolutional complexity) by defining a variable that contains the process information of gene fusion/fission and gene duplication, which we named as GFF (genomic fusion flux). For two given species, the GFF is defined as: GFFA B FA FB and GFF A B GFFA B ( FA FB ) / 2 where FA is the total number of multidomain proteins fused in species A but separated in species B (Fig. 1), and FB represents the total number of multidomain proteins fused in species B but separated in species A. GFFA B represents the net fusion flux from species B to species A because FB corresponding the fission flux from A to B. GFFA B >0 indicates that there is evolutional accretion of multidomain proteins and vice versa. The two species are evolutionally indistinguishable using GFF if GFFA B =0: they are either evolutionally too close or too far to compare. GFF A B is the weighted GFF to indicate the relative significance of GFF against total fusion and fission events. Fig. 1. Sketch of fusion/fission events between the orthologous proteins between 3 specie A and B. As shown in Fig. 1, GFF only counts the fusion events whose final products are the multidomain protein to trace its evolutional accretion, i.e., GFF ignores the fusion events in which two independent genes are fused into a gene coding a single-domain protein. The fusion events are counted based on the orthologous groups provided by Clusters of Orthologous Groups (COG) database[33, 34]. Each COG orthologous group presumably contains the “same” gene or domain from multiple organisms and multidomain proteins are split into the constituent domains if the orthologs of the latter are present as standalone forms in some of the genomes. We allow the paralogs of fused multidomain proteins within the same COG group to be counted as multiple fusion events to incorporate the accretion of multidomain proteins contributed by gene duplication. Fig 2. Sketch of GFF-rooted universal tree of life. 4 5 Results and Discussions Total 70 organisms have been analyzed in which 62 species have been incorporated in a rooted tree of life (Fig. 2). The tree shown in Fig. 2 is fully rooted using the rooting strategy described below except the Archaea branch which is shown using dashed line in Fig. 2. Correlation between GFF and Evolution Table 1 GFF of seven eukaryotic multicellula organism hsa ath cel dme spo Sce ecu total hsa 0 24 148 189 95 103 31 279 ath -24 0 95 135 97 96 19 194 cel -148 -95 0 7 30 35 9 134 dme -189 -135 -7 0 12 11 5 97 spo -95 -97 -30 -12 0 4 4 28 sce -103 -96 -35 -11 -4 0 6 24 ecu -31 -19 -9 -5 -4 -6 0 1 GFF is defined to measure the net fusion flux between genomes. GFFA B >0 indicates there is the accretion of multidomain domain proteins from B to A; therefore A is evolutionally more advanced than B. In other words, B is more ancient than A. However, it remains as a question whether this conclusion is really true. Table 1 shows the GFF for seven eukaryotic organisms based on extended COG group for complex, multicellular eukaryotes named as KOGs (eukaryotic orthologous groups). Except the last column, each cell in Table 1 shows the fusion flux from the organism of corresponding row to the organism of corresponding column. The cell is negative symmetric along the diagonal line because GFFA B =- GFFB A . The last column shows the total multidomain fusion events of each organism. The data in Table 1 unambiguously deciphers the following evolutional order: ecu<sce<spo<dme<cel<ath<hsa 6 where the symbol “<” can be read as more “ancient”. The result is based on exhaustive pairwise comparison between seven species and the data is consistent across all comparison, resulting in distinguished negative diagonal symmetry of Table 1. The above result shows that Homo sapiens (hsa) is the most advanced species and the microsporidian Encephalitozoon cuniculi (ecu) is the least advanced one. The result also tells that budding yeast Saccharomyces cerevisiae (sce) is more ancient than fission yeast Schizosaccharomyces pombe (spo), which is supported by the result from Brinkmann et al. [5] and Bullerwell et al [35]. The result shown above is obviously not random and the P-value to get an ordered line of n organisms (either all ‘<’, or all ‘>’ or all ‘=’) based on random GFF value can be estimated as following: P n! n!3 n 1 3 1 3 n2 For seven eukaryotic organisms, the P value is 1/35=0.004. It is also worth noticing that same ordered species sequence can be achieved simply based on the total fused multidomain proteins. Although this result indicates the plausible connection between GFF and evolution, it is still not clear why the fruit fly Drosophila melanogaster (dme) and nematode Caenorhabditis elegans (cel) are both less complex than the green plant Arabidopsis thaliana (ath). As hsa, dme and cel are all animals, it remains controversy whether it is plant or animal that is closer to their LCA. In contrast to the seven eukaryotic species, the GFF data for the thirteen archaea species are much messier: it is impossible to get the similar ordered line for the whole archaebacterial group. The corresponding weighted GFF are shown in Table 2: Table 2 Weighted GFF of 13 archaebacterial organisms, see reference[36] for complete species name Hbs Hbs Mac 0 Mth 0.125 Tac 0.146 Pho 0.14 Tvo 0.104 Afu 0.116 Mka 0.248 Mja 0.104 Pab 0.168 0.299 Pya Sso 0 0.245 Ape Total 0.225 64 Mac -0.125 0 0.055 0.077 0.042 0.044 0.391 0.094 0.276 0.197 -0.031 -0.052 0.068 160 Mth -0.146 -0.055 0 0.049 0.044 0.123 0.242 0.396 0.294 0.366 0.089 0.194 0.08 59 Tac -0.14 -0.077 -0.049 0 0.037 0.091 0.023 0.222 0.215 0.133 0.038 -0.036 0.053 22 Pho -0.104 -0.042 -0.044 -0.037 0 0.333 0.021 0.312 0.213 0.145 -0.095 0.091 -0.167 32 Tvo -0.116 -0.044 -0.123 -0.091 -0.333 0 0.023 0.222 0.185 0.178 -0.075 -0.107 0 22 Afu -0.248 -0.391 -0.242 -0.023 -0.021 -0.023 0 0.021 0.037 0.227 -0.125 -0.101 0.049 65 Mka -0.104 -0.094 -0.396 -0.222 -0.312 -0.222 -0.021 0 0.08 0.255 -0.19 0.091 -0.042 32 Mja -0.168 -0.276 -0.294 -0.215 -0.213 -0.185 -0.037 -0.08 0 0.485 -0.27 -0.026 -0.068 43 7 Pab -0.299 -0.197 -0.366 -0.133 -0.145 -0.178 -0.227 -0.255 -0.485 0 -0.37 -0.07 -0.154 23 Pya 0 0.031 -0.089 -0.038 0.095 0.075 0.125 0.19 0.27 0.37 0 0.185 0.085 31 Sso -0.245 0.052 -0.194 0.036 -0.091 0.107 0.101 -0.091 0.026 0.07 -0.185 0 0.12 34 Ape -0.225 -0.068 -0.08 -0.053 0.167 0 -0.049 0.042 0.068 0.154 -0.085 -0.12 0 16 After examination of the taxonomy we notice that three species of the archaebacterial group shown in Table 2 belong to Crenarchaeota (indicated by light-green color) and the rest ten belong to Euryarchaeota (indicated by light-blue color). When the crenarchaeota is isolated as an independent group, we get the same pattern for both groups as for eukaryotic species (Table 1), that is, three Crenarchaeota organism can be lined up as: Ape<Sso<Pya and ten Euryarchaeota organisms can be lined up as: Pab<Mja<Mka<Afu<Tvo<Pho<Tac<Mth<Mac<Hbs This dramatic change indicates that GFF is sensitive to the genealogical information, which further enhances its correlation with evolution because in general an organism’s complexity depends on not only its own evolution rate, but also the evolution rate of the phylum it belongs to. However, it still needs to be verified whether the correlation is correct. For this purpose, we compared the branch order determined by GFF with the corresponding one proposed by rRNA universal tree[37] which is generally thought as the gold standard. Taxonomy information is used to match the branch name in rRNA tree (the rRNA tree is reproduced at the right bottom corner in Fig. 2 for comparison purpose) with the species used in this study. Taxonomy information is based on NCBI Taxonomy Browser[38] and a brief version with all taxonomy information used in this paper can be found at reference[36]. Table 3 lists these mappings: Table 3. Name matching between rRNA tree and GFF-rooted tree Number and name in rRNA Species in GFF-rooted tree tree 1: Thermotogales Thermotoga maritime (Tma) 4: The purple bacteria Proteobacteria 8 5: Gram-positive bacteria Firmicutes 7:Pyrodictium Aeropyrum pernix (Ape) 8 :Thermoprotesu Pyrobaculum aerophilum 9:Thermococcales Pyrococcus abyssi (Pab), Pyrococcus horikoshii (Pho) 10:Methanococcales Methanocaldococcus jannaschii (Mja) 11:Methanobacteriales Methanothermobacter thermautotrophicus (Mth) 12:Methanomicrobiales Methanosarcina acetivorans C2A (Mac) 13: the extremehalophiles Halobacterium sp. NRC-1(Hbs) 14:The animals Homo sapiens (Hsa), Caenorhabditis elegans (Cel), Drosophila melanogaster (Dme) 16:Plant Arabidopsis thaliana (Ath) 17:Fungi Saccharomyces cerevisiae (Sce), Schizosaccharomyces pombe (Spo) Our results indicates that desulfurococcales represented by Aeropyrum pernix (Ape) is closer to their LCA than thermoproteales represented by Pyrobaculum aerophilum (Pya), which is in agreement with the crenarchaeota branch in the rRNA universal tree. When comparing the euryarchaeota branch, it is noticed that Pyrococcus horikoshii (Pho) and Pyrococcus abyssi (Pab) are in the same phylum (Thermococci) and out results put Pho between Thermoplasma acidophilum (Tac) and Thermoplasma volcanium (Tvo) which are both Thermoplasmata[2]. This results in the following contradiction: Thermoplasmata< Thermococci< Thermoplasmata On the other hand, our results also shows Pab<Mja and leads to the conclusion that Thermococci < Methanococci, which is in agreement with the rRNA tree as it indicates that Thermococci (represented by Thermococcales) branches earlier than Thermococci (represented by Methanococcales). The GFF also indicates that Pab<Pho, which means that Pab is closer to the LCA of Thermococci branch. It is therefore reasonable to use Pab to represent Thermococci when rooting the euryarchaeota tree. Similarly, Tvo is used to represent Thermoplasmata branch as Tvo<Tac. With these considerations, our results for euryarchaeota become: 9 Pab<Mja<Mka<Afu<Tvo <Mth<Mac<Hbs which is in full agreement with the euryarchaeota branch in rRNA tree (see Fig. 2 for detail). Rooting Strategy Experience teaches that the complex tends to arise from the simple, and GFF seems a good measurement of this trend in the genome evolution. The fact that the "organism" represented by the root of a tree is in effect a modern organism[27] makes it nature for us to use the species that is closest to the root of tree to represent the tree root. The pattern of GFF distribution shown in both Table 1 and Table 2 unveils the close correlation between GFF and evolution. The successful comparison of the branch order, especially the one of euryarchaeota tree, between our result and rRNA tree provides the hints for us to propose a systematic rooting strategy using GFF. The strategy is based on the combination of GFF with the genealogy information. We prefer to use taxonomy in this paper rather than phylogeny for the genealogical data because phylogeny in general differs from each other while taxonomies based on molecular sequences will remain indispensable[3]. For similar reason, we will use NCBI taxonomy database[38] to root the tree of life using GFF because it attempts to incorporate phylogenetic and taxonomic knowledge from a variety of sources, although it is not a phylogenetic or taxonomic authority. 10 Fig. 3 Sketch of rooting strategy. Gammaproteobacteria tree is used as the example. The hierarchical structure of taxonomy is shown at the left side. The right-up side is table of corresponding GFF value. The yellow rectangular lists the GFF for Enterobacteriales branch. Red spot in the tree represents the specific organism and green spot represents the branching point. Briefly speaking, the rooting strategy uses the taxonomy to determine the tree topology and GFF to determine the branch order. Figure 3 uses Gammaproteobacteria as the example to present the rooting strategy: given the hierarchy structure for 11 organisms (some taxonomic levels have been omitted), we start with the organisms in deepest taxon (Escherichia): Escherichia coli K12 (Eco), Escherichia coli O157:H7 (Ecs) and Escherichia coli O157:H7 EDL933 (Ecz). The weighted GFF data (table at right-upper of Fig. 3) tell that 11 Ecs<Ecz<Eco Therefore Ecs is used as the root of Escherichia when comparing Escherichia against other branches at same taxonomic level: Buchnera, Salmonella and Yersinia. Because each of the other three branches only has one organism, the organism is directly used as the root of the branch. The weighted GFF of these roots of the four branches shows that: Buc<Ecs<Ype<Sty It is therefore concluded that Buchnera is the “oldest” branch and its root, Buchnera sp. (Buc), will be used as the root of Enterobacteriales branch. This strategy, similar as the neighbor-join method, can be recursively used to find the root of Gammaproteobacteria which again is Buc. The corresponding rooted Gammaproteobacteria tree is drawn at the right-bottom in Fig. 3. It is worthy to point out that Haemophilus influenzae (Hin) and Pasteurella multocida (Pmu) in Oceanospirillales branch can not be distinguished by GFF as their GFF =0. However, Pmu is used to represent the root of Oceanospirillales because it can be exclusively ordered in the Gammaproteobacteria tree so that all rootspecies of the Gammaproteobacteria can be lined up while Hin can not. In addition, GFF of all species within Enterobacteriales branch (yellow rectangular in Fig. 3) exhibits the same distinguish pattern as shown in both Table 1 and Table 2, which indicates that GFF is insensitive to the taxonomy within certain range of phylogenetic breadth in contrast to its sensitivity to the Euryarchaeota and Crenarchaeota branch. On the other hand, if the root of one branch can not be lined up with the roots of other branches, the branch will be moved out from the tree as its order can not be determined based on GFF only. This could happen if there is no enough data provided for a given branch. For example, if we do not have the genome data for Buc, Ecs and Ecz, then Eco will be used as the root of Enterobacteriales branch, however, the GFF data shown in Fig 3 shows that Sty>Eco>Pae while Pae>Sty. In this case, Enterobacteriales branch will have to be moved out from Enterobacteriales branch. Rooted Tree of Life The tree of life with 62 species is sketched in Fig. 2 with rRNA universal tree reproduced 12 at the right-bottom corner for comparison purpose. The number in the yellow circle shown in the GFF-rooted corresponds to the number of branch in rRNA tree. There is no meaning for branch length and orientation except the vertical branch order of the sketched universal tree. In general, the position of each species (red spot) is put in such a way that the relative vertical position reflects corresponding GFF within the branch for the illustration purpose, so as the branch order (green spot). The tree is restrictively rooted using the strategy presented above except the Archaea branch order for three domains that will be discussed later. Within the 70 species analyzed in this study, eight species have been taken off from the tree shown in Fig. 2. Actinobacteria branch is removed from bacteria stem because its root Corynebacterium glutamicum (Cgl) can not be lined up with other seven branch roots, although four species within it can be lined up well. Cyanobacteria branch and Deinococcus-Thermus branch are both removed from Bacteria stem for the same reason. The Microsporidia branch is removed from Fungi stem due to the surprisingly small number of total multidomain proteins of Ecu (Table 1). Except the two branches mentioned above, our study brings Bacteria stem seven ordered branches and put Spirochaetes at the bottom of the tree. Interestingly, the universal tree based on 23 combined protein data set of 45 species also placed Spirochaetes as the lowest branching Bacteria[4]. More interestingly, the same study puts Aquificae as the lowest branch after removing 9 HGT proteins. While in our study, Aquificae branches out from Bacteria stem immediately after Spirochaetes. Spirochaetes as the lowest branch of Bacteria stem is consistent with hypotheses of extensive horizontal gene transfer between Spirochaetes and Archaea. For example, Archaea and Spirochaetes share novel class I type lysyl-tRNA synthetase to the exclusion of eukaryotes and most other bacteria[39]. In addition, the spirochetes are also notable for the number of cases in which their aminoacyl-tRNA synthetases are of the archaeal and eukaryotic genre. On the other hand, the closeness to the basal position of Bacteria stem of Aquificae and Thermotogae satisfies the hypotheses about the thermophilic origins of life for prokaryotes[40]. The GFF-rooted tree also move the Firmcutes close to the root of Bacteria domain, which is also evidenced by the review of protein phylogenies and signature sequences that concluded that the root of the prokaryotic tree has been placed between archaebacteria and gram-positive bacteria (Firmcutes)[41]. 13 Compared with the rRNA universal tree, both trees agree that Firmcutes branches out from Bacteria stem earlier than Proteobacteria. However, our results suggest that Thermotogae branches out after the Firmcutes while in rRNA tree it branches out earlier than Firmcutes. This difference is possibly introduced by the artifact in our method similar as the LBA due to the limited number of species of corresponding branch. Recall that the Thermococci branch in Euryarchaeota stem is correctly moved down to the bottom by the GFF of Pab (without GFF of Pab, our strategy will put Euryarchaeota stem between Thermoplasmatales and Methanobacteria branch, which is inconsistent with rRNA tree). Since Thermotogae branch only contains one species: Thermotoga maritime (Tma), it is therefore very possible that an undiscovered organism (or an organism discovered without the availability of whole genome) in Thermotogae branch will correctly move it down after Firmcutes. Similar hypothesis can also be used to explain why GFF-rooted tree may incorrectly indicate that Euryarchaeota stems out earlier than Crenarchaeota although the view that Crenarchaeota is the ancestral lineage of the two archaebacterial groups is not supported by the signature sequence presented in the EF-1α/Tu protein[41, 42].. Similar reason may also be used to interpret the different branch order of animal (Metazoa) and fungi between rRNA tree and GFFrooted tree. Table 3 Comparison of weighted GFF between three domains Pab Mja Spo Bbu 0.186 Aae 0.179 Sau 0.39 Tma 0.457 Jhp 0.228 Fnu 0.059 Ctr 0 0.312 0.248 Mka 0.305 Afu 0.132 Tvo 0.426 Sce 0.198 Spo 0 total 86 Sce 0.046 0.195 0.43 0.239 0.346 -0.016 0.021 0.465 0.084 0.333 0.298 0.633 0 -0.198 76 Tvo -0.121 -0.169 -0.13 -0.254 -0.16 -0.306 -0.143 0.178 0.185 0.222 0.023 0 -0.633 -0.426 22 Afu 0.105 -0.157 0.15 0.17 0.065 -0.139 -0.141 0.227 0.037 0.021 0 -0.023 -0.298 -0.132 65 Mka 0.047 0.174 -0.04 -0.027 -0.1 -0.268 -0.192 0.255 0.08 0 -0.021 -0.222 -0.333 -0.305 32 Mja 0.037 0.3 -0.1 -0.119 0 -0.215 -0.159 0.485 0 -0.08 -0.037 -0.185 -0.084 -0.248 43 Pab -0.118 -0.367 -0.12 -0.156 -0.275 -0.274 -0.233 0 -0.49 -0.255 -0.227 -0.178 -0.465 -0.312 23 Ctr 0.129 0.211 0.18 0.131 0.083 0.057 0 0.233 0.159 0.192 0.141 0.143 -0.021 0 20 Fnu 0.23 0.207 0.19 0.154 0.256 0 -0.057 0.274 0.215 0.268 0.139 0.306 0.016 -0.059 50 Jhp 0.308 0.062 0.09 0.058 0 -0.256 -0.083 0.275 0 0.1 -0.065 0.16 -0.346 -0.228 28 Tma 0.192 0.077 0.1 0 -0.058 -0.154 -0.131 0.156 0.119 0.027 -0.17 0.254 -0.239 -0.457 41 Sau 0.29 0.063 0 -0.101 -0.093 -0.185 -0.179 0.123 0.099 0.044 -0.146 0.125 -0.433 -0.389 58 Aae 0.25 0 -0.06 -0.077 -0.062 -0.207 -0.211 0.367 -0.3 -0.174 0.157 0.169 -0.195 -0.179 37 Bbu 0 -0.25 -0.29 -0.192 -0.308 -0.23 -0.129 0.118 -0.04 -0.047 -0.105 0.121 -0.046 -0.186 11 The GFF data between three domains are shown in Table 3, the full data set can be found at reference [36]. The yellow rectangular represents domain, cyan rectangular represents domain Archaea (we used Euryarchaeota branch) and green rectangular 14 represents the Bacteria domain. It can be seen that almost all organisms represented as the branch roots in domain Archaea and Bacteria are “ancient” than Eukaryota domain represented by Sce and Spo, although our rooting strategy only requires Pab, Bbu <Sce. This result clearly distinguishes Eukaryota domain from both Bacteria and Archaea domain and put the root of universal tree between Bacteria and Archaea domain. In addition, our study puts the root of Eukaryota near the Fungi branch, which is in line with the results from other research group[43], including the one that used single fusion event to root the eukaryotic tree[44]. Meanwhile, the difference between Bacteria and Archaea domain is not so clear based on GFF data in Table 3. In fact, the GFF of Pab indicates that Archaea domain should branch out earlier than Bacteria domain based on our rooting strategy, which is obviously inconsistent with the rRNA tree. However, we still put Archaea domain between Eukaryota and Bacteria except that we used dashed line to represent Archaea branch indicating the possible problem based on current GFF data sets, with following considerations: First, the inconsistency of GFF data is weak. Except Pab, other GFF data in Archaea domain, such as Mja and Mka, shows that Archaea does branches later than Bacteria. Second, it may not be correct to use Pab as the root of Archaea due to the lack of genome data of the organisms in Crenarchaeota branch. Furthermore, recently study has shown that the new archaeal kingdom Nanoarchaeota even branches earlier than Crenarchaeota[45]. Therefore the inconsistency of branch order of Archaea domain may possibly be introduced by the wrong archaeal root. We expect more genome data in Archaea domain will resolve this puzzle. The tree shown in Fig. 2 is merely a demonstration of the rooting strategy we proposed in the paper. The tree topology provided by the taxonomy will certainly affect rooting results. For example, if we place the Archaea as the sister group of Bacteria, then the inconsistency of GFF data of Archaea domain mentioned above will simply disappear because the ancientness of Archaea determined by GFF corresponding to Bacteria will naturedly explains the close relationship between Archaea and Eukaryota. This archaebacterial/eubacterial sisterhood in fact is the second most promising topology of rooted universal tree based on broader survey of homologous genes in three domains(Fig. 7 in reference [9]) and is also supported by the phylogenetic analysis of the SRP54 sequences[5]. All these arguments imply that the prokaryotes to be a natural 15 group. Conclusion The evolution process is irreversible. Nevertheless, it may leave some evidences for us to trace back its history. The GFF of multidomain proteins has been shown as one of the good marker of the possible evidences. The rooting strategy based on GFF is systemic: it can be used to root any trees as long as the organisms within the tree have certain amount of multidomain proteins. The rooting strategy is also extensible: the more organisms will make the tree more accurate. The conclusion we got here not only unveils a novel and promising rooting alternative, but also reveals the close correlation between evolution and accretion of multidomain proteins. Furthermore, the fact that the tree can be rooted using GFF indirectly verifies the proposal of universal tree of life initiated by Darwin despite the confusion introduced mainly by HGT and LBA that the life should be treated as a network rather than a single tree[3, 7]. Reference 1. 2. 3. 4. 5. 6. 7. Woese, C.R., Bacterial evolution. Microbiol Rev, 1987. 51(2): p. 221-71. Darwin, C., F. Burkhardt, and S. Smith, The correspondence of Charles Darwin. Vol. 6. 1990, Cambridge [Cambridgeshire] ; New York: Cambridge University Press. 1856-1857. Doolittle, W.F., Phylogenetic classification and the universal tree. Science, 1999. 284(5423): p. 2124-9. Brown, J.R., et al., Universal trees based on large combined protein sequence data sets. Nat Genet, 2001. 28(3): p. 281-5. Brinkmann, H. and H. Philippe, Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol Biol Evol, 1999. 16(6): p. 817-25. Kurland, C.G., B. Canback, and O.G. Berg, Horizontal gene transfer: a critical view. Proc Natl Acad Sci U S A, 2003. 100(17): p. 9658-62. Pennisi, E., Is it time to uproot the tree of life? Science, 1999. 284(5418): p. 1305-7. 16 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. Felsenstein, J., Cases in which parsimony and compatibility methods will be positively misleading. Systematic Zoology, 1978. 27: p. 401-410. Doolittle, W.F. and J.R. Brown, Tempo, mode, the progenote, and the universal root. Proc Natl Acad Sci U S A, 1994. 91(15): p. 6721-8. Iwabe, N., et al., Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc Natl Acad Sci U S A, 1989. 86(23): p. 9355-9. Baldauf, S.L., J.D. Palmer, and W.F. Doolittle, The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proc Natl Acad Sci U S A, 1996. 93(15): p. 7749-54. Brown, J.R. and W.F. Doolittle, Root of the universal tree of life based on ancient aminoacyl-tRNA synthetase gene duplications. Proc Natl Acad Sci U S A, 1995. 92(7): p. 2441-5. Mathews, S. and M.J. Donoghue, The root of angiosperm phylogeny inferred from duplicate phytochrome genes. Science, 1999. 286(5441): p. 947-50. Embley, T.M. and R.P. Hirt, Early branching eukaryotes? Curr Opin Genet Dev, 1998. 8(6): p. 624-9. Hirt, R.P., et al., Microsporidia are related to Fungi: evidence from the largest subunit of RNA polymerase II and other proteins. Proc Natl Acad Sci U S A, 1999. 96(2): p. 580-5. Hilario, E. and J.P. Gogarten, Horizontal transfer of ATPase genes--the tree of life becomes a net of life. Biosystems, 1993. 31(2-3): p. 111-9. Philippe, H. and P. Forterre, The rooting of the universal tree of life is not reliable. J Mol Evol, 1999. 49(4): p. 509-23. Wolf, Y.I., A.S. Kondrashov, and E.V. Koonin, Interkingdom gene fusions. Genome Biol, 2000. 1(6): p. RESEARCH0013. Yanai, I., Y.I. Wolf, and E.V. Koonin, Evolution of gene fusions: horizontal transfer versus independent events. Genome Biol, 2002. 3(5): p. research0024. Goffeau, A., et al., Life with 6000 genes. Science, 1996. 274(5287): p. 546, 563-7. Myers, E.W., et al., A whole-genome assembly of Drosophila. Science, 2000. 287(5461): p. 2196-204. Lander, E.S., et al., Initial sequencing and analysis of the human genome. Nature, 2001. 409(6822): p. 860921. Pennisi, E., Genomics. Sequence tells mouse, human 17 genome secrets. Science, 2002. 298(5600): p. 1863-5. 24. Venter, J.C., et al., The sequence of the human genome. Science, 2001. 291(5507): p. 1304-51. 25. Koonin, E.V., Y.I. Wolf, and G.P. Karev, The structure of the protein universe and genome evolution. Nature, 2002. 420(6912): p. 218-23. 26. Marcotte, E.M., et al., Detecting protein function and protein-protein interactions from genome sequences. Science, 1999. 285(5428): p. 751-3. 27. Woese, C.R., On the evolution of cells. Proc Natl Acad Sci U S A, 2002. 99(13): p. 8742-7. 28. Whitehead, A.N., Process and reality, an essay in cosmology; Gifford lectures delivered in the University of Edinburgh during the session 1927-28. 1929, New York, Cambridge, Eng.,: The Macmillan Company; University Press. xii, 547. 29. Chothia, C., et al., Evolution of the protein repertoire. Science, 2003. 300(5626): p. 1701-3. 30. Yanai, I., A. Derti, and C. DeLisi, Genes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomes. Proc Natl Acad Sci U S A, 2001. 98(14): p. 7940-5. 31. Reizer, J. and M.H. Saier, Jr., Modular multidomain phosphoryl transfer proteins of bacteria. Curr Opin Struct Biol, 1997. 7(3): p. 407-15. 32. Trifonov, E.N., et al., Distinct stages of protein evolution as suggested by protein sequence analysis. J Mol Evol, 2001. 53(4-5): p. 394-401. 33. Tatusov, R.L., et al., The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 2003. 4(1): p. 41. 34. Tatusov, R.L., E.V. Koonin, and D.J. Lipman, A genomic perspective on protein families. Science, 1997. 278(5338): p. 631-7. 35. Bullerwell, C.E., et al., A comparison of three fission yeast mitochondrial genomes. Nucleic Acids Res, 2003. 31(2): p. 759-68. 36. Supporting_materials, http://bioinfo.bu.edu/GFF. 37. Woese, C.R., O. Kandler, and M.L. Wheelis, Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A, 1990. 87(12): p. 4576-9. 38. Wheeler, D.L., et al., Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res, 2004. 32(1): p. D35-40. 18 39. Ibba, M., et al., A euryarchaeal lysyl-tRNA synthetase: resemblance to class I synthetases. Science, 1997. 278(5340): p. 1119-22. 40. Woese, C.R., et al., Aminoacyl-tRNA synthetases, the genetic code, and the evolutionary process. Microbiol Mol Biol Rev, 2000. 64(1): p. 202-36. 41. Gupta, R.S., Protein phylogenies and signature sequences: A reappraisal of evolutionary relationships among archaebacteria, eubacteria, and eukaryotes. Microbiol Mol Biol Rev, 1998. 62(4): p. 1435-91. 42. Rivera, M.C. and J.A. Lake, Evidence that eukaryotes and eocyte prokaryotes are immediate relatives. Science, 1992. 257(5066): p. 74-6. 43. Baldauf, S.L., The deep roots of eukaryotes. Science, 2003. 300(5626): p. 1703-6. 44. Stechmann, A. and T. Cavalier-Smith, Rooting the eukaryote tree by using a derived gene fusion. Science, 2002. 297(5578): p. 89-91. 45. Waters, E., et al., The genome of Nanoarchaeum equitans: insights into early archaeal evolution and derived parasitism. Proc Natl Acad Sci U S A, 2003. 100(22): p. 12984-8. 19