The novel RpoB families discovered by the Global Ocean Survey (GOS) are distinct from those from NCLDVs
Dongying Wu, Jonathan A. Eisen
Abstract:
By analyzing protein-coding genes such as recA s and rpoB s from the Global Ocean
Survey (GOS)
, we’ve identified novel deep-branches on the tree of life that had never been discovered before. Meanwhile, it has been proposed that Nucleocytoplasmic
Large DNA Viruses (NCLDVs) represent a fourth domain of life in addition to Bacteria ,
Archaea and Eukaryotes . By studying the phylogenies of major subfamilies of the DNA directed RNA polymerase subunit II (RpoB), we confirms that the novel GOS RpoB families we’ve discovered are quite distinct from other RpoB homologs including those from NCLDVs. Our findings indicate the existence of deep-branching lineages of life that are waiting to be discovered by metagenomic studies.
Introductions:
In the 1970s, Carl Woese at al.
discovered Archaebacteria as the third major branch on the tree of life in addition to Bacteria and Eukaryotes by analyzing small subunit rRNA
(ss-rRNA) sequences [1]. Since then, ss-rRNA has been widely used as the universal phylogenetic marker for expanding our understanding of the tree of life. Small subunit rRNAs are present and conserved in all cellular lineages. Universal primers can be designed for PCR amplification of ss-rRNA genes in organisms from all three domains, thus PCR has widely used to sample the rRNA genes in a culture-independent manner
[2,3]. Accelerated by the advances in sequencing technologies, environmental ss-rRNA
PCR surveys have generated huge volumes of ss-rRNA sequences that have extended our knowledge of the evolutionary relationships among all cellular organisms [4].
As powerful as ss-rRNA PCR technology is, the small subunit rRNA still has some limitations as a phylogenetic marker. Even the best-designed universal primers are not
“truly” universal, and from time to time fail to amplify some lineages. Phylogenetic analysis on only ss-rRNA can be misleading because of lateral gene transfer, different rates of evolution in different lineages, and convergent evolution from distantly related species [5,6,7,8]. Phylogenetic analysis of protein coding genes has been developed but limited in scope because of the difficulty of PCR primer design [9]. The primers for
PCR amplification of a highly conserved protein domain across different species have to be degenerate, thus large-scale survey of protein coding genes is not practical.
Metagenomics, the direct sequencing of organisms present in an environment,
generates sequence data in a relatively unbiased manner [10]. Metagenomics not only provides solutions to solve the ss-rRNA PCR bias problem, but also provides a robust way of sampling protein-encoding genes. Metagenomics based phylogenetic characterization of entire communities has been exemplified by our analysis of the
Sargasso Sea metagenomic data using both protein-coding and rRNA sequences in phylotyping [11]. Further more, by analyzing protein-coding phylogenetic marker genes such as recombinase A ( recA ) and DNA directed RNA polymerase beta subunit ( rpoB ) in the Global Ocean Survey (GOS) sequencing data [12,13]
, we’ve identified novel deep-branches on the tree of life that have never been discovered before [14].
The novel RecA and R poB homologs we’ve identified in the metagenomic data set could be from uncharacterized viruses or represent ancient paralogs. It is also possible that they are from major deeply branching lineages on the tree of life that have escaped our previous knowledge [14]. Coincidently, Boyer at al.
hypothesize that there exists a fourth domain of life in addition to the known Bacteria , Archaea and Eukaryotes [15].
They proposed that such a novel domain include the large nuclear and cytoplasmic
DNA viruses (NCLDVs). NCLDVs are viruses of eukaryotes and include poxviruses, asfarviruses, asco-iridoviruses, phycodnaviruses, marseilleviruses and mimiviruses
[15,16]. By combining phylogenetic tree building from marker genes such as RpoB homologs and the study of phyletic patterns, Boyer at al.
propose that NCLDVs represent a distinct domain of life [15]. Except for poxviruses, many NCLDVs were not included in our RpoB superfamily phylogenetic tree building in the GOS study [14].
Some of the NCLDV RNA polymerase II gene sequences were not available at the time of our analysis, while others were excluded as a result of being singletons at the RpoB gene family classification step [14]. We carried out a new study of RpoB homologs, which include the novel GOS RpoB families and the latest available NCLDV RNA polymerase II sequences, to determine the relationships among the novel GOS lineages and NCLDVs.
Methods:
As has been previously described, Lek clustering algorithm has been used to cluster
784 RpoB homologs from the NRAA database and published microbial genomes and
1875 RpoB homologs from the GOS data set [14]. A total of 1816 GOS sequences and
778 RpoB homologs from NRAA and microbial genomes were clustered into 17 clusters containing more than two members[14]. Representative amino acid sequences from each of the RpoB subfamilies were selected manually for tree building. In addition,
NCLDV RNA polymerase II beta subunits with the following NCBI GI numbers were selected: 282935501, 9628161, 73852908, 116326824, 134287220, 115298554,
15079139, 62421225, 13358406, 56692710, 45686052, 311977620 and 310831534
[14,15]. The RpoB homologs were aligned by MUSCLE [17]. The alignments were examined and columns with more than 80% gaps as well as columns with extremely bad alignments were trimmed to ensure alignment quality. A maximum likelihood tree was built from the trimmed alignments using PHYML [18]. The JTT substitution model was applied with both the proportion of invariable sites and the gamma distribution parameter estimated by PHYML in tree building. Bootstrap values were calculated based on 100 replicas.
Results:
From the RpoB super family tree we’ve built, we identify the following major clades: the
Bacteria and plastids (RpoB), the Archaeal, Eukaryotes (Rpa2, Rpb2, Rpc2), yeast linear plasmids (killer plasmids), poxviruses, asco-iridoviruses and two unknown lineages contains only GOS sequences. Mimivirus, marseillevirus, phycodnavirus, asfarvirus and the Cafeteria roenbergensis giant virus are dispersed deep branches on the tree with very weak bootstrapping supports (Figure 1).
The two novel families of GOS RpoB are not close to any known RpoB homologs including NCLDV RNA polymerases. Both are deep-branching lineages in the phylogenetic tree. We previously thought Unknown 2 might represent an eukaryotic or archaeal lineage [14]. We think a viral origin of Unknown 2 family is also a possibility given its position in the current RpoB tree. The origin of Unknown 1 remains uncertain.
The Nucleocytoplasmic LargeDNA Viruses don’t form a distinct clade in our study. Our tree are inconsistent with the RpoB tree Boyer et al.
built, which served as an evidence of NCLDVs being a fourth domain of life [15]. The difference can be because we included a broader range of RpoB family members with the addition of killer plasmid homologs and GOS-only novel families [14]. Different alignment masking may also contribute to the difference of the topologies of the two trees. As a point of caution,
RpoB phylogenetic tree alone cannot tell weather NCLDVs are from one domain or not, because the sequences of the NCLDV RNA polymerases are so diverse among themselves and so different from others that they are vulnerable to long branch attraction in tree building [19].
Regardless the phylogeny of NCLDVs, the GOS-only RpoB families still stands as novel deep-branching clades. They could be from novel uncharacterized viruses. They certainly would extend our understanding of the diversity within the viral world if such a notion proves to be true. We are also entertaining the possibility that these novel
sequences are from unknown major branches of cellular organisms on the tree of life.
We are exploring more metagenomic data with additional phylogenetic markers to expend the horizon of our understanding of the phylogenetic space of life, because metagenomic studies have proved to be the best way for novel lineage discoveries.
Figure Legends:
Figure 1. Maximum likelihood tree of representative RpoB superfamily members. A
PHYML tree was built for members from the two novel RpoB subfamilies that contain only GOS sequences (cyan), RpoB representatives from Bacteria (purple), RNA polymerase II beta subunits from Archaea (green), Eukaryotes (blue) and
Nucleocytoplasmic Large-DNA Viruses (NCLDV, red). The number precedes the non-
NCLDV sequence accession are Lek cluster identification number [14]. The tree was built by PHYML [18], drawn by Dendroscope [20]. Bootstrap values are based on 100 replicas.
References:
1. Woese CR, Fox GE (1977) Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A 74: 5088-5090.
2. Medlin L, Elwood HJ, Stickel S, Sogin ML (1988) The characterization of enzymatically amplified eukaryotic 16S-like ribosomal RNA-coding regions. Gene
71: 491-500.
3. Weisburg W, Barns S, Pelletier D, Lane D (1991) 16S ribosomal DNA amplification for phylogenetic study. J Bacteriol 173: 697-703.
4. Hugenholtz P, Pitulle C, Hershberger KL, Pace NR (1998) Novel division level bacterial diversity in a Yellowstone hot spring. J Bacteriol 180: 366-376.
5. Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, et al. (2005) Opinion: Reevaluating prokaryotic species. Nat Rev Microbiol 3: 733-739.
6. Achtman M, Wagner M (2008) Microbial diversity and the genetic nature of microbial species. Nat Rev Microbiol 6: 431-440.
7. Beiko RG, Doolittle WF, Charlebois RL (2008) The impact of reticulate evolution on genome phylogeny. Syst Biol 57: 844-856.
8. Hasegawa M, Hashimoto T (1993) Ribosomal RNA trees misleading? Nature 361:
23.
9. Sandler SJ, Hugenholtz P, Schleper C, DeLong EF, Pace NR, et al. (1999) Diversity of radA genes from cultured and uncultured archaea: comparative analysis of putative RadA proteins and their use as a phylogenetic marker. J Bacteriol 181:
907-915.
10. Morgan JL, Darling AE, Eisen JA (2010) Metagenomic sequencing of an in vitrosimulated microbial community. PLoS One 5: e10209.
11. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004)
Environmental genome shotgun sequencing of the Sargasso Sea. Science 304:
66-74.
12. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The
Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through
Eastern Tropical Pacific. PLoS Biol 5: e77.
13. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) The
Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of
Protein Families. PLoS Biol 5: e16.
14. Wu D WM, Halpern A, Rusch D, Yooseph S, Frazier M, Venter JC, Eisen JA (2011)
Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. Plos ONE: Submitted.
15. Boyer M MM-A, Gimenez G, La Scola B, Raoult D (2010) Phylogenetic and Phyletic
Studies of Informational Genes in Genomes Highlight Existence of a 4th Domain of Life Including Giant Viruses. PLoS ONE 5: e15530.
16. Iyer LM, Balaji S, Koonin EV, Aravind L (2006) Evolutionary genomics of nucleocytoplasmic large DNA viruses. Virus Res 117: 156-184.
17. Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5: 113.
18. Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52: 696-704.
19. Dacks JB, Marinets A, Ford Doolittle W, Cavalier-Smith T, Logsdon JM, Jr. (2002)
Analyses of RNA Polymerase II genes from free-living protists: phylogeny, long branch attraction, and the eukaryotic big bang. Mol Biol Evol 19: 830-840.
20. Huson DH, Richter DC, Rausch C, Dezulian T, Franz M, et al. (2007) Dendroscope:
An interactive viewer for large phylogenetic trees. BMC Bioinformatics 8: 460.