Comparative and evolutionary analysis of genomes from Rickettsia-related endosymbionts B. Franz Lang Tom Doak, Michael Lynch, Hans-Dieter Görtz, Henner Brinkmann, Hervé Philippe and G. Burger My principle interest is in mitochondria their genes, proteins, and functions, and where they come from. This implies that I am most interested in bacteria whose ancestors gave rise to mitochondria. They, like bacterial endosymbionts, undergo never-ending most rapid evolutionary change. Life is’nt static, it evolves and so do pathways – hence my interest in how they change, get damaged and repaired (sometime by adopting alien genes), and get sometimes eliminated in evolutionary time. Following up the evolution of Rickettsia-related endosymbionts is a model for what occurred when mitochondria entered the eukaryotic cell. The basis for inferences are broadly sampled, perfect genome sequences and annotations, to map the evolution of pathways to a phylogenetic tree (that has to be correct with high confidence). As you will notice this is a massive undertaken, and much of what I talk about is work in progress. Outline of presentation In the following, I will discuss the following: • • • • • • • • what are Rickettsia-like bacteria - history why certain of them are more interesting than others conceptual view of mitochondrial and eukaryotic origins phylogenetic concepts, how to infer biologically meaningful (correct) phylogenies results from phylogenomic analyses to locate mitochondrial and rickettsial origins annotating genes and assigning EC numbers with AutoFact some results on the highly reduced Holospora using pathway hole inference to update annotations, and the continuing problem of genome sequence quality Rickettsia, Wolbachia, Ehrlichia, Orientia … history Rickettsia-like bacteria (including Wolbachia, Ehrlichia, Orientia …) are well-known obligate, intracellular pathogens of animals, that undergo progressive, reductive genome evolution. Instead of producing all metabolites by themselves, they take some from their host, continuously inventing new transporters (even for stealing ATP). In turn, we suspect that they produce certain components (e.g., biotin) that are shared with the host, creating some sort of perverse dependence on the intruder. Rickettsia, Wolbachia, Ehrlichia, Orientia … history Investigating their genetics and related functional pathways at a truly biochemical level is technically most difficult; there is no bacterial model that allows effective lab work. Inferences are almost all paper biochemistry based on genome sequences (in the future foreseeably including transcriptome data). Unfortunately, once introduced, erroneous interpretations and annotations are copied and perpetuated. Holospora, Caedibacter, Ichthyophtirius … history The search for Rickettsia-like models that are more easily investigated are endosymbionts of unicellular eukaryotes (ciliates) such as Holospora and Caedibacter. Other more recent additions are the fish pathogen Ichthyophtirius, causing ‘sudden, catastrophic death of aquarium fish’, and an endosymbiont of Stachyamoeba (an amoeboid, excavate protist) that we have found, and sequenced just two weeks ago. Holospora, Caedibacter, Ichthyophtirius … history The molecular basis of cellular infection by Holospora has been intensely studied in the H-D. Görtz and M. Fujishima labs – thus our interest in sequencing its genome. Holospora invades ciliates via the food vacuole, escapes into the cytoplasm, and enters into one of the nuclei (micro- or macro) where they propagate. This project was interesting to us because of the potential of comparative genome analyses among several known endosymbionts and their phylogenomic analysis, in conjunction with mitochondria (i.e., identification of mitochondrial origins). A partial Holospora genome sequence is analyzed in Lang et al. (2005) Jpn. J. Protzool. 38: 171-181 Holospora genome – history Phylogenetic position of Holospora at the base of the Rickettsia/Ehrlichia/Wolbachia cluster of animal endosymbionts; together at the base (outside) of the α-Proteobacteria. Yet: a long branch attraction artifact may move them towards the distant outgroup? Holospora genome – history In this paper, we reached the conclusion that the Holospora genome is much more derived than its endosymbiont neighbors. At a bit more than 1 Mbp it has lost a number of cellular functions and pathways, including oxidative phosphorylation, with most of its key genes used for inferring the evolutionary origin of mitochondria . It further contains a high number of insertion elements, which makes genome assembly most difficult. Conclusion: finish genome sequence, but find Holospora relatives that are minimally derived and slowly evolving. History end – new start. In the end, due to lacking funds, the Holospora genome remained uncompleted, until the M. Lynch group came to our rescue, more recently. They are now very close to completing Holospora obtusa (~ 1.4 Mbp, linear), and are getting close to Caedibacter caryophila as well. Likewise, we are about to complete the genome sequence of the Stachyamoeba endosymbiont with ~ 1.8 Mbp). On the origin of mitochondria and bacterial endosymbionts The symbiotic introduction of mitochondria is a key event in eukaryotic evolution – a sizeable contribution of genetic material (~10% or more, species depending), essential for understanding the nature of the eukaryotic cell. It occurred a billion or more years ago, thus phylogenetic inferences aimed at resolving eukaryotic origins are exceedingly difficult. The origin of mitochondria and Rickettsia-like bacteria is somewhere close to (but not within) α-Proteobacteria. Yet, our insights remain plagued by phylogenetic artifacts --published analyses are poor if not misleading. Genome sequences from diverse bacterial species (among them Holospora and Caedibacter) are the most promising way to overcome the current impasse. On the origin of mitochondria and bacterial endosymbionts To obtain statistically significant and biologically meaningful results, it requires the use of broad taxon sampling and data from preferentially slowly-evolving species, • minimally derived mtDNAs plus nuclear genomes from, for instance, relatives of jakobid flagellates (e.g., Reclinomonas americana), and • a large variety of genomes from free-living αProteobacteria and endosymbionts close to mitochondria. What are jakobids? (e.g., Reclinomonas americana and Andalucia godoyi) Why R. americana and A. godoyi? Ongoing nuclear genome project on Reclinomonas, and in preparation for Andalucia. Among jakobids, they have the slowest-evolving mt sequences, and Andalucia has even a few more mt genes than the previous record set in Reclinomonas. How to infer biologically meaningful (correct) phylogenies To obtain statistically significant and biologically meaningful results, use most realistic phylogenetic models (CAT …), which are ideally derived from and adapted to the data to be analyzed (CAT+GTR). For this it needs lots of sequence, multiple gene sequences or proteins. What is CAT/PhyloBayes? We know that many a.a. sequence positions have specific profiles that do not fit global evolutionary models such as WAG A/S A/S/T A/P A/N/Q CAT (PhyloBayes) models this site-wise heterogeneity. Its use increases phylogenetic signal and reduces the impact of artifacts (e.g., LBA). Even better, CAT + GTR infers profiles from the data (yet, very slow …) Value of mitochondrial genes in phylogenetics Extant eukaryotic lineages all have (or sometimes had) mitochondria, a parallel genetic universe with distinct phylogenetic markers – providing a comparative view and confirmation of nuclear gene phylogenies back to the time point where the mitochondrial endosymbiont was introduced. More, it allows identifying known bacterial relatives. Although the bacterium-derived mitochondrial genome is small (13 to 30 protein genes), nuclear genes of clearly α-proteobacterial origin and with evidently mitochondrial function may be added. ~ 3,300 > 10,000 a. a. Problem: nuclear genomes are hybrid monsters containing genes transferred from organelles and more ! But nuclear genomes are wildly mosaic! - Organelle genomes undergo massive gene loss, plus transfer to the nucleus. - Nuclear genomes therefore include proteobacterial (or cyanobacterial) genes. In addition, nuclear genes may be acquired by lateral transfer, from various sources. The challenge: incongruent gene/genome/species phylogenies, often difficult to identify and resolve. Eukaryote-eukaryote endosymbiosis further increases genomic mosaicism ? From Keeling et al. 2004 Are we misled by eukaryote-eukaryote endosymbioses? Almost unavoidably! Phylogenies including data from stramenopiles, haptophytes, cryptophytes, chlorarachniophytes, … any secondary-plastidcontaining group of species are a priori suspect; definitively so when phylogenies with plastid genes (including the nucleus encoded ones) differ from other nuclear, and mitochondrial gene trees. For the planned analysis, we will therefore use only mitochondrial and nuclear genes from jakobids, without known photosynthetic members, and with the highest number of mtDNA-encoded (i.e., unproblematic) genes. (I) Origin of mitochondria from within Proteobacteria As a start, we have analyzed genomes from all > 500 Proteobacteria at GenBank (i) to check if the bacterial textbook topology (with rRNA data) is reproduced (it is), and (ii) to confidently identify/exclude genes with a tendency for lateral transfer, and/or are plagued by paralogy. New unpublished data: Holospora, Caedibacter, Stachyamoeba-endo ~ 1/2 of analyzed genes are totally unproblematic; transporters are virtually always questionable, as are many of the tRNA synthetases. Trees with paralogs/transferred genes removed versus all proteins included are almost identical; i.e., contrary to the belief of some, phylogenetic issue are minor even when not removing genes with occasional transfers. Phylogenomic analysis, α-Proteobacteria plus mitochondria. Dataset with 10,800 aligned a.a. positions, except for Holospora which is about half; PhyloBayes analysis (CAT, GTR). Endosymbionts + Mitochondria branch together, but outside αProteobacteria. Strong potential for an LBA artifact of these fast-evolving species (only exception Caedibacter) attracting them (i) together and (ii) to the distant outgroup. What happens when all fast-evolving species are removed? Phylogenomic analysis, α-Proteobacteria plus mitochondria. What happens when all fastevolving species are removed? Caedibacter and Stachyamoebaendo now clearly branch within Rhodospirillales (confirmed with an independent dataset w/o the genes used here) By inference, endosymbionts plus mitochondria potentially derive from within the Rhodospirillum/ Magnotospirillum clade. But beware of more LBA artifacts, e.g. mitochondria – Rickettsias! Phylogenomic analysis, α-Proteobacteria plus mitochondria – what next? • Include more sequences from slowly-evolving relatives of Caedibacter and many more free-living Rhodospirillales. • apply better phylogenetic models, adapted to A+T rich and fast-evolving genomes. • eliminate fast-evolving (or heterotachous) sequence positions, which requires a much larger dataset (20-30,000 a.a.) -- to compensate for loss of sequence information. (II) Analyzing genes and metabolic pathways • • • initial prediction of protein-coding genes (e.g., Glimmer, or simply conceptual ORFs) re-annotation with AutoFact (Blast against several reference databases such as uniref, kegg, cog, pfam, smart, and optimize by scoring; HMM profile search instead of Blast would be better, is under development) To gain sensitivity and be more certain in picking orthologs, it would be even better to combine AutoFact with comparative bacterial genome annotation that uses synteny information (e.g., Mage at Genoscope, currently under exploration) Analyzing genes and metabolic pathways • • extract relevant data from AutoFact as food for pathwaytools (assign E.C. numbers) infer pathways, initial round Analyzing genes and metabolic pathways (Example of AutoFact result, with EC number from Kegg) Analyzing genes and metabolic pathways (example of database collection including Holospora and Ich) Analyzing genes and metabolic pathways (Example of Holospora pathway overview graph; mousing over objects provides details) Analyzing genes and metabolic pathways (Example of Holospora biotin synthesis I pathway) Analyzing genes and metabolic pathways, second round • • check for pathway holes; if incomplete recheck presence of respective genes in genome (HMM profile searches for highest sensitivity) apply pathway comparisons among species to identify other potential inconsistencies; search missing in genome sequence Using manual curation for this step would be overwhelming. We therefore work on scripting and automation, using HMM profile searches. For this we need to build models from proteins of closely related reference bacteria (preferentially Rhodospirillales) – thus the importance of knowing phylogenetic relationships and origins. Analyzing genes and metabolic pathways … optimizing HMM profile models For HMM profiles one needs to start with a multiple alignment (e.g., Muscle). We optimize this alignment with iterated rounds of HMMalign (criterion: best E-value), and then eliminate too close sequences based on a phylogenetic distance matrix – which in the end further improves the sensitivity of the resulting HMM model. This approach works best when many sequences are available, thus the urgent need for more Rhodospirillales. Continuing problems with sequence quality When going through the process of finding missing genes, we noted that some have simply been missed, and that others contain frameshifts and were not considered. Frameshifts might indicate that a species is on its way to dropping a function or whole pathway, or --- there is sequencing error. This is common with early Sanger technology but now resurges with pyrosequencing. For instance, in our current 454 project on Stachy-endo we have lots of potential error in homopolymer stretches and these are not at all flagged. A potential solution is adding Illumina sequences for error correction. Conclusions The Caedibacter/Holospora group of bacterial endosymbionts diverge from within Rhodospirillales, a deep divergence in α-Proteobacteria. Caedibacter/Holospora diverge prior to mitochondria and the Rickettsia/Wolbachia/Ehrlichia (RWE) group of pathogens with the new Stachyamoeba-endo as its most slowly evolving member. Mitochondria appear to be a sister group to RWE endosymbionts. Yet, this topology maybe be caused by a phylogenetic LBA artifact (?). Holospora is highly derived and fast-evolving. It specifically lost oxidative phosphorylation, but curiously, retained the complete two, alternative pathways for biotin synthesis (a means for host dependence?). Our results indicate a need of genome projects for broadly sampled relatives of Caedibacter and other slowly-evolving endosymbionts, and more free-living Rhodospirillales, to better resolve evolutionary relationships and the evolution of metabolic pathways. Lab members and collaborators Michael Lynch, Tom Doak Gertraud Burger, Lise Forget (Montreal) Henner Brinkmann, Hervé Philippe (Montreal) Andrew Roger, Alistar Simpson, Mike Gray (Halifax) Iñaki Ruiz-Trillo (Barcelona) … numerous others unnamed … Thanks ! This work was possible thanks to generous and long-standing financial support by the Canadian Institute of Health Research (CIHR) Canadian Institute for Advanced Research (CIfAR) Canadian Research Chair Program Genome Quebec/Atlantic/Canada EUSKO JAURLARITZA GOBIERNO VASCO Genome Québec Genome Canada EST data TBestDB Thanks also to the National Human Genome Research Institute (NHGRI/NIH), to endorse a multi-taxon genome sequencing initiative, to gain insights into how multicellularity evolved. This initiative, the UNICellular Opisthokont Research iNitiative ('UNICORN') will generate genomic data from some unicellular relatives of both animals and fungi. G. Burger, M.W. Gray, P.W. Holland, N. King, B.F. Lang, A.J. Roger, I. Ruiz-Trillo For more information see Ruiz-Trillo et al., Trends in Genetics 23 (2007). To use these data for analyses at the genomic level, please contact members of the UNICORN project, either for collaboration or for approval of use. Status of genome projects In sequencing pipeline or close to finished: Allomyces, Spizellomyces, Mortierella, Amoebidium, Sphaeroforma, Capsaspora, Amastigomonas, Proterospongia, Reclinomonas, Ministeria DNA purification phase: Andalucia, Malawimonas