PPT - Bioinformatics Research Group at SRI International

advertisement
Comparative and evolutionary analysis of
genomes from Rickettsia-related
endosymbionts
B. Franz Lang
Tom Doak, Michael Lynch, Hans-Dieter Görtz, Henner Brinkmann, Hervé
Philippe and G. Burger
My principle interest is in mitochondria their genes, proteins,
and functions, and where they come from. This implies that I
am most interested in bacteria whose ancestors gave rise to
mitochondria. They, like bacterial endosymbionts, undergo
never-ending most rapid evolutionary change. Life is’nt static, it
evolves and so do pathways – hence my interest in how they
change, get damaged and repaired (sometime by adopting alien
genes), and get sometimes eliminated in evolutionary time.
Following up the evolution of Rickettsia-related endosymbionts
is a model for what occurred when mitochondria entered the
eukaryotic cell.
The basis for inferences are broadly sampled, perfect genome
sequences and annotations, to map the evolution of pathways to
a phylogenetic tree (that has to be correct with high confidence).
As you will notice this is a massive undertaken, and much of
what I talk about is work in progress.
Outline of presentation
In the following, I will discuss the following:
•
•
•
•
•
•
•
•
what are Rickettsia-like bacteria - history
why certain of them are more interesting than others
conceptual view of mitochondrial and eukaryotic origins
phylogenetic concepts, how to infer biologically meaningful
(correct) phylogenies
results from phylogenomic analyses to locate mitochondrial
and rickettsial origins
annotating genes and assigning EC numbers with AutoFact
some results on the highly reduced Holospora
using pathway hole inference to update annotations, and the
continuing problem of genome sequence quality
Rickettsia, Wolbachia, Ehrlichia, Orientia …
history
Rickettsia-like bacteria (including Wolbachia, Ehrlichia,
Orientia …) are well-known obligate, intracellular pathogens
of animals, that undergo progressive, reductive genome
evolution. Instead of producing all metabolites by
themselves, they take some from their host, continuously
inventing new transporters (even for stealing ATP). In turn,
we suspect that they produce certain components (e.g.,
biotin) that are shared with the host, creating some sort of
perverse dependence on the intruder.
Rickettsia, Wolbachia, Ehrlichia, Orientia …
history
Investigating their genetics and related functional pathways
at a truly biochemical level is technically most difficult;
there is no bacterial model that allows effective lab work.
Inferences are almost all paper biochemistry based on
genome sequences (in the future foreseeably including
transcriptome data). Unfortunately, once introduced,
erroneous interpretations and annotations are copied and
perpetuated.
Holospora, Caedibacter, Ichthyophtirius …
history
The search for Rickettsia-like models that are more
easily investigated are endosymbionts of unicellular
eukaryotes (ciliates) such as Holospora and
Caedibacter. Other more recent additions are the fish
pathogen Ichthyophtirius, causing ‘sudden, catastrophic
death of aquarium fish’, and an endosymbiont of
Stachyamoeba (an amoeboid, excavate protist) that we
have found, and sequenced just two weeks ago.
Holospora, Caedibacter, Ichthyophtirius …
history
The molecular basis of cellular infection by Holospora has been intensely
studied in the H-D. Görtz and M. Fujishima labs – thus our interest in
sequencing its genome. Holospora invades ciliates via the food vacuole,
escapes into the cytoplasm, and enters into one of the nuclei (micro- or
macro) where they propagate.
This project was interesting to us because of the potential of comparative
genome analyses among several known endosymbionts and their
phylogenomic analysis, in conjunction with mitochondria (i.e.,
identification of mitochondrial origins).
A partial Holospora genome sequence is analyzed in Lang et al. (2005) Jpn.
J. Protzool. 38: 171-181
Holospora genome – history
Phylogenetic position of Holospora at the base of the Rickettsia/Ehrlichia/Wolbachia
cluster of animal endosymbionts; together at the base (outside) of the α-Proteobacteria.
Yet: a long branch attraction artifact may move them towards the distant outgroup?
Holospora genome – history
In this paper, we reached the conclusion that the Holospora genome
is much more derived than its endosymbiont neighbors. At a bit
more than 1 Mbp it has lost a number of cellular functions and
pathways, including oxidative phosphorylation, with most of its key
genes used for inferring the evolutionary origin of mitochondria .
It further contains a high number of insertion elements, which
makes genome assembly most difficult.
Conclusion: finish genome sequence, but find Holospora relatives
that are minimally derived and slowly evolving.
History end – new start.
In the end, due to lacking funds, the Holospora genome
remained uncompleted, until the M. Lynch group came
to our rescue, more recently. They are now very close to
completing Holospora obtusa (~ 1.4 Mbp, linear), and
are getting close to Caedibacter caryophila as well.
Likewise, we are about to complete the genome sequence
of the Stachyamoeba endosymbiont with ~ 1.8 Mbp).
On the origin of mitochondria and
bacterial endosymbionts
The symbiotic introduction of mitochondria is a key event in
eukaryotic evolution – a sizeable contribution of genetic
material (~10% or more, species depending), essential for
understanding the nature of the eukaryotic cell.
It occurred a billion or more years ago, thus phylogenetic
inferences aimed at resolving eukaryotic origins are exceedingly
difficult. The origin of mitochondria and Rickettsia-like bacteria
is somewhere close to (but not within) α-Proteobacteria.
Yet, our insights remain plagued by phylogenetic artifacts --published analyses are poor if not misleading. Genome
sequences from diverse bacterial species (among them
Holospora and Caedibacter) are the most promising way to
overcome the current impasse.
On the origin of mitochondria and
bacterial endosymbionts
To obtain statistically significant and biologically
meaningful results, it requires the use of broad
taxon sampling and data from preferentially
slowly-evolving species,
• minimally derived mtDNAs plus nuclear genomes
from, for instance, relatives of jakobid flagellates
(e.g., Reclinomonas americana), and
• a large variety of genomes from free-living αProteobacteria and endosymbionts close to
mitochondria.
What are jakobids?
(e.g., Reclinomonas americana and Andalucia godoyi)
Why R. americana and A. godoyi?
Ongoing nuclear genome project on
Reclinomonas, and in preparation for Andalucia.
Among jakobids, they have the slowest-evolving
mt sequences, and Andalucia has even a few
more mt genes than the previous record set in
Reclinomonas.
How to infer biologically meaningful
(correct) phylogenies
To obtain statistically significant and biologically
meaningful results, use most realistic phylogenetic
models (CAT …), which are ideally derived from and
adapted to the data to be analyzed (CAT+GTR).
For this it needs lots of sequence, multiple gene
sequences or proteins.
What is CAT/PhyloBayes?
We know that many a.a. sequence positions have specific profiles
that do not fit global evolutionary models such as WAG
A/S
A/S/T A/P
A/N/Q
CAT (PhyloBayes) models this site-wise heterogeneity.
Its use increases phylogenetic signal and reduces the impact of
artifacts (e.g., LBA). Even better, CAT + GTR infers profiles from
the data (yet, very slow …)
Value of mitochondrial genes in phylogenetics
Extant eukaryotic lineages all have (or sometimes had) mitochondria, a
parallel genetic universe with distinct phylogenetic markers – providing a
comparative view and confirmation of nuclear gene phylogenies back to
the time point where the mitochondrial endosymbiont was introduced.
More, it allows identifying known bacterial relatives.
Although the bacterium-derived mitochondrial genome is small (13 to 30
protein genes), nuclear genes of clearly α-proteobacterial origin and with
evidently mitochondrial function may be added.
~ 3,300  > 10,000 a. a.
Problem: nuclear genomes are hybrid monsters containing genes
transferred from organelles and more !
But nuclear genomes are wildly mosaic!
- Organelle genomes undergo massive gene loss, plus transfer to the nucleus.
- Nuclear genomes therefore include proteobacterial (or cyanobacterial) genes.
In addition, nuclear genes may be acquired by lateral transfer, from various sources.
The challenge: incongruent gene/genome/species phylogenies, often
difficult to identify and resolve.
Eukaryote-eukaryote endosymbiosis further increases
genomic mosaicism
?
From Keeling et al. 2004
Are we misled by eukaryote-eukaryote
endosymbioses?
Almost unavoidably!
Phylogenies including data from stramenopiles, haptophytes,
cryptophytes, chlorarachniophytes, … any secondary-plastidcontaining group of species are a priori suspect; definitively so
when phylogenies with plastid genes (including the nucleus encoded
ones) differ from other nuclear, and mitochondrial gene trees.
For the planned analysis, we will therefore use only mitochondrial
and nuclear genes from jakobids, without known photosynthetic
members, and with the highest number of mtDNA-encoded (i.e.,
unproblematic) genes.
(I) Origin of mitochondria from within Proteobacteria
As a start, we have analyzed genomes from all > 500 Proteobacteria at
GenBank (i) to check if the bacterial textbook topology (with rRNA data) is
reproduced (it is), and (ii) to confidently identify/exclude genes with a
tendency for lateral transfer, and/or are plagued by paralogy. New
unpublished data: Holospora, Caedibacter, Stachyamoeba-endo
~ 1/2 of analyzed genes are totally unproblematic; transporters are virtually
always questionable, as are many of the tRNA synthetases.
Trees with paralogs/transferred genes removed versus all proteins included
are almost identical; i.e., contrary to the belief of some, phylogenetic issue are
minor even when not removing genes with occasional transfers.
Phylogenomic analysis,
α-Proteobacteria plus
mitochondria.
Dataset with 10,800 aligned a.a.
positions, except for Holospora
which is about half; PhyloBayes
analysis (CAT, GTR).
Endosymbionts + Mitochondria
branch together, but outside αProteobacteria.
Strong potential for an LBA
artifact of these fast-evolving
species (only exception
Caedibacter) attracting them (i)
together and (ii) to the distant
outgroup.
 What happens when all
fast-evolving species are
removed?
Phylogenomic analysis,
α-Proteobacteria plus
mitochondria.
What happens when all fastevolving species are removed?
Caedibacter and Stachyamoebaendo now clearly branch within
Rhodospirillales (confirmed with
an independent dataset w/o the
genes used here)
By inference, endosymbionts plus
mitochondria potentially derive
from within the Rhodospirillum/
Magnotospirillum clade.
But beware of more LBA artifacts,
e.g. mitochondria – Rickettsias!
Phylogenomic analysis,
α-Proteobacteria plus mitochondria – what next?
•
Include more sequences from slowly-evolving relatives of Caedibacter
and many more free-living Rhodospirillales.
•
apply better phylogenetic models, adapted to A+T rich and fast-evolving
genomes.
•
eliminate fast-evolving (or heterotachous) sequence positions, which
requires a much larger dataset (20-30,000 a.a.) -- to compensate for loss
of sequence information.
(II) Analyzing genes and metabolic pathways
•
•
•
initial prediction of protein-coding genes (e.g., Glimmer, or
simply conceptual ORFs)
re-annotation with AutoFact (Blast against several reference
databases such as uniref, kegg, cog, pfam, smart, and
optimize by scoring; HMM profile search instead of Blast
would be better, is under development)
To gain sensitivity and be more certain in picking orthologs,
it would be even better to combine AutoFact with
comparative bacterial genome annotation that uses synteny
information (e.g., Mage at Genoscope, currently under
exploration)
Analyzing genes and metabolic pathways
•
•
extract relevant data from AutoFact as food for pathwaytools (assign E.C. numbers)
infer pathways, initial round
Analyzing genes and metabolic pathways
(Example of AutoFact result, with EC number from Kegg)
Analyzing genes and metabolic pathways
(example of database collection including Holospora and Ich)
Analyzing genes and metabolic pathways
(Example of Holospora pathway overview graph; mousing over objects provides details)
Analyzing genes and metabolic pathways
(Example of Holospora biotin synthesis I pathway)
Analyzing genes and metabolic pathways,
second round
•
•
check for pathway holes; if incomplete recheck presence of
respective genes in genome (HMM profile searches for
highest sensitivity)
apply pathway comparisons among species to identify other
potential inconsistencies; search missing in genome sequence
Using manual curation for this step would be overwhelming. We
therefore work on scripting and automation, using HMM
profile searches. For this we need to build models from
proteins of closely related reference bacteria (preferentially
Rhodospirillales) – thus the importance of knowing
phylogenetic relationships and origins.
Analyzing genes and metabolic pathways …
optimizing HMM profile models
For HMM profiles one needs to start with a multiple alignment
(e.g., Muscle). We optimize this alignment with iterated rounds
of HMMalign (criterion: best E-value), and then eliminate too
close sequences based on a phylogenetic distance matrix – which
in the end further improves the sensitivity of the resulting
HMM model.
This approach works best when many sequences are available,
thus the urgent need for more Rhodospirillales.
Continuing problems with sequence quality
When going through the process of finding missing genes, we
noted that some have simply been missed, and that others
contain frameshifts and were not considered.
Frameshifts might indicate that a species is on its way to
dropping a function or whole pathway, or --- there is sequencing
error. This is common with early Sanger technology but now
resurges with pyrosequencing. For instance, in our current 454
project on Stachy-endo we have lots of potential error in
homopolymer stretches and these are not at all flagged. A
potential solution is adding Illumina sequences for error
correction.
Conclusions

The Caedibacter/Holospora group of bacterial endosymbionts diverge
from within Rhodospirillales, a deep divergence in α-Proteobacteria.

Caedibacter/Holospora diverge prior to mitochondria and the
Rickettsia/Wolbachia/Ehrlichia (RWE) group of pathogens with the new
Stachyamoeba-endo as its most slowly evolving member.

Mitochondria appear to be a sister group to RWE endosymbionts. Yet,
this topology maybe be caused by a phylogenetic LBA artifact (?).

Holospora is highly derived and fast-evolving. It specifically lost oxidative
phosphorylation, but curiously, retained the complete two, alternative
pathways for biotin synthesis (a means for host dependence?).

Our results indicate a need of genome projects for broadly sampled
relatives of Caedibacter and other slowly-evolving endosymbionts, and
more free-living Rhodospirillales, to better resolve evolutionary
relationships and the evolution of metabolic pathways.
Lab members and collaborators
Michael Lynch, Tom Doak
Gertraud Burger, Lise Forget (Montreal)
Henner Brinkmann, Hervé Philippe (Montreal)
Andrew Roger, Alistar Simpson, Mike Gray (Halifax)
Iñaki Ruiz-Trillo (Barcelona)
… numerous others unnamed …
Thanks !
This work was possible thanks to generous and long-standing
financial support by the
Canadian Institute of Health Research (CIHR)
Canadian Institute for Advanced Research (CIfAR)
Canadian Research Chair Program
Genome Quebec/Atlantic/Canada
EUSKO JAURLARITZA
GOBIERNO VASCO
Genome
Québec
Genome
Canada
EST data  TBestDB
Thanks also
to the National Human Genome Research Institute (NHGRI/NIH), to
endorse a multi-taxon genome sequencing initiative, to gain insights into how
multicellularity evolved. This initiative, the UNICellular Opisthokont
Research iNitiative ('UNICORN') will generate genomic data from some
unicellular relatives of both animals and fungi.
G. Burger, M.W. Gray, P.W. Holland, N. King, B.F. Lang, A.J. Roger, I. Ruiz-Trillo
For more information see
Ruiz-Trillo et al., Trends in Genetics 23 (2007).
To use these data for analyses at the genomic level,
please contact members of the UNICORN project, either
for collaboration or for approval of use.
Status of genome projects
In sequencing pipeline or close to finished:
Allomyces, Spizellomyces, Mortierella, Amoebidium, Sphaeroforma,
Capsaspora, Amastigomonas, Proterospongia, Reclinomonas, Ministeria
DNA purification phase:
Andalucia, Malawimonas
Download