Thesis

advertisement
PhD Thesis proposal form
Discipline
Biology
Doctoral School
Gènes, Génomes, Cellules
Thesis subject title:

Inferring the genomic evolutionary history of transposable elements.
Laboratory name and web site:
Laboratoire Évolution, Génomes et Spéciation (Lab Evolution, Genomes, and Speciation), CNRS
UPR 9034, Gif-sur-Yvette, France - http://www.legs.cnrs-gif.fr/

PhD supervisor (contact person):
Supervisor #1: Pierre Capy, Professor UPS, pierre.capy@legs.cnrs-gif.fr
Supervisor #2: Aurélie Hua-Van, Associate Prof UPS, ahuavan@legs.cnrs-gif.fr
Supervisor #3: Arnaud Le Rouzic, CNRS Researcher, lerouzic@legs.cnrs-gif.fr

Thesis proposal (max 1500 words):
Transposable elements (TEs) are repeated DNA sequences that are able to self-replicate in genomes.
Sequences from various TE families are present in virtually all known species, and they often
represent a significant part of the non-genic DNA. TEs are selfish DNA sequences, their presence
and their activity generally promote cancers and genetic diseases, but they also constitute a source of
genetic diversity and evolvability. Understanding their impact and their role in molecular evolution is
thus a major concern for a better understanding of genome sequences.
In spite of considerable technical progress in DNA sequencing, the repeated fraction of genomes
(often dubbed "junk DNA") still remains poorly described and understood. Eukaryotic genomes can
contain up to 80% TE-derived sequences, most of them being inactive and degraded. Active copies
contain an open reading frame, coding for proteins promoting transposition. Non-autonomous copies,
usually more abundant than autonomous elements, do not code for the transposition machinery but
can be amplified in trans by autonomous partners in a way reminding parasitic interactions in a
complex genomic ecosystem. Genomes thus contain ancient traces of former transposition activity,
offering the possibility of reconstructing molecular evolutionary history for millions of years.
The aim of this PhD project is to set up a conceptual and bioinformatic framework to connect
genome sequence data and theoretical models of TE evolution. It is articulated in three main
directions, extending prior work performed in our lab:
(i) The development of a bioinformatic pipeline for collection, identification, alignment and
phylogenetic reconstruction of TE sequences from genome databases. This work will require the
combination of already existing software and newly developed models or software modules. Two
steps are anticipated to be challenging: the automatic alignment of TEs (characterized by non-
PhD Thesis proposal form
homologous flanking sequences and frequent deletions), and the phylogenetic reconstruction
assuming a specific evolutionary model, which is expected to be different from “regular” genes.
(ii) The formalization of TE evolutionary models, in order to connect existing population
genetic models available in the lab to DNA sequence evolution. Such models should predict the
pattern of TE sequence divergence according to various scenarios, including e.g. the maintenance of
active elements in a transposition – selection – deletion equilibrium, or recurrent invasion by
horizontal transfers.
(iii) The development of statistical tools and software aiming at estimating evolutionaryrelevant parameters (such as the past dynamics of transposition activity) and their confidence
intervals from genome sequences. This project thus requires the definition of a probabilistic
framework from the evolutionary models, in order to be able to feed the model with automaticallycollected data. Beyond the mere reconstruction of transposition history, such models will allow to
contrast evolutionary properties of similar TEs in different species, and of different TEs in the same
species, and thus to enlight the importance of specific host-TE interactions (as well as e.g. the impact
of demographic events) on genome evolution.
In spite of its ambitious nature, this project is likely to lead to significant results in the time frame of
a PhD. Around 18 months will be necessary for developing and testing bioinformatic and statistic
tools, this work being facilitated by the availability of preliminary results (including manuallyprocessed data sets to be compared with the output of automatic treatments). Six months will be
devoted to the collection of TE sequence data from public databases according to a carefullydesigned sampling strategy (number of organisms, diversity of TE families, choice of some species
that will be deeply inspected...). In the last part of the PhD, the candidate will analyze the results and
draft research papers. Results will be published in international, peer-reviewed journals, and software
will be released under free licenses to ensure its availability to the scientific community.
The PhD candidate will be hosted at the LEGS in Gif-sur-Yvette, around 3km from the main campus
of University Paris-Sud. He/she will be part of the “ELEGEM” group, lead by Aurélie Hua-Van.
Previous work carried on in the group evidences our internationally-recognized competencies in
genome evolution, bioinformatics, statistics, theoretical population genetics, and transposable
element biology. For the bioinformatics work, the candidate will have access to the computational
resources of the lab, including a server (bought in 2012) devoted to heavy calculation. PhD students
in our lab are encouraged to participate to at least one national or international conference every year.

Publications of the laboratory in the field (max 5):
Le Rouzic, A. et Deceliere, G. 2005. Models of the population genetics of transposable elements.
Genetical Research 85:171-181.
Le Rouzic, A., Boutin, T., Capy, P. 2007. Long-term evolution of transposable elements. PNAS.
104:19375-19380.
Le Rouzic, A., Dupas, S., Capy, P. 2007. Genome ecosystem and transposable elements species.
Gene 390:214-220.
Wicker T, Sabot F, Hua-Van A, et al. 2007. A unified classification system for eukaryotic
transposable elements. Nat Rev Genet. 8:973-982.
Hua-Van, A., Le Rouzic, A., Boutin, T.S., Filée, J., Capy, P. 2011. The struggle for life of the
genome's selfish architects. Biol Direct. 6:19.
PhD Thesis proposal form

Specific requirements to apply, if any:
We are looking for a candidate with strong bases in bioinformatics, statistics, and genome analysis.
The project involves scripting and software development, prior knowledge in bash, Python, R, and
C++ is necessary. We expect the candidate to show strong interest in evolutionary biology, and solid
experience in population genetics and evolutionary modeling will be welcome. The student will be
part of a research group involving researchers of various scientific background, curiosity and good
communication skills (including fluent English) will ensure the success of the project.
Download