PhD Thesis proposal form Discipline Biology Doctoral School Gènes, Génomes, Cellules Thesis subject title: Inferring the genomic evolutionary history of transposable elements. Laboratory name and web site: Laboratoire Évolution, Génomes et Spéciation (Lab Evolution, Genomes, and Speciation), CNRS UPR 9034, Gif-sur-Yvette, France - http://www.legs.cnrs-gif.fr/ PhD supervisor (contact person): Supervisor #1: Pierre Capy, Professor UPS, pierre.capy@legs.cnrs-gif.fr Supervisor #2: Aurélie Hua-Van, Associate Prof UPS, ahuavan@legs.cnrs-gif.fr Supervisor #3: Arnaud Le Rouzic, CNRS Researcher, lerouzic@legs.cnrs-gif.fr Thesis proposal (max 1500 words): Transposable elements (TEs) are repeated DNA sequences that are able to self-replicate in genomes. Sequences from various TE families are present in virtually all known species, and they often represent a significant part of the non-genic DNA. TEs are selfish DNA sequences, their presence and their activity generally promote cancers and genetic diseases, but they also constitute a source of genetic diversity and evolvability. Understanding their impact and their role in molecular evolution is thus a major concern for a better understanding of genome sequences. In spite of considerable technical progress in DNA sequencing, the repeated fraction of genomes (often dubbed "junk DNA") still remains poorly described and understood. Eukaryotic genomes can contain up to 80% TE-derived sequences, most of them being inactive and degraded. Active copies contain an open reading frame, coding for proteins promoting transposition. Non-autonomous copies, usually more abundant than autonomous elements, do not code for the transposition machinery but can be amplified in trans by autonomous partners in a way reminding parasitic interactions in a complex genomic ecosystem. Genomes thus contain ancient traces of former transposition activity, offering the possibility of reconstructing molecular evolutionary history for millions of years. The aim of this PhD project is to set up a conceptual and bioinformatic framework to connect genome sequence data and theoretical models of TE evolution. It is articulated in three main directions, extending prior work performed in our lab: (i) The development of a bioinformatic pipeline for collection, identification, alignment and phylogenetic reconstruction of TE sequences from genome databases. This work will require the combination of already existing software and newly developed models or software modules. Two steps are anticipated to be challenging: the automatic alignment of TEs (characterized by non- PhD Thesis proposal form homologous flanking sequences and frequent deletions), and the phylogenetic reconstruction assuming a specific evolutionary model, which is expected to be different from “regular” genes. (ii) The formalization of TE evolutionary models, in order to connect existing population genetic models available in the lab to DNA sequence evolution. Such models should predict the pattern of TE sequence divergence according to various scenarios, including e.g. the maintenance of active elements in a transposition – selection – deletion equilibrium, or recurrent invasion by horizontal transfers. (iii) The development of statistical tools and software aiming at estimating evolutionaryrelevant parameters (such as the past dynamics of transposition activity) and their confidence intervals from genome sequences. This project thus requires the definition of a probabilistic framework from the evolutionary models, in order to be able to feed the model with automaticallycollected data. Beyond the mere reconstruction of transposition history, such models will allow to contrast evolutionary properties of similar TEs in different species, and of different TEs in the same species, and thus to enlight the importance of specific host-TE interactions (as well as e.g. the impact of demographic events) on genome evolution. In spite of its ambitious nature, this project is likely to lead to significant results in the time frame of a PhD. Around 18 months will be necessary for developing and testing bioinformatic and statistic tools, this work being facilitated by the availability of preliminary results (including manuallyprocessed data sets to be compared with the output of automatic treatments). Six months will be devoted to the collection of TE sequence data from public databases according to a carefullydesigned sampling strategy (number of organisms, diversity of TE families, choice of some species that will be deeply inspected...). In the last part of the PhD, the candidate will analyze the results and draft research papers. Results will be published in international, peer-reviewed journals, and software will be released under free licenses to ensure its availability to the scientific community. The PhD candidate will be hosted at the LEGS in Gif-sur-Yvette, around 3km from the main campus of University Paris-Sud. He/she will be part of the “ELEGEM” group, lead by Aurélie Hua-Van. Previous work carried on in the group evidences our internationally-recognized competencies in genome evolution, bioinformatics, statistics, theoretical population genetics, and transposable element biology. For the bioinformatics work, the candidate will have access to the computational resources of the lab, including a server (bought in 2012) devoted to heavy calculation. PhD students in our lab are encouraged to participate to at least one national or international conference every year. Publications of the laboratory in the field (max 5): Le Rouzic, A. et Deceliere, G. 2005. Models of the population genetics of transposable elements. Genetical Research 85:171-181. Le Rouzic, A., Boutin, T., Capy, P. 2007. Long-term evolution of transposable elements. PNAS. 104:19375-19380. Le Rouzic, A., Dupas, S., Capy, P. 2007. Genome ecosystem and transposable elements species. Gene 390:214-220. Wicker T, Sabot F, Hua-Van A, et al. 2007. A unified classification system for eukaryotic transposable elements. Nat Rev Genet. 8:973-982. Hua-Van, A., Le Rouzic, A., Boutin, T.S., Filée, J., Capy, P. 2011. The struggle for life of the genome's selfish architects. Biol Direct. 6:19. PhD Thesis proposal form Specific requirements to apply, if any: We are looking for a candidate with strong bases in bioinformatics, statistics, and genome analysis. The project involves scripting and software development, prior knowledge in bash, Python, R, and C++ is necessary. We expect the candidate to show strong interest in evolutionary biology, and solid experience in population genetics and evolutionary modeling will be welcome. The student will be part of a research group involving researchers of various scientific background, curiosity and good communication skills (including fluent English) will ensure the success of the project.