Multi-Species Gene Prediction with PhyloHMMs One of the major disciplines in the field of biology is genetics, which is the science of genes and heredity. A unique discipline within genetics is genomics, which is the study of the genomes in different organisms. Scientists have made a great deal of progress in genetic mapping, much in part towards the government sponsorship of the Human Genome Project. While this project has been quite successful in mapping the human genome, there is still a significant amount of work to be done in this field. Computational prediction of protein-coding gene structure is an important method being employed by researchers. These methods of prediction are never 100% sensitive nor 100% specific because their performance varies on different input data. One way to improve these measures is to use consensus-generating methods that mix multiple individual predictions into a single prediction. Another important concept is that gene structure is evolutionarily conserved across closely related species. Using this knowledge, I propose to implement a novel, consensus-generating method to integrate a mixture of gene predictions. I will do this by developing a PhyloHMM finite state transducer framework to model gene structure alphabets at ancestral nodes in a phylogenetic tree. I will implement algorithms for model estimation and optimal consensus gene structure using dynamic programming. I also plan to use an open source software package called xRate, which is an interpreter for phylogrammars. I will evaluate my results based on their fit with respect to existing experimental data. Gene Prediction I will first explain some background on gene prediction. Biologists have made excellent progress in genome annotation since drafts of the human genome were first analyzed. The aim of genome annotation is to determine the biochemical function of nucleotides in the genome (Brent, 2008). Gene prediction models can be improved by integrating multiple sources of gene evidence. One program that implements this idea is Evigan, which is an automated gene annotation program (Liu, Mackey, Roos, Pereira, 2008). My project will take several concepts from Evigan, including the use of a hidden Markov model (HMM). An HMM incorporates ideas from probability and computer theory, specifically in its resemblance to a finite state transducer. A finite state transducer has an input tape, output tape, a symbol alphabet, and set of transitions (Bradley, Holmes, 2007). An HMM is similar, but it also has a set of “hidden” states that are determined by using its set of observed states, transitions, and transition probabilities. My project will make use of phyloHMMs, which combines the ideas of HMMs and phylogenetics. Phylogeny studies the evolutionary history of lineages of organisms. One of the big questions for biologists is how to reconstruct that history (Cracraft, 1974). PhyloHMMs are often used to accomplish this. They probabilistic models that consider the way substitutions occur through evolutionary history at each site of a genome and the way this process changes from one site to the next (Siepel, Haussler, 2005). They work as two combined Markov processes; one that operates along a genome and one that operates along the branches of a phylogenetic tree. These models are essential for creating accurate gene predictions in my project. xRate Software My project will use the xRate software package to create a phylo-grammar that will implement a hierarchical dynamic Bayesian network. The phylo-grammar will use different species phylogenies to enforce correlations in gene structure along the phylogenetic tree. I will use the non-DNA alphabet described by Evigan (Liu et al., 2008) and a transition matrix that will provide correlation between exon-intron states. The intellectual merit of this project stems from my ability to model complex algorithms and create detailed grammars to generate the desired results. These ideas are heavily based in computer theory and computational science. Results and Impact With this project, I hope to combine my understanding of computer science and programming with research in the field of genomics, computational probability, and bioinformatics. I will perform this project with the intent of generating better gene predictions by using combining existing predictions with probabilistic models. Hopefully, the results will show that combining phylogenies and HMMs produce better gene prediction results with higher sensitivities and specificities. It is possible that problems could occur in the implementation of this project that leads to unforeseen outcomes. In this case, the issues that arise will be well documented and will still provide insight for future projects in this area. Works Cited Bradley, R.K., & Holmes, I. (2007). Transducers: an emerging probabilistic framework for modeling indels on trees. Bioinformatics, 23(23), 3258-3262. Brent, M.R. (2008). Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nature Reviews Genetics, 9, 62-73. Cracraft, J. (1974). Phylogenetic Models and Classification. Systematic Zoology, 23(1), 71-90. Makarov, V. (2002). Computer Programs for Eukaryotic Gene Prediction. Briefings in Bioinformatics, 3(2), 195-199. Meyer, I. M., & Durbin, R. (2002). Comparative Ab Initio Prediction of Gene Structures Using Pair HMMs. Bioinformatics, 18(10), 1309-1318. Mossel, E., & Roch, S. (2006). Learning Nonsingular Phylogenies and Hidden Markov Models. The Annals of Applied Probability, 16(2), 583-614. Liu, Q., Mackey, A., Roos, D.S., & Pereira, F.C.N. (2008). Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction. Bioinformatics, 24(5), 597-605. Siepel, A., & Haussler, D. (2005). Phylogenetic Hidden Markov Models. Statistics for Biology and Health, 3, 325-351.