TechnicalProspectus

advertisement
Multi-Species Gene Prediction with PhyloHMMs
One of the major disciplines in the field of biology is genetics, which is the science of genes and
heredity. A unique discipline within genetics is genomics, which is the study of the genomes in
different organisms. Scientists have made a great deal of progress in genetic mapping, much in
part towards the government sponsorship of the Human Genome Project. While this project has
been quite successful in mapping the human genome, there is still a significant amount of work
to be done in this field. Computational prediction of protein-coding gene structure is an
important method being employed by researchers. These methods of prediction are never 100%
sensitive nor 100% specific because their performance varies on different input data. One way to
improve these measures is to use consensus-generating methods that mix multiple individual
predictions into a single prediction. Another important concept is that gene structure is
evolutionarily conserved across closely related species. Using this knowledge, I propose to
implement a novel, consensus-generating method to integrate a mixture of gene predictions. I
will do this by developing a PhyloHMM finite state transducer framework to model gene
structure alphabets at ancestral nodes in a phylogenetic tree. I will implement algorithms for
model estimation and optimal consensus gene structure using dynamic programming. I also plan
to use an open source software package called xRate, which is an interpreter for phylogrammars. I will evaluate my results based on their fit with respect to existing experimental data.
Gene Prediction
I will first explain some background on gene prediction. Biologists have made excellent progress
in genome annotation since drafts of the human genome were first analyzed. The aim of genome
annotation is to determine the biochemical function of nucleotides in the genome (Brent, 2008).
Gene prediction models can be improved by integrating multiple sources of gene evidence. One
program that implements this idea is Evigan, which is an automated gene annotation program
(Liu, Mackey, Roos, Pereira, 2008). My project will take several concepts from Evigan,
including the use of a hidden Markov model (HMM). An HMM incorporates ideas from
probability and computer theory, specifically in its resemblance to a finite state transducer. A
finite state transducer has an input tape, output tape, a symbol alphabet, and set of transitions
(Bradley, Holmes, 2007). An HMM is similar, but it also has a set of “hidden” states that are
determined by using its set of observed states, transitions, and transition probabilities. My project
will make use of phyloHMMs, which combines the ideas of HMMs and phylogenetics.
Phylogeny studies the evolutionary history of lineages of organisms. One of the big questions for
biologists is how to reconstruct that history (Cracraft, 1974). PhyloHMMs are often used to
accomplish this. They probabilistic models that consider the way substitutions occur through
evolutionary history at each site of a genome and the way this process changes from one site to
the next (Siepel, Haussler, 2005). They work as two combined Markov processes; one that
operates along a genome and one that operates along the branches of a phylogenetic tree. These
models are essential for creating accurate gene predictions in my project.
xRate Software
My project will use the xRate software package to create a phylo-grammar that will implement a
hierarchical dynamic Bayesian network. The phylo-grammar will use different species
phylogenies to enforce correlations in gene structure along the phylogenetic tree. I will use the
non-DNA alphabet described by Evigan (Liu et al., 2008) and a transition matrix that will
provide correlation between exon-intron states. The intellectual merit of this project stems from
my ability to model complex algorithms and create detailed grammars to generate the desired
results. These ideas are heavily based in computer theory and computational science.
Results and Impact
With this project, I hope to combine my understanding of computer science and programming
with research in the field of genomics, computational probability, and bioinformatics. I will
perform this project with the intent of generating better gene predictions by using combining
existing predictions with probabilistic models. Hopefully, the results will show that combining
phylogenies and HMMs produce better gene prediction results with higher sensitivities and
specificities. It is possible that problems could occur in the implementation of this project that
leads to unforeseen outcomes. In this case, the issues that arise will be well documented and will
still provide insight for future projects in this area.
Works Cited
Bradley, R.K., & Holmes, I. (2007). Transducers: an emerging probabilistic framework for
modeling indels on trees. Bioinformatics, 23(23), 3258-3262.
Brent, M.R. (2008). Steady progress and recent breakthroughs in the accuracy of
automated genome annotation. Nature Reviews Genetics, 9, 62-73.
Cracraft, J. (1974). Phylogenetic Models and Classification. Systematic Zoology, 23(1),
71-90.
Makarov, V. (2002). Computer Programs for Eukaryotic Gene Prediction. Briefings in
Bioinformatics, 3(2), 195-199.
Meyer, I. M., & Durbin, R. (2002). Comparative Ab Initio Prediction of Gene Structures Using
Pair HMMs. Bioinformatics, 18(10), 1309-1318.
Mossel, E., & Roch, S. (2006). Learning Nonsingular Phylogenies and Hidden Markov
Models. The Annals of Applied Probability, 16(2), 583-614.
Liu, Q., Mackey, A., Roos, D.S., & Pereira, F.C.N. (2008). Evigan: a hidden variable model for
integrating gene evidence for eukaryotic gene prediction. Bioinformatics, 24(5), 597-605.
Siepel, A., & Haussler, D. (2005). Phylogenetic Hidden Markov Models. Statistics for
Biology and Health, 3, 325-351.
Download