Analysing Protein Energetics with Statistical Machine Learning

advertisement
Analysing Protein Energetics with Statistical Machine Learning
Computer simulations continue to shed light on the phenomena of protein folding and function. Protein
modelling and structure prediction from sequence faces two major challenges. The first is the difficulty of
efficient sampling in the enormous conformational space, which is especially critical for molecular dynamics
and Markov Chain Monte Carlo simulations. The second challenge is the development of the energy function
describing molecular interactions for the problem at hand. The microscopic size of protein molecules
makes it impossible to measure these interactions directly, and so known protein structures themselves
have become the best available experimental evidence.
This project will address both of these challenges. The overall goal of this research is to
advance knowledge of protein energetics and improve on established modelling techniques
that utilize empirical knowledge-based potentials. It will address outstanding questions surrounding one of the most intriguing problems of molecular biology: how proteins adopt a
unique functional native structure. The project will draw on Contrastive Divergence, a
novel statistical machine learning technique, to infer the interaction potentials from a subset
of known protein structures. It will also build upon a successful CSC MSc. project from last
year, which implemented a novel Bayesian method for sampling the conformational space of
molecular systems, known as Nested Sampling. This technique allows us to directly investigate the macroscopic states of the protein folding pathway and evaluate the associated free
energies.
Contrastive Divergence (CD) [1] is a very general methodology for the iterative optimization of interaction parameters. The methodology requires a dataset of known equilibrium conformations, and a
Metropolis Monte Carlo procedure to produce perturbed conformations. In the Contrastive Divergence
approach, unlike the traditional approach to statistical potentials, no assumptions are made regarding
the a priori distribution of conformations in the absence of interaction. After originating in the machine
learning and artificial intelligence community just a few years ago, CD learning has mostly been used in
fields such as computer vision and microchip design that are very far from biological applications. We have
recently completed the first successful biophysical applications, which demonstrate ‘proof of principle’ for
the proposed research [2, 3, 4].
The first phase of this project will focus on building an appropriate parameterized model of a protein. In
particular, the research will assess the dielectric permittivity of proteins and the Kauzmann hydrophobicity
coefficient, along with other parameters of interest. Estimates of these parameters are hotly debated in
the scientific community. The optimization of these interaction parameters will provide valuable insights
into protein energetics, and help select the model of protein interactions that best corresponds to native
protein structures. The second phase will concentrate on learning the model parameters with Contrastive
Divergence and creating empirical potentials that integrate electrostatic and hydrophobic interactions,
depending on amino acid type. The outcome of the learning procedure will provide feedback about the
quality of the selected model. In the third phase, as a control, the model interactions and their parameters
will be used in the context of protein folding simulations. This phase will evaluate the stability of model
proteins and the corresponding energy landscape, using Nested Sampling [5].
[1] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput,
14(8):1771–800, 2002.
[2] A. A. Podtelezhnikov, Z. Ghahramani, and D. L. Wild. Learning about protein hydrogen bonding by
minimizing contrastive divergence. Proteins, 66(3):588–99, 2007.
[3] A. A. Podtelezhnikov and D. L. Wild. Crankite: A fast polypeptide backbone conformation sampler.
Source Code Biol Med, 3:12, 2008.
[4] A. A. Podtelezhnikov and D. L. Wild. Reconstruction and stability of secondary structure elements in
the context of protein structure prediction. Biophys J, 96:4399–4408, 2009.
[5] J. Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1(4):833–860, 2006.
Download