Analysing Protein Energetics with Statistical Machine Learning Computer simulations continue to shed light on the phenomena of protein folding and function. Protein modelling and structure prediction from sequence faces two major challenges. The first is the difficulty of efficient sampling in the enormous conformational space, which is especially critical for molecular dynamics and Markov Chain Monte Carlo simulations. The second challenge is the development of the energy function describing molecular interactions for the problem at hand. The microscopic size of protein molecules makes it impossible to measure these interactions directly, and so known protein structures themselves have become the best available experimental evidence. This project will address both of these challenges. The overall goal of this research is to advance knowledge of protein energetics and improve on established modelling techniques that utilize empirical knowledge-based potentials. It will address outstanding questions surrounding one of the most intriguing problems of molecular biology: how proteins adopt a unique functional native structure. The project will draw on Contrastive Divergence, a novel statistical machine learning technique, to infer the interaction potentials from a subset of known protein structures. It will also build upon a successful CSC MSc. project from last year, which implemented a novel Bayesian method for sampling the conformational space of molecular systems, known as Nested Sampling. This technique allows us to directly investigate the macroscopic states of the protein folding pathway and evaluate the associated free energies. Contrastive Divergence (CD) [1] is a very general methodology for the iterative optimization of interaction parameters. The methodology requires a dataset of known equilibrium conformations, and a Metropolis Monte Carlo procedure to produce perturbed conformations. In the Contrastive Divergence approach, unlike the traditional approach to statistical potentials, no assumptions are made regarding the a priori distribution of conformations in the absence of interaction. After originating in the machine learning and artificial intelligence community just a few years ago, CD learning has mostly been used in fields such as computer vision and microchip design that are very far from biological applications. We have recently completed the first successful biophysical applications, which demonstrate ‘proof of principle’ for the proposed research [2, 3, 4]. The first phase of this project will focus on building an appropriate parameterized model of a protein. In particular, the research will assess the dielectric permittivity of proteins and the Kauzmann hydrophobicity coefficient, along with other parameters of interest. Estimates of these parameters are hotly debated in the scientific community. The optimization of these interaction parameters will provide valuable insights into protein energetics, and help select the model of protein interactions that best corresponds to native protein structures. The second phase will concentrate on learning the model parameters with Contrastive Divergence and creating empirical potentials that integrate electrostatic and hydrophobic interactions, depending on amino acid type. The outcome of the learning procedure will provide feedback about the quality of the selected model. In the third phase, as a control, the model interactions and their parameters will be used in the context of protein folding simulations. This phase will evaluate the stability of model proteins and the corresponding energy landscape, using Nested Sampling [5]. [1] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput, 14(8):1771–800, 2002. [2] A. A. Podtelezhnikov, Z. Ghahramani, and D. L. Wild. Learning about protein hydrogen bonding by minimizing contrastive divergence. Proteins, 66(3):588–99, 2007. [3] A. A. Podtelezhnikov and D. L. Wild. Crankite: A fast polypeptide backbone conformation sampler. Source Code Biol Med, 3:12, 2008. [4] A. A. Podtelezhnikov and D. L. Wild. Reconstruction and stability of secondary structure elements in the context of protein structure prediction. Biophys J, 96:4399–4408, 2009. [5] J. Skilling. Nested sampling for general Bayesian computation. Bayesian Analysis, 1(4):833–860, 2006.