David Bernick BME230 Identifying Functional signatures in Proteins – a computational approach 2/16/2016 Abstract Residues within a protein sequence contribute in varying degrees to the structure that the protein will adopt and the ability of the molecule to provide value to the organism. Natural selection provides the selection pressure which evaluates the function(value) of a protein while imposing selection pressure on the underlying scaffold(structure) that supports that function. In this way, nature selects for both protein function and structure in this interdependent manner. We have tools today to computationally evaluate and predict protein structure, but few techniques exist to computationally evaluate or classify function. Protein design tools (Rosetta, SAM, etc.) can create computational ensembles of putative sequences that fit a query backbone. These tools select for near optimal sequences by minimizing a score (energy) that the sequence would have when placed on a query fold. The information used by these tools includes structure fragments drawn from known protein structures, as well as scoring of atomic-scale interactions. Function specific information is not available to this process. These computational ensembles can then provide a position specific distribution of residues that can adopt a protein fold, independent of any functional constraints. This study will use these computational ensembles as a background distribution to highlight function in natural sequences. Previous use of structural ensembles by Pei (2003)i, Larson (2003)ii iiiiv, Koehlv, and Kuhlman(2000)vi have contributed to improvements in function prediction, homology searches and in examining the energetic optimality of natural protein structures. Koehl has made use of a euclidian distance score as a measure of the difference in position specific distributions between aligned positions. This study will extend on this idea, in order to measure both structural and functional importance of every position by considering a triangle of three distributions for every position. The points of the triangle include the distribution found in nature, the distribution found with our computational ensemble, and the background distribution found across all proteins. We can then consider the length of each side as a measure of function, structure, and the vector sum of function and structure. As a measure of confidence, these lengths can be converted to Zscores, normalizing for the variation seen across all positions in the protein. Additionally, this project can evaluate methods of creating these structural ensembles. Ideally, a method will produce a good approximation of true residue diversity available at every position. Any noise induced in the background distribution would increase the variance in the structure distribution found across all positions in the protein. This variation would be seen in both scores for function and for structure. The method that provides the smaller variance in structure and function scores would then be preferred. There are two methods under consideration for creating structural ensembles. The first population is generated by building an ensemble of decoys constructed at an elevated annealing temperature with a flexible protein backbone. The second population of ensembles is computationally derived from natural structures that adopt the query protein fold, using a fixed backbone protocol. Both of these techniques attempt to explore the sequence diversity available to the specific fold. Previous efforts by this author have shown some success with this technique using SH3, PH and TIM barrel domains. This study will examine 15+ domains, with the intention of eventually scaling up to include all known domains. David Bernick BME230 Identifying Functional signatures in Proteins – a computational approach 2/16/2016 References i Pei, J Dokholyan, DV, Shakhnovich, EI, and Grishin, NV, Using protein design for homology detection and active site searches. (2003) PNAS 100 no 20:11361-11366 ii Larson SM, Garg A, Desjarlais, JR and Pande VS Increased Detection of Structural Templates Using Alignments of Designed Sequences (2003) Proteins: Structure, Function and Genetics 51:390-396 iii Larson SM, Garg A, Desjarlais, JR and Pande VS Increased Detection of Structural Templates Using Alignments of Designed Sequences (2003) Proteins: Structure, Function and Genetics 51:390-396 iv Larson SM, Pande VS Sequence Optimzation for Native State Stability Determines the Evolution and Folding Kinetics of a Small Protein (2003) J. Mol. Biol. 332, 275-286 v Koehl P and Levitt M, Protein Topology and stability define the space of allowed sequences (2002) PNAS 99 no 3. vi Kuhlman B, Baker D, Native protein sequences are close to optimal for their structures (2000) PNAS 97:10383-10388