The Side-Chain Positioning Problem Carl Kingsford Princeton University Joint work with Bernard Chazelle and Mona Singh Proteins Many functions: Structural, messaging, catalytic, … Sequence of amino acids strung together on a backbone Each amino acid has a flexible side-chain Proteins fold. Function depends highly on 3D shape V R R C Protein Structure Backbone Side-chains Side-chain Positioning Problem Given: • fixed backbone • amino acid sequence Find the 3D positions for the side-chains that minimize the energy of the structure Assume lowest energy is best IILVPACW… Side-chain Positioning Applications Homology-modeling: Use known backbone of similar protein to predict new structure Unknown:KNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHII NV CKNG NCY S S + ITDCR G+SKYPNC YKT+ KHII Known:ENVTCKNGKKNCYKSTSALHITDCRLKGNSKYPNCDYKTSDYQKHII Rotamers Each amino acid has some number of statistically preferred side-chain positions These are called rotamers Continuum of positions is well approximated by rotamers 3 rotamers of Arginine An Equivalent Graph Problem For protein with p side-chains: V1 V2 p-partite graph: • part Vi for each side-chain i • node u for each rotamer • edge {u,v} if u interacts with v Weights: • E(u) = self-energy • E(u,v) = interaction energy position n nodes rotamer Feasible Solution V1 Feasible solution: one node from each part cost(feasible) = cost of induced subgraph V2 Hard to approximate within a factor of cn where n is the # of nodes rotamer position Determining the Energy 0 + - electrostatics • Energy of a protein conformation is the sum of several energy terms van der Waals bond lengths bond angles • No -inequality A B hydrogen bonds dihedral angles Plan of Attack 1.Formulate as a quadratic integer program 2.Relax into a semidefinite program 3.Solve the SDP in polynomial time 4.Round solution vectors to choice of rotamers Quadratic Integer Program min subject to for each posn j for each posn j, node v Relax Into Vector Program Use xu = xu2 for to write as pure quadratic program Variables n-dimensional vectors ( ) minimize subject to for each posn j for each node v, posn j Rewrite As Semidefinite Program X (xuv) is PSD xuv = xuTxv minimize subject to for each posn j for each node v, posn j Constraints & Dummy Position Insert a new position with a single node. No edges, no node cost. xu0 Vi V0 xuv xvv position constraints sum of the node variables in each position is 1 flow constraints sum of edge variables adjacent to a node equals that node variable Vj Geometry of the Solution Vectors Geometry of Solution Vectors Lemma. Proof. Let . Simple algebra shows that: • Length of y is 1 • Length of xu0 is 1 • Length of projection of y onto xu0 is 1 Solution Vectors Lie on a Sphere Each solution vector lies on a sphere of radius ½ centered at xu0/2: a2 = xu 0 a because xu O Note. Length of projection of xu onto xu0 is the length of vector xu squared. How do we round the solution of the SDP relaxation? Convert fractional solutions into feasible 0/1 solutions • Projection rounding • Perron-Frobenius rounding Projection Rounding Since , the xuu give a probability distribution at at each position. Pick node u with probability xuu xuu = length of the projection onto xu0. xu 0 xv X= O xu Drift for Projection Rounding Drift expected difference between fractional & rounded solutions. uv = E(u,v)(xuv – Pr[uv]) Comes entirely from pairwise interactions. In fact, Because xu are on a sphere, By Cauchy-Schwartz, xu yu xv yv Perron-Frobenius Rounding = q= 0 1 0 =1 0 1 0 =1 0 0 1 =1 0 0 1 0 0 =1 0/1 characteristic n-vector of optimal solution Optimal integral X* T rank(X*) = 1 Idea: Approximate fractional X by a rank 1 matrix qqT Want to sample from , but settle for q q needs to contain probability distributions for each position. How do we choose q? 0 Possible Choices for q Lemma. Any nonnegative vector q with L1-norm p in the image space of X contains the required set of probability distributions. Proof. X = WTW, where W = [x1 x2 … xn]. Let 1i characteristic vector for position i Suppose q = Xy for some y. Then, The final value is independent of i each position sums to 1. A Choice for q By spectral decomposition where Take z1 is in the image space of X. By Perron-Frobenius theorem for nonnegative matrices q ≥ 0. By Lemma, q contains the needed probability distributions. Computational Results 30 random graphs 60 nodes, 15 positions edge probability ½ weights uniformly from [0,1] Compare solutions from Simple LP SDP Fractional Projection rounded Perron-Frobenius rounded Future Work Can the rounding schemes be applied to other problems? Can the semidefinite program be sped up? ─ Can only routinely solve graphs with ≤ 120 nodes (reasonable protein problems contain 1000 to 5000 nodes) ─ xuv ≥ 0 constraints are the bottleneck Can the requirement of a fixed backbone be relaxed? We’ve worked quite a bit with real proteins using a LP approach Seems an SDP formulation might be useful More Information The Side-Chain Positioning Problem: A Semidefinite Programming Formulation with New Rounding Schemes, B. Chazelle, C. Kingsford, M. Singh, Proc. ACM FCRC'2003, Principles of Computing and Knowledge: Paris Kanellakis Memorial Workshop (2003). http://www.cs.princeton.edu/~carlk/papers.html