Robotics Algorithms for the Study of Protein Structure and Motion Jean-Claude Latombe

advertisement

Robotics Algorithms for the Study of

Protein Structure and Motion

Jean-Claude Latombe

Computer Science Department

Stanford University

Based on Itay Lotan’s PhD

Unfolded (denatured) state

Many pathways

Folded (native) state

Folded State

Loops connect  helices and  strands

Protein Sequence Structure amino-acid

(residue) peptide bonds

f-y Kinematic Linkage Model

Conformational space

Molecule  Robot

Why Studying Proteins?

 They perform many vital functions, e.g.:

• catalysis of reactions

• storage of energy

• transmission of signals

• building blocks of muscles

 They are linked to key biological problems that raise major computational challenges mostly due to their large sizes (100s to several

1000s of atoms), many degrees of kinematic freedom, and their huge number (millions)

Two problems

 Structure determination from electron density maps

• Inverse kinematics techniques

[Itay Lotan, Henry van den Bedem, Ashley Deacon

(Joint Center for Structural Genomics)]

 Energy maintenance during Monte Carlo simulation

• Distance computation techniques

[Itay Lotan, Fabian Schwarzer, and Danny Halperin

(Tel Aviv University)]

Structure Determination:

X-Ray Crystallography

Software

Software systems: RESOLVE, TEXTAL, ARP/wARP, MAID

• 1.0Å < d < 2.3Å

• 2.3Å ≤ d < 3.0Å

~ 90% completeness

~ 67% completeness (varies widely) 1

1.0Å 3.0Å

JCSG: 43% of data sets  2.3Å

 Manually completing a model:

• Labor intensive, time consuming

• Existing tools are highly interactive

 Model completion is high-throughput bottleneck

1 Badger (2003) Acta Cryst. D59

The Completion Problem

 Input:

• Electron-density map

• Partial structure

• Two anchor residues

• Amino-acid sequence of

Anchor 1

(3 atoms) missing fragment

(typically 4 – 15 residues long)

Partial structure

(folded)

Anchor 2

(3 atoms)

 Output:

• Ranked conformations Q of fragment that

Respect the closure constraint

Maximize target function T(Q) measuring fit with electron-density map

No atomic clashes

(Inverse Kinematics)

Two-Stage IK Method

1.

Candidate generations

 Closed fragments

2.

Candidate refinement

 Optimize fit with EDM

Stage 1: Candidate Generation

1.

Generate a random conformation of fragment (only one end attached to anchor)

2.

Close fragment (i.e., bring other end to second anchor) using Cyclic Coordinate

Descent (CCD)

(Wang & Chen ’91, Canutescu & Dunbrack ’03)

Closure Distance

Closure Distance:

S

N

-

N

2 

C 

-

C 

2 

C

-

C

2 moving end

A.A. Canutescu and R.L. Dunbrack Jr.

Cyclic coordinate descent: A robotics algorithm for protein loop closure.

Prot. Sci. 12:963–972, 2003.

fixed end

Compute q i

+ bias toward avoiding steric clashes

Exact Inverse Kinematics

Repeat for each conformation of a closed fragment:

1.

Pick 3 amino-acids at random (3 pairs of f y angles)

2.

Apply exact IK solver to generate all

IK solutions

[Coutsias et al, 2004]

GLU-83

TM0813

GLY-96

Stage 2: Candidate Refinement

 Target function T (Q) measuring quality of the fit with the EDM

 Minimize T while retaining closure

 Closed conformations lie on a self-motion manifold of lower dimension d q

3

( q

1

, q

2

, q

3

) d q

2

Null space d q

1

1-D manifold

Closure and Null Space

 dX = J dQ, where J is the 6  n Jacobian matrix (n > 6)

 Null space {dQ | J dQ = 0} has dim = n – 6

 N: orthonormal basis of null space

 dQ = NN T  T(Q)

X

dX

=

Computation of N

SVD of J

U

6  6 s

1 s

2

S

6  6

V T

6  n dQ

0 s

6

N T

(n-6) basis N of null space

Gram-Schmidt orthogonalization

Refinement Procedure

Repeat until minimum of T is reached:

1.

Compute J and N at current Q

2.

Compute  T at current Q

(analytical expression of  T + linear-time recursive computation [Abe et al., Comput. Chem., 1984] )

3.

Move by small increment along dQ = NN T  T

(+ Monte Carlo / simulated annealing protocol to deal with local minima)

GLU-83

TM0813

GLY-96

Tests #1: Artificial Gaps

 TM1621 (234 residues) and TM0423 (376 residues), SCOP classification a/b

 Complete structures (gold standard) resolved with EDM at 1.6Å resolution

 Compute EDM at 2, 2.5, and 2.8Å resolution

 Remove fragments and rebuild

TM1621

103 Fragments from TM1621 at 2.5Å

Short Fragments:

100% < 1.0Å aaRMSD

Long Fragments:

12: 96% < 1.0Å aaRMSD

15: 88% < 1.0Å aaRMSD

Produced by H. van den Bedem

Example: TM0423

PDB: 1KQ3, 376 res.

2.0Å resolution

12 residue gap

Best: 0.3Å aaRMSD

Tests #2: True Gaps

 Structure computed by RESOLVE

 Gaps completed independently (gold standard)

 Example: TM1742 (271 residues)

 2.4Å resolution; 5 gaps left by RESOLVE

Length

4

5

5

7

10

Top scorer

0.22Å

0.78Å

0.36Å

0.72Å

0.43Å

Produced by H. van den Bedem

TM1621

 Green: manually completed conformation

 Cyan: conformation computed by stage 1

 Magenta: conformation computed by stage 2

 The aaRMSD improved by 2.4Å to 0.31Å

Current/Future Work

 Software actively being used at the JCSG

 What about multi-modal loops?

A

B

 TM0755: data at 1.8Å

 8-residue fragment crystallized in 2 conformations

 Overlapping density: Difficult to interpret manually

A316

Ser

A323

Hist

Algorithm successfully identified and built both conformations

Current/Future Work

 Software actively being used at the JCSG

 What about multi-modal loops?

 Fuzziness in EDM can then be exploited

 Use EDM to infer probability measure over the conformation space of the loop

B

A

Amylosucrase

J. Cortés, T. Siméon, M. Renaud-Siméon, and V. Tran.

J. Comp. Chemistry, 25:956-967, 2004

Energy maintenance during

Monte Carlo simulation joint work with Itay Lotan, Fabian Schwarzer, and Dan Halperin 1

1 Computer Science Department, Tel Aviv University

Monte Carlo Simulation (MCS)

 Random walk through conformation space

 At each attempted step:

• Perturb current conformation at random

• Accept step with probability:

)

 e

-

E k T

 The conformations generated by an arbitrarily long MCS are Boltzman distributed, i.e.,

#conformations in V ~

V

-

E e dV

Monte Carlo Simulation (MCS)

 Used to:

• sample meaningful distributions of conformations

• generate energetically plausible motion pathways

 A simulation run may consist of millions of steps

 energy must be evaluated a large number of times

Problem: How to maintain energy efficiently?

Energy Function

 E = S bonded terms

+ S non-bonded terms

+ S solvation terms

 Bonded terms

O(n)

 Non-bonded terms

E.g., Van der Waals and electrostatic

- Depend on distances between pairs of atoms

O(n 2 )  Expensive to compute

 Solvation terms

May require computing molecular surface

Non-Bonded Terms

 Energy terms go to 0 when distance increases

 Cutoff distance (6 - 12Å)

 vdW forces prevent atoms from bunching up

 Only O(n) interacting pairs

[Halperin&Overmars 98]

Problem: How to find interacting pairs without enumerating all atom pairs?

d cutoff

Grid Method

 Subdivide 3-space into cubic cells

 Compute cell that contains each atom center

 Represent grid as hashtable

d cutoff

Grid Method

 Θ(n) time to build grid

 O(1) time to find interactive pairs for each atom

 Θ(n) to find all interactive pairs of atoms

[Halperin&Overmars, 98]

 Asymptotically optimal in worst-case

Can we do better on average?

 Few DOFs are changed at each MC step simulation of 100,000 attempted steps

0

5 10 20 30

Number k of DOF changes

Can we do better on average?

 Few DOFs are changed at each MC step

 Proteins are long chain kinematics

Long sub-chains stay rigid at each step

 Many interacting pairs of atoms are unchanged

 Many partial energy sums remain constant

Problem: How to find new interacting pairs and retrieve unchanged partial sums?

Two New Data Structures

1. ChainTree

 Fast detection of interacting atom pairs

2. EnergyTree

 Retrieval of unchanged partial energy sums

ChainTree

(Twofold Hierarchy: BVs + Transforms) links

ChainTree

(Twofold Hierarchy: BVs + Transforms)

T

NO

T

AB

T

JK joints

Updating the ChainTree

Update path to root:

– Recompute transforms that “shortcut” the DOF change

– Recompute BVs that contain the DOF change

– O(k log

2

(2n/k)) work for k changes

Finding Interacting Pairs



Finding Interacting Pairs

Finding Interacting Pairs

 Do not search inside rigid sub-chains

(unmarked nodes)

Finding Interacting Pairs

 Do not search inside rigid sub-chains

(unmarked nodes)

 Do not test two nodes with no marked node between them

 New interacting pairs

E (N,N)

E (J,L)

EnergyTree

E (K.L)

E (L,L)

E (M,M)

E (N,N)

E (J,L)

EnergyTree

E (K.L)

E (L,L)

E (M,M)

Complexity

 n : total number of DOFs

 k : number of DOF changes at each MCS step

 k << n

 Complexity of:

 updating ChainTree: O( k log

2

(2 n / k ))

 finding interacting pairs: O( n 4/3 ) but p erforms much better in practice!!!

Experimental Setup

 Energy function:

 Van der Waals

 Electrostatic

 Attraction between native contacts

 Cutoff at 12Å

 300,000 steps MCS with Grid and

ChainTree

 Steps are the same with both methods

 Early rejection for large vdW terms

Results: 1-DOF change

12.5

7.8

speedup

5.8

3.5

# amino acids (68) (144) (374) (755)

Results: 5-DOF change

5.9

2.2

speedup 4.5

3.4

(68) (144) (374) (755)

Two-Pass ChainTree (ChainTree+)

1 st pass: small cutoff distance to detect steric clashes

2 nd pass: normal cutoff distance

>5

Tests around native state

Interaction with Solvent

 Implicit solvent model : solvent as continuous medium, interface is solvent-accessible surface

E. Eyal, D. Halperin. Dynamic Maintenance of Molecular Surfaces under

Conformational Changes. http://www.give.nl/movie/publications/telaviv/EH04.pdf

Summary

 Inverse kinematics techniques 

Improve structure determination from fuzzy electron density maps

 Collision detection techniques 

Speedup energy maintenance during

Monte Carlo simulation

About Computational Biology

 Computational Biology is more than mimicking nature (e.g., performing Molecular Dynamic simulation)

 One of its goals is to achieve algorithmic efficiency by exploiting properties of molecules, e.g.:

• Atoms cannot bunch up together

• Forces have relatively short ranges

• Proteins are long kinematic chains

Download