SMA5233 Particle Methods and Molecular Dynamics Lecture 5: Applications in Biomolecular Simulation and Drug Design A/P Chen Yu Zong Tel: 6516-6877 Email: phacyz@nus.edu.sg http://bidd.nus.edu.sg Room 08-14, level 8, S16 National University of Singapore Proteins Proteins are life’s machines, tools and structures – Many jobs, many shapes, many sizes 2 Proteins Proteins are life’s machines, tools and structures – Nature reuses designs for similar jobs 1enh 1hdd 1bw5 1f43 1du6 1ftt 1cqt 3 Proteins Proteins are hetero-polymers of specific sequence M K L V D Y A G E – There are 20 common polymeric units (amino acids) Composed of a variety of basic chemical moieties – Chain lengths range from 40 amino acids on up 4 Proteins Proteins are hetero-polymers that adopt a unique fold M K L V D Y A G E 5 Proteins Protein folding as a reaction Transition state Bad Free Energy Reactants Products Good 6 Proteins Protein folding … Transition state Bad Free Energy Denatured / Partially Unfolded Native Good 7 Proteins Folded proteins Transition state Bad Free Energy Denatured / Partially Unfolded Native Good Folded, active, functional, biologically relevant state (ensemble of conformers) 8 Proteins Folded proteins Transition state Bad Free Energy Denatured / Partially Unfolded Native Good Static, 3D coordinates of some proteins’ atoms are available from x-ray crystallography & NMR 9 Proteins Folded proteins Transition state Bad Free Energy Denatured / Partially Unfolded Native Good Static, 3D coordinates of some proteins’ atoms are available from PDB http://www.pdb.org 10 Proteins Folded proteins are complex and dynamic molecules Transition state Bad Denatured / Free Partially Unfolded Energy Native Good 11 Proteins Folded proteins are complex and dynamic molecules Transition state Bad Denatured / Free Partially Unfolded Energy Native Good 12 Molecular Dynamics MD provides atomic resolution of native dynamics PDB ID: 3chy, E. coli CheY 1.66 Å X-ray crystallography 13 Molecular Dynamics MD provides atomic resolution of native dynamics PDB ID: 3chy, E. coli CheY 1.66 Å X-ray crystallography 14 Molecular Dynamics MD provides atomic resolution of native dynamics 3chy, hydrogens added 15 Molecular Dynamics MD provides atomic resolution of native dynamics 3chy, waters added (i.e. solvated) 16 Molecular Dynamics MD provides atomic resolution of native dynamics 3chy, waters and hydrogens hidden 17 Molecular Dynamics MD provides atomic resolution of native dynamics 18 native state simulation of 3chy at 298 Kelvin, waters and hydrogens hidden Proteins Folding & unfolding at atomic resolution Transition state Bad Free Energy Denatured / Partially Unfolded Native Good Disordered, non-functional, heterogeneous ensemble of conformers 19 Proteins Protein folding, why we care how it happens Transition state Free Energy mutation mutation Denatured / Partially Unfolded Native mutation Many diseases are related to protein folding and / or 20 misfolding in response to genetic mutation. Proteins Protein folding, why we care how it happens Transition state Free Energy mutation mutation Denatured / Partially Unfolded Native mutation We need to comprehend folding to build nano-scale biomachines (that could produce energy, etc…) 21 Proteins Protein folding takes > 10 μs (often much longer) Transition state Bad Denatured / Free Partially Unfolded Energy Native Good 22 Proteins Protein folding takes > 10 μs (often much longer) Transition state Bad Denatured / Free Partially Unfolded Energy Native Good 23 Proteins Protein folding is the reverse of protein unfolding Transition state Bad Denatured / Free Partially Unfolded Energy Native Good 24 Proteins Protein unfolding is relatively invariant to temperature Transition state Bad Denatured / Native Free Partially Unfolded Energy Temperature Good 25 Molecular Dynamics MD provides atomic resolution of folding / unfolding 26 unfolding simulation (reversed) of 3chy at 498 Kelvin, waters & hydrogens hidden Forces Involved in the Protein Folding Electrostatic interactions van der Waals interactions Hydrogen bonds Hydrophobic interactions (Hydrophobic molecules associate with each other in water solvent as if water molecules is the repellent to them. It is like oil/water separation. The presence of water is important for this interaction.) 27 Energy Functions used in Molecular Simulation Φ r Θ Bond stretching term Angle bending term Vtotal K b r r0 2 bonds K 0 2 angles K 1 cosn dihedrals Cij Dij 12 10 van der Waals r Hbonds rij ij i , j pairs Dihedral term H-bonding term Van der Waals term O r H The most time demanding part. Aij Bij qi q j 12 6 r electrosta tic r r ij ij i , j pairs ij Electrostatic term + r r ー 28 System for MD Simulations Without water molecules With water molecules # of atoms: 304 # of atoms: 304 + 7,377 = 7,681 29 MD Requires Huge Computational Cost Time step of MD (Δt) is limited up to about 1 fsec (10-15 sec). ← The size of Δt should be approximately one-tenth the time of the fastest motion in the system. For simulation of a protein, because bond stretching motions of light atoms (ex. O-H, C-H), whose periods are about 10-14 sec, are the fastest motions in the system for biomolecular simulations, Δt is usually set to about 1 fsec. Huge number of water molecules have to be used in biomolecular MD simulations. ← The number of atom-pairs evaluated for non-bonded interactions (van der Waals, electrostatic interactions) increases in order of N 2 (N is the number of atoms). It is difficult to simulate for long time. Usually a few tens of nanoseconds simulation is performed. 30 Time Scales of Protein Motions and MD Permeation of an ion in Porin channel Elastic vibrations of proteins α-Helix folding β-Hairpin folding Bond stretching Protein folding 10-15 10-12 10-9 10-6 10-3 100 (fs) (ps) (ns) (μs) (ms) (s) MD Time It is still difficult to simulate a whole process of a protein folding using the conventional MD method. 31 Much Faster, Much Larger! Special-purpose computer – Calculation of non-bonded interactions is performed using the special chip that is developed only for this purpose. – For example; MDM (Molecular Dynamics Machine) or MD-Grape: RIKEN MD Engine: Taisho Pharmaceutical Co., and Fuji Xerox Co. Parallelization – A single job is divided into several smaller ones and they are calculated on multi CPUs simultaneously. – Today, almost MD programs for biomolecular simulations (ex. AMBER, CHARMm, GROMOS, NAMD, MARBLE, etc) can run on parallel computers. 32 Brownian Dynamics (BD) The dynamic contributions of the solvent are incorporated as a dissipative random force (Einstein’s derivation on 1905). Therefore, water molecules are not treated explicitly. Since BD algorithm is derived under the conditions that solvent damping is large and the inertial memory is lost in a very short time, longer timesteps can be used. BD method is suitable for long time simulation. 33 System for BD Simulations Without water molecules With water molecules # of atoms: 304 # of atoms: 304 + 7,377 = 7,681 34 Algorithm of BD The Langevin equation can be expressed as d 2 ri dr mi 2 i i Fi R i dt dt (7) Here, ri and mi represent the position and mass of atom i, respectively. ζi is a frictional coefficient and is determined by the Stokes’ law, that is, ζi = 6πaiStokesη in which aiStokes is a Stokes radius of atom i and η is the viscosity of water. Fi is the systematic force on atom i. Ri is a random force on atom i having a zero mean <Ri(t)> = 0 and a variance <Ri(t)Rj(t)> = 6ζikTδijδ(t); this derives from the effects of solvent. For the overdamped limit, we set the left of eq.7 to zero, i d ri Fi R i dt (8) The integrated equation of eq. 8 is called Brownian dynamics; ri (t t ) ri (t ) Fi (t ) i t 2kBT i t ωi (9) where Δt is a time step and ωi is a random noise vector obtained from Gaussian distribution. 35 Computational Time of BD Computational time required for 1 nsec simulation of a peptide Algorithm Computer # of atoms Time (sec) Efficiency MD Pentium4 2.8 GHz 7,681 2,057 1.00 BD Pentium4 2.8 GHz 304 38.8 53.0 BD +MTS† Pentium4 2.8 GHz 304 12.8 161 BD +MTS† IBM Regatta 8 CPU 304 3.4 605 †MTS(Multiple time step) algorithm: This method reduces the frequency of calculation of the most time-demanding part (nonbonded energy terms). 36 Folding Simulation of an α-Helical Peptide using BD The fraction of native contacts 1.0 0.8 0.6 0.4 0.2 0 0 100 200 300 Simulation time (nsec) 400 37 The fraction of native contacts Folding Simulation of an β-Hairpin Peptide using BD 1.0 0.8 0.6 0.4 0.2 0 0 100 200 300 Simulation time (nsec) 400 38 Time Scales of Protein Motions and BD Permeation of an ion in Porin channel Elastic vibrations of proteins α-Helix folding β-Hairpin folding Bond stretching Protein folding 10-15 10-12 10-9 10-6 10-3 (fs) (ps) (ns) (μs) (ms) MD BD 100 Time (s) BD method allows us to simulate for long time. 39 EXAMPLE: Unfolding of Staphylococcal protein A through high temperature MD simulations D. Alonso and V. Daggett, PNAS 2000; 97: 133-138. 40 Simulation Methodology Starting structure: NMR structures 1edk and 1bbd ENCAD program The protein was initially minimized 1,000 steps in vacuo. The minimized protein was then solvated with water in a box (approximately 1 g/ml) extending a minimum of 10 Å from the protein The box dimensions were then increased uniformly to yield the experimental liquid water density for the temperature of interest (0.997 g/ml at 298 K and 0.829 at 498 K) 41 Simulation Methodology (cont) The systems were then equilibrated by minimizing the water for 2,000 steps, minimizing water and protein for 100 steps, performing MD of the water for 4,000 steps, minimizing the water for 2,000 steps, minimizing the protein for 500 steps, and minimizing the protein and water for 1,000 steps. Production MD simulations were then run using a 2-fs time step for several ns (T=298K, T=498K) 42 RMSD and RG as a function of simulation time 43 Snapshots along the unfolding trajectory 44 EXAMPLE 2: Identification of the N-terminal peptide binding site of GRP94 GRP94 - Glucose regulated protein 94 VSV8 peptide - derived from vesicular stomatitis virus Gidalevitz T, Biswas C, Ding H, Schneidman-Duhovny D, Wolfson HJ, Stevens F, Radford S, Argon Y. J Biol Chem. 2004 45 Biological and Drug Design Motivation The complex between the two molecules highly stimulates the response of the T-cells of the immune system. The grp94 protein alone does not have this property. The activity that stimulates the immune response is due to the ability of grp94 to bind different peptides. Characterization of peptide binding site is highly important for drug design. Either the peptides or their derived non-peptide inhibitors can be developed into drugs for treating immune related or immune-regulating diseases 46 GRP94 molecule There was no structure of grp94 protein. Homology modeling was used to predict a structure using another protein with 52% identity. Recently the structure of grp94 was published. The RMSD between the crystal structure and the model is 1.3A. 47 Docking and Structure Optimization PatchDock was applied to dock the two molecules, without any binding site constraints followed by MD or MM simulation to optimize the docked structure. Docking results were clustered in the two cavities: 48 GRP94 molecule There is a binding site for inhibitors between the helices. There is another cavity produced by beta sheet on the opposite side. 49 Advantages of Fully Atomic Models Detailed level of description of protein and solvent Disadvantages of Fully Atomic Models Computationally very costly Cannot reach the long time and length scales of biological interest Can use enhanced sampling techniques (see talks by Andrij and Arturo) Can use simplified (coarse-grained) models 50 Molecular Dynamics Scalable, parallel MD & analysis software: ilmm in lucem Molecular Mechanics 1. 1 Beck, Alonso, Daggett, (2004) University of Washington, Seattle 51 Molecular Dynamics ilmm is written in C (ANSI / POSIX) 64 bit math POSIX threads / MPI POSIX threads (multiprocessor machines) CPU CPU Message Passing Interface (multiple machines) + CPU CPU VERY high bandwidth Software design philosophy: – Kernel Compiles user’s molecular mechanics programs Schedules execution across processor and machines – Modules, e.g. Molecular Dynamics Analysis 52 Molecular Dynamics ilmm is written in C (ANSI / POSIX) 64 bit math POSIX threads / MPI POSIX threads (multiprocessor machines) CPU CPU Message Passing Interface (multiple machines) + CPU CPU VERY high bandwidth Software design philosophy: – Kernel Compiles user’s molecular mechanics programs Schedules execution across processor and machines – Modules, e.g. Molecular Dynamics Analysis 53 Dynameomics Simulate representative protein from all folds coverage population 1 150 folds represent ~ 75% of known protein structures fold fold 1. Day R., Beck D. A. C., Armen R., Daggett V. Protein Science (2003) 10: 2150-2160. 54 Dynameomics Simulate representative protein from all folds – Native (folded) dynamics 20 nanosecond simulation at 298 Kelvin – Folding / unfolding pathway 3 x 2 ns simulations at 498 K 2 x 20 ns simulations at 498 K – Each target requires 6 simulations = MANY CPU HOURS 55 Dynameomics NERSC DOE INCITE award – 2,000,000 + hours – 906 simulations of 151 protein folds on Seaborg – One to two simulations per node (8 – 16 CPUs / simulation) – Opportunity to tune ilmm for maximum performance 56 Dynameomics Load balancing – Even distribution of non-bonded pairs to processors ~20% faster 57 Dynameomics Parallel efficiency – Threaded computations on 16 CPU IBM Nighthawk 1 t (1) p, number of processors parallel efficiency, e( p) p t ( p) t(p), run-time using p processors parallel efficiency 1 0.8 0.6 0.4 0.2 0 1 2 4 8 12 16 CPU CPU CPU CPU CPU CPU 58 Dynameomics Simulate representative from top 151 folds – 151 folds represent about 75% of known proteins ~ 11 μs of combined sim. time from 906 sims! ~ 2 terabytes of data (w/ 40 to 60% compression!) ~ 75 / 151 have been analyzed Validated against experiment where possible 59 Dynameomics Now what? – Simulate the top 1130 folds (>90%) More CPU time – Share simulation data from top 151 folds w/ world: www.dynameomics.org Coordinates, analyses, available via WWW MicrosoftSQL database w/ On-Line Analytical Processing (OLAP) End-user queries of coordinate data, analyses, etc. – Data mining More CPU time, clever statistical algorithms, etc. 60