GROMACS Tutorial Umbrella Sampling Justin A. Lemkul Department of Pharmaceutical Sciences, University of Maryland, Baltimore This tutorial will guide the user through the process of setting up and running pulling simulations necessary to calculate binding energy between two species. The tutorial assumes the user has already successfully completed the Lysozyme tutorial, some other tutorial, or is otherwise well-­versed in basic GROMACS simulation methods and topology organization. Special attention will be paid to the methods for properly building the system and settings for the pull code itself. The binding energy (ΔGbind) is derived from the potential of mean force (PMF), extracted from a series of umbrella sampling simulations. A series of initial configurations is generated, each corresponding to a location wherein the molecule of interest (generally referred to as a "ligand") is harmonically restrained at increasing center-­of-­mass (COM) distance from a reference molecule using an umbrella biasing potential. This restraint allows the ligand to sample the configurational space in a defined region along a reaction coordinate between it and its reference molecule or binding partner. The windows must allow for slight overlap of the ligand positions for proper reconstruction of the PMF curve. The steps for such a procedure (and the ones utilized in this tutorial) are as follows: 1. Generate a series of configurations along a single degree of freedom (reaction coordinate) 2. Extract frames from the trajectory in step 1 that correspond to the desired COM spacing 3. Run umbrella sampling simulations on each configuration to restrain it within a window corresponding to the chosen COM distance 4. Use the Weighted Histogram Analysis Method (WHAM) to extract the PMF and calculate ΔGbind The tutorial assumes that the reader is using GROMACS version 4.5.3 or later. My original work (from which this workflow was derived) was conducted with version 4.0.5, but in principle can be applied to any version in the 4.0.x or 4.5.x series. The pull code was completely re-­written after version 3.3.3, such that none of the information contained herein (beyond the basic theory of the technique) is applicable to any GROMACS version prior to 4.0. For the GROMACS 2013 workshop at the University of Virginia, it is assumed that you are using GROMACS version 4.6.3. Step One: Prepare the Topology Generating a molecular topology for an umbrella sampling simulation is just like any other simulation. Obtain the coordinate file of the structure of interest, and generate the topology from pdb2gmx. Some systems will require special consideration (i.e., protein-­ligand complexes, membrane proteins, etc). For protein-­ligand systems, please consult this tutorial, and for membrane proteins, I recommend my own tutorial on the topic. The principles of umbrella sampling are easily extendable to these systems, though we will consider only protein molecules in this tutorial. The system we will consider here is the dissociation of a single peptide from the growing end of an Aβ42 protofibril, and is based on simulations we recently published. The structure file of the wild-­type Aβ42 protofibril used in those simulations, acetylated at the N-­terminus of each chain, can be found here. The original PDB accession code is 2BEG. Run the structure through pdb2gmx: pdb2gmx -f input.pdb -ignh -ter -o complex.gro Choose the GROMOS96 53A6 parameter set, "None" for the N-­termini, and "COO-­" for the C-­termini. Modify topol_Protein_chain_B.itp to include the following lines (at the end of the file): #ifdef POSRES_B #include "posre_Protein_chain_B.itp" #endif We will be using chain B as an immobile reference later on in the pulling simulations, hence the need to specially position-­restrain this chain only, and none of the others. Step Two: Define the Unit Cell Defining the unit cell for a pulling simulation is not unlike defining the unit cell for any other simulation. There is, however, one major consideration. One must allow enough space in the pulling direction to allow for a continuous pull without interacting with the periodic images of the system. That is, the minimum image convention must be continually satisfied, and as well, the pull distance must always be less than one-­half the length of the box vector along which the pulling is being conducted. Why, you may ask? GROMACS calculates distances while simultaneously taking periodicity into account. This, if you have a 10-­nm box, and you pull over a distance greater than 5.0 nm, the periodic distance becomes the reference distance for the pulling, and this distance is actually less than 5.0 nm! This fact will significantly affect results, since the distance you think you are pulling is not what is actually calculated. We will be pulling a total distance of 5.0 nm in a 12.0-­nm box, to avoid the complications described above. The center of mass of the protofibril will be placed at (3.280, 2.181, 2.4775) in a box of dimensions 6.560 x 4.362 x 12. Use editconf to place the protofibril at this location: editconf -f complex.gro -o newbox.gro -center 3.280 2.181 2.4775 -box 6.560 4.362 12 You can visualize the location of the protofibril within the box using, for example, VMD. Load the structure in VMD and open the Tk console. Type the following (note that > indicates the Tk prompt, not something you actually type): > pbc box You should see something like the following in the VMD window: Step Three: Adding Solvent and Ions This step is conducted much like any other simulation. Refer to the Lysozyme tutorial for a more detailed description of what is going on here if you are unsure. First, we will add water with genbox: genbox -cp newbox.gro -cs spc216.gro -o solv.gro -p topol.top Next, we will add ions using genion, utilizing this .mdp file. We are going to be conducting these simulations in the presence of 100 mM NaCl, on top of neutralizing counterions: grompp -f ions.mdp -c solv.gro -p topol.top -o ions.tpr genion -s ions.tpr -o solv_ions.gro -p topol.top -pname NA -nname CL -neutral -conc 0.1 Step Four: Energy Minimization and Equilibration The energy minimization and equilibration steps are going to be conducted just like any other protein-­in-­water system. Here, we will perform steepest descents minimization followed by NPT equilibration. The .mdp file for minimization can be found here, and the one for NPT equilibration can be found here. Invoke grompp and mdrun, as usual: grompp -f minim.mdp -c solv_ions.gro -p topol.top -o em.tpr mdrun -v -deffnm em grompp -f npt.mdp -c em.gro -p topol.top -o npt.tpr mdrun -deffnm npt Because these procedures are time-­consuming, they are likely best run in parallel, i.e.: mdrun -nt X -deffnm npt In the above command, "X" represents the desired number of threads over which the parallel calculation is conducted. Step Five: Generating Configurations To conduct umbrella sampling, one must generate a series of configurations along a reaction coordinate, ζ. Some of these configurations will serve as the starting configurations for the umbrella sampling windows, which are run in independent simulations. The figure below illustrates these principles. The top image illustrates the pulling simulation we will run now, conducted in order to generate a series of configurations along the reaction coordinate. These configurations are extracted after the simulation is complete (dashed arrows in between the top and middle images). The middle image corresponds to the independent simulations conducted within each sampling window, with the center of mass of the free peptide restrained in that window by an umbrella biasing potential. The bottom images shows the ideal result as a histogram of configurations, with neighboring windows overlapping such that a continuous energy function can later be derived from these simulations. For this example, the reaction coordinate is the z-­axis. To generate these configurations, we must pull peptide A away from the protofibril. We will pull over the course of 500 ps of MD, saving snapshots every 1 ps. This setup has been established based on trial-­and-­error to obtain a reasonable distribution of configurations. In other systems, it may be necessary to save configurations more often, or sufficient to save configurations less often. The idea is to capture enough configurations along the reaction coordinate to obtain regular spacing of the umbrella sampling windows, in terms of center-­of-­mass distance between peptides A and B, the latter of which is our reference group. The .mdp file for this pulling can be found here. A brief explanation of the pulling options used is as follows: ; Pull code pull pull_geometry pull_dim pull_start pull_ngroups pull_group0 pull_group1 pull_rate1 pull_k1 = = = = = = = = = umbrella distance N N Y yes 1 Chain_B Chain_A 0.01 1000 ; define initial COM distance > 0 ; 0.01 nm per ps = 10 nm per ns ; kJ mol^-1 nm^-2 pull = umbrella: using a harmonic potential to pull. IMPORTANT: This procedure is NOT umbrella sampling. I used a harmonic potential in order to make qualitative observations about the dissociation pathway in this study. The harmonic potential allows the force to vary according to the nature of the interactions of peptide A with peptide B. That is, the force will build up until certain critical interactions are broken. See our paper for details. For the purposes of generating the initial configurations for umbrella sampling, you can actually use any combination of pull settings (pull and pull_geometry), but when it comes time for the actual umbrella sampling (in the next step) you MUST be using pull = umbrella. It is very important that you do not apply extremely fast pulling rates or extremely strong force constants, which can seriously deform elements of your system. Please refer to paper (particularly the Supporting Information) for how we chose to validate the pull rate used. pull_geometry = distance: see the note the in .mdp file;; you can also use position or direction, but changes will have to be made to other pulling parameters. pull_dim = N N Y: we are pulling only in the z-­dimension. Thus, x and y are set to "no" (N) and z is set to "yes" (Y). pull_start = yes: the initial COM distance is the reference distance for the first frame. This is useful because if we are attempting to pull 5.0 nm, converting the initial COM distance to zero (i.e., pull_start = no) makes this interpretation difficult. pull_ngroups = 1: we are only applying a pulling force to one group. pull_group0 = Chain_B: reference group for pulling. pull_group1 = Chain_A: group to which pulling force is applied. pull_rate1 = 0.01: the rate at which the "dummy particle" attached to our pull group is moved. This type of pulling is also called "constant velocity" due to the fact that this rate is fixed. pull_k1 = 1000: the force constant for pulling. Remember that #ifdef POSRES_B statement we added to topol_B.itp a while ago? We're going to use it now. By restraining peptide B of the protofibril, we are able to more easily pull peptide A away. Due to the extensive non-­covalent interactions between chains A and B, if we did not restrain chain B, we would end up simply towing the whole complex along the simulation box, which wouldn't accomplish much. We will need to define some custom index groups for this pulling simulation. Use make_ndx: make_ndx -f npt.gro (> indicates the make_ndx prompt) > r 1-27 > name 19 Chain_A > r 28-54 > name 20 Chain_B > q Now, run the continuous pulling simulation: grompp -f md_pull.mdp -c npt.gro -p topol.top -n index.ndx -t npt.cpt -o pull.tpr mdrun -s pull.tpr Again, this procedure will take some time, so run it in parallel if you have the resources available to you. Once this simulation is complete, we will need to extract useful frames for defining the umbrella sampling windows. The easiest way I have found to do this is the following: 1. Define the spacing of the windows (generally 0.1 -­ 0.2 nm) 2. Extract all the frames from the pulling trajectory that was just produced 3. Measure the COM distance of each of these frames between the reference and pull group 4. Use the selected frames for umbrella sampling input To extract the frames from your trajectory (traj.xtc), use trjconv: trjconv -s pull.tpr -f traj.xtc -o conf.gro -sep A series of coordinate files (conf0.gro, conf1.gro, etc) will be produced, corresponding to each of the frames saved in the continuous pulling simulation. To iteratively call g_dist on all of these (501!) frames that were generated, I have written a Perl script that takes care of this task. It will print a file called "summary_distances.dat" that contains this information. The script can be found here. We will need to make use of the index file again, as well as a text file called "groups.txt," which will be used to select our analysis groups non-­interactively. The contents of groups.txt should be: 19 20 The groups.txt file can be created with a plain text editor. Once you have this file, change the .txt file extension of distances.txt (linked above) to .pl and execute the script: perl distances.pl Look at the contents of summary.dat to see the progression of COM distance between chain A and chain B over time. Make note of the configurations to be used for umbrella sampling, based on the desired spacing. That is, if you want 0.2-­ nm spacing, you might find the following lines in summary.dat: 50 ... 100 0.600 0.800 You would then use conf50.gro and conf100.gro as the starting configurations of two adjacent umbrella sampling windows. Make note of all the configurations you wish to use before continuing. For the purposes of this tutorial, identifying configurations with 0.2-­nm spacing will suffice, although in the original work a different (more detailed) spacing was used. Step Six: Umbrella Sampling Simulations After having identified the initial configurations of the sampling windows, we can now conduct actual umbrella sampling simulations. We will need to generate a number of input files in order to conduct each of the necessary simulations. For example, if you have identified 25 configurations along the reaction coordinate, that means you will need 25 different input files for 25 independent simulations. You will simply have to call grompp to process this .mdp file for each of the conf.gro files you identified in the previous step. Many of the pulling parameters are the same as in the previous step, with the notable exception of pull_rate1, which has now been set to zero. We don't want to move the configuration along the reaction coordinate;; instead we want to restrain it within a defined window of configurational space. Setting pull_start = yes means that the initial COM distance is the reference distance, and we do not have to define a reference (pull_init1) separately for each configuration. In this example, we will be sampling COM distances from 0.5 -­ 5.0 nm along the z-­axis using roughly 0.2-­nm spacing. The following example commands may or may not be literally correct (the frame numbers may differ), but will serve as an example as to how to run grompp on separate coordinate files to generate all 23 inputs (note as well that 23 is the amount of windows required to obtain 0.2-­nm spacing over roughly 4.5 nm;; in our original work, 31 asymmetric windows were used). You will also note that I have set gen_vel = no in the .mdp file. I have found that allowing the initial forces to govern the dynamics in each window is sufficient for a large, robust system such as this one. If this is not the case in systems with which you work, you will likely want to set gen_vel = yes and allow some time for equilibration in each sampling window. grompp -f md_umbrella.mdp -c conf0.gro -p topol.top -n index.ndx -o umbrella0.tpr ... grompp -f md_umbrella.mdp -c conf450.gro -p topol.top -n index.ndx -o umbrella22.tpr Now, each input file should be passed to mdrun for the actual data collection simulation. Once all of the simulations are complete, you can proceed to data analysis. One note on proper execution of the simulations: do not use the -­deffnm option of mdrun without also specifying -­pf and -­px filenames. Using -­deffnm will cause both the pullf.xvg and pullx.xvg files to be written to the same file (whatever is specified by -­deffnm) in this case. Using -­pf and -­px will override the setting of the -­deffnm flag. Step Seven: Data Analysis The most common analysis conducted for umbrella sampling simulations is the extraction of the potential of mean force (PMF), which will yield the ΔG for the binding/unbinding process. The value of ΔG is simply the difference between the highest and lowest values of the PMF curve, as long as the values of the PMF converge to a stable value at large COM distance. A common method for extracting PMF is the Weighted Histogram Analysis Method (WHAM), included in GROMACS as the g_wham utility. The input to g_wham consists of two files, one that lists the names of the .tpr files of each window, and the other that lists the names of either the pullf.xvg or pullx.xvg files from each window. For example, a simple tpr-­files.dat might consist of: umbrella0.tpr umbrella1.tpr ... umbrella22.tpr And analogously for the list of pullf.xvg or pullx.xvg files, in either pullf-­files.dat or pullx-­files.dat. Note that the files must have unique names (i.e., pullf0.xvg, pullf1.xvg, etc) or else g_wham will fail. We then run g_wham: g_wham -it tpr-files.dat -if pullf-files.dat -o -hist -unit kCal The g_wham utility will then open each of the umbrella.tpr and pullf.xvg files sequentially and run the WHAM analysis on them. The -unit kCal option indicates that the output will be in kcal mol-­1, but you can also get results in kJ mol-­1 or kBT. Note that you may have to discard the first several hundred ps of the trajectory as equilibration (using g_wham -­b), since we generated our starting configurations from a non-­equilibrium simulation. Once the PMF converges, you should know how much time was required to equilibrate your system. You should end up with a profile.xvg file that looks like the following: Please note that the result you obtain may be different, since the spacing recommended in this tutorial is different from the spacing I actually used to generate this data in the original study. The overall shape of the curve should be similar, and the value of ΔG (calculated as the difference between the plateau region of the PMF curve and the energy minimum of the curve) should be close to -­50.5 kcal mol-­1. The other output from the g_wham command will be a file called histo.xvg, which contains the histograms of the configurations within the umbrella sampling windows. These histograms will determine whether or not there is sufficient overlap between adjacent windows. For the types of simulations conducted as part of this tutorial, you may obtain something like the following: The above histogram shows reasonable overlap between windows from about 1.2 -­ 5 nm of COM spacing;; the overlap around 1 nm (green and blue curves) indicates that more sampling windows are likely necessary to obtain good results from the WHAM algorithm. As it stands now, there is very little overlap between these two windows. Summary You have now hopefully been successful in conducting umbrella sampling simulations by generating a series of configurations along a reaction coordinate, running biasing simulations, and extracting the PMF. The .mdp files provided here serve as examples only, and should not be considered broadly applicable to all systems. Review the literature and the GROMACS manual for adjustments to these files for efficiency and accuracy purposes. If you have suggestions for improving this tutorial, if you notice a mistake, or if anything else is unclear, please feel free to email me. Please note: this is not an invitation to email me for GROMACS problems. I do not advertise myself as a private tutor or personal help service. That's what the gmx-­users list is for. I may help you there, but only in the context of providing service to the community as a whole, not just the end user. Happy simulating!