GROMACS Tutorial Umbrella Sampling

advertisement
GROMACS Tutorial
Umbrella Sampling
Justin A. Lemkul
Department of Pharmaceutical Sciences, University of Maryland, Baltimore
This tutorial will guide the user through the process of setting up and running pulling simulations necessary to calculate
binding energy between two species. The tutorial assumes the user has already successfully completed the Lysozyme
tutorial, some other tutorial, or is otherwise well-­versed in basic GROMACS simulation methods and topology
organization. Special attention will be paid to the methods for properly building the system and settings for the pull code
itself.
The binding energy (ΔGbind) is derived from the potential of mean force (PMF), extracted from a series of umbrella
sampling simulations. A series of initial configurations is generated, each corresponding to a location wherein the
molecule of interest (generally referred to as a "ligand") is harmonically restrained at increasing center-­of-­mass (COM)
distance from a reference molecule using an umbrella biasing potential. This restraint allows the ligand to sample the
configurational space in a defined region along a reaction coordinate between it and its reference molecule or binding
partner. The windows must allow for slight overlap of the ligand positions for proper reconstruction of the PMF curve.
The steps for such a procedure (and the ones utilized in this tutorial) are as follows:
1. Generate a series of configurations along a single degree of freedom (reaction coordinate)
2. Extract frames from the trajectory in step 1 that correspond to the desired COM spacing
3. Run umbrella sampling simulations on each configuration to restrain it within a window corresponding to the chosen
COM distance
4. Use the Weighted Histogram Analysis Method (WHAM) to extract the PMF and calculate ΔGbind
The tutorial assumes that the reader is using GROMACS version 4.5.3 or later. My original work (from which this workflow
was derived) was conducted with version 4.0.5, but in principle can be applied to any version in the 4.0.x or 4.5.x series.
The pull code was completely re-­written after version 3.3.3, such that none of the information contained herein (beyond
the basic theory of the technique) is applicable to any GROMACS version prior to 4.0. For the GROMACS 2013 workshop
at the University of Virginia, it is assumed that you are using GROMACS version 4.6.3.
Step One: Prepare the Topology
Generating a molecular topology for an umbrella sampling simulation is just like any other simulation. Obtain the
coordinate file of the structure of interest, and generate the topology from pdb2gmx. Some systems will require special
consideration (i.e., protein-­ligand complexes, membrane proteins, etc). For protein-­ligand systems, please consult this
tutorial, and for membrane proteins, I recommend my own tutorial on the topic. The principles of umbrella sampling are
easily extendable to these systems, though we will consider only protein molecules in this tutorial.
The system we will consider here is the dissociation of a single peptide from the growing end of an Aβ42 protofibril, and is
based on simulations we recently published. The structure file of the wild-­type Aβ42 protofibril used in those simulations,
acetylated at the N-­terminus of each chain, can be found here. The original PDB accession code is 2BEG.
Run the structure through pdb2gmx:
pdb2gmx -f input.pdb -ignh -ter -o complex.gro
Choose the GROMOS96 53A6 parameter set, "None" for the N-­termini, and "COO-­" for the C-­termini. Modify
topol_Protein_chain_B.itp to include the following lines (at the end of the file):
#ifdef POSRES_B
#include "posre_Protein_chain_B.itp"
#endif
We will be using chain B as an immobile reference later on in the pulling simulations, hence the need to specially
position-­restrain this chain only, and none of the others.
Step Two: Define the Unit Cell
Defining the unit cell for a pulling simulation is not unlike defining the unit cell for any other simulation. There is, however,
one major consideration. One must allow enough space in the pulling direction to allow for a continuous pull without
interacting with the periodic images of the system. That is, the minimum image convention must be continually satisfied,
and as well, the pull distance must always be less than one-­half the length of the box vector along which the pulling is
being conducted. Why, you may ask?
GROMACS calculates distances while simultaneously taking periodicity into account. This, if you have a 10-­nm box, and
you pull over a distance greater than 5.0 nm, the periodic distance becomes the reference distance for the pulling, and
this distance is actually less than 5.0 nm! This fact will significantly affect results, since the distance you think you are
pulling is not what is actually calculated.
We will be pulling a total distance of 5.0 nm in a 12.0-­nm box, to avoid the complications described above. The center of
mass of the protofibril will be placed at (3.280, 2.181, 2.4775) in a box of dimensions 6.560 x 4.362 x 12. Use editconf to
place the protofibril at this location:
editconf -f complex.gro -o newbox.gro -center 3.280 2.181 2.4775 -box 6.560 4.362 12
You can visualize the location of the protofibril within the box using, for example, VMD. Load the structure in VMD and
open the Tk console. Type the following (note that > indicates the Tk prompt, not something you actually type):
> pbc box
You should see something like the following in the VMD window:
Step Three: Adding Solvent and Ions
This step is conducted much like any other simulation. Refer to the Lysozyme tutorial for a more detailed description of
what is going on here if you are unsure. First, we will add water with genbox:
genbox -cp newbox.gro -cs spc216.gro -o solv.gro -p topol.top
Next, we will add ions using genion, utilizing this .mdp file. We are going to be conducting these simulations in the
presence of 100 mM NaCl, on top of neutralizing counterions:
grompp -f ions.mdp -c solv.gro -p topol.top -o ions.tpr
genion -s ions.tpr -o solv_ions.gro -p topol.top -pname NA -nname CL -neutral -conc 0.1
Step Four: Energy Minimization and Equilibration
The energy minimization and equilibration steps are going to be conducted just like any other protein-­in-­water system.
Here, we will perform steepest descents minimization followed by NPT equilibration. The .mdp file for minimization can
be found here, and the one for NPT equilibration can be found here.
Invoke grompp and mdrun, as usual:
grompp -f minim.mdp -c solv_ions.gro -p topol.top -o em.tpr
mdrun -v -deffnm em
grompp -f npt.mdp -c em.gro -p topol.top -o npt.tpr
mdrun -deffnm npt
Because these procedures are time-­consuming, they are likely best run in parallel, i.e.:
mdrun -nt X -deffnm npt
In the above command, "X" represents the desired number of threads over which the parallel calculation is conducted.
Step Five: Generating Configurations
To conduct umbrella sampling, one must generate a series of configurations along a reaction coordinate, ζ. Some of
these configurations will serve as the starting configurations for the umbrella sampling windows, which are run in
independent simulations. The figure below illustrates these principles. The top image illustrates the pulling simulation we
will run now, conducted in order to generate a series of configurations along the reaction coordinate. These
configurations are extracted after the simulation is complete (dashed arrows in between the top and middle images). The
middle image corresponds to the independent simulations conducted within each sampling window, with the center of
mass of the free peptide restrained in that window by an umbrella biasing potential. The bottom images shows the ideal
result as a histogram of configurations, with neighboring windows overlapping such that a continuous energy function can
later be derived from these simulations.
For this example, the reaction coordinate is the z-­axis. To generate these configurations, we must pull peptide A away
from the protofibril. We will pull over the course of 500 ps of MD, saving snapshots every 1 ps. This setup has been
established based on trial-­and-­error to obtain a reasonable distribution of configurations. In other systems, it may be
necessary to save configurations more often, or sufficient to save configurations less often. The idea is to capture enough
configurations along the reaction coordinate to obtain regular spacing of the umbrella sampling windows, in terms of
center-­of-­mass distance between peptides A and B, the latter of which is our reference group.
The .mdp file for this pulling can be found here. A brief explanation of the pulling options used is as follows:
; Pull code
pull
pull_geometry
pull_dim
pull_start
pull_ngroups
pull_group0
pull_group1
pull_rate1
pull_k1
=
=
=
=
=
=
=
=
=
umbrella
distance
N N Y
yes
1
Chain_B
Chain_A
0.01
1000
; define initial COM distance > 0
; 0.01 nm per ps = 10 nm per ns
; kJ mol^-1 nm^-2
pull = umbrella: using a harmonic potential to pull. IMPORTANT: This procedure is NOT umbrella sampling. I
used a harmonic potential in order to make qualitative observations about the dissociation pathway in this study.
The harmonic potential allows the force to vary according to the nature of the interactions of peptide A with peptide
B. That is, the force will build up until certain critical interactions are broken. See our paper for details. For the
purposes of generating the initial configurations for umbrella sampling, you can actually use any combination of pull
settings (pull and pull_geometry), but when it comes time for the actual umbrella sampling (in the next step) you
MUST be using pull = umbrella. It is very important that you do not apply extremely fast pulling rates or extremely
strong force constants, which can seriously deform elements of your system. Please refer to paper (particularly the
Supporting Information) for how we chose to validate the pull rate used.
pull_geometry = distance: see the note the in .mdp file;; you can also use position or direction, but changes will
have to be made to other pulling parameters.
pull_dim = N N Y: we are pulling only in the z-­dimension. Thus, x and y are set to "no" (N) and z is set to "yes" (Y).
pull_start = yes: the initial COM distance is the reference distance for the first frame. This is useful because if we
are attempting to pull 5.0 nm, converting the initial COM distance to zero (i.e., pull_start = no) makes this
interpretation difficult.
pull_ngroups = 1: we are only applying a pulling force to one group.
pull_group0 = Chain_B: reference group for pulling.
pull_group1 = Chain_A: group to which pulling force is applied.
pull_rate1 = 0.01: the rate at which the "dummy particle" attached to our pull group is moved. This type of pulling
is also called "constant velocity" due to the fact that this rate is fixed.
pull_k1 = 1000: the force constant for pulling.
Remember that #ifdef POSRES_B statement we added to topol_B.itp a while ago? We're going to use it now. By
restraining peptide B of the protofibril, we are able to more easily pull peptide A away. Due to the extensive non-­covalent
interactions between chains A and B, if we did not restrain chain B, we would end up simply towing the whole complex
along the simulation box, which wouldn't accomplish much.
We will need to define some custom index groups for this pulling simulation. Use make_ndx:
make_ndx -f npt.gro
(> indicates the make_ndx prompt)
> r 1-27
> name 19 Chain_A
> r 28-54
> name 20 Chain_B
> q
Now, run the continuous pulling simulation:
grompp -f md_pull.mdp -c npt.gro -p topol.top -n index.ndx -t npt.cpt -o pull.tpr
mdrun -s pull.tpr
Again, this procedure will take some time, so run it in parallel if you have the resources available to you. Once this
simulation is complete, we will need to extract useful frames for defining the umbrella sampling windows. The easiest
way I have found to do this is the following:
1. Define the spacing of the windows (generally 0.1 -­ 0.2 nm)
2. Extract all the frames from the pulling trajectory that was just produced
3. Measure the COM distance of each of these frames between the reference and pull group
4. Use the selected frames for umbrella sampling input
To extract the frames from your trajectory (traj.xtc), use trjconv:
trjconv -s pull.tpr -f traj.xtc -o conf.gro -sep
A series of coordinate files (conf0.gro, conf1.gro, etc) will be produced, corresponding to each of the frames saved in the
continuous pulling simulation. To iteratively call g_dist on all of these (501!) frames that were generated, I have written a
Perl script that takes care of this task. It will print a file called "summary_distances.dat" that contains this information. The
script can be found here. We will need to make use of the index file again, as well as a text file called "groups.txt," which
will be used to select our analysis groups non-­interactively. The contents of groups.txt should be:
19
20
The groups.txt file can be created with a plain text editor. Once you have this file, change the .txt file extension of
distances.txt (linked above) to .pl and execute the script:
perl distances.pl
Look at the contents of summary.dat to see the progression of COM distance between chain A and chain B over time.
Make note of the configurations to be used for umbrella sampling, based on the desired spacing. That is, if you want 0.2-­
nm spacing, you might find the following lines in summary.dat:
50
...
100
0.600
0.800
You would then use conf50.gro and conf100.gro as the starting configurations of two adjacent umbrella sampling
windows. Make note of all the configurations you wish to use before continuing. For the purposes of this tutorial,
identifying configurations with 0.2-­nm spacing will suffice, although in the original work a different (more detailed) spacing
was used.
Step Six: Umbrella Sampling Simulations
After having identified the initial configurations of the sampling windows, we can now conduct actual umbrella sampling
simulations. We will need to generate a number of input files in order to conduct each of the necessary simulations. For
example, if you have identified 25 configurations along the reaction coordinate, that means you will need 25 different
input files for 25 independent simulations. You will simply have to call grompp to process this .mdp file for each of the
conf.gro files you identified in the previous step. Many of the pulling parameters are the same as in the previous step, with
the notable exception of pull_rate1, which has now been set to zero. We don't want to move the configuration along the
reaction coordinate;; instead we want to restrain it within a defined window of configurational space. Setting pull_start =
yes means that the initial COM distance is the reference distance, and we do not have to define a reference (pull_init1)
separately for each configuration.
In this example, we will be sampling COM distances from 0.5 -­ 5.0 nm along the z-­axis using roughly 0.2-­nm spacing. The
following example commands may or may not be literally correct (the frame numbers may differ), but will serve as an
example as to how to run grompp on separate coordinate files to generate all 23 inputs (note as well that 23 is the amount
of windows required to obtain 0.2-­nm spacing over roughly 4.5 nm;; in our original work, 31 asymmetric windows were
used).
You will also note that I have set gen_vel = no in the .mdp file. I have found that allowing the initial forces to govern the
dynamics in each window is sufficient for a large, robust system such as this one. If this is not the case in systems with
which you work, you will likely want to set gen_vel = yes and allow some time for equilibration in each sampling window.
grompp -f md_umbrella.mdp -c conf0.gro -p topol.top -n index.ndx -o umbrella0.tpr
...
grompp -f md_umbrella.mdp -c conf450.gro -p topol.top -n index.ndx -o umbrella22.tpr
Now, each input file should be passed to mdrun for the actual data collection simulation. Once all of the simulations are
complete, you can proceed to data analysis. One note on proper execution of the simulations: do not use the -­deffnm
option of mdrun without also specifying -­pf and -­px filenames. Using -­deffnm will cause both the pullf.xvg and pullx.xvg
files to be written to the same file (whatever is specified by -­deffnm) in this case. Using -­pf and -­px will override the setting
of the -­deffnm flag.
Step Seven: Data Analysis
The most common analysis conducted for umbrella sampling simulations is the extraction of the potential of mean force
(PMF), which will yield the ΔG for the binding/unbinding process. The value of ΔG is simply the difference between the
highest and lowest values of the PMF curve, as long as the values of the PMF converge to a stable value at large COM
distance. A common method for extracting PMF is the Weighted Histogram Analysis Method (WHAM), included in
GROMACS as the g_wham utility. The input to g_wham consists of two files, one that lists the names of the .tpr files of
each window, and the other that lists the names of either the pullf.xvg or pullx.xvg files from each window. For example, a
simple tpr-­files.dat might consist of:
umbrella0.tpr
umbrella1.tpr
...
umbrella22.tpr
And analogously for the list of pullf.xvg or pullx.xvg files, in either pullf-­files.dat or pullx-­files.dat. Note that the files must
have unique names (i.e., pullf0.xvg, pullf1.xvg, etc) or else g_wham will fail. We then run g_wham:
g_wham -it tpr-files.dat -if pullf-files.dat -o -hist -unit kCal
The g_wham utility will then open each of the umbrella.tpr and pullf.xvg files sequentially and run the WHAM analysis on
them. The -unit kCal option indicates that the output will be in kcal mol-­1, but you can also get results in kJ mol-­1 or kBT.
Note that you may have to discard the first several hundred ps of the trajectory as equilibration (using g_wham -­b), since
we generated our starting configurations from a non-­equilibrium simulation. Once the PMF converges, you should know
how much time was required to equilibrate your system. You should end up with a profile.xvg file that looks like the
following:
Please note that the result you obtain may be different, since the spacing recommended in this tutorial is different from the
spacing I actually used to generate this data in the original study. The overall shape of the curve should be similar, and
the value of ΔG (calculated as the difference between the plateau region of the PMF curve and the energy minimum of the
curve) should be close to -­50.5 kcal mol-­1.
The other output from the g_wham command will be a file called histo.xvg, which contains the histograms of the
configurations within the umbrella sampling windows. These histograms will determine whether or not there is sufficient
overlap between adjacent windows. For the types of simulations conducted as part of this tutorial, you may obtain
something like the following:
The above histogram shows reasonable overlap between windows from about 1.2 -­ 5 nm of COM spacing;; the overlap
around 1 nm (green and blue curves) indicates that more sampling windows are likely necessary to obtain good results
from the WHAM algorithm. As it stands now, there is very little overlap between these two windows.
Summary
You have now hopefully been successful in conducting umbrella sampling simulations by generating a series of
configurations along a reaction coordinate, running biasing simulations, and extracting the PMF. The .mdp files provided
here serve as examples only, and should not be considered broadly applicable to all systems. Review the literature and
the GROMACS manual for adjustments to these files for efficiency and accuracy purposes.
If you have suggestions for improving this tutorial, if you notice a mistake, or if anything else is unclear, please feel free to
email me. Please note: this is not an invitation to email me for GROMACS problems. I do not advertise myself as a private
tutor or personal help service. That's what the gmx-­users list is for. I may help you there, but only in the context of
providing service to the community as a whole, not just the end user.
Happy simulating!
Download