Uploaded by waheed akram

Protein-ligand docking

advertisement
8/29/22, 11:21 AM
Protein-ligand docking
Galaxy Training!
Protein-ligand docking
Authors:
Simon Bray
Overview
 Questions:
What is cheminformatics?
What is protein-ligand docking?
How can I perform a simple docking workflow in Galaxy?
 Objectives:
Create a small compound library using the ChEMBL database
Dock a variety of ligands to the active site of the Hsp90 protein
 Requirements:
Introduction to Galaxy Analyses
 Time estimation: 3 hours
 Level: Intermediate   
 Supporting Materials:
 Workflows
 FAQs
 Available on these Galaxies
 Last modification: Nov 17, 2021
 License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International
License The GTN Framework is licensed under MIT
Introduction
Download data
Separating protein and ligand structures
Prepare files for docking
Docking
Optional: cheminformatics tools applied to the compound library
Post-processing and plotting
Frequently Asked Questions
References
Feedback
Citing this Tutorial
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
1/16
8/29/22, 11:21 AM
Protein-ligand docking
Introduction
Cheminformatics is the use of computational techniques and information about molecules to solve
problems in chemistry. This involves a number of steps: retrieving data on chemical compounds,
sorting data for properties which are of interest, and extracting new information. This tutorial will
provide a brief overview of all of these, centered around protein-ligand docking, a molecular
modelling technique. The purpose of protein-ligand docking is to find the optimal binding
between a small molecule (ligand) and a protein. It is generally applied to the drug discovery and
development process with the aim of finding a potential drug candidate. First, a target protein is
identified. This protein is usually linked to a disease and is known to bind small molecules. Second,
a ‘library’ of possible ligands is assembled. Ligands are small molecules that bind to a protein and
may interfere with protein function. Each of the compounds in the library is then ‘docked’ into the
protein to find the optimal binding position and energy.
Docking is a form of molecular modelling, but several simplifications are made in comparison to
methods such as molecular dynamics. Most significantly, the receptor is generally considered to be
rigid, with covalent bond lengths and angles held constant. Charges and protonation states are
also not permitted to change. While these approximations reduce accuracy to some extent, they
increase computational speed, which is necessary to screen a large compound library in a realistic
amount of time.
In this tutorial, you will perform docking of ligands into the N-terminus of Hsp90 (heat shock
protein 90). The tools used for docking are based on the open-source software AutoDock Vina
(Trott and Olson 2009).
 Biological background 
Agenda
In this tutorial, we will cover:
1. Download data
1. Get data
2. Separating protein and ligand structures
1. Creating and processing the compound library
3. Prepare files for docking
4. Docking
5. Optional: cheminformatics tools applied to the compound library
6. Post-processing and plotting
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
2/16
8/29/22, 11:21 AM
Protein-ligand docking
Download data
For this exercise, we need two datasets: a protein structure and a library of compounds. We will
download the former directly from the Protein Data Bank; the latter will be created by searching
the ChEMBL database (Gaulton et al. 2016).
Get data
 Hands-on: Data upload 
1. Create a new history for this tutorial
2. Search Galaxy for the Get PDB  tool. Request the accession code 2brc .
3. Rename the dataset to ‘Hsp90 structure’
4. Check that the datatype is correct (PDB file).
 Tip: Changing the datatype 
 FAQs | Gitter Chat | Help Forum
Figure 1: Structure of Hsp90 N-terminus, as recorded on the PDB. Visualization produced
using VMD (Humphrey et al. 1996).
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
3/16
8/29/22, 11:21 AM
Protein-ligand docking
Separating protein and ligand
structures
You can view the contents of the downloaded PDB file by pressing the ‘View data’ icon in the
history pane. After the header section (about 500 lines), the atoms of the protein and their
coordinates are listed. The lines begin with ATOM . At the end of the file, the atomic coordinates of
the ligand and the solvent water molecules are also listed, labelled HETATM . We will use the grep
tool to separate these molecules into separate files, and then convert the ligand file into SDF/MOL
format using the ‘Compound conversion’ tool, which is based on OpenBabel, an open-source
library for analyzing chemical data (O’Boyle et al. 2011).
 What is grep? 
 Hands-on: Separate protein and ligand 
1. Search in textfiles (grep)  with the following parameters:
 “Select lines from”: Downloaded PDB file ‘Hsp90 structure’
 “that”: Don't match
 “Regular Expression”: HETATM
All other parameters can be left as their defaults.
Rename the dataset ‘Protein (PDB)’.
The result is a file with all non-protein ( HETATM ) atoms removed.
2. Search in textfiles (grep)  with the following parameters. Here, we use grep again
to produce a file with only non-protein atoms.
 “Select lines from”: Downloaded PDB file ‘Hsp90 structure’
 “that”: Match
 “Regular Expression”: CT5 (the name of the ligand in the PDB file)
All other parameters can be left as their defaults.
Rename the dataset ‘Ligand (PDB)’.
This produces a file which only contains ligand atoms.
3. Compound conversion  with the following parameters:
 “Molecular input file”: Ligand PDB file created in step 2.
 “Output format”: MDL MOL format (sdf, mol)
 “Add hydrogens appropriate for pH”: 7.4
All other parameters can be left as their defaults.
Change the datatype to ‘mol’ and rename the dataset ‘Ligand (MOL)’.
Applying this tool will generate a representation of the structure of the ligand in MOL
format.
 FAQs | Gitter Chat | Help Forum
OPEN CHAT
At this stage, separate protein and ligand files have been created. Next, we want to generate a
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
4/16
8/29/22, 11:21 AM
Protein-ligand docking
compound library we can use for docking.
Creating and processing the compound
library
In this step we will create a compound library, using data from the ChEMBL database.
 What chemical databases are available? 
We will generate our compound library by searching ChEMBL for compounds which have a similar
structure to the ligand in the PDB file we downloaded in the first step. There is a Galaxy tool for
accessing ChEMBL which requires data input in SMILES format; thus, the first step is to convert the
‘Ligand’ PDB file to a SMILES file. Then the search is performed, returning a SMILES file. For
docking, we would like to convert to SDF format, which we can do once again using the
‘Compound conversion’ tool.
 Hands-on: Generate compound library 
1. Compound conversion  with the following parameters:
 “Molecular input file”: ‘Ligand’ PDB file
 “Output format”: SMILES format (SMI)
Leave all other options as default.
2. Rename the output of the ‘compound conversion’ step to ‘Ligand SMILES’.
3. Search ChEMBL database  with the following parameters:
 “SMILES input type”: File
 “Input file”: ‘Ligand SMILES’ file
 “Search type”: Similarity
 “Tanimoto cutoff score”: 40
 “Filter for Lipinski’s Rule of Five”: Yes
All other parameters can be left as their defaults.
 Question 
Why are compounds filtered for Lipinski’s Rule of Five?
 Solution 
OPEN CHAT
Optional: experiment with different combinations of options - adding different filters,
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
5/16
8/29/22, 11:21 AM
p
p
adjusting the Tanimoto coefficient.
Protein-ligand docking
p
g
4. Check that the datatype is correct (smi). This step is essential, as Galaxy does not
automatically detect the datatype for SMILES files.
 Tip: Changing the datatype 
5. Rename dataset ‘Compound library’.
 FAQs | Gitter Chat | Help Forum
 Tip: Problems using the ChEMBL tool? 
There are some other tools available, which will not be used in this tutorial, which help to develop a
more focused compound library. For example, the ‘Natural product likeness calculator’ and ‘Druglikeness’ tools assign a score to compounds based on how similar they are to typical natural
products and drugs respectively, which could then be used to filter the library. If you are interested,
you can try testing them out on the library just generated.
 Tip: Generating a compound library 
 What are SMILES and SDF formats? 
Prepare files for docking
A processing step now needs to be applied to the protein structure and the docking candidates each of the structures needs to be converted to PDBQT format before using the AutoDock Vina
docking tool.
Further, docking requires the coordinates of a binding site to be defined. Effectively, this defines a
‘box’ in which the docking software attempts to define an optimal binding site. In this case, we
already know the location of the binding site, since the downloaded PDB structure contained a
bound ligand. There is a tool in Galaxy which can be used to automatically create a configuration
file for docking when ligand coordinates are already known.
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
6/16
8/29/22, 11:21 AM
Protein-ligand docking
 Hands-on: Generate PDBQT and config files for docking 
1. Prepare receptor  with the following parameters:
 “Select a PDB file”: ‘Protein’ PDB file.
2. Compound conversion  with the following parameters:
 “Molecular input file”: ‘Compound library’ file.
 “Output format”: SDF
 “Generate 3D coordinates”: Yes
 “Add hydrogens appropriate for pH”: 7.4
Leave all other options unchanged.
Rename to ‘Prepared ligands’
3. Calculate the box parameters for an AutoDock Vina job  with the following
parameters:
 “Input ligand or pocket”: Ligand (MOL) file.
 “x-axis buffer”: 5
 “y-axis buffer”: 5
 “z-axis buffer”: 5
 “Random seed”: 1
Rename to ‘Docking config file’.
 FAQs | Gitter Chat | Help Forum
Perhaps you are interested in a system which does not have a ligand within the binding site (an
apoprotein). In this case you need to run the fpocket tool to identify potential binding sites in the
protein structure - take a look at the following (optional) section.
 How to find the binding site of an apoprotein? 
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
7/16
8/29/22, 11:21 AM
Docking
Protein-ligand docking
Now that the protein and the ligand library have been correctly prepared and formatted, docking
can be performed.
 Hands-on: Perform docking 
1. Docking  with the following parameters:
 “Receptor”: ‘Protein PDBQT’ file.
 “Ligands”: ‘Prepared ligands’ file.
 “Specify pH value for ligand protonation”: 7.4
 “Specify parameters”: ‘Upload a config file to specify parameters’
 “Box configuration”: ‘Docking config file’
 “Exhaustiveness”: leave blank (it was specified in the previous step)
 FAQs | Gitter Chat | Help Forum
The output consists of a collection, which contains an SDF output file for each ligand, containing
multiple docking poses and scoring files for each of the ligands. We will now perform some
processing on these files which extracts scores from the SD-files and selects the best score for
each.
Optional: cheminformatics tools
applied to the compound library
The ChemicalToolbox contains a large number of cheminformatics tools. This section will
demonstrate some of the useful functionalities available. If you are just interested in docking, feel
free to skip this section - or, just try out the tools which look particularly interesting.
(This section can also be completed while waiting for the docking, which can take some time to
complete.)
Visualization
OPEN CHAT
It can be useful to visualize the compounds generated There is a tool available for this in Galaxy
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
8/16
8/29/22, 11:21 AM
Protein-ligand docking
It can be useful to visualize the compounds generated. There is a tool available for this in Galaxy
based on OpenBabel.
 Hands-on: Visualization of chemical structures 
1. Visualisation  with the following parameters:
 “Molecular input file”: Compound library
 “Embed molecule as CML”: No
 “Draw all carbon atoms”: No
 “Use thicker lines”: No
 “Property to display under the molecule”: Molecule title
 “Sort the displayed molecules by”: Molecular weight
 “Format of the resultant picture”: SVG
 FAQs | Gitter Chat | Help Forum
This produces an SVG image of all the structures generated ordered by molecular weight.
Figure 2: Structures of the compounds from ChEMBL.
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
9/16
8/29/22, 11:21 AM
Protein-ligand docking
Calculation of fingerprints and clustering
In this step, we will group similar molecules together. A key tool in cheminformatics for measuring
molecular similarity is fingerprinting, which entails extracting chemical properties of molecules and
storing them as a bitstring. These bitstrings can be easily compared computationally, for example
with a clustering method. The fingerprinting tools in Galaxy are based on the Chemfp tools (Dalke
2013).
Before clustering, let’s label each compound. To do so add a second column to the SMILES
compound library containing a label for each molecule. The Ligand SMILES file is also labelled
something like /data/dnb02/galaxy_db/files/010/406/dataset_10406067.dat (the exact name
will vary) and we would like to give it a more useful name. When labelling is complete, we can
concatenate (join together) the library file with the original SMILES file for the ligand from the PDB
file.
 Hands-on: Calculate molecular fingerprints 
1. Replace  with the following parameters:
 “File to process”: Ligand SMILES .
 “Find pattern”: add the current label of the SMILES here. You can find it by
clicking the ‘view’ button next to the Ligand SMILES dataset - it will look
something like /data/dnb02/galaxy_db/files/010/406/dataset_10406067.dat .
 “Replace with”: ligand
2. Concatenate datasets  with the following parameters:
 “Datasets to concatenate”: Output of the previous step.
Click on Insert Dataset and in the new selection box which appears, select
‘Compound library’.
Run the step and rename the output dataset ‘Labelled compound library’.
3. Molecule to fingerprint  with the following parameters:
 “Molecule file”: ‘Labelled compound library’ file.
 “Type of fingerprint”: Open Babel FP2 fingerprints
Rename to ‘Fingerprints’.
 FAQs | Gitter Chat | Help Forum
Taylor-Butina clustering (Butina 1999) provides a classification of the compounds into different
groups or clusters, based on their structural similarity. This methods shows us how similar the
compounds are to the original ligand, and after docking, we can compare the results to the
proposed clusters to observe if there is any correlation.
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
10/16
8/29/22, 11:21 AM
Protein-ligand docking
Figure 3: A simple fingerprinting system. Each 1 or 0 in the bitstring corresponds to the
presence or absence of a particular feature in the molecule. In this case, the presence of
phenyl, amine and carboxylic acid groups are encoded.
 Hands-on: Cluster molecules using molecular fingerprints 
1. Taylor-Butina clustering  with the following parameters:
 “Fingerprint dataset”: ‘Fingerprints’ file.
 “threshold”: 0.8
2. NxN clustering  with the following parameters:
 “Fingerprint dataset”: ‘Fingerprints’ file.
 “threshold”: 0.0
 “Format of the resulting picture”: SVG
 “Output options”: Image
 FAQs | Gitter Chat | Help Forum
The image produced by the NxN clustering shows the clustering in the form of a dendrogram,
where individual molecules are represented as vertical lines and merged into clusters. Merges are
represented by horizontal lines. The y-axis represents the similarity of data points to each other;
thus, the lower a cluster is merged, the more similar the data points are which it contains. Clusters
in the dendogram are colored differently. For example, all molecules connected in red are similar
enough to be grouped into the same cluster.
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
11/16
8/29/22, 11:21 AM
Protein-ligand docking
Figure 4: Dendrogram produced by NxN clustering. The library used to produce this
image is generated with a Tanimoto cutoff of 80; here 15 search results are shown, plus
the original ligand contained in the PDB file.
 Further investigation (optional) 
Post-processing and plotting
From our collection of SD-files, we first extract all stored values into tabular format and then
combine the files together to create a single tabular file.
 Hands-on: Process SD-files 
1. Extract values from an SD-file  with the following parameters:
 “Input SD-file”: Collection of SD-files generated by the docking step. (Remember
to select the ‘collection’ icon!)
 “Include the property name as header”: Yes
 “Include SMILES as column in output”: Yes
 “Include molecule name as column in output”: Yes
Leave all other paramters unchanged.
2. Collapse Collection  with the following parameters:
 “Collection of files to collapse into single dataset”: Collection of tabular files
generated by the previous step.
 “Keep one header line”: Yes
 “Append File name”: No
 Tip: Selecting a dataset collection as input 
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
12/16
8/29/22, 11:21 AM
Protein-ligand docking
3. Compound conversion  with the following parameters:
 “Molecular input file”: choose one of the SD-files from the collection generated
by the docking step.
 “Output format”: Protein Data Bank format (pdb)
 “Split multi-molecule files into a collection”: Yes
Leave all other parameters unchanged.
 FAQs | Gitter Chat | Help Forum
We now have a tabular file available which contains all poses calculated for all ligands docked,
together with scores and RMSD values for the deviation of each pose from the optimum. We also
have PDB files for some of the docking poses which can be inspected using the NGLViewer
 Further
investigation
(optional) 
visualization
embedded
in Galaxy.
Use the NGLviewer to inspect the protein and various ligand poses generated by docking. This can
be done using either the visualization of NGLViewer in Galaxy, or via the NGL website.
Figure 6: Two docking poses for a ligand bound to the active site of Hsp90. One (docking
score -8.4) can be seen to be bound deeper in the active site than the other (docking
score -5.7), which is reflected in the difference between the docking scores.
 Key points
Docking allows ‘virtual screening’ of drug candidates
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
13/16
8/29/22, 11:21 AM
Protein-ligand docking
Molecular fingerprints encode features into a bitstring
The ChemicalToolbox contains many tools for cheminformatics analysis
Frequently Asked Questions
Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the
Computational chemistry topic to see if your question is listed there. If not, please ask your
question on the GTN Gitter Channel or the Galaxy Help Forum
Useful literature
Further information, including links to documentation and original publications, regarding the
tools, analysis techniques and the interpretation of results described in this tutorial can be found
here.
References
1. Humphrey, W., A. Dalke, and K. Schulten, 1996 VMD – Visual Molecular Dynamics. Journal
of Molecular Graphics 14: 33–38. .
2. Butina, D., 1999 Unsupervised data base clustering based on daylight’s fingerprint and
Tanimoto similarity: A fast and automated way to cluster small and large data sets.
Journal of Chemical Information and Computer Sciences 39: 747–750.
3. Lipinski, C. A., 2004 Lead- and drug-like compounds: the rule-of-five revolution. Drug
Discovery Today: Technologies 1: 337–341. 10.1016/j.ddtec.2004.11.007
4. Pearl, L. H., and C. Prodromou, 2006 Structure and Mechanism of the Hsp90 Molecular
Chaperone Machinery. Annual Review of Biochemistry 75: 271–294.
10.1146/annurev.biochem.75.103004.142738
5. Le Guilloux, V., P. Schmidtke, and P. Tuffery, 2009 Fpocket: an open source platform for
ligand pocket detection. BMC bioinformatics 10: 168.
6. Trott, O., and A. J. Olson, 2009 AutoDock Vina: Improving the speed and accuracy of
docking with a new scoring function, efficient optimization, and multithreading.
Journal of Computational Chemistry NA–NA. 10.1002/jcc.21334
7. O’Boyle, N. M., M. Banck, C. A. James, C. Morley, T. Vandermeersch et al., 2011 Open Babel:
An open chemical toolbox. Journal of Cheminformatics 3: 10.1186/1758-2946-3-33
8. Dalke, A., 2013 The FPS fingerprint format and chemfp toolkit. Journal of cheminformatics
5: P36.
9. Gaulton, A., A. Hersey, M. Nowotka, Bento A Patrı́cia, J. Chambers et al., 2016 The ChEMBL
database in 2017. Nucleic acids research 45: D945–D954.
Feedback
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
14/16
Feedback
8/29/22, 11:21 AM
Protein-ligand docking
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Citing this Tutorial
1. Simon Bray, 2021 Protein-ligand docking (Galaxy Training Materials).
https://training.galaxyproject.org/training-material/topics/computationalchemistry/tutorials/cheminformatics/tutorial.html Online; accessed Mon Aug 29 2022
2. Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems
10.1016/j.cels.2018.05.012
 BibTeX 
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
15/16
8/29/22, 11:21 AM
Protein-ligand docking
 Congratulations on successfully completing this
tutorial!
OPEN CHAT
https://training.galaxyproject.org/training-material/topics/computational-chemistry/tutorials/cheminformatics/tutorial.html#download-data
16/16
Download