Document - Indico

advertisement
DEPLOYMENT
OF A DOCKING
PLATFORM FOR IN SILICO DRUG
DISCOVERY ON GRIDS
Document
identifier:
drug discovery - application to
EGAAP v2.0
Date:
12/02/2016
Activity:
EGEE NA4 : applications
identification and support activity
SIMDAT
Authors:
V. Breton, M. Hofmann, T.
Schwede, M. Podvinec, N. Jacq
Document
status:
DRAFT
Document link:
https://edms.cern.ch/document
Abstract: This document proposes the deployment of a high throughput virtual screening platform in
the perspective of in silico drug discovery for neglected diseases. This platform will be jointly
developed by the SIMDAT and EGEE projects in collaboration with the SwissBioGRID initiative,
Swiss Institute of Bioinformatics (Biozentrum Basel), the INSTRUIRE regional grid in Auvergne
and the CampusGRID Bonn Aachen regional GRID.
IST-2002-508833
INTERNAL
1
Document Log
Issue
Date
0-1
29.09.2004
0-2
14.10.2004
0-3
19.10.2004
0-4
26.10.2004
2.0
8.11.04
Comment
First version
Author
V. Breton, N. Jacq
Added CampusGrid & Dengue
Information; modified some of the M. Hoffman
wording
Added SwissBioGrid & Dengue
Information; modified some of the T. Schwede, M.Podvinec
wording
Modifies version number of the
document; added requirements part; V. Breton, N. Jacq
modified some of the wording
Renumbering, final edition
V. Breton
Document Change Record
Issue
IST-2002-508833
Item
Reason for Change
INTERNAL
3
CONTENT
1. INTRODUCTION ............................................................................................................................................. 5
1.1. PURPOSE ....................................................................................................................................................... 5
1.2. APPLICATION AREA ...................................................................................................................................... 5
1.3. REFERENCES ................................................................................................................................................. 5
1.4. DOCUMENT EVOLUTION PROCEDURE ............................................................................................................ 5
1.5. TERMINOLOGY ............................................................................................................................................. 5
2. EXECUTIVE SUMMARY ............................................................................................................................... 7
3. RATIONALE .................................................................................................................................................... 8
3.1. ADRESSING MALARIA ON A GRID ................................................................................................................... 8
3.2. ADRESSING DENGUE FEVER BY GRID-BASED VIRTUAL SCREENING ................................................ 9
3.3. DESCRIPTION OF THE APPLICATION ............................................................................................................... 9
4. IMPLEMENTATION .................................................................................................................................... 10
4.1. PROPOSED ROADMAP .................................................................................................................................. 10
4.2. RESOURCE EVALUATION ............................................................................................................................ 10
4.3. REQUIREMENTS .......................................................................................................................................... 11
4.3.1. Job submission ................................................................................................................................... 11
4.3.2. Data management............................................................................................................................... 11
4.3.3. Information system ............................................................................................................................. 11
4.3.4. Storage ............................................................................................................................................... 11
4.3.5. Operation ........................................................................................................................................... 12
5. PARTICIPANTS ............................................................................................................................................. 13
IST-2002-508833
INTERNAL
4
1. INTRODUCTION
1.1. PURPOSE
This document proposes the deployment of a docking platform in the perspective of in silico drug
discovery for neglected diseases. This platform will be jointly developed by the SIMDAT and EGEE
projects in collaboration with the SwissBioGRID initiative, the Swiss Institute of Bioinformatics at the
Biozentrum Basel, and the INSTRUIRE regional grid in Auvergne. It will also be linked to the newly
established project “CampusGRID Bonn Aachen”, a GRID project in the field of life and medical
sciences funded by the Ministry of Science and Research of the state of North Rhine Westfalia.
1.2. APPLICATION AREA
The document addresses issues related to Drug Discovery in a grid environment.
1.3. REFERENCES
[1] FlexX
A fast flexible docking method using an incremental
construction algorithm, Rarey M., Kramer B.,
Lengauer T., Klebe G., J. Mol. Biol. (1996) 261,470489
[2] Autodock
"Automated Docking Using a Lamarckian Genetic
Algorithm and and Empirical Binding Free Energy
Function". Morris, G. M., Goodsell, D. S., Halliday,
R.S., Huey, R., Hart, W. E., Belew, R. K. and
Olson, A. J. (1998), J. Computational Chemistry, 19:
1639-1662.
[3] Dock
DOCK 4.0: search strategies for automated molecular
docking of flexible molecule databases. Ewing TJ,
Makino S, Skillman AG, Kuntz ID. J Comput. Aided
Mol. Des. 2001 May;15(5):411-28.
[4] ZINC
See http://zinc.docking.org
1.4. DOCUMENT EVOLUTION PROCEDURE
This document will be updated incrementally as the collaboration evolves.
Comments should be sent to the authors.
1.5. TERMINOLOGY
Glossary
DD
HTS
Drug Discovery
High-throughput Screening
HTD
High-throughput Docking
DHF
Dengue Hemorrhagic Fever
IST-2002-508833
INTERNAL
5
Definitions
IST-2002-508833
INTERNAL
6
2. EXECUTIVE SUMMARY
Grid technology holds out the promise of more effective means to manage information and enhance
knowledge-based processes in just the sort of environment that is well established in Pharmaceutical
Research and Development.
The present document describes a proposal to deploy in silico screening on a grid infrastructure using
a large library of compounds (approx. 2 million compounds) and shared docking algorithms in the
perspective of drug discovery for 2 neglected diseases, Malaria and Dengue Fever.
The project involves 2 European FP6 projects, SIMDAT and EGEE, the SwissBioGrid Initiative, the
Swiss Institute of Bioinformatics (Biozentrum Basel), the INSTRUIRE regional grid and the
CampusGRID Bonn Aachen regional GRID.
IST-2002-508833
INTERNAL
7
3. RATIONALE
The challenges in Drug Discovery (DD) today lie around understanding diseases, elucidating
pathways and exploiting information at molecular and cellular level which should result in reducing
DD projects life cycle and increasing probability of success after lead optimization.
Information Technology can transform the DD process through comprehensive and reliable data and
information seamlessly integrated for easy navigation, simulation of biomolecular processes and data
mining as well as e-collaboration for dispersed teams. Grids are creating a fertile ground for the
development of such new services. The grid technology provides the collaborative IT environment to
enable a quicker drug development process from molecular biology research to goal-oriented field
work. A pharmaceutical Grid should be a shared in silico resource to guarantee and preserve
knowledge in the areas of discovery, development, manufacturing, marketing and sales of new drug
therapies and cover three dimensions:
a resource that provides extremely large CPU power to perform computing intense tasks in a
transparent way by means of an automated job submission and distribution facility
-
a resource that provides transparent and secure access to storage and archiving of large
amounts of data in an automated and self-organized mode
-
a resource that connects, analyses and structures data and information in a transparent mode
according to pre-defined rules (science or business process based)
Drug discovery efforts are traditionally associated with pharmaceutical companies. The path from
initial discovery of a biological target to the transition into development of a single lead compound
into a drug takes up to 12 years in a highly sequential process. Pharmaceutical GRIDs hold the power
to change the DD process in significant ways: They can speed up the DD process by enabling largescale computational approaches previously considered infeasible, such as high-throughput docking
(HTD). More importantly, due to their innate characteristics, public GRIDs foster collaboration
between academic labs.
The motivating perspective is to enhance the ability of both pharmaceutical industry and academic
research institutions to share diverse, complex and distributed information on a given disease for
collaborative exploration and mutual benefit. The goal is to lower the barrier to such substantive
interactions in order to produce cheaper drugs and insecticides to address diseases affecting third
world development and to increase the return on investment for new drugs in the developed countries.
Two neglected diseases have been identified as targets to prove the feasibility of the concept a
Pharmaceutical Grid: Dengue and Malaria. The deployment on EGEE and the INSTRUIRE grid will
focus on addressing malaria. The Swiss Institute of Bioinformatics at the Biozentrum Basel will focus
on addressing HTD on Dengue fever targets using computational resources provided by the
SwissBioGrid initiative. CampusGRID Bonn Aachen will focus on technical aspects and
benchmarking of GRID enabled virtual screening tools.
3.1. ADRESSING MALARIA ON A GRID
The number of cases and deaths from malaria increases in many parts of the world. There are 300 to
500 million new infections, 1 to 3 million new deaths and a 1 to 4% loss of gross domestic product (at
least $12 billion) annually in Africa caused by malaria.
For both vector control and chemotherapy, knowing the gene sequences of Anopheles and Plasmodium
species should lead to discovery of targets against which new insecticides or anti-malarial drugs can
be produced. However, such discoveries are likely to be patented and only developed at prices
unaffordable to governments or villagers in tropical countries. Genomics research is crucial to
attacking malaria, but this research must be strongly coupled to ground studies in order to guide
disease controllers, especially those working in countries with annual health budgets of less than $10
per person. Evaluation of the impact of new drugs and new vaccines require careful monitoring of
IST-2002-508833
INTERNAL
8
clinical tests, especially in areas of high malaria transmission where it is important to distinguish
recurrence of parasites, due to recrudescence of incompletely cured infections, from re-infection due
to new mosquito bites. Disease controllers are also faced with the necessity to monitor goal-oriented
field work on a long term.
3.2. ADRESSING DENGUE FEVER BY GRID-BASED VIRTUAL SCREENING
Dengue and dengue hemorrhagic fever (DHF) are caused by one of four closely related, but
antigenically distinct, virus serotypes (DEN-1, DEN-2, DEN-3, and DEN-4), of the genus Flavivirus.
Infection with one of these serotypes does not provide cross-protective immunity; thus, persons living
in a dengue-endemic area can have four dengue infections during their lifetimes. Dengue is primarily a
disease of the tropics, and the viruses that cause it are maintained in a cycle that involves humans and
Aedes aegypti, a domestic, day-biting mosquito that prefers to feed on humans. Infection with dengue
viruses produces a spectrum of clinical illness ranging from a non-specific viral syndrome to severe
and fatal hemorrhagic disease. Important risk factors for DHF include the strain and serotype of the
infecting virus, as well as the age, immune status, and genetic predisposition of the patient. Research
for inhibitors of Dengue propagation and proliferation could target viral functions in both hosts,
humans and mosquitoes.
The genomes of several Dengue strains have been sequenced and possible targets with enzymatic
function have already been identified. For three of these target proteins, experimental 3-dimensional
structures are available. For one additional target, a comparative model based on an homologous
protein from human hepatitis virus is available. We will use the GRID infrastructure to perform HTD
based on these three-dimensional structures of the target proteins. The most promising compounds
identified by computational screening will be transferred to interested parties with the capability to
further analyze their value in a Drug Discovery program aimed at treating Dengue. Such parties may
include institutions like the Novartis Institute for Tropical Diseases in Singapore.
3.3. DESCRIPTION OF THE APPLICATION
The proposed application is the deployment of a docking platform. This deployment would involve 2
European projects, SIMDAT and EGEE, the SwissBioGrid initiative, the Swiss Institute of
Bioinformatics (Biozentrum Basel), and two regional initiatives, the INSTRUIRE regional grid in
Auvergne and CampusGRID Bonn Aachen, a life and medical GRID initiative of the state of North
Rhine Westfalia.
This platform consists of a library of compounds and shared docking algorithms deployed on several
grid nodes. Such deployment would allow massive in silico screening comparable to those performed
within the framework of the French Telethon or by the United Devices Grid (www.grid.org). Hits
coming out of in silico screening of target proteins will have to be validated, and promising
compounds transferred to parties with capabilities to analyze their value in a Drug Discovery Program
(Bioassays, toxicity, efficacy).
In the future, hits will be made available in a common knowledge space. This knowledge space should
attract prominent biologists, life scientists, and members of the Pharmaceuticals industry to help steer
the deployment of grids that will enable the scientific community to achieve substantially higher
effective performance and functionalities than is possible today. The perspective of the application is
to build an open-source Drug Discovery grid for neglected diseases.
IST-2002-508833
INTERNAL
9
4. IMPLEMENTATION
The application corresponds to the initial phase of a Drug Discovery fabric, i.e high throughput in
silico screening/docking of small compounds in available libraries against molecular drug targets to
identify hits as very early drug candidates. Implementation requires the following elements:
-
Docking algorithms. There are today quite a few docking algorithms available, but their
deployment on a grid raises licensing issues. Dock [3] and Autodock [2] are available for
free for non-commercial projects. The FlexX [1] docking software package has been
developed at Fraunhofer SCAI, and is now distributed on a commercial basis by BioSolvIT.
SCAI will give access to FlexX within the framework of this collaboration.
-
A library of compounds. Existing compound libraries are often owned privately (typically
by companies that sell compound libraries). One of our major goals for the future is
therefore a public library of drug like compounds to support academic drug discovery
approaches.
-
Grid infrastructure allowing the deployment of large scale computations on large sets of
data.
The grid infrastructure includes a set of services to prepare, submit and handle jobs on the
infrastructure.
4.1. PROPOSED ROADMAP
In a first step, the docking process on the grid needs to be validated on a known public data set. The
reliability and confidence level of the software stack will be evaluated based on well-defined
benchmark data sets, e.g. the GOLD benchmark. In the same time, we will deploy the large library of
compounds for the 2 use cases and evaluate the performance of the infrastructure. Subsequently,
activity will focus on the Malaria and Dengue target proteins with the goal of demonstrating the grid
impact on drug discovery for neglected diseases. Improving the docking process in an open-source
environment on a large library of compounds is the challenge.
The docking algorithms are initially software available under open source or at no-cost licences
(Dock, Autodock), and FlexX, which will be provided by SCAI within the framework of this
collaboration. Two compound libraries will be evaluated to be used in this study: The open source
collection ZINC [4], and the SCAI library of drug like compounds provided by Marc Zimmermann
(SCAI) within the framework of this collaboration.
We propose to integrate the Biomed Virtual Organization. The LPC Clermont-Ferrand and Fraunhofer
SCAI accept or will accept this VO. To protect IPR on FlexX used in this study, the algorithm will be
sent by the input sandbox. Today, we can’t install the compound database and the algorithms on our
nodes.
4.2. RESOURCE EVALUATION
Grid deployment has started during the summer using the GILDA testbed. The partners are
experienced in the deployment of grid applications used in complex Drug Discovery processes.
The following estimations are based on tests with Autodock:
-
During the first step, one docking process will require 300 small jobs of 30 seconds. The
CPU usage will be approximately 25 h by week. For each job, input size is 8 MB and output
size is 1 MB. So transferred data size will be around 27 GB weekly.
IST-2002-508833
INTERNAL
10
-
During the second step, one docking process will consist of 2,000,000 small jobs of 30
seconds. The CPU usage will be approximately 8 300 h for one target protein. Transferred
data size will be around 9 TB for one target protein
4.3. REQUIREMENTS
The main requirements for our application are listed below. Requirements are based on the EGEE list
available on http://egee-na4.ct.infn.it/requirements/list.php. Several are essential. For instance, we
absolutely need protect IPR on FlexX and the compounds. We also need to manage million of jobs and
files.
4.3.1. Job submission
4.3.1.1. Multiple data jobs
4.3.1.2. Compound jobs execution (pipelining)
4.3.1.3. Job access to data
4.3.1.4. Scalability
4.3.2. Data management
4.3.2.1. Fine grain control of access rights
4.3.2.2. Group of files
4.3.2.3. Access rights delegation
4.3.2.4. Data replication control
4.3.2.5. Data registration, retrieval, and deletion
4.3.2.6. Scalability
4.3.3. Information system
4.3.3.1. Jobs information and status notification
4.3.3.2. Top level information system index
4.3.3.3. Resource brokers index
4.3.3.4. Grid resource browsing
4.3.3.5. Licensed software management
4.3.4. Storage
IST-2002-508833
INTERNAL
11
4.3.4.1. On-disk encryption
4.3.4.2. Hook on data privacy manager
4.3.4.3. Communications encryption
4.3.4.4. Outbound connectivity
4.3.5. Operation
4.3.5.1. VO creation
4.3.5.2. User control
4.3.5.3. User login
4.3.5.4. Software package publication
4.3.5.5. Robustness
4.3.5.6. Multiple VOs registration
IST-2002-508833
INTERNAL
12
5. PARTICIPANTS
The two grids projects (SIMDAT and EGEE European projects) in collaboration with the
SwissBioGrid initiative, Swiss Institute of Bioinformatics (Biozentrum Basel), the INSTRUIRE
regional grid and CampusGRID Bonn Aachen will join their efforts to help develop and host together
this initial prototype:
-
EGEE will provide human resources to deploy docking algorithms on its infrastructure, to
create and run a Drug Discovery VO. EGEE will provide the needed services for the VO.
-
SIMDAT will provide human resources (especially GRID expertise) in the field of GRID
based data access, bioinformatics tool application and ontology-based services.
-
The SwissBioGrid initiative will provide a computational platform, as well as human
resources with expertise in the field of GRID infrastructure.
-
The Swiss Institute of Bioinformatics at the Biozentrum Basel will provide human resources
with expertise in the field of GRID based bioinformatics tools, and will be responsible for
the scientific aspects of the Dengue docking project.
-
The INSTRUIRE regional grid in Auvergne will provide its infrastructure during the year
2005.
-
CampusGRID Bonn Aachen will perform benchmarking studies comparing the different
virtual screening tools proposed for this project.
The involved laboratories in this application are :

Laboratoire de Physique Corpusculaire of Clermont-Ferrand– CNRS/IN2P3
CNRS-IN2P3 will give access to the Clermont-Ferrand resource.
Vincent Breton (IN2P3), Head of the PCSV team at LPC, NA4 leader for EGEE
Nicolas Jacq (IN2P3), PHD student in bioinformatics at the PCSV team, technical contact
point
Jean Salzemann (IN2P3)
Emmanuel Medernach (IN2P3)

SCAI Fraunhofer
Martin Hofmann, Head of the Department of Bioinformatics at SCAI
Kai Kumpf, group leader GRID applications and technical contact point to SIMDAT (pharma
activity)
Marc Zimmermann, group leader chemoinformatics at Fraunhofer SCAI
Horst Schwichtenberg, principle investigator and project leader EGEE at SCAI
Astrid Maass, research scientist in the group of Horst Schwichtenberg

Biozentrum Basel / Swiss Institute of Bioinformatics
Torsten Schwede, group leader at the Biozentrum Basel and Swiss Institute of Bioinformatics
Michael Podvinec, project leader of the Dengue Project
Konstantin Arnold, research programmer in the group of Torsten Schwede
IST-2002-508833
INTERNAL
13

SwissBioGrid Initiative
Marie-Christine Sawley, Head of CSCS
Patrick Wieghardt, SBG Project Manager
Sergio Maffioletti, SBG Technical Project Manager
IST-2002-508833
INTERNAL
14
Download