DEPLOYMENT OF A DOCKING PLATFORM FOR IN SILICO DRUG DISCOVERY ON GRIDS Document identifier: drug discovery - application to EGAAP v2.0 Date: 12/02/2016 Activity: EGEE NA4 : applications identification and support activity SIMDAT Authors: V. Breton, M. Hofmann, T. Schwede, M. Podvinec, N. Jacq Document status: DRAFT Document link: https://edms.cern.ch/document Abstract: This document proposes the deployment of a high throughput virtual screening platform in the perspective of in silico drug discovery for neglected diseases. This platform will be jointly developed by the SIMDAT and EGEE projects in collaboration with the SwissBioGRID initiative, Swiss Institute of Bioinformatics (Biozentrum Basel), the INSTRUIRE regional grid in Auvergne and the CampusGRID Bonn Aachen regional GRID. IST-2002-508833 INTERNAL 1 Document Log Issue Date 0-1 29.09.2004 0-2 14.10.2004 0-3 19.10.2004 0-4 26.10.2004 2.0 8.11.04 Comment First version Author V. Breton, N. Jacq Added CampusGrid & Dengue Information; modified some of the M. Hoffman wording Added SwissBioGrid & Dengue Information; modified some of the T. Schwede, M.Podvinec wording Modifies version number of the document; added requirements part; V. Breton, N. Jacq modified some of the wording Renumbering, final edition V. Breton Document Change Record Issue IST-2002-508833 Item Reason for Change INTERNAL 3 CONTENT 1. INTRODUCTION ............................................................................................................................................. 5 1.1. PURPOSE ....................................................................................................................................................... 5 1.2. APPLICATION AREA ...................................................................................................................................... 5 1.3. REFERENCES ................................................................................................................................................. 5 1.4. DOCUMENT EVOLUTION PROCEDURE ............................................................................................................ 5 1.5. TERMINOLOGY ............................................................................................................................................. 5 2. EXECUTIVE SUMMARY ............................................................................................................................... 7 3. RATIONALE .................................................................................................................................................... 8 3.1. ADRESSING MALARIA ON A GRID ................................................................................................................... 8 3.2. ADRESSING DENGUE FEVER BY GRID-BASED VIRTUAL SCREENING ................................................ 9 3.3. DESCRIPTION OF THE APPLICATION ............................................................................................................... 9 4. IMPLEMENTATION .................................................................................................................................... 10 4.1. PROPOSED ROADMAP .................................................................................................................................. 10 4.2. RESOURCE EVALUATION ............................................................................................................................ 10 4.3. REQUIREMENTS .......................................................................................................................................... 11 4.3.1. Job submission ................................................................................................................................... 11 4.3.2. Data management............................................................................................................................... 11 4.3.3. Information system ............................................................................................................................. 11 4.3.4. Storage ............................................................................................................................................... 11 4.3.5. Operation ........................................................................................................................................... 12 5. PARTICIPANTS ............................................................................................................................................. 13 IST-2002-508833 INTERNAL 4 1. INTRODUCTION 1.1. PURPOSE This document proposes the deployment of a docking platform in the perspective of in silico drug discovery for neglected diseases. This platform will be jointly developed by the SIMDAT and EGEE projects in collaboration with the SwissBioGRID initiative, the Swiss Institute of Bioinformatics at the Biozentrum Basel, and the INSTRUIRE regional grid in Auvergne. It will also be linked to the newly established project “CampusGRID Bonn Aachen”, a GRID project in the field of life and medical sciences funded by the Ministry of Science and Research of the state of North Rhine Westfalia. 1.2. APPLICATION AREA The document addresses issues related to Drug Discovery in a grid environment. 1.3. REFERENCES [1] FlexX A fast flexible docking method using an incremental construction algorithm, Rarey M., Kramer B., Lengauer T., Klebe G., J. Mol. Biol. (1996) 261,470489 [2] Autodock "Automated Docking Using a Lamarckian Genetic Algorithm and and Empirical Binding Free Energy Function". Morris, G. M., Goodsell, D. S., Halliday, R.S., Huey, R., Hart, W. E., Belew, R. K. and Olson, A. J. (1998), J. Computational Chemistry, 19: 1639-1662. [3] Dock DOCK 4.0: search strategies for automated molecular docking of flexible molecule databases. Ewing TJ, Makino S, Skillman AG, Kuntz ID. J Comput. Aided Mol. Des. 2001 May;15(5):411-28. [4] ZINC See http://zinc.docking.org 1.4. DOCUMENT EVOLUTION PROCEDURE This document will be updated incrementally as the collaboration evolves. Comments should be sent to the authors. 1.5. TERMINOLOGY Glossary DD HTS Drug Discovery High-throughput Screening HTD High-throughput Docking DHF Dengue Hemorrhagic Fever IST-2002-508833 INTERNAL 5 Definitions IST-2002-508833 INTERNAL 6 2. EXECUTIVE SUMMARY Grid technology holds out the promise of more effective means to manage information and enhance knowledge-based processes in just the sort of environment that is well established in Pharmaceutical Research and Development. The present document describes a proposal to deploy in silico screening on a grid infrastructure using a large library of compounds (approx. 2 million compounds) and shared docking algorithms in the perspective of drug discovery for 2 neglected diseases, Malaria and Dengue Fever. The project involves 2 European FP6 projects, SIMDAT and EGEE, the SwissBioGrid Initiative, the Swiss Institute of Bioinformatics (Biozentrum Basel), the INSTRUIRE regional grid and the CampusGRID Bonn Aachen regional GRID. IST-2002-508833 INTERNAL 7 3. RATIONALE The challenges in Drug Discovery (DD) today lie around understanding diseases, elucidating pathways and exploiting information at molecular and cellular level which should result in reducing DD projects life cycle and increasing probability of success after lead optimization. Information Technology can transform the DD process through comprehensive and reliable data and information seamlessly integrated for easy navigation, simulation of biomolecular processes and data mining as well as e-collaboration for dispersed teams. Grids are creating a fertile ground for the development of such new services. The grid technology provides the collaborative IT environment to enable a quicker drug development process from molecular biology research to goal-oriented field work. A pharmaceutical Grid should be a shared in silico resource to guarantee and preserve knowledge in the areas of discovery, development, manufacturing, marketing and sales of new drug therapies and cover three dimensions: a resource that provides extremely large CPU power to perform computing intense tasks in a transparent way by means of an automated job submission and distribution facility - a resource that provides transparent and secure access to storage and archiving of large amounts of data in an automated and self-organized mode - a resource that connects, analyses and structures data and information in a transparent mode according to pre-defined rules (science or business process based) Drug discovery efforts are traditionally associated with pharmaceutical companies. The path from initial discovery of a biological target to the transition into development of a single lead compound into a drug takes up to 12 years in a highly sequential process. Pharmaceutical GRIDs hold the power to change the DD process in significant ways: They can speed up the DD process by enabling largescale computational approaches previously considered infeasible, such as high-throughput docking (HTD). More importantly, due to their innate characteristics, public GRIDs foster collaboration between academic labs. The motivating perspective is to enhance the ability of both pharmaceutical industry and academic research institutions to share diverse, complex and distributed information on a given disease for collaborative exploration and mutual benefit. The goal is to lower the barrier to such substantive interactions in order to produce cheaper drugs and insecticides to address diseases affecting third world development and to increase the return on investment for new drugs in the developed countries. Two neglected diseases have been identified as targets to prove the feasibility of the concept a Pharmaceutical Grid: Dengue and Malaria. The deployment on EGEE and the INSTRUIRE grid will focus on addressing malaria. The Swiss Institute of Bioinformatics at the Biozentrum Basel will focus on addressing HTD on Dengue fever targets using computational resources provided by the SwissBioGrid initiative. CampusGRID Bonn Aachen will focus on technical aspects and benchmarking of GRID enabled virtual screening tools. 3.1. ADRESSING MALARIA ON A GRID The number of cases and deaths from malaria increases in many parts of the world. There are 300 to 500 million new infections, 1 to 3 million new deaths and a 1 to 4% loss of gross domestic product (at least $12 billion) annually in Africa caused by malaria. For both vector control and chemotherapy, knowing the gene sequences of Anopheles and Plasmodium species should lead to discovery of targets against which new insecticides or anti-malarial drugs can be produced. However, such discoveries are likely to be patented and only developed at prices unaffordable to governments or villagers in tropical countries. Genomics research is crucial to attacking malaria, but this research must be strongly coupled to ground studies in order to guide disease controllers, especially those working in countries with annual health budgets of less than $10 per person. Evaluation of the impact of new drugs and new vaccines require careful monitoring of IST-2002-508833 INTERNAL 8 clinical tests, especially in areas of high malaria transmission where it is important to distinguish recurrence of parasites, due to recrudescence of incompletely cured infections, from re-infection due to new mosquito bites. Disease controllers are also faced with the necessity to monitor goal-oriented field work on a long term. 3.2. ADRESSING DENGUE FEVER BY GRID-BASED VIRTUAL SCREENING Dengue and dengue hemorrhagic fever (DHF) are caused by one of four closely related, but antigenically distinct, virus serotypes (DEN-1, DEN-2, DEN-3, and DEN-4), of the genus Flavivirus. Infection with one of these serotypes does not provide cross-protective immunity; thus, persons living in a dengue-endemic area can have four dengue infections during their lifetimes. Dengue is primarily a disease of the tropics, and the viruses that cause it are maintained in a cycle that involves humans and Aedes aegypti, a domestic, day-biting mosquito that prefers to feed on humans. Infection with dengue viruses produces a spectrum of clinical illness ranging from a non-specific viral syndrome to severe and fatal hemorrhagic disease. Important risk factors for DHF include the strain and serotype of the infecting virus, as well as the age, immune status, and genetic predisposition of the patient. Research for inhibitors of Dengue propagation and proliferation could target viral functions in both hosts, humans and mosquitoes. The genomes of several Dengue strains have been sequenced and possible targets with enzymatic function have already been identified. For three of these target proteins, experimental 3-dimensional structures are available. For one additional target, a comparative model based on an homologous protein from human hepatitis virus is available. We will use the GRID infrastructure to perform HTD based on these three-dimensional structures of the target proteins. The most promising compounds identified by computational screening will be transferred to interested parties with the capability to further analyze their value in a Drug Discovery program aimed at treating Dengue. Such parties may include institutions like the Novartis Institute for Tropical Diseases in Singapore. 3.3. DESCRIPTION OF THE APPLICATION The proposed application is the deployment of a docking platform. This deployment would involve 2 European projects, SIMDAT and EGEE, the SwissBioGrid initiative, the Swiss Institute of Bioinformatics (Biozentrum Basel), and two regional initiatives, the INSTRUIRE regional grid in Auvergne and CampusGRID Bonn Aachen, a life and medical GRID initiative of the state of North Rhine Westfalia. This platform consists of a library of compounds and shared docking algorithms deployed on several grid nodes. Such deployment would allow massive in silico screening comparable to those performed within the framework of the French Telethon or by the United Devices Grid (www.grid.org). Hits coming out of in silico screening of target proteins will have to be validated, and promising compounds transferred to parties with capabilities to analyze their value in a Drug Discovery Program (Bioassays, toxicity, efficacy). In the future, hits will be made available in a common knowledge space. This knowledge space should attract prominent biologists, life scientists, and members of the Pharmaceuticals industry to help steer the deployment of grids that will enable the scientific community to achieve substantially higher effective performance and functionalities than is possible today. The perspective of the application is to build an open-source Drug Discovery grid for neglected diseases. IST-2002-508833 INTERNAL 9 4. IMPLEMENTATION The application corresponds to the initial phase of a Drug Discovery fabric, i.e high throughput in silico screening/docking of small compounds in available libraries against molecular drug targets to identify hits as very early drug candidates. Implementation requires the following elements: - Docking algorithms. There are today quite a few docking algorithms available, but their deployment on a grid raises licensing issues. Dock [3] and Autodock [2] are available for free for non-commercial projects. The FlexX [1] docking software package has been developed at Fraunhofer SCAI, and is now distributed on a commercial basis by BioSolvIT. SCAI will give access to FlexX within the framework of this collaboration. - A library of compounds. Existing compound libraries are often owned privately (typically by companies that sell compound libraries). One of our major goals for the future is therefore a public library of drug like compounds to support academic drug discovery approaches. - Grid infrastructure allowing the deployment of large scale computations on large sets of data. The grid infrastructure includes a set of services to prepare, submit and handle jobs on the infrastructure. 4.1. PROPOSED ROADMAP In a first step, the docking process on the grid needs to be validated on a known public data set. The reliability and confidence level of the software stack will be evaluated based on well-defined benchmark data sets, e.g. the GOLD benchmark. In the same time, we will deploy the large library of compounds for the 2 use cases and evaluate the performance of the infrastructure. Subsequently, activity will focus on the Malaria and Dengue target proteins with the goal of demonstrating the grid impact on drug discovery for neglected diseases. Improving the docking process in an open-source environment on a large library of compounds is the challenge. The docking algorithms are initially software available under open source or at no-cost licences (Dock, Autodock), and FlexX, which will be provided by SCAI within the framework of this collaboration. Two compound libraries will be evaluated to be used in this study: The open source collection ZINC [4], and the SCAI library of drug like compounds provided by Marc Zimmermann (SCAI) within the framework of this collaboration. We propose to integrate the Biomed Virtual Organization. The LPC Clermont-Ferrand and Fraunhofer SCAI accept or will accept this VO. To protect IPR on FlexX used in this study, the algorithm will be sent by the input sandbox. Today, we can’t install the compound database and the algorithms on our nodes. 4.2. RESOURCE EVALUATION Grid deployment has started during the summer using the GILDA testbed. The partners are experienced in the deployment of grid applications used in complex Drug Discovery processes. The following estimations are based on tests with Autodock: - During the first step, one docking process will require 300 small jobs of 30 seconds. The CPU usage will be approximately 25 h by week. For each job, input size is 8 MB and output size is 1 MB. So transferred data size will be around 27 GB weekly. IST-2002-508833 INTERNAL 10 - During the second step, one docking process will consist of 2,000,000 small jobs of 30 seconds. The CPU usage will be approximately 8 300 h for one target protein. Transferred data size will be around 9 TB for one target protein 4.3. REQUIREMENTS The main requirements for our application are listed below. Requirements are based on the EGEE list available on http://egee-na4.ct.infn.it/requirements/list.php. Several are essential. For instance, we absolutely need protect IPR on FlexX and the compounds. We also need to manage million of jobs and files. 4.3.1. Job submission 4.3.1.1. Multiple data jobs 4.3.1.2. Compound jobs execution (pipelining) 4.3.1.3. Job access to data 4.3.1.4. Scalability 4.3.2. Data management 4.3.2.1. Fine grain control of access rights 4.3.2.2. Group of files 4.3.2.3. Access rights delegation 4.3.2.4. Data replication control 4.3.2.5. Data registration, retrieval, and deletion 4.3.2.6. Scalability 4.3.3. Information system 4.3.3.1. Jobs information and status notification 4.3.3.2. Top level information system index 4.3.3.3. Resource brokers index 4.3.3.4. Grid resource browsing 4.3.3.5. Licensed software management 4.3.4. Storage IST-2002-508833 INTERNAL 11 4.3.4.1. On-disk encryption 4.3.4.2. Hook on data privacy manager 4.3.4.3. Communications encryption 4.3.4.4. Outbound connectivity 4.3.5. Operation 4.3.5.1. VO creation 4.3.5.2. User control 4.3.5.3. User login 4.3.5.4. Software package publication 4.3.5.5. Robustness 4.3.5.6. Multiple VOs registration IST-2002-508833 INTERNAL 12 5. PARTICIPANTS The two grids projects (SIMDAT and EGEE European projects) in collaboration with the SwissBioGrid initiative, Swiss Institute of Bioinformatics (Biozentrum Basel), the INSTRUIRE regional grid and CampusGRID Bonn Aachen will join their efforts to help develop and host together this initial prototype: - EGEE will provide human resources to deploy docking algorithms on its infrastructure, to create and run a Drug Discovery VO. EGEE will provide the needed services for the VO. - SIMDAT will provide human resources (especially GRID expertise) in the field of GRID based data access, bioinformatics tool application and ontology-based services. - The SwissBioGrid initiative will provide a computational platform, as well as human resources with expertise in the field of GRID infrastructure. - The Swiss Institute of Bioinformatics at the Biozentrum Basel will provide human resources with expertise in the field of GRID based bioinformatics tools, and will be responsible for the scientific aspects of the Dengue docking project. - The INSTRUIRE regional grid in Auvergne will provide its infrastructure during the year 2005. - CampusGRID Bonn Aachen will perform benchmarking studies comparing the different virtual screening tools proposed for this project. The involved laboratories in this application are : Laboratoire de Physique Corpusculaire of Clermont-Ferrand– CNRS/IN2P3 CNRS-IN2P3 will give access to the Clermont-Ferrand resource. Vincent Breton (IN2P3), Head of the PCSV team at LPC, NA4 leader for EGEE Nicolas Jacq (IN2P3), PHD student in bioinformatics at the PCSV team, technical contact point Jean Salzemann (IN2P3) Emmanuel Medernach (IN2P3) SCAI Fraunhofer Martin Hofmann, Head of the Department of Bioinformatics at SCAI Kai Kumpf, group leader GRID applications and technical contact point to SIMDAT (pharma activity) Marc Zimmermann, group leader chemoinformatics at Fraunhofer SCAI Horst Schwichtenberg, principle investigator and project leader EGEE at SCAI Astrid Maass, research scientist in the group of Horst Schwichtenberg Biozentrum Basel / Swiss Institute of Bioinformatics Torsten Schwede, group leader at the Biozentrum Basel and Swiss Institute of Bioinformatics Michael Podvinec, project leader of the Dengue Project Konstantin Arnold, research programmer in the group of Torsten Schwede IST-2002-508833 INTERNAL 13 SwissBioGrid Initiative Marie-Christine Sawley, Head of CSCS Patrick Wieghardt, SBG Project Manager Sergio Maffioletti, SBG Technical Project Manager IST-2002-508833 INTERNAL 14