MAINTENANCE MANUAL for DRUTEX Using Workflow technology to identify new bacterial drug-targets Group A, M.Sc. Bioinformatics, Newcastle University March, 2010 Maintenance Manual Authorization Memorandum I have carefully assessed the Maintenance Manual for the DRUTEX system. This document has been completed using the Maintenance Manual template from Maura Lilienfeld, CHM Team, U.S. Department of Housing and Urban development, 2005 (http://www.hud.gov/offices/cio/sdm/devlife/tempcheck.cfm). MANAGEMENT CERTIFICATION - Please check the appropriate statement. ______ The document is accepted. ______ The document is accepted pending the changes noted. ______ The document is not accepted. We fully accept the changes as needed improvements and authorize initiation of work to proceed. Based on our authority and judgment, the continued operation of this system is authorized. _______________________________ NAME Project Leader _____________________ DATE _______________________________ NAME Operations Division Director _____________________ DATE _______________________________ NAME Program Area/Sponsor Representative _____________________ DATE _______________________________ NAME Program Area/Sponsor Director _____________________ DATE MAINTENANCE MANUAL TABLE OF CONTENTS Page # 1.0 GENERAL INFORMATION ...................................................................................................... 1-1 1.1 System Overview .................................................................................................................... 1-1 1.2 Points of Contact..................................................................................................................... 1-1 1.2.1 2.0 Technical support ................................................................................................................................1-1 SYSTEM DESCRIPTION ........................................................................................................... 2-2 2.1 System Architecture ............................................................................................................... 2-2 2.2 Security .................................................................................................................................... 2-2 3.0 ENVIRONMENT ......................................................................................................................... 3-3 This system has used the following tools: ...................................................Error! Bookmark not defined. 3.1 Taverna.................................................................................................................................... 3-3 3.2 Biocatalogue ............................................................................................................................ 3-3 3.3 myExperiment......................................................................................................................... 3-3 3.4 Compatibility .......................................................................................................................... 3-3 3.5 Support Software Environment ............................................................................................ 3-3 3.6 Setting up the system/reinstallation ...................................................................................... 3-3 4.0 SYSTEM MAINTENANCE PROCEDURES ............................................................................. 4-4 4.1 Responsibilities ....................................................................................................................... 4-4 4.2 Performance Verification Procedure/Quality control ......................................................... 4-4 4.3 Handling performance problems/errors............................................................................... 4-4 4.3.1 System raises an error message ...........................................................................................................4-4 4.3.2 System is producing unexpected output ..............................................................................................4-4 4.3.3 Program producing unexpected output ...............................................................................................4-4 Maintenance Manual Page iii 5.0 INFORMATION ABOUT EACH WORKFLOW UNIT ............................................................ 5-1 5.1 Workflow 1: Read in two files in GenBank or EMBL format............................................ 5-1 5.1.1 Overview ............................................................................................................................................5-1 5.1.2 Detailed description ............................................................................................................................5-1 5.2 Workflow 2: Compare relatedness of two strains ............................................................... 5-2 5.2.1 Overview ............................................................................................................................................5-2 5.2.2 Detailed description ............................................................................................................................5-2 5.3 Workflow 3: Compare proteins to find those unique to pathogen ..................................... 5-3 5.3.1 Overview ............................................................................................................................................5-3 5.3.2 Detailed description ............................................................................................................................5-3 5.4 Workflow 4: Pathogen proteins that are potential drug targets ........................................ 5-4 5.4.1 Overview ............................................................................................................................................5-5 5.4.2 Detailed description ............................................................................................................................5-5 5.5 Workflow 5: Pin-point pathogen enzymes in Kegg diagrams ............................................ 5-6 5.5.1 Overview ............................................................................................................................................5-6 5.5.2 Detailed description ............................................................................................................................5-7 Maintenance Manual Page iv The Maintenance Manual presents information on the DRUTEX system. It is written for personnel who are responsible for the maintenance of the system and who need to understand the operating environment, security, and control requirements. It describes the programs in technical detail to assist the maintenance programmer. 1.0 GENERAL INFORMATION 1.1 System Overview The system aims at identifying new targets for existing drugs. We achieve this by comparing genomes of a pathogen and a non-pathogen to see if they are closely related. Then we find out the proteins in a pathogen that are unique (not present in the non-pathogen). These proteins are probably the cause of its pathogenicity. These proteins are compared for sequence similarity to proteins which are targets of existing drugs. Those with a high similarity to targets of existing drugs are the potential new bacterial drug-targets. This would be a significant discovery as we are running out of antibiotics because pathogens are becoming resistant to them. Outline of its working: 1) The system reads in at least two files (EMBL or Genbank format) - the source (non-pathogen) and the target (pathogen). 2) The system tests whether the two strains are closely related enough to produce a meaningful comparison. 3) The system provides an output a list of proteins encoded by the target genomes that do not have sequence similarity to those encoded by the source genome. 4) The system compares the unique protein list to a list of known protein sequences that are known to be the target of existing drugs. 5) The system also produces a list of those proteins in the target organism that may be the target for known drugs based on protein similarity. 6) The system pin-points the position of those proteins that are enzymes in Kegg pathway diagrams. 1.2 Points of Contact 1.2.1 Technical support For further assistance or information about the maintenance of this product, please email your query to J.S.Steyn@ncl.ac.uk or call our customer support line on 0191 123456. Maintenance Manual Page 1-1 2.0 SYSTEM DESCRIPTION 2.1 System Architecture The workflow will run on Taverna Server. e-Drugfinders can start the workflow on Taverna Server by using a web interface where the sequence files can be uploaded. For each part of the workflow, Taverna Server accesses an appropriate internet web service and submits a job to it. The web service processes the job and sends the results back to Taverna Server. When the workflow is complete, Taverna Server will display the results through the web interface. It will also connect to e-Drugfinders’ Ondex warehouse and update it with new potential drug targets it discovers. The components of the software comprise of workflows created using the Taverna platform. Each workflow, once executed, supplies an output that is accepted as an input by the next workflow in the sequence. Figure representing System Architecture 2.2 Security Although there are currently no provisions for keeping this system secure, the company implementing the software are free to apply them. Usage of this product for illegal purposes might lead toSPLUGE will accept no liability for loss or damage of the product. Maintenance Manual Page 2-2 3.0 ENVIRONMENT 3.1 Taverna Taverna Workbench is an open source tool for designing and executing workflows. This allows users to connect a number of bioinformatics services into one process. It is written in the Scufl programming language. The myExperiment social web site is a database for Taverna workflows and has special support for Scufl workflows. http://www.taverna.org.uk/. 3.2 Biocatalogue The BioCatalogue is a curated registry of biological Web Services. It was launched in June 2008 at the Intelligent Systems for Molecular Biology Conference. The project is collaboration between the myGrid project at the University of Manchester led by Carole Goble and the European Bioinformatics Institute led by Rodrigo Lopez. http://www.biocatalogue.org/. 3.3 myExperiment myExperiment is a social networking website for scientists sharing Research Objects such as scientific workflows and experiment plans. It was launched in November 2007. http://www.myexperiment.org/. 3.4 Compatibility System Requirements for Taverna: This tool is freely available online and works with all web-browsers and Operating systems (Windows XP, Windows Vista, Windows 7, Mac OS X 10.4 and higher). It is recommended that your system should have 1GB memory for Taverna to work efficiently, it might work with lesser memory as well, but performance will be slower than expected. It requires Java 1.5 or higher installed. GraphViz application (not required for Windows users) For more details, refer to the link: http://www.taverna.org.uk/download/taverna-2-1/systemrequirements/ DRUTEX should be capable of working on all OS’s although this testing is a work-in-progress. 3.5 3.6 Support Software Environment M-GCAT, which is a kind of software, is used in workflow2 to solve the problem. Java has been used to produce codes solving problems. Glassfish uses a derivative of Apache Tomcat as the servlet container for serving Web content, with an added component called Grizzly which uses Java NIO for scalability and speed. Setting up the system/reinstallation All instructions for getting DRUTEX running on your system are outlined in section III, Setup in the User Manual. For further details, please go to the website: www.taverna.org/ or, alternatively, follow the stepby-step instructions on the installation CD. Maintenance Manual Page 3-3 4.0 SYSTEM MAINTENANCE PROCEDURES This section provides information about the specific procedures necessary for the programmer to maintain the collective software units that make up the system. 4.1 Responsibilities The person responsible for maintaining the software is a person qualified in computing. If such a person should not be able to fix the problem, then SPLUGE will cover the costs of any replacement within 1 year of the warranty period. 4.2 Performance Verification Procedure/Quality control We recommended that you perform system maintenance as a prerequisite to using the system. This will ensure that the system is functioning optimally before it is used to generate biological data. To do this you should follow the steps outlined in the maintenance CD. The CD will guide you through a procedure which uses standardized test input values to test whether the system is working as expected. 4.3 Handling performance problems/errors 4.3.1 System raises an error message Unexpected input in dialogue box asking for threshold value Incorrect file path or file format. If having confirmed that the file paths and formats are correct, please reinstall the software by following the instructions at 3.6 Setting up the system/reinstallation. Blast/Kegg servers are down. For the Taverna workflow to run, it must access two external servers, Blast (http://blast.ncbi.nlm.nih.gov/) and Kegg (http://www.genome.jp/kegg/). Should you think these servers might be causing a problem, please navigate to the relevant website to confirm whether they are working. For any other errors please email your query to J.S.Steyn@ncl.ac.uk or call our customer support line on 0191 123456. 4.3.2 System is producing unexpected output In the general case, please isolate individual workflow components to test whether they are working in isolation. The next step is to connect up all the workflows that are working. This will allow you to isolate the problem to a particular workflow. If you are unable to progress further with a solution to the problem, please email your query to J.S.Steyn@ncl.ac.uk or call our customer support line on 0191 123456, quoting the particular workflow the problems has been isolated to. 4.3.3 Program producing unexpected output Please navigate to section 4.2 Performance Verification Procedure. This will ask you to test the system with standardized input to see whether the output is expected. If this is not the case, please contact support. Maintenance Manual Page 4-4 5.0 INFORMATION ABOUT EACH WORKFLOW UNIT This section provides a detailed description of each software unit. This allows the maintainer of the workflow to understand how each workflow functions so that if there is a problem with the workflow proper, each component workflow can be isolated and test individually. 5.1 Workflow 1: Read in two files in GenBank or EMBL format 5.1.1 Overview This workflow reads in two files containing in either GenBank or EMBL format and parses this information into two output files containing only the genomic sequences. This is the first step in the workflow proper. The inputs are 2 files in GenBank or EMBL format. One file is a genome from a non-pathogenic bacterium; the other is from a (supposedly) related bacterium that is pathogenic. The outputs are 2 files containing the sequences only. 5.1.2 Detailed description The base directory contains the list of proteins in the target organism. The remaining two inputs are the actual EMBL files for the target and source organisms. The ReadEMBLDatabase service uses BioJava to retrieve the protein sequence from each EMBL file. Each protein from the target genome is transferred into a separate file so that it can be Blasted against the source database in Workflow 2. The list of protein sequences from the source organism in the original file will later represent this database. The output from the workflow is an integer number of proteins in the target organism, and therefore the total number of files to be Blasted against the source database. Maintenance Manual Page 5-1 5.2 Workflow 2: Compare relatedness of two strains 5.2.1 Overview This workflow compares the genomes of non-pathogen and pathogen to see whether the two strains are related. The threshold level of similarity can be set by the user using the dialogue box that appears when the workflow is run. This is the second step in the workflow proper. The inputs are two files containing sequences in GenBank or EMBL format. The outputs are ‘1’ or ‘0’. ‘1’ indicates that the two strains are sufficiently similar and the workflow can progress onto the next stage which compares the proteins of the two genomes. A ‘0’ indicates that the two strains are too distantly-related for a meaningful comparison so the system exits. 5.2.2 Detailed description The inputs are the file paths to FastA files containing the full genome sequences from each organism. INI_Path is the full path to the configuration file used by M-GCAT. A number of regular expressions parses the file. MGCAT_Path is the path to the M-GCAT executable. INI_File is the name of the Maintenance Manual Page 5-2 configuration file (not the path). Working_Dir is the directory in which the output of M-GCAT will be stored. Run M-GCAT which aligns the two genomes and produces a log file. The service Read_Text_File_2 reads the log file and extracts the genome similarity. A dialogue box asks the user to choose the minimum genome similarity required. Finally, the Thresholder tests whether the similarity is greater than or equal to the threshold set by the user. 5.3 Workflow 3: Compare proteins to find those unique to pathogen 5.3.1 Overview This workflow takes three inputs – e value threshold, number of files and basedir (working directory) and extracts the proteins that are unique. i.e. that exist in pathogen (accounting for pathogenicity) but not in the source (non-pathogen). The output is a file containing a list naming the proteins that are unique to the pathogen. 5.3.2 Detailed description This is a work-in-progress. This workflow has been running in Unix systems, but has problems running in Windows, although our experts are working on it. Maintenance Manual Page 5-3 5.4 Workflow 4: Pathogen proteins that are potential drug targets Maintenance Manual Page 5-4 5.4.1 Overview This workflow compares the unique protein list to a list of known protein sequences that are known to be the target of existing drugs. This will result in a list of pathogen proteins which are potential targets for known drugs based on protein similarity above a certain e-value threshold. This is the fourth step in the workflow proper. The inputs are a list of proteins that are unique to the pathogen. The outputs are a list naming the proteins which are the subset of proteins unique to the pathogen that also show global sequence similarity to the targets of known drugs. 5.4.2 Detailed description The inputs to the workflow are: path_proteins: a single path to a file containing a list of file paths of single protein sequences that are unique to the pathogen in FastA format, drugs_file: a path to a file containing the amino acid sequences in FastA format of all known drug-affected proteins which was supplied by e-Drugfinders, DBSavePath: a path to the working directory storing the files required for the workflow to run. FormatDB_Path and Blastall_path contain only strings specifying paths to the location of the local BLAST database and Blast output. We create a local Blast database, then run the nested workflow. The first nested workflow runs a Blast search of each protein unique to the pathogen against the database. The output from this goes into the second nested workflow. A dialogue-box prompts the user for a threshold value which is then used to filter the Blast results. The only output is a flattened list of GI numbers which are similar to the drug-affected proteins according to the threshold input by the user, which is the minimum e-value they expect. Maintenance Manual Page 5-5 5.5 Workflow 5: Pin-point pathogen enzymes in Kegg diagrams 5.5.1 Overview This workflow will take the list from the previous workflow that names pathogen proteins that are potential drug targets and pin-point their position in Kegg pathway diagrams. If the proteins are not enzymes then no Kegg pathway will apply. This is the final step in the workflow proper. The inputs are 1) a three-letter code representing the species’ name, 2) the species id, and 3) background and foreground colours (green and red, respectively). If the protein is an enzyme, the outputs are a list of Kegg Pathway IDs indicating the pathways in which the enzyme participates and a coloured image pin-pointing the position of the enzyme in these pathways. If the protein is not an enzyme, then the workflow will return an empty list. Maintenance Manual Page 5-6 5.5.2 Detailed description In the nested workflow we will concatenate the species name and id to a single query string for use by Kegg. For example: hsa (species name) + string1_value (colon) = hsa: hsa: + 1487 (gene id) = hsa:1487 (query string) The workflow takes the query string to check in Kegg whether there is a corresponding enzyme. If one is present the position of the enzyme is pin-pointed in the Kegg image with a green background and a red foreground. The workflow also takes the query string to find the names of the Kegg Pathway IDs in which the enzyme is involved. Maintenance Manual Page 5-7 5.6 Combined Workflows 4-6 Maintenance Manual Page 5-8