Maintenance Manual - Newcastle University Student Publishing

advertisement
MAINTENANCE
MANUAL
for
DRUTEX
Using Workflow technology to identify new bacterial drug-targets
Group A, M.Sc. Bioinformatics, Newcastle University
March, 2010
Maintenance Manual
Authorization Memorandum
I have carefully assessed the Maintenance Manual for the DRUTEX system. This document has been
completed using the Maintenance Manual template from Maura Lilienfeld, CHM Team, U.S. Department
of Housing and Urban development, 2005 (http://www.hud.gov/offices/cio/sdm/devlife/tempcheck.cfm).
MANAGEMENT CERTIFICATION - Please check the appropriate statement.
______ The document is accepted.
______ The document is accepted pending the changes noted.
______ The document is not accepted.
We fully accept the changes as needed improvements and authorize initiation of work to proceed. Based
on our authority and judgment, the continued operation of this system is authorized.
_______________________________
NAME
Project Leader
_____________________
DATE
_______________________________
NAME
Operations Division Director
_____________________
DATE
_______________________________
NAME
Program Area/Sponsor Representative
_____________________
DATE
_______________________________
NAME
Program Area/Sponsor Director
_____________________
DATE
MAINTENANCE MANUAL
TABLE OF CONTENTS
Page #
1.0
GENERAL INFORMATION ...................................................................................................... 1-1
1.1
System Overview .................................................................................................................... 1-1
1.2
Points of Contact..................................................................................................................... 1-1
1.2.1
2.0
Technical support ................................................................................................................................1-1
SYSTEM DESCRIPTION ........................................................................................................... 2-2
2.1
System Architecture ............................................................................................................... 2-2
2.2
Security .................................................................................................................................... 2-2
3.0
ENVIRONMENT ......................................................................................................................... 3-3
This system has used the following tools: ...................................................Error! Bookmark not defined.
3.1
Taverna.................................................................................................................................... 3-3
3.2
Biocatalogue ............................................................................................................................ 3-3
3.3
myExperiment......................................................................................................................... 3-3
3.4
Compatibility .......................................................................................................................... 3-3
3.5
Support Software Environment ............................................................................................ 3-3
3.6
Setting up the system/reinstallation ...................................................................................... 3-3
4.0
SYSTEM MAINTENANCE PROCEDURES ............................................................................. 4-4
4.1
Responsibilities ....................................................................................................................... 4-4
4.2
Performance Verification Procedure/Quality control ......................................................... 4-4
4.3
Handling performance problems/errors............................................................................... 4-4
4.3.1
System raises an error message ...........................................................................................................4-4
4.3.2
System is producing unexpected output ..............................................................................................4-4
4.3.3
Program producing unexpected output ...............................................................................................4-4
Maintenance Manual
Page iii
5.0
INFORMATION ABOUT EACH WORKFLOW UNIT ............................................................ 5-1
5.1
Workflow 1: Read in two files in GenBank or EMBL format............................................ 5-1
5.1.1
Overview ............................................................................................................................................5-1
5.1.2
Detailed description ............................................................................................................................5-1
5.2
Workflow 2: Compare relatedness of two strains ............................................................... 5-2
5.2.1
Overview ............................................................................................................................................5-2
5.2.2
Detailed description ............................................................................................................................5-2
5.3
Workflow 3: Compare proteins to find those unique to pathogen ..................................... 5-3
5.3.1
Overview ............................................................................................................................................5-3
5.3.2
Detailed description ............................................................................................................................5-3
5.4
Workflow 4: Pathogen proteins that are potential drug targets ........................................ 5-4
5.4.1
Overview ............................................................................................................................................5-5
5.4.2
Detailed description ............................................................................................................................5-5
5.5
Workflow 5: Pin-point pathogen enzymes in Kegg diagrams ............................................ 5-6
5.5.1
Overview ............................................................................................................................................5-6
5.5.2
Detailed description ............................................................................................................................5-7
Maintenance Manual
Page iv
The Maintenance Manual presents information on the DRUTEX system. It is written for personnel who
are responsible for the maintenance of the system and who need to understand the operating
environment, security, and control requirements. It describes the programs in technical detail to assist the
maintenance programmer.
1.0
GENERAL INFORMATION
1.1
System Overview
The system aims at identifying new targets for existing drugs. We achieve this by comparing genomes of
a pathogen and a non-pathogen to see if they are closely related. Then we find out the proteins in a
pathogen that are unique (not present in the non-pathogen). These proteins are probably the cause of its
pathogenicity. These proteins are compared for sequence similarity to proteins which are targets of
existing drugs. Those with a high similarity to targets of existing drugs are the potential new bacterial
drug-targets. This would be a significant discovery as we are running out of antibiotics because pathogens
are becoming resistant to them.
Outline of its working:
1) The system reads in at least two files (EMBL or Genbank format) - the source (non-pathogen)
and the target (pathogen).
2) The system tests whether the two strains are closely related enough to produce a meaningful
comparison.
3) The system provides an output a list of proteins encoded by the target genomes that do not have
sequence similarity to those encoded by the source genome.
4) The system compares the unique protein list to a list of known protein sequences that are known
to be the target of existing drugs.
5) The system also produces a list of those proteins in the target organism that may be the target for
known drugs based on protein similarity.
6) The system pin-points the position of those proteins that are enzymes in Kegg pathway diagrams.
1.2
Points of Contact
1.2.1 Technical support
For further assistance or information about the maintenance of this product, please email your query to
J.S.Steyn@ncl.ac.uk or call our customer support line on 0191 123456.
Maintenance Manual
Page 1-1
2.0
SYSTEM DESCRIPTION
2.1
System Architecture
The workflow will run on Taverna Server. e-Drugfinders can start the workflow on Taverna Server by
using a web interface where the sequence files can be uploaded. For each part of the workflow, Taverna
Server accesses an appropriate internet web service and submits a job to it. The web service processes the
job and sends the results back to Taverna Server. When the workflow is complete, Taverna Server will
display the results through the web interface. It will also connect to e-Drugfinders’ Ondex warehouse and
update it with new potential drug targets it discovers. The components of the software comprise of
workflows created using the Taverna platform. Each workflow, once executed, supplies an output that is
accepted as an input by the next workflow in the sequence.
Figure representing System Architecture
2.2
Security
Although there are currently no provisions for keeping this system secure, the company implementing the
software are free to apply them. Usage of this product for illegal purposes might lead toSPLUGE will
accept no liability for loss or damage of the product.
Maintenance Manual
Page 2-2
3.0
ENVIRONMENT
3.1
Taverna
Taverna Workbench is an open source tool for designing and executing workflows. This allows users to
connect a number of bioinformatics services into one process. It is written in the Scufl programming
language. The myExperiment social web site is a database for Taverna workflows and has special support
for Scufl workflows. http://www.taverna.org.uk/.
3.2
Biocatalogue
The BioCatalogue is a curated registry of biological Web Services. It was launched in June 2008 at the
Intelligent Systems for Molecular Biology Conference. The project is collaboration between the myGrid
project at the University of Manchester led by Carole Goble and the European Bioinformatics Institute led
by Rodrigo Lopez. http://www.biocatalogue.org/.
3.3
myExperiment
myExperiment is a social networking website for scientists sharing Research Objects such as scientific
workflows and experiment plans. It was launched in November 2007. http://www.myexperiment.org/.
3.4




Compatibility
System Requirements for Taverna:
This tool is freely available online and works with all web-browsers and Operating systems
(Windows XP, Windows Vista, Windows 7, Mac OS X 10.4 and higher).
It is recommended that your system should have 1GB memory for Taverna to work efficiently, it
might work with lesser memory as well, but performance will be slower than expected.
It requires Java 1.5 or higher installed.
GraphViz application (not required for Windows users)
For more details, refer to the link: http://www.taverna.org.uk/download/taverna-2-1/systemrequirements/
DRUTEX should be capable of working on all OS’s although this testing is a work-in-progress.
3.5



3.6
Support Software Environment
M-GCAT, which is a kind of software, is used in workflow2 to solve the problem.
Java has been used to produce codes solving problems.
Glassfish uses a derivative of Apache Tomcat as the servlet container for serving Web content,
with an added component called Grizzly which uses Java NIO for scalability and speed.
Setting up the system/reinstallation
All instructions for getting DRUTEX running on your system are outlined in section III, Setup in the User
Manual. For further details, please go to the website: www.taverna.org/ or, alternatively, follow the stepby-step instructions on the installation CD.
Maintenance Manual
Page 3-3
4.0
SYSTEM MAINTENANCE PROCEDURES
This section provides information about the specific procedures necessary for the programmer to maintain
the collective software units that make up the system.
4.1
Responsibilities
The person responsible for maintaining the software is a person qualified in computing. If such a person
should not be able to fix the problem, then SPLUGE will cover the costs of any replacement within 1 year
of the warranty period.
4.2
Performance Verification Procedure/Quality control
We recommended that you perform system maintenance as a prerequisite to using the system. This will
ensure that the system is functioning optimally before it is used to generate biological data. To do this you
should follow the steps outlined in the maintenance CD. The CD will guide you through a procedure
which uses standardized test input values to test whether the system is working as expected.
4.3
Handling performance problems/errors
4.3.1 System raises an error message
Unexpected input in dialogue box asking for threshold value
Incorrect file path or file format. If having confirmed that the file paths and formats are correct, please
reinstall the software by following the instructions at 3.6 Setting up the system/reinstallation.
Blast/Kegg servers are down. For the Taverna workflow to run, it must access two external servers, Blast
(http://blast.ncbi.nlm.nih.gov/) and Kegg (http://www.genome.jp/kegg/). Should you think these servers
might be causing a problem, please navigate to the relevant website to confirm whether they are working.
For any other errors please email your query to J.S.Steyn@ncl.ac.uk or call our customer support line on
0191 123456.
4.3.2 System is producing unexpected output
In the general case, please isolate individual workflow components to test whether they are working in
isolation. The next step is to connect up all the workflows that are working. This will allow you to isolate
the problem to a particular workflow. If you are unable to progress further with a solution to the problem,
please email your query to J.S.Steyn@ncl.ac.uk or call our customer support line on 0191 123456,
quoting the particular workflow the problems has been isolated to.
4.3.3 Program producing unexpected output
Please navigate to section 4.2 Performance Verification Procedure. This will ask you to test the system
with standardized input to see whether the output is expected. If this is not the case, please contact
support.
Maintenance Manual
Page 4-4
5.0
INFORMATION ABOUT EACH WORKFLOW UNIT
This section provides a detailed description of each software unit. This allows the maintainer of the
workflow to understand how each workflow functions so that if there is a problem with the workflow
proper, each component workflow can be isolated and test individually.
5.1
Workflow 1: Read in two files in GenBank or EMBL format
5.1.1 Overview
This workflow reads in two files containing in either GenBank or EMBL format and parses this
information into two output files containing only the genomic sequences.
This is the first step in the workflow proper. The inputs are 2 files in GenBank or EMBL format. One file
is a genome from a non-pathogenic bacterium; the other is from a (supposedly) related bacterium that is
pathogenic. The outputs are 2 files containing the sequences only.
5.1.2 Detailed description
The base directory contains the list of proteins in the target organism. The remaining two inputs are the
actual EMBL files for the target and source organisms. The ReadEMBLDatabase service uses BioJava to
retrieve the protein sequence from each EMBL file. Each protein from the target genome is transferred
into a separate file so that it can be Blasted against the source database in Workflow 2. The list of protein
sequences from the source organism in the original file will later represent this database. The output from
the workflow is an integer number of proteins in the target organism, and therefore the total number of
files to be Blasted against the source database.
Maintenance Manual
Page 5-1
5.2
Workflow 2: Compare relatedness of two strains
5.2.1 Overview
This workflow compares the genomes of non-pathogen and pathogen to see whether the two strains are
related. The threshold level of similarity can be set by the user using the dialogue box that appears when
the workflow is run.
This is the second step in the workflow proper. The inputs are two files containing sequences in GenBank
or EMBL format. The outputs are ‘1’ or ‘0’. ‘1’ indicates that the two strains are sufficiently similar and
the workflow can progress onto the next stage which compares the proteins of the two genomes. A ‘0’
indicates that the two strains are too distantly-related for a meaningful comparison so the system exits.
5.2.2 Detailed description
The inputs are the file paths to FastA files containing the full genome sequences from each organism.
INI_Path is the full path to the configuration file used by M-GCAT. A number of regular expressions
parses the file. MGCAT_Path is the path to the M-GCAT executable. INI_File is the name of the
Maintenance Manual
Page 5-2
configuration file (not the path). Working_Dir is the directory in which the output of M-GCAT will be
stored. Run M-GCAT which aligns the two genomes and produces a log file. The service
Read_Text_File_2 reads the log file and extracts the genome similarity. A dialogue box asks the user to
choose the minimum genome similarity required. Finally, the Thresholder tests whether the similarity is
greater than or equal to the threshold set by the user.
5.3
Workflow 3: Compare proteins to find those unique to pathogen
5.3.1 Overview
This workflow takes three inputs – e value threshold, number of files and basedir (working directory) and
extracts the proteins that are unique. i.e. that exist in pathogen (accounting for pathogenicity) but not in
the source (non-pathogen).
The output is a file containing a list naming the proteins that are unique to the pathogen.
5.3.2 Detailed description
This is a work-in-progress. This workflow has been running in Unix systems, but has problems running in
Windows, although our experts are working on it.
Maintenance Manual
Page 5-3
5.4
Workflow 4: Pathogen proteins that are potential drug targets
Maintenance Manual
Page 5-4
5.4.1 Overview
This workflow compares the unique protein list to a list of known protein sequences that are known to be
the target of existing drugs. This will result in a list of pathogen proteins which are potential targets for
known drugs based on protein similarity above a certain e-value threshold.
This is the fourth step in the workflow proper. The inputs are a list of proteins that are unique to the
pathogen. The outputs are a list naming the proteins which are the subset of proteins unique to the
pathogen that also show global sequence similarity to the targets of known drugs.
5.4.2 Detailed description
The inputs to the workflow are:



path_proteins: a single path to a file containing a list of file paths of single protein
sequences that are unique to the pathogen in FastA format,
drugs_file: a path to a file containing the amino acid sequences in FastA format of all known
drug-affected proteins which was supplied by e-Drugfinders,
DBSavePath: a path to the working directory storing the files required for the workflow to run.
FormatDB_Path and Blastall_path contain only strings specifying paths to the location of the local
BLAST database and Blast output. We create a local Blast database, then run the nested workflow. The
first nested workflow runs a Blast search of each protein unique to the pathogen against the database. The
output from this goes into the second nested workflow. A dialogue-box prompts the user for a threshold
value which is then used to filter the Blast results.
The only output is a flattened list of GI numbers which are similar to the drug-affected proteins according
to the threshold input by the user, which is the minimum e-value they expect.
Maintenance Manual
Page 5-5
5.5
Workflow 5: Pin-point pathogen enzymes in Kegg diagrams
5.5.1 Overview
This workflow will take the list from the previous workflow that names pathogen proteins that are
potential drug targets and pin-point their position in Kegg pathway diagrams. If the proteins are not
enzymes then no Kegg pathway will apply.
This is the final step in the workflow proper. The inputs are 1) a three-letter code representing the species’
name, 2) the species id, and 3) background and foreground colours (green and red, respectively). If the
protein is an enzyme, the outputs are a list of Kegg Pathway IDs indicating the pathways in which the
enzyme participates and a coloured image pin-pointing the position of the enzyme in these pathways. If
the protein is not an enzyme, then the workflow will return an empty list.
Maintenance Manual
Page 5-6
5.5.2 Detailed description
In the nested workflow we will concatenate the species name and id to a single query string for use by
Kegg. For example:
hsa (species name) + string1_value (colon) = hsa:
hsa: + 1487 (gene id) = hsa:1487 (query string)
The workflow takes the query string to check in Kegg whether there is a corresponding enzyme. If one is
present the position of the enzyme is pin-pointed in the Kegg image with a green background and a red
foreground. The workflow also takes the query string to find the names of the Kegg Pathway IDs in
which the enzyme is involved.
Maintenance Manual
Page 5-7
5.6
Combined Workflows 4-6
Maintenance Manual
Page 5-8
Download