Design Document

advertisement
(Group A)
Team Members:
Duncan White
Gregg Iceton
Jamie Hunter
Jannetta Steyn
Meenu Pipat
Richard Crossland
Yu Jiang
CONTENTS
S.No.
Topic
1
Essential Minimum Requirements of the Software System
2
Tools required for the System
3
System Architectural Design
4
Outline of Workflows
5
Functional Requirements
1. Minimum Requirements of the Software-System:
A system to read two files: one as a source and
another as a target
A system to test if two strains are closely related
enough
A system to produce list of unique proteins in
target
A system to compare these unique proteins to
already existing drug-targets
A system to list unique proteins in an organism
that can possibly be used as drug-targets
A system to pin-point the position of possibledrug targets in Kegg-pathway diagrams
2.
Tools required for the Service-based Software System:
Taverna: Taverna is an open-source tool for designing and executing workflows. It is integrated
with myexperiment, BioCatalogue, Moby, Biomart, Soaplab and R. We have used it to
incorporate and use services from BioCatalogue using its latest version Taverna 2.1, which is
freely available online.
BioCatalogue: BioCatalogue is a BBSRC funded project and is a joint venture of EMBL-EBI and
the myGrid project at University of Manchester. It is freely accessible and provides many
biological services. It also allows discovery, monitoring, submission and annotation of web
services. It is available at: http://www.biocatalogue.org/
MyExperiment: This is a social-networking website where one can upload workflows and
share with others. The site can also be used to download workflows from others that already
provide required functionality, rather than having to develop such workflows from scratch. It is
available at: http://www.myexperiment.org/
For testing purposes, tools that can be used for inputs are NCBI, Genbank, EMBL etc.
3.
System Architectural Design
Figure representing accession of service-based software system
The workflow will run on Taverna Server. e-Drugfinders can start the workflow on Taverna
Server by using a web interface where the sequence files can be uploaded. For each part of the
workflow, Taverna Server accesses an appropriate internet web service and submits a job to it.
The web service processes the job and sends the results back to Taverna Server. When the
workflow is complete, Taverna Server will display the results through the web interface. It will
also connect to e-Drugfinders’ Ondex warehouse and update it with new potential drug targets
it discovers. The components of the software comprise of workflows created using the Taverna
platform. Each workflow, once executed, supplies an output that is accepted as an input by the
next workflow in the sequence.
4. Outline of Workflows
R1 Reading and Parsing Genomes
Annotated source
genome sequence
Annotated target
genome sequence
Parsing
DNA sequence
of source
DNA sequence
of target
The first component of the workflow reads in two annotated whole genome sequences in either
the EMBL or Genbank formats. The parsing program is able to read in either format, and parse
the sequence data from the file.
This program will be written, as an appropriate web service is not available in the Taverna
platform to parse the files. The program will read each line of the file passed to it, scanning for a
regular expression that marks the section of the file containing the sequence data. This
sequence data will be extracted from the file and stored. Once the sequence data has been
extracted, each sequence is passed to the next component.
R2 Similarity Comparison of Genome Sequences
DNA sequence
of source
DNA sequence
of target
Genome alignment (e.g. BLAST)
Filter similarity value
Threshold similarity value
Similarity
check
The second component of the workflow accepts the two outputs containing the genome
sequences for the source and target genomes parsed by component 1. The two genomes are
aligned using a method such as BLAST.
The output of the alignment is split into a list, and the similarity value is parsed from the output
of the alignment.
The similarity value of the global sequence alignment is passed to a service to check the
similarity of the sequences when aligned against a threshold value. This similarity checking
service requests a minimum percentage similarity threshold value from the user, defining the
lower limit of how similar the sequences should be. If the alignment of the two sequences meets
or exceeds the specified threshold value, the workflow can continue to the next stage. If the
threshold value is not met, then a message is displayed to notify the user.
R3 Identification of Unique Proteins
Source genome
proteins
Target genome
proteins
BLAST search
Proteins unique to target
genome
This component analyses the sufficiently similar genomes to discover proteins that are present
in the target (pathogenic) genome, but that are not present in the source (non-pathogenic)
genome.
This component will run BLAST searches of each protein encoded on the pathogenic genome
against all the proteins encoded on the other, non-pathogenic genome. If a BLAST search for a
particular protein on the target does not return any results indicating a similar protein on the
source genome, then it is taken that the protein is unique to the target genome.
The output of this workflow is a list of proteins that are encoded by the pathogen’s genome that
do not have sequence similarity to proteins in the non-pathogen’s genome.
R4-5 Novel protein vs. known drug targets
Unique_Protein_Lis
t
Drug_List
Add input sequence to drug
list
Analyze via ClustalW
Filter out the proteins of low similarity
Targets
Alignmen
t
This workflow will take in a list of protein IDs and align them against a single query protein (in
FastA format). The user is requested to input a minimum alignment score, and then those
alignments which exceed this score are output along with the target protein ID.
Firstly, unique protein list, as the query protein sequences, was added to drug list. Then, it was
compared with the target of existing drugs via service ClustalW in Taverna. A detail result, the
alignment of protein, is listed here. After treated with filter service in the Taverna, the drug list
can produce a list of target proteins based on protein similarity.
R6 Kegg pathway diagrams with enzymes coloured
Genes
Colour
Retrieve pathways from
Kegg
Pathway images with enzymes in colour specified
This workflow finds pathways in which all the genes in the list are involved and find all enzymes
from the list of genes. For each pathway, it draws the diagram and colour the enzymes boxes in
the colours specified.
KEGG provides, via their web service, a method for finding pathways for a specified list of
proteins/genes. These genes can then be coloured on pathways diagram. Using this service from
a Taverna workflow, we are able to satisfy requirement six.
5. Functional Requirements:
Download