(Group A) Team Members: Duncan White Gregg Iceton Jamie Hunter Jannetta Steyn Meenu Pipat Richard Crossland Yu Jiang CONTENTS S.No. Topic 1 Essential Minimum Requirements of the Software System 2 Tools required for the System 3 System Architectural Design 4 Outline of Workflows 5 Functional Requirements 1. Minimum Requirements of the Software-System: A system to read two files: one as a source and another as a target A system to test if two strains are closely related enough A system to produce list of unique proteins in target A system to compare these unique proteins to already existing drug-targets A system to list unique proteins in an organism that can possibly be used as drug-targets A system to pin-point the position of possibledrug targets in Kegg-pathway diagrams 2. Tools required for the Service-based Software System: Taverna: Taverna is an open-source tool for designing and executing workflows. It is integrated with myexperiment, BioCatalogue, Moby, Biomart, Soaplab and R. We have used it to incorporate and use services from BioCatalogue using its latest version Taverna 2.1, which is freely available online. BioCatalogue: BioCatalogue is a BBSRC funded project and is a joint venture of EMBL-EBI and the myGrid project at University of Manchester. It is freely accessible and provides many biological services. It also allows discovery, monitoring, submission and annotation of web services. It is available at: http://www.biocatalogue.org/ MyExperiment: This is a social-networking website where one can upload workflows and share with others. The site can also be used to download workflows from others that already provide required functionality, rather than having to develop such workflows from scratch. It is available at: http://www.myexperiment.org/ For testing purposes, tools that can be used for inputs are NCBI, Genbank, EMBL etc. 3. System Architectural Design Figure representing accession of service-based software system The workflow will run on Taverna Server. e-Drugfinders can start the workflow on Taverna Server by using a web interface where the sequence files can be uploaded. For each part of the workflow, Taverna Server accesses an appropriate internet web service and submits a job to it. The web service processes the job and sends the results back to Taverna Server. When the workflow is complete, Taverna Server will display the results through the web interface. It will also connect to e-Drugfinders’ Ondex warehouse and update it with new potential drug targets it discovers. The components of the software comprise of workflows created using the Taverna platform. Each workflow, once executed, supplies an output that is accepted as an input by the next workflow in the sequence. 4. Outline of Workflows R1 Reading and Parsing Genomes Annotated source genome sequence Annotated target genome sequence Parsing DNA sequence of source DNA sequence of target The first component of the workflow reads in two annotated whole genome sequences in either the EMBL or Genbank formats. The parsing program is able to read in either format, and parse the sequence data from the file. This program will be written, as an appropriate web service is not available in the Taverna platform to parse the files. The program will read each line of the file passed to it, scanning for a regular expression that marks the section of the file containing the sequence data. This sequence data will be extracted from the file and stored. Once the sequence data has been extracted, each sequence is passed to the next component. R2 Similarity Comparison of Genome Sequences DNA sequence of source DNA sequence of target Genome alignment (e.g. BLAST) Filter similarity value Threshold similarity value Similarity check The second component of the workflow accepts the two outputs containing the genome sequences for the source and target genomes parsed by component 1. The two genomes are aligned using a method such as BLAST. The output of the alignment is split into a list, and the similarity value is parsed from the output of the alignment. The similarity value of the global sequence alignment is passed to a service to check the similarity of the sequences when aligned against a threshold value. This similarity checking service requests a minimum percentage similarity threshold value from the user, defining the lower limit of how similar the sequences should be. If the alignment of the two sequences meets or exceeds the specified threshold value, the workflow can continue to the next stage. If the threshold value is not met, then a message is displayed to notify the user. R3 Identification of Unique Proteins Source genome proteins Target genome proteins BLAST search Proteins unique to target genome This component analyses the sufficiently similar genomes to discover proteins that are present in the target (pathogenic) genome, but that are not present in the source (non-pathogenic) genome. This component will run BLAST searches of each protein encoded on the pathogenic genome against all the proteins encoded on the other, non-pathogenic genome. If a BLAST search for a particular protein on the target does not return any results indicating a similar protein on the source genome, then it is taken that the protein is unique to the target genome. The output of this workflow is a list of proteins that are encoded by the pathogen’s genome that do not have sequence similarity to proteins in the non-pathogen’s genome. R4-5 Novel protein vs. known drug targets Unique_Protein_Lis t Drug_List Add input sequence to drug list Analyze via ClustalW Filter out the proteins of low similarity Targets Alignmen t This workflow will take in a list of protein IDs and align them against a single query protein (in FastA format). The user is requested to input a minimum alignment score, and then those alignments which exceed this score are output along with the target protein ID. Firstly, unique protein list, as the query protein sequences, was added to drug list. Then, it was compared with the target of existing drugs via service ClustalW in Taverna. A detail result, the alignment of protein, is listed here. After treated with filter service in the Taverna, the drug list can produce a list of target proteins based on protein similarity. R6 Kegg pathway diagrams with enzymes coloured Genes Colour Retrieve pathways from Kegg Pathway images with enzymes in colour specified This workflow finds pathways in which all the genes in the list are involved and find all enzymes from the list of genes. For each pathway, it draws the diagram and colour the enzymes boxes in the colours specified. KEGG provides, via their web service, a method for finding pathways for a specified list of proteins/genes. These genes can then be coloured on pathways diagram. Using this service from a Taverna workflow, we are able to satisfy requirement six. 5. Functional Requirements: