1. Introduction to myGrid my large project involving European Bioinformatics Institute (EBI), IT Innovations, Universities of Sheffield, Nottingham, Manchester, Newcastle and Southampton and some industrial partners final end of project is June 2005, but other projects are developing our basic infrastructure further Grid is a computer science pilot project working in the field of bioinformatics 2. Introduction to bioinformatics bioinformatics is a research field investigating how to store, process and publish biological information the products of bioinformatics research includes large databases, access and analysis services produced as a result of bioinformatics research users of bioinformatics commonly wish to automate processes involving multiple databases, access services and analyses my will first introduce bioinformatics and workflow ideas before going on to talk about portal work Grid was set up to investigate the use of workflows to automate common processes in bioinformatics 3. Data in bioinformatics common data in bioinformatics involves sequences (eg DNA, RNA, protein sequences) for a stored sequence to be useful, we need to know some meta-information o species and chromosome o interesting features found in the sequence o who produced sequence, and with what equipment sequence records in flat-files common, containing sequence plus meta-information, sometimes with hyperlinks to web-pages displaying related records useful records 3 global databases (EMBL, GenBank and DDBJ) aim to store all DNA sequence records generated by scientists around the world submission of sequences to these databases is a necessary pre-condition of publication for many journals smaller, specialist databases exist, eg Nottingham Arabidopsis Stock Centre (NASC) holds entire genome of the arabidopsis plant millions of sequences in eg EMBL – need efficient algorithms to make use of it 4. Using bioinformatics data database providers host access services for databases o fetch sequence for given ID o look up similar sequences in database to a given sequence o search by keyword (protein name, species etc) database providers/other service providers/specialist liabs host analysis services o look for important features of sequence o predict proteins which might be produced by expressing DNA sequence users of bioinformatics data include o web-lab biologist who wants to analyse data gathered experimentally o bioinformatician who searches for and analyzess data in large, publicly available databases (performing in sillico experiments) 5. Service interfaces web-pages command line tools programming language libraries SOAP web-services with interfaces defined by WSDL 6. Tasks involving services bioinformaticians often perform composite tasks involving combinations of services eg given initial sequence measured in experiment, look up similar sequences in database, look up genes involved in sequence, look up function of these genes with wide variety of interfaces, this can involve cutting and pasting between webpages, and between files used as input and output for command-line tool this can be a repetitive, time consuming and hence error-prone process, especially since some experiments can produce 1000s of gene sequences that have to be analysed commonly, processes are automated by (eg perl) scripts, which can fetch web-pages designed for human use, extract data and pass it on to command-line tools. small changes in the interface of the web-pages break these scripts. We need a better automation technology! 7. Workflows my workflows have named input ports and output ports Grid is investigation the use of workflow technology to automate common, repetitive tasks in bioinformatics workflows contain instances of processors, which also have named input ports and output ports users build workflows by adding workflow inputs, adding workflow outputs, adding instances of processors and connecting ports with edges to show how data should flow in the workflow once built, users enact workflows multiple times with different input values, producing sets of results (and other info, such as paraters services are invoked with) 8. myGrid workflow technology workflows are enacted by the Freefluo workflow enactor, which has a plugin mechanism to allow different service types to included in the same workflow Taverna workflow workbench provides facilities to build and enact workflows and to store results (eg to disk) my Grid Information Repository (MIR) is a web-service which can be used to store workflows, results, intermediate values and personalization information (eg these results were produced by this person) 9. Including services in workflows to include a service in a workflow, a processor must be plugged-in to Freefluo which can query that service for the names of its inputs and outputs, and which can invoke the service when the workflow is being enacted a generic processor can be written to invoke an operation on a SOAP web-service whose interface is defined by WSDL specific processors can be written which call functions in a client API for a service SOAPlab is a web-service which exposes command-line tools. It is configured with a set of tools and configuration files expressing how input and output works for each tool GOWlab does a similar job for web-pages 10. Portal work in myGrid Taverna/Freefluo quite mature with growing user base and focussed on being production quality systems – not much scope to hack around with interface MIR had first useful release in November, and is integrated into Taverna however, there are some issues/limitations with this system o Taverna provides facilities to enact a workflow given a set of input data, producing results which can be saved to disk or MIR. It does not provide functionality to run further workflows from produced results o together, Taverna/Freefluo/MIR forms complex system which takes time and effort to install and learn alternative interface strategies being investigated through JSR-168 portlets in Gridsphere 11. Text services workflows if workflow enactment produces SwissPort protein sequence record, can extract from this PubMed ID of first paper referring to this record. might add extra stages to workflow which look up related papers might re-run these stages as a separate workflow on related papers to find even more related papers using workflow results as inputs to other workflows is not well-supported in Taverna – so prototype it in portal 12. A simpler interface to myGrid Taverna/Freefluo/MIR provide functionality suitable for expert bioinformatician to build workflows, enact them and store results requires some software installation and configuration requires learning complex interface many potential users of myGrid don’t have the time/expertise to do this, but may wish to use workflows written by more expert users we can exploit the fact that all scientists have a web-browser installed on their desktop and are used to using web-based interfaces portlets allow o user to select a workflow stored in the MIR o input form to be auto-generated for workflow, into which the user can enter input values o workflow to be enacted using these inputs, and the results of this enactment to be stored back into the MIR o these, and previously stored results, to be browsed o <to come> results management – eg delete workflows, results etc that we no longer want 13. More information about myGrid general o www.mygrid.org.uk o twiki.mygrid.org.uk portal o Stefan Rennick Egglestone (sre@cs.nott.ac.uk) o Ian Roberts (i.roberts@dcs.shef.ac.uk)