1. Introduction to Grid

advertisement
1. Introduction to myGrid

my

large project involving European Bioinformatics Institute (EBI), IT Innovations,
Universities of Sheffield, Nottingham, Manchester, Newcastle and Southampton and
some industrial partners

final end of project is June 2005, but other projects are developing our basic
infrastructure further
Grid is a computer science pilot project working in the field of bioinformatics
2. Introduction to bioinformatics

bioinformatics is a research field investigating how to store, process and publish
biological information

the products of bioinformatics research includes large databases, access and analysis
services produced as a result of bioinformatics research

users of bioinformatics commonly wish to automate processes involving multiple
databases, access services and analyses

my

will first introduce bioinformatics and workflow ideas before going on to talk about
portal work
Grid was set up to investigate the use of workflows to automate common processes
in bioinformatics
3. Data in bioinformatics

common data in bioinformatics involves sequences (eg DNA, RNA, protein
sequences)

for a stored sequence to be useful, we need to know some meta-information
o species and chromosome
o interesting features found in the sequence
o who produced sequence, and with what equipment

sequence records in flat-files common, containing sequence plus meta-information,
sometimes with hyperlinks to web-pages displaying related records useful records

3 global databases (EMBL, GenBank and DDBJ) aim to store all DNA sequence
records generated by scientists around the world

submission of sequences to these databases is a necessary pre-condition of
publication for many journals

smaller, specialist databases exist, eg Nottingham Arabidopsis Stock Centre (NASC)
holds entire genome of the arabidopsis plant

millions of sequences in eg EMBL – need efficient algorithms to make use of it
4. Using bioinformatics data

database providers host access services for databases
o fetch sequence for given ID
o look up similar sequences in database to a given sequence
o search by keyword (protein name, species etc)

database providers/other service providers/specialist liabs host analysis services
o look for important features of sequence
o predict proteins which might be produced by expressing DNA sequence

users of bioinformatics data include
o web-lab biologist who wants to analyse data gathered experimentally
o bioinformatician who searches for and analyzess data in large, publicly
available databases (performing in sillico experiments)
5. Service interfaces

web-pages

command line tools

programming language libraries

SOAP web-services with interfaces defined by WSDL
6. Tasks involving services

bioinformaticians often perform composite tasks involving combinations of services

eg given initial sequence measured in experiment, look up similar sequences in
database, look up genes involved in sequence, look up function of these genes

with wide variety of interfaces, this can involve cutting and pasting between webpages, and between files used as input and output for command-line tool

this can be a repetitive, time consuming and hence error-prone process, especially
since some experiments can produce 1000s of gene sequences that have to be
analysed

commonly, processes are automated by (eg perl) scripts, which can fetch web-pages
designed for human use, extract data and pass it on to command-line tools.

small changes in the interface of the web-pages break these scripts. We need a better
automation technology!
7. Workflows

my

workflows have named input ports and output ports
Grid is investigation the use of workflow technology to automate common,
repetitive tasks in bioinformatics

workflows contain instances of processors, which also have named input ports and
output ports

users build workflows by adding workflow inputs, adding workflow outputs, adding
instances of processors and connecting ports with edges to show how data should
flow in the workflow

once built, users enact workflows multiple times with different input values,
producing sets of results (and other info, such as paraters services are invoked with)
8. myGrid workflow technology

workflows are enacted by the Freefluo workflow enactor, which has a plugin
mechanism to allow different service types to included in the same workflow

Taverna workflow workbench provides facilities to build and enact workflows and to
store results (eg to disk)

my
Grid Information Repository (MIR) is a web-service which can be used to store
workflows, results, intermediate values and personalization information (eg these
results were produced by this person)
9. Including services in workflows

to include a service in a workflow, a processor must be plugged-in to Freefluo which
can query that service for the names of its inputs and outputs, and which can invoke
the service when the workflow is being enacted

a generic processor can be written to invoke an operation on a SOAP web-service
whose interface is defined by WSDL

specific processors can be written which call functions in a client API for a service

SOAPlab is a web-service which exposes command-line tools. It is configured with a
set of tools and configuration files expressing how input and output works for each
tool

GOWlab does a similar job for web-pages
10. Portal work in myGrid

Taverna/Freefluo quite mature with growing user base and focussed on being
production quality systems – not much scope to hack around with interface

MIR had first useful release in November, and is integrated into Taverna

however, there are some issues/limitations with this system
o Taverna provides facilities to enact a workflow given a set of input data,
producing results which can be saved to disk or MIR. It does not provide
functionality to run further workflows from produced results
o together, Taverna/Freefluo/MIR forms complex system which takes time and
effort to install and learn

alternative interface strategies being investigated through JSR-168 portlets in
Gridsphere
11. Text services workflows

if workflow enactment produces SwissPort protein sequence record, can extract from
this PubMed ID of first paper referring to this record.

might add extra stages to workflow which look up related papers

might re-run these stages as a separate workflow on related papers to find even more
related papers

using workflow results as inputs to other workflows is not well-supported in Taverna
– so prototype it in portal
12. A simpler interface to myGrid

Taverna/Freefluo/MIR provide functionality suitable for expert bioinformatician to
build workflows, enact them and store results

requires some software installation and configuration

requires learning complex interface

many potential users of myGrid don’t have the time/expertise to do this, but may wish
to use workflows written by more expert users

we can exploit the fact that all scientists have a web-browser installed on their
desktop and are used to using web-based interfaces

portlets allow
o user to select a workflow stored in the MIR
o input form to be auto-generated for workflow, into which the user can enter
input values
o workflow to be enacted using these inputs, and the results of this enactment to
be stored back into the MIR
o these, and previously stored results, to be browsed
o <to come> results management – eg delete workflows, results etc that we no
longer want
13. More information about myGrid

general
o www.mygrid.org.uk
o twiki.mygrid.org.uk

portal
o Stefan Rennick Egglestone (sre@cs.nott.ac.uk)
o Ian Roberts (i.roberts@dcs.shef.ac.uk)
Download