Providing Web Service Coordination to Bioinformaticians Matthew Addis

advertisement
Providing Web Service
Coordination to
Bioinformaticians
Matthew Addis
IT Innovation Centre
4 December 2003
e-Science Workflow Services workshop
Edinburgh
Contents
•
•
•
•
•
•
•
•
•
What problem are we are trying to solve?
Our approach
What we’ve built
Who’s using/developing our stuff
Demonstrations
What’s coming next
Downloading and using our software
Semantics and integration into myGrid
Questions
myGrid
• EPSRC eScience project
• 3+ years, 15 people
• Almost 2/3 way through
What sort of biology problems is myGrid
aiming to help solve?
• Graves’ Disease
• Autoimmune disease of
the thyroid in which the
immune system of an
individual attacks cells
in the thyroid gland
resulting in
hyperthyroidism
• Weight loss, trembling,
muscle weakness,
increased pulse rate,
increased sweating and
heat intolerance, goitre,
exophtalmos
What sort of biology problems is myGrid
aiming to help solve?
TSH
Pituitary
Gland
-ve feedback
effect
• Grave’s Disease is
caused by the
stimulation of the
thyrotrophin
TSH
Receptor
receptor by thyroidstimulating
autoantibodies
Thyroid
secreted by
Cell
lymphocytes of the
immune system.
• What is the
molecular basis for
this autoimmune
response?
Thyroid Hormones Released
A biologist’s approach to the problem
• Combine lab
biology and insilico
experiments
• Exploratory
• Ad-hoc
• Hypothesis
driven
• Bespoke
processes
What the scientist would like
• Easy to use tools
– Let the biologist concentrate on the science, not the
technicalities of composing and invoking services
– User can work at their chosen level of abstraction
– Combine remote services + local tools
– User interaction (breakpoints, visualisation, filtering)
• ‘Workflow’ lifecycle
– Authoring, enacting, validating, modifying
– Publishing and sharing, which involves annotation, discovery
and personalisation
• Provenance
– What, where, when, how, who, why
• Data
– lists and sets (which are potentially large)
– Images, text, html
A typical approach to in-silico
experiments
Courtesy of Mark Wilkinson (BioMOBY)
Data isn’t just numbers
Overall, in-silico experiments is a tricky
business…
• EBI hosts 50+ tools
• Seamless access to
– Homology & Similarity
bioinformatics data sources
– Prot. Function. Analysis
and tools is not easy
– Structural Analysis
– Data formats
– Data access mechanisms
– Data annotations and
interpretations
– Analysis techniques and
implementations
– Multiple service providers
• Relatively few standards
– GO
– DAS
– BioMOBY, I3C
– Sequence Analysis
– Miscellaneous Tools
• EBI hosts 30+
databases
–
–
–
–
–
–
Nucleotide Databases
Protein Databases
Proteome Analysis
Structure Databases
Microarray Database
Literature Databases
But don’t worry, XML and Web Services
will save us, right?
• Many existing Web
Services services
and plenty more on
the way…
• Providers have a
natural interest in
delivering their
services in
whatever way will
help their services
to be used
SoapLab
• Web Service
access to 100+
apps and tools
• For each
application
•
•
•
•
•
CreateJob
Run
WaitFor
GetResults
Destroy
Talisman
• Portal building tool, but has a Web Service
interface that takes XML scripts which define a
series of activities to perform
The exploding world of Web Service
Standards
[WS-SecureConversation]
[WS-Acknowledgement]
[WS-Security]
[WS-ActiveProfile]
[WS-SecurityPolicy]
[WS-Addressing]
[WS-Transaction]
[WS-Attachments]
[WS-TransmissionControl]
[WS-Authorization]
[WS-Trust]
[WS-AtomicTransaction]
[BPEL4WS]
[WS-BusinessActivity]
[WS-Choreography]
[WS-CAF]
[WS-Policy]
[WSRP]
[WS-Callback]
[WS-PolicyAssertions]
[WSXL]
[WS-PolicyAttachment]
[WS-Coordination]
[WS-Provisioning]
[WS-EndpointResolution]
[WS-Federation]
[WS-Privacy]
[WS-MessageData]
[WS-Inspection]
[WS-Referral]
[WS-MetadataExchange]
[WS-Manageability]
[WS-Reliability]
[WS-ReliableMessaging]
[WS-PassiveProfile]
[WS-Routing]
[WS-Referral]
Apache
Web Service architectures and stacks
W3C and others
Adoption and maturity
Web Services Roadmap
In summary, Web Services bring a new
set of problems
• Web Services is far from mature
– Standards evolve, tools lag behind
– But, the world is moving this way
• Bioinformatics Web Services aren’t easy to
find, understand and use
– Lack of community directories and common
standards for describing services
– Multiple application programming models
• Stateful
• Script driven
• Parameterised
– XML is often used simply to wrap legacy data
structures
Don’t worry, Converchoreograorchestroordination
will save us, right?
XML Coverpages
Process Modelling Languages
ebPML.org
Workflow patterns
W.M.P.
van der Aalst,
Eindhoven
Some scientific workflow tools exist, but
tend not to use Web Services
Open source tools are only just starting
to emerge
• Enhydra Shark.
• Codehaus Werkflow
• OpenSymphony
OSWorkflow
• jBpm
• wfmOpen
• OFBiz Workflow
Engine
• ObjectWeb Bonita
• Bigbross Bossa
•
•
•
•
•
•
•
•
•
XFlow
Taverna
PowerFolder
Breeze
Open Business
Engine
OpenWFE
Freefluo
ZBuilder
Shocks
In summary, Web Service coordination
brings yet more problems…
•
•
•
•
•
Wrong level of abstraction for the scientist
Standards are very much shifting sands
Very few freely available tools
Little support for e-Science
Workflow v.s. Dataflow
The approach we’re taking
• Build something that people can use now
• Provide a platform for research into the
benefits of new technologies (e.g.
Semantic Web) in e-Science
• Deliver tools and specifications in a form
that can be easily taken further by others
What we’ve built
• Taverna
– build, edit and browse
workflows
– easy import of
services
– integrated execution
using enactor
• FreeFluo
– Control flow and data
flow, data sets,
nested flows
– Local apps, web
services
– provenance and
status reporting
• Deployment
– available as easy to install desktop toolset
– integrated within myGrid workbench
– Enactor available as a Web Service and a
Grid Service
Architecture
Taverna
Workbench
Scufl language
parser
Freefluo Enactor Core
Processor
Processor
Processor
Processor
Web
Service
Soaplab
Local
App
Enactor
• General purpose
core that uses a
directed graph
model
• ‘Processors’
encapsulate how
to use services or
local applications
Data flow
• Types
• Transport: XML
• Formats: BSML,
AGAVE, FASTA
• Semantic: Protein,
DNA sequence
• Multimedia: Images,
3D models, text
• Collections: sets,
lists
• Taverna/Freefluo is only concerned with how to
deal with collections and how to display results
• Sets, lists, MIME types
Taverna workflow workbench
Simple workflow
Control flow and data flow
Running the workflow
Viewing Results
Provenance
Deployment on scientist’s desktop
Scientist
Community Service
Directory
Service
Composition
tools
Author
Workflow
Enacment
Find
Application
Publish
Bind
Application
Developer
License
Applicaton
Service
Bind
Resource
Management
Service
Provider
Deployment at service provider
Community
Service
Directory
Service
Find
Publish
Bind
Client
Scientist
negotiatiation,
fulfillment,
settlement
Expert
Workflow
Enacment
Application
Author
Composition
tools
Application
Service
Provider
Demonstrations
• Trivial workflow
– Currency conversion video
• Real workflow
– Graves disease workflow
– Video
Who’s using/developing it?
Downloading and using our software
• Taverna
– Graphical workflow authoring tool
http://taverna.sourceforge.net
– LGPL open source on SourceForge
– User and developer documentation
• Scufl language specification
• Videos and examples
• FreeFluo
– Workflow enactment engine
– http://freefluo.sourceforge.net
– LGPL open source on SourceForge
What’s coming next
• Large datasets
–
–
–
–
Streaming to and from local files, Xpath
Passing data using pointers
Direct exchange, data staging
Protocols: ftp, SOAP attachments
• Long running workflows
– Persisting state
– Breakpoints, suspend, resume
• User interaction
– Inspection and filtering of intermediate data
• Workflow portal and enactor running at a
service provider
In-silico experiments in a scientific
context
• Personalisation
– Who else has asked this question & can I use/adapt their
approach?
– I want to annotate and publish my process for use by others
– I want to store and access my personal datasets
• Provenance
–
–
–
–
Which type, version and provider of BLAST did I use?
What was the workflow and the results at each stage?
I want to publish my results, workflows and provenance
Ownership, immutable and auditable data
• Change management and notification
–
–
–
–
When was P12345 last updated?
Has PDB changed since I last ran this workflow?
Has the data provenance changed?
Are there new or alternative services that I can use?
Integration of workflow into myGrid
myView on
the mIR
Workflow
Metadata
about
workflow
note about
workflow
Semantic description of services
and workflows
•
•
•
•
Services and workflows in registry
have RDF and OWL descriptions
Selection by the types of inputs
they use, outputs they produce,
the bioinformatics tasks they
perform…
Querying using RDQL over RDF
UDDI registry for operational
metadata
Matching using FaCT OWL
classification for concept-based
metadata
User
Value chain
W hat is the
structure of
this protein?
Tools
Biology
Problems
Reasoning
Knowledgebased services
Get protein
sequence
View or predict
structure
Orchestration
Services
in-silico
Processes
Find similar
sequences,
SW ISSPROT,
PDB,
RASMOL
Application
services
Jobs and
Data
W eb Services
Raw
Resources
Semantics needed for
Inputs, outputs
Function
Resources used
Process for using service
Interoperability, Higher level ontologies
Reasoning services, Discovery services
A few words on semantics
Questions?
Taverna: http://taverna.sourceforge.net
FreeFluo: http://freefluo.sourceforge.net
END
Download