Providing Web Service Coordination to Bioinformaticians Matthew Addis IT Innovation Centre 4 December 2003 e-Science Workflow Services workshop Edinburgh Contents • • • • • • • • • What problem are we are trying to solve? Our approach What we’ve built Who’s using/developing our stuff Demonstrations What’s coming next Downloading and using our software Semantics and integration into myGrid Questions myGrid • EPSRC eScience project • 3+ years, 15 people • Almost 2/3 way through What sort of biology problems is myGrid aiming to help solve? • Graves’ Disease • Autoimmune disease of the thyroid in which the immune system of an individual attacks cells in the thyroid gland resulting in hyperthyroidism • Weight loss, trembling, muscle weakness, increased pulse rate, increased sweating and heat intolerance, goitre, exophtalmos What sort of biology problems is myGrid aiming to help solve? TSH Pituitary Gland -ve feedback effect • Grave’s Disease is caused by the stimulation of the thyrotrophin TSH Receptor receptor by thyroidstimulating autoantibodies Thyroid secreted by Cell lymphocytes of the immune system. • What is the molecular basis for this autoimmune response? Thyroid Hormones Released A biologist’s approach to the problem • Combine lab biology and insilico experiments • Exploratory • Ad-hoc • Hypothesis driven • Bespoke processes What the scientist would like • Easy to use tools – Let the biologist concentrate on the science, not the technicalities of composing and invoking services – User can work at their chosen level of abstraction – Combine remote services + local tools – User interaction (breakpoints, visualisation, filtering) • ‘Workflow’ lifecycle – Authoring, enacting, validating, modifying – Publishing and sharing, which involves annotation, discovery and personalisation • Provenance – What, where, when, how, who, why • Data – lists and sets (which are potentially large) – Images, text, html A typical approach to in-silico experiments Courtesy of Mark Wilkinson (BioMOBY) Data isn’t just numbers Overall, in-silico experiments is a tricky business… • EBI hosts 50+ tools • Seamless access to – Homology & Similarity bioinformatics data sources – Prot. Function. Analysis and tools is not easy – Structural Analysis – Data formats – Data access mechanisms – Data annotations and interpretations – Analysis techniques and implementations – Multiple service providers • Relatively few standards – GO – DAS – BioMOBY, I3C – Sequence Analysis – Miscellaneous Tools • EBI hosts 30+ databases – – – – – – Nucleotide Databases Protein Databases Proteome Analysis Structure Databases Microarray Database Literature Databases But don’t worry, XML and Web Services will save us, right? • Many existing Web Services services and plenty more on the way… • Providers have a natural interest in delivering their services in whatever way will help their services to be used SoapLab • Web Service access to 100+ apps and tools • For each application • • • • • CreateJob Run WaitFor GetResults Destroy Talisman • Portal building tool, but has a Web Service interface that takes XML scripts which define a series of activities to perform The exploding world of Web Service Standards [WS-SecureConversation] [WS-Acknowledgement] [WS-Security] [WS-ActiveProfile] [WS-SecurityPolicy] [WS-Addressing] [WS-Transaction] [WS-Attachments] [WS-TransmissionControl] [WS-Authorization] [WS-Trust] [WS-AtomicTransaction] [BPEL4WS] [WS-BusinessActivity] [WS-Choreography] [WS-CAF] [WS-Policy] [WSRP] [WS-Callback] [WS-PolicyAssertions] [WSXL] [WS-PolicyAttachment] [WS-Coordination] [WS-Provisioning] [WS-EndpointResolution] [WS-Federation] [WS-Privacy] [WS-MessageData] [WS-Inspection] [WS-Referral] [WS-MetadataExchange] [WS-Manageability] [WS-Reliability] [WS-ReliableMessaging] [WS-PassiveProfile] [WS-Routing] [WS-Referral] Apache Web Service architectures and stacks W3C and others Adoption and maturity Web Services Roadmap In summary, Web Services bring a new set of problems • Web Services is far from mature – Standards evolve, tools lag behind – But, the world is moving this way • Bioinformatics Web Services aren’t easy to find, understand and use – Lack of community directories and common standards for describing services – Multiple application programming models • Stateful • Script driven • Parameterised – XML is often used simply to wrap legacy data structures Don’t worry, Converchoreograorchestroordination will save us, right? XML Coverpages Process Modelling Languages ebPML.org Workflow patterns W.M.P. van der Aalst, Eindhoven Some scientific workflow tools exist, but tend not to use Web Services Open source tools are only just starting to emerge • Enhydra Shark. • Codehaus Werkflow • OpenSymphony OSWorkflow • jBpm • wfmOpen • OFBiz Workflow Engine • ObjectWeb Bonita • Bigbross Bossa • • • • • • • • • XFlow Taverna PowerFolder Breeze Open Business Engine OpenWFE Freefluo ZBuilder Shocks In summary, Web Service coordination brings yet more problems… • • • • • Wrong level of abstraction for the scientist Standards are very much shifting sands Very few freely available tools Little support for e-Science Workflow v.s. Dataflow The approach we’re taking • Build something that people can use now • Provide a platform for research into the benefits of new technologies (e.g. Semantic Web) in e-Science • Deliver tools and specifications in a form that can be easily taken further by others What we’ve built • Taverna – build, edit and browse workflows – easy import of services – integrated execution using enactor • FreeFluo – Control flow and data flow, data sets, nested flows – Local apps, web services – provenance and status reporting • Deployment – available as easy to install desktop toolset – integrated within myGrid workbench – Enactor available as a Web Service and a Grid Service Architecture Taverna Workbench Scufl language parser Freefluo Enactor Core Processor Processor Processor Processor Web Service Soaplab Local App Enactor • General purpose core that uses a directed graph model • ‘Processors’ encapsulate how to use services or local applications Data flow • Types • Transport: XML • Formats: BSML, AGAVE, FASTA • Semantic: Protein, DNA sequence • Multimedia: Images, 3D models, text • Collections: sets, lists • Taverna/Freefluo is only concerned with how to deal with collections and how to display results • Sets, lists, MIME types Taverna workflow workbench Simple workflow Control flow and data flow Running the workflow Viewing Results Provenance Deployment on scientist’s desktop Scientist Community Service Directory Service Composition tools Author Workflow Enacment Find Application Publish Bind Application Developer License Applicaton Service Bind Resource Management Service Provider Deployment at service provider Community Service Directory Service Find Publish Bind Client Scientist negotiatiation, fulfillment, settlement Expert Workflow Enacment Application Author Composition tools Application Service Provider Demonstrations • Trivial workflow – Currency conversion video • Real workflow – Graves disease workflow – Video Who’s using/developing it? Downloading and using our software • Taverna – Graphical workflow authoring tool http://taverna.sourceforge.net – LGPL open source on SourceForge – User and developer documentation • Scufl language specification • Videos and examples • FreeFluo – Workflow enactment engine – http://freefluo.sourceforge.net – LGPL open source on SourceForge What’s coming next • Large datasets – – – – Streaming to and from local files, Xpath Passing data using pointers Direct exchange, data staging Protocols: ftp, SOAP attachments • Long running workflows – Persisting state – Breakpoints, suspend, resume • User interaction – Inspection and filtering of intermediate data • Workflow portal and enactor running at a service provider In-silico experiments in a scientific context • Personalisation – Who else has asked this question & can I use/adapt their approach? – I want to annotate and publish my process for use by others – I want to store and access my personal datasets • Provenance – – – – Which type, version and provider of BLAST did I use? What was the workflow and the results at each stage? I want to publish my results, workflows and provenance Ownership, immutable and auditable data • Change management and notification – – – – When was P12345 last updated? Has PDB changed since I last ran this workflow? Has the data provenance changed? Are there new or alternative services that I can use? Integration of workflow into myGrid myView on the mIR Workflow Metadata about workflow note about workflow Semantic description of services and workflows • • • • Services and workflows in registry have RDF and OWL descriptions Selection by the types of inputs they use, outputs they produce, the bioinformatics tasks they perform… Querying using RDQL over RDF UDDI registry for operational metadata Matching using FaCT OWL classification for concept-based metadata User Value chain W hat is the structure of this protein? Tools Biology Problems Reasoning Knowledgebased services Get protein sequence View or predict structure Orchestration Services in-silico Processes Find similar sequences, SW ISSPROT, PDB, RASMOL Application services Jobs and Data W eb Services Raw Resources Semantics needed for Inputs, outputs Function Resources used Process for using service Interoperability, Higher level ontologies Reasoning services, Discovery services A few words on semantics Questions? Taverna: http://taverna.sourceforge.net FreeFluo: http://freefluo.sourceforge.net END