Under the Hood of a Workflow Manager T n a r i a Matthew Shields, BiodiversityWorld GRID workshop, NeSC, 30 June - 1 July Outline What is Workflow management? Why should I care? Current State of the Art Workflow Languages Other Projects Triana, Architecture & Services Extending Triana for BDWorld Conclusion Matthew Shields, Cardiff University What is Workflow Management? Concept comes from business world Many years of research and practice Process capture and reuse Repeatability, provenance, audit trails & accountability Domain expert knowledge capture Analysis and optimization Matthew Shields, Cardiff University What Can a Workflow Manager do for Me? Scientific Workflow different focus to business Large-scale data collection Querying Analysis Visualization Similar goals Component & workflow reuse Knowledge capture Additional goals Simplified application/experiment design Environment/Complexity abstraction Matthew Shields, Cardiff University State of the Art Schedule workflow tasks (Grid/distributed environment) Monitor/Control execution Active visualization and computational steering User interaction Pause and restart Data provenance Component and sub-workflow reuse Analysis and optimization Matthew Shields, Cardiff University Workflow Languages No current agreed standard Most projects use DAG or Petri-Net Data vs control flow Dependency vs scripting language Many XML schema Business workflow standards - BPEL Not good enough fit GGF WFM-RG Attempting to solicit agreement on standards Matthew Shields, Cardiff University Workflow Management Projects myGrid/Taverna - Southampton & others XML/DAG based workflow language Initially WS choreography tool - now incorporates local tools/components Grid integration with databases via OGSA Distributed Query Processor myGrid Project main users - Bioinformatics Kepler - SDSC Based on Ptolemy - modeling, simulation & design of real time & concurrent systems Concurrent dataflow Actors (components), Directors (workflow engines) Local, Web Service & Grid Service actors Ecology, biology, chemistry, oceanography, and the geosciences Matthew Shields, Cardiff University WM Projects 2 Karajan/Commodity Grid (CoG) Kit, Argonne & Berkerley Scripting workflow language for Grid tasks Integration with Globus Toolkit GT3 & GT4 Pure control flow Data flow performed by data tasks - GridFTP And many more… See http://www.gridworkflow.org/snips/gridworkflow/ http://www.extreme.indiana.edu/swf-survey/ Matthew Shields, Cardiff University Triana Cardiff University! PPARC funded Java based Scientific Workflow Tool or PSE Originally designed for Signal Processing Now domain independent Bioinformatics - obviously! Signal Processing - gravitational wave detection & radio astronomy Design optimisation Data mining Medical imaging Distributed Audio Processing Matthew Shields, Cardiff University Triana Components Local Java components Service-oriented Components Web services as components (WSRF coming soon) Web service workflow Peer 2 Peer services as components Distributed service workflow Grid-oriented Components Grid file and job primitives as components Complex Grid workflow Legacy code components via GridMonSteer Mix and Match composition Matthew Shields, Cardiff University Workflow Inherently data flow based control flow through “messages” XML/DCG workflow format Internally workflow language independent Migration to standards based language Simple Parent/Child relationship between tasks Context based implied actions Local file -> local file = file copy Local file -> remote file = file transfer Import/Export other workflow formats Pegasus/EGEE read/write DAGMan format Matthew Shields, Cardiff University Triana Architecture Service Based Computing: Grid Computing: Deployment, discovery and communication with distributed services e.g. P2P and (GSI) Web services Job Submission, File services A Graphical Grid Computing Environment or Portal GAP Interface GAT Interface Condor Unicore Globus RLS SSH GridFTP PBS SGE GRMS .NET GridLab LDR WSRF Other.. Matthew Shields, Cardiff University P2PS P2PS Discovery P2PS Pipes JXTA JXTA Discovery Web Services UDDI JXTA Pipes SOAP Grid services Triana in a SO World en_fr hello network Service Discovery Dynamic? Decentralized? Communication Message Format SOAP? Transport Protocol TCP? UDP? Matthew Shields, Cardiff University bonjour BabelFish babelfish. altavista. com GAP Interface A Simple Service based API, for Service Deployment, Service Discovery Pipe Based Communication Static application interface with multiple middleware bindings P2PS JXTA Web services GAP Interface P2PS P2PS Discovery Matthew Shields, Cardiff University Web Services JXTA P2PS Pipes JXTA Discovery UDDI SOAP JXTA Pipes WSPeer High Level Interface to Web Services Discovery Invocation Deployment Hosting Abstract from usual Web Service Discovery and Communication Mechanisms (i.e. UDDI and HTTP) P2PS Web Service Discovery? Uses Apache AXIS as SOAP Engine Extends Capabilities of Apache AXIS Stubless Invocation (including complex types) Non Standard Transports (i.e. P2PS) Matthew Shields, Cardiff University WSPeer Application deploy publish locate WSPeer – HTTP/UDDI invoke WSPeer – P2PS locate locate publish publish deploy deploy invoke UDDI invoke launch server HTTP Server Matthew Shields, Cardiff University Extending Triana for BDWorld BDWorld proxy components talk to Web Services Workflow Design Assistant (WfDA) selection and composition of BDWorld workflows from available services Uses Meta Data Repository (MDR) & Meta Data Agent (MDA) MDR contains mapping from proxies to resources WfDA captures domain knowledge in constraints Constraints used to limit the possible components at each stage of composition Simplifies valid workflow creation Matthew Shields, Cardiff University Conclusion A workflow manager should: Simplify scientific experimentation Enable reuse at multiple levels Component Sub-workflow/Compund components Collaboration Abstract component and environment complexities Think of all components as a service that performs a known task Implied/Context based operations - file copy/move Put the scientist back in control of the science, not the computing Matthew Shields, Cardiff University