Scientific Workflow Requirements Carole Goble, University of Manchester, UK Bertram Ludaescher, SDSC, USA Attendees included Bob Mann Anthony Mayer Austin Tate Bertram Ludäscher Geoffrey Fox Jeffrey Grethe Matthew Shields Mike Wilde Simon Cox Carole Goble Antoon Goderis Earl Ecklund Alan Bundy Albert Burger Jessica Chen-Burger And a bunch more whose names we didn’t get Scientific Workflow Requirements characterise scientific workflows, identify their requirements compare/contrast with business workflow requirements. Some science stakeholders neuroscience, astronomy, engineering Few business stakeholders Goals Identify Those requirements which are fundamental and crucial Those requirements which are desirable but optional Those characteristics that are found in business workflows but are inappropriate or unnecessary for scientific workflows Result Inform the selection of appropriate workflow languages Suggest the commonalities and dissimilarities between different workflows for various problems or communities Inform workflow models, lifecycles and architectures such as workflow creation, registration, enactment and termination. Methodology Matrixes of requirements against application workflows System requirements Functional requirements Language requirements Post-id harvesting Retrospective rationalisation. A Scientist Writes “Work in my problem solving environment so that I don’t need to change the way I work.” User facing Reflect the modelling paradigm of the scientist. Varies between experiments, disciplines Which user would that be then? Creators, users, auditors, validators (I know if its right if I see it but I can’t right it) Biologists compared to bioinformaticians, and transitioning between Different users different environments Appropriate levels of abstraction. User models -> workflow models Simple to use & intuitive creation, deployment, execution and debugging environments Supporting Scientific Practice Incrementally exploratory prototypical TYPE A Got the data, now get the nature paper before the next guy Large scale production TYPE B Got the idea, Get the data for every many experiments, and even many teams, communities blah blah Migration from TYPE A to TYPE B. Capture of TYPE A for later non-interactive replay in a parameterised fashion. Workflow creation paradigms by example, plagiarism, drag and drop Provenance tracking Cool tools, right tools I love my VI editor Diagramming tools, text tools Works on all workflows, use which you like when you like. Good tools! Easy tools! Friendly tools! For the domain user (which user?) not the computer scientist ☺ Cat skinning Multiple scripting language support Multiple ways to write a workflow One size does not fit all Transparency and control Looking under the hood and inside the box observe, trace, compare, muse, fettle & fiddle. What should be transparent? Do users need to know what format data is in or just that it is an image? Unveil at different levels of detail, through the wedding cakes, stacks Opaque to some users some of the time, drillable by others some of the time Role, authorisation, policy Scientist knows best User interaction Creation, Discovery, Enactment Single User interaction with workflow execution Choice between paths of execution in specific states Parameter modification mid-run Collaborative multi-user interaction in creation Reusing workflows -> Modularisation Reusing wfs with different parameters and datasets Joining up wfs from different areas, different disciplines and across scales E-science crosses disciplines!! No support for “extreme team wf creation” Collaborative multi-user interaction in execution? Legacy and Extensibility Ingesting legacy and external applications & services May not run on every platform, may need an emulator. Heterogeneity – of types, platforms etc Include arbitrary services available within the users domain or hacked up by the users. Simon’s piece of Matlab hackery – dark matter services. On the fly development and assimilation Suspending the workflow, or prompting the user For the prototypical exploratory workflows largely. Massaging, lubrication, facilitating, gluing without programming ! Easy to extend to meet specific or unique requirements Legacy and Extensibility Ingesting legacy applications May not run on every platform, may need an emulator. Include arbitrary services available within the users domain or hacked up by the users. Simon’s piece of Matlab hackery – dark matter services. On the fly development and assimilation Suspending the workflow, or prompting the user For the prototypical exploratory workflows largely. Discovering, reusing, wrapping external services Massaging, lubrication, facilitating, gluing Without programming ! Easy to extend to meet specific or unique requirements More on workflow sorts Batch vs interactive Dataflow vs control flow vs state driven Incrementally exploratory prototypical vs large scale production (and migration from former to latter). Workflow lifecycles Prototypical workflow development to production run Different parts of the lifecycle might need different environments and policies Different sorts of users will interact at different points in the lifecycle. Security, trust and validation Guarantees That a provisioned service is what it says it is and follows all notification mandates. Models of soundness at different level, well behavedness 500 lambs follow 10-15 shepherds (or wolves?) Validate at the right time not every time. Confidence in someone else’s stuff I can look at it to check it but I can’t write it. Business vs Scientific Its all the same and its all different Use cases and scenarios needed. Classify business and scientific workflow against Matthew’s Stack Drivers Science workflow driven by scientific questions, outcomes and vanity. Business workflow driven by business processes & goals and $£€ Granularity Business languages for coarse grain of swf Scientists hack at fine grain level Business vs Scientific Individualism vs Corporations Ratios -- more creators than users in science? What is the Scientific Business Process? A techy writes Formal underpinning in CS theory What is the underlying formal theoretic model? What is the natural scripting language? Dataflow is function & parallel Control flow is imperative & sequential? SWF creation as programming. What are the languages? Next Steps Write this up! Harvest some business use cases from Forrester report style sources (and get Tony Hey to pay) Collect scientific workflow examples Develop matrixes of system, functional and language requirements against these examples. Er … that’s it! Fin