Scientific Workflows

advertisement
Scientific Workflows Based on
Dataflow Process Networks
(or from Ptolemy to Kepler)
(or Workflow Considered Harmful …)
Bertram Ludäscher
San Diego Supercomputer Center
ludaesch@SDSC.edu
NeSCR Dec-3 -2003 Bertram Ludaescher
Overview
1.
2.
3.
4.
5.
Scientific Workflow (SWF) Examples
SWF Requirements & Characteristics
Workflow standards considered harmful for SWF!?
Dataflow Process Networks (Ptolemy II)
Scientific Workflows (Kepler = Ptolemy II + X)
NeSCR Dec-3 -2003 Bertram Ludaescher
• NSF, NIH, DOE
Acknowledgements I
• GEOsciences Network (NSF)
– www.geongrid.org
• Biomedical Informatics Research Network (NIH)
– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)
– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)
– sdm.lbl.gov/sdmcenter/
NeSCR Dec-3 -2003 Bertram Ludaescher
Acknowledgements II
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Ilkay Altintas SDM
Chad Berkley SEEK
Shawn Bowers SEEK
Jeffrey Grethe BIRN
Christopher H. Brooks Ptolemy II
Zhengang Cheng SDM
Efrat Jaeger GEON
Matt Jones SEEK
Edward A. Lee Ptolemy II
Kai Lin GEON
Bertram Ludaescher BIRN, GEON, SDM, SEEK
Stephen Neuendorffer Ptolemy II
Mladen Vouk SDM
Yang Zhao Ptolemy II
…
• Coming soon!?:
– ROADNet, myGrid, GriPhyN, ...
NeSCR Dec-3 -2003 Bertram Ludaescher
Ptolemy II
Promoter Identification Workflow (PIW)
NeSCR Dec-3 -2003 Bertram Ludaescher
Source: Matt Coleman (LLNL)
Execution
Semantics
NeSCR Dec-3 -2003 Bertram Ludaescher
Promoter
Identification
Workflow
in Ptolemy-II
(SSDBM’03)
GARP Invasive Species Pipeline
Test sample (d)
Registered
Ecogrid
Database
EcoGrid
Query
Species
presence &
absence points
(native range)
(a)
Registered
Ecogrid
Database
+A1
+A2
+A3
Sample
Data
Training
sample
(d)
Data
Calculation
GARP
rule set
(e)
Map
Generation
Native
range
prediction
map (f)
Model quality
parameter (g)
Integrated
layers
(native range) (c)
Environmental
layers (native
range) (b)
Invasion
area prediction
map (f)
Map
Generation
Layer
Integration
Registered
Ecogrid
Database
Environmental
layers (invasion
area) (b)
Layer
Integration
User
Model quality
parameter (g)
Integrated layers
(invasion area) (c)
EcoGrid
Query
Registered
Ecogrid
Database
Validation
Validation
Archive
To Ecogrid
Selected
prediction
maps (h)
Generate
Metadata
Species presence
&absence points
(invasion area) (a)
NeSCR Dec-3 -2003 Bertram Ludaescher
Source: NSF SEEK (Deana Pennington et. al, UNM)
Rock & Mineral Classification Workflow
NeSCR Dec-3 -2003 Bertram Ludaescher
A Look Inside Classification
Finer granularity
Extracted from the mineral
composition and this level’s
diagram coordinates.
Diagrams information and
transitions between them.
Classifier: Locates
the point’s region.
SVG to polygons.
Displays the point in the
diagram for this level.
NeSCR Dec-3 -2003 Bertram Ludaescher
Source: NIH BIRN (Jeffrey Grethe, UCSD)
NeSCR Dec-3 -2003 Bertram Ludaescher
SWF Requirements & Characteristics
• Scientist friendly "problem solving environment"
– WF design
– WF execution
– WF steering and UI
• pause; revise; resume; rollback (cf. SCIRun)
– repositories of reusable components
– data and WF provenance (virtual data concept)
• logging, cache reuse/partial re-derive, reports, …
– Conceptual modeling support
• complex data (semantics) support
• “wiring” support (cf. web service composition)
• planning support
NeSCR Dec-3 -2003 Bertram Ludaescher
SWF Requirements & Characteristics
• "Modeling" support
–
–
–
–
Abstraction, hierarchical modeling
Models of Computation (MoC)
component interaction; combination of MoCs (cf. CCA)
WF multi-grain/granola: powder to bolders (and back)
• Boolean (N)AND, (N)OR,… vs. chaining together Grid-apps
– Rich data structures and type systems
• End user "programming" support
– high-level programming constructs
• e.g. map/3 for iteration, filter, select, branch, merge, ...
– data transformations
– legacy tool integration (plug-ins)
– data streaming
• How to tame (e.g., starve a dataflow; then resume)?
 Zauberlehrling’s problem
NeSCR Dec-3 -2003 Bertram Ludaescher
SWF Requirements & Characteristics
• Grid-enabling SWFs
– transparent use of (remote) resources
– big data
– big computation requirements
– early/late binding of logical to physical resources, …
– planning, scheduling, …
 cf. Chimera, Pegasus, DAGman, Condor(-G)
NeSCR Dec-3 -2003 Bertram Ludaescher
Scientific Workflows: Some Findings
• More dataflow than (business) workflow
– but some branching looping, merging, …
– not: documents/objects undergoing modifications
– instead often: dataset-out = analysis(dataset-in)
• Need for “programming extension”
– Iterations over lists (foreach); filtering; functional composition; generic &
higher-order operations (zip, map(f), …)
• Need for abstraction and nested workflows
• Need for data transformations (compute/transform alternations)
• Need for rich user interaction & workflow steering:
– pause / revise / resume
– select & branch; e.g., web browser capability at specific steps as part of a
coordinated SWF
• Need for high-throughput transfers (“grid-enabling”, “streaming”)
• Need for persistence of intermediate products
 data provenance (“virtual data” concept)
NeSCR Dec-3 -2003 Bertram Ludaescher
A ZOO of Workflow Standards and Systems
Source: W.M.P. van der Aalst et al.
http://tmitwww.tm.tue.nl/research/patterns/
NeSCR Dec-3 -2003 Bertram Ludaescher
Business Workflows
• Business Workflows
–
–
–
–
–
show their office automation ancestry
documents and “work-tasks” are passed
no data streaming, no data-intensive pipelines
lots of standards to choose from: WfMC, WSFL, BMPL, BPEL4WS,.. XPDL,…
but often no clear execution semantics for constructs as simple as this:
Source: Expressiveness and Suitability of Languages for Control Flow
Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
NeSCR Dec-3 -2003 Bertram Ludaescher
On Workflow Standards…
http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html
NeSCR Dec-3 -2003 Bertram Ludaescher
Workflow “Standards” Debunked
Source: Don’t go with the flow:Web services composition standards exposed,W.M.P. van der Aalst,
Trends
Controversies,
NeSCR
Dec-3&
-2003
Bertram Ludaescher Jan/Feb 2003 issue of IEEE Intelligent Systems Web Services - Been there done that?
Workflow “Standards” Debunked
Source: Don’t go with the flow:Web services composition standards exposed,W.M.P. van der Aalst,
Trends
Controversies,
NeSCR
Dec-3&
-2003
Bertram Ludaescher Jan/Feb 2003 issue of IEEE Intelligent Systems Web Services - Been there done that?
But never mind the standards discussion:
Many Scientific Workflows are Dataflows!
(Check YOUR examples …)
NeSCR Dec-3 -2003 Bertram Ludaescher
Commercial Workflow/Dataflow Systems
NeSCR Dec-3 -2003 Bertram Ludaescher
SCIRun: Component-Based Problem Solving
Environments for Large-Scale Scientific Computing
•
•
•
SCIRun: problem solving environment for interactive construction,
debugging, and steering of large-scale scientific computations
Component model, based on generalized dataflow programming
Contact: Steve Parker (cs.utah.edu); SciDAC/SDM collaboration
NeSCR Dec-3 -2003 Bertram Ludaescher
Workflow and distributed
computation grid created
with Kensington Discovery
Edition from InforSense.
NeSCR Dec-3 -2003 Bertram Ludaescher
Dataflow Process Networks:
Putting Computation Models first!
typed i/o ports
FIFO
actor
actor
• Synchronous Dataflow Network (SDF)
advanced push/pull
– Statically schedulable single-threaded dataflow
• Can execute multi-threaded, but the firing-sequence is known in advance
– Maximally well-behaved, but also limited expressiveness
• Process Network (PN)
– Multi-threaded dynamically scheduled dataflow
– More expressive than SDF (dynamic token rate prevents static scheduling)
– Natural streaming model
• Other Execution Models (“Domains”)
– Implemented through different “Directors”
NeSCR Dec-3 -2003 Bertram Ludaescher
Dataflow Process
Networks and Ptolemy-II
see!
read!
try!
NeSCR Dec-3 -2003 Bertram Ludaescher
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Why Ptolemy-II?
• PTII Objective:
– “The focus is on assembly of concurrent components. The key
underlying principle in the project is the use of well-defined
models of computation that govern the interaction between
components. A major problem area being addressed is the use of
heterogeneous mixtures of models of computation.”
• Data & Process oriented:
– Dataflow process networks
• Natural Data Streaming Support
• End user “WF console” (Vergil GUI)
• PRAGMATICS
– mature, actively maintained, well-documented
– open source system
– leverage “sister projects” activities (e.g. SEEK, SDM, BIRN,…)
NeSCR Dec-3 -2003 Bertram Ludaescher
NeSCR Dec-3 -2003 Bertram Ludaescher
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
NeSCR Dec-3 -2003 Bertram Ludaescher
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Marrying & Divorcing Control- & Dataflow
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
NeSCR Dec-3 -2003 Bertram Ludaescher
Another Goodie: Ptolemy-II Type System
NeSCR Dec-3 -2003 Bertram Ludaescher
Support for Multiple Workflow Granularities
Bolders
Plumbing
Powder
Sand
NeSCR Dec-3 -2003 Bertram Ludaescher
Abstraction:
Sand to
Rocks
Scientific Workflows = Dataflow Process Networks + X
Kepler
=
Ptolemy-II
+
X
• X=…
– Database plug-ins
– Legacy application plug-ins (via command line, as web services, …)
– Grid extensions:
•
•
•
•
–
–
–
–
Actors as web/grid services
3rd party data transfer, high-throughput data streaming
Dealing with thousands of files (cf. astrophysics, astronomy, HEP, … examples)
Data and service repositories, discovery Extended type system (structural & semantic
extensions)
Programming extensions (declarative/FP) and
Rich user interactions/workflow steering
Rich data transformations (compute/transform alternations)
Data provenance
• (semi-)automatic meta-data creation
NeSCR Dec-3 -2003 Bertram Ludaescher
Status update / specific tasks for Kepler
$DONE, %ONGOING, *NEW
• User interaction, workflow steering
– $ Pause/revise/resume
– $ BrowserUI actor (browser as a 0-learning display and selection tool)
• Distributed execution
– $ Dynamically port-specializing WSDL actor
– * Dynamically specializing Grid service actor
• Port & actor type extensions (SEEK leverage)
– * Structural types (XML Schema)
– * Semantic types (OWL) incl. unit types w/ automatic conversion
• Programming extensions
– % Data transformation actors (XSLT, XQuery, Python, Perl,…)
– * map, zip, zipWith, …, loop, switch “patterns”
• Specialized Data Sources
– $ EML (SEEK),
– % MS Access (GEON), *JDBC,
– *XML, *NetCDF, …
NeSCR Dec-3 -2003 Bertram Ludaescher
Some specific tasks for Kepler
(all NEW)
• Design & develop transparent, Grid-enabled PNs:
–
–
–
–
Communication protocol details
Grid-actor extensions and/or
Grid-Process Network director (G-PN)
Host/Source-location becomes actor parameter
• add “active-inline” parameter display for grid-actors (@exec-loc), channels
(@transport-protocol), source-actors (@{src-loc|catalog-loc})
• Activity Monitoring
– Add “activity status” display (green, yellow, red) to replace PtII animation
(needed for concurrently executing PN!)
• Registration & Deployment mechanisms
– Actor/Data/Workflow repository (=composite actors)
– Shows up as (config’able) actor library
– OGSA Service Registry approach? (SEEK leverage; UDDI complex & limited says MattJ)
• http://www-unix.globus.org/toolkit/draft-ggf-ogsi-gridservice-33_2003-06-27.pdf
• Extensions to deal with failures (fault tolerance)
NeSCR Dec-3 -2003 Bertram Ludaescher
Example: Database actors for Ptolemy II
(Kepler-GEON; Efrat Jaeger)
NeSCR Dec-3 -2003 Bertram Ludaescher
Database Actors
• Database Connection actor:
• Database Query actor:
Database Actors Example
Example: Web service-enabling Ptolemy II
(Kepler-SDM; Ilkay Altintas)
NeSCR Dec-3 -2003 Bertram Ludaescher
A Generic Web Service Actor
Configure
Configure
– select
- selectWSDL
service
url
from
operation
repository
NeSCR Dec-3 -2003 Bertram Ludaescher
Set Parameters and Commit Specialized Actor
Set parameters
and commit
NeSCR Dec-3 -2003 Bertram Ludaescher
Web Service Actor after Instantiation
NeSCR Dec-3 -2003 Bertram Ludaescher
Composing Third-Party Web Services
Output of previous
web service
User interaction &
Transformations
NeSCR Dec-3 -2003 Bertram Ludaescher
Input of next
web service
Results of the Execution
User I/O via
standard brower!
Run Window /
WF Deployment
NeSCR Dec-3 -2003 Bertram Ludaescher
Composing Legacy Applications (here: Phylogeny):
Shell / Command-Line Actors
NeSCR Dec-3 -2003 Bertram Ludaescher
Example: Grid-enabling Ptolemy II
( Kepler-SEEK, Chad Berkley
Kepler-SDM, Ilkay Altintas,
… myGrid?, …
…GriPhyN?, …
… OGS{I|A}-[DAI] ...)
NeSCR Dec-3 -2003 Bertram Ludaescher
Transparently Grid-Enabling PTII: Handles
Logical token transfer (3)
requires get_handle(1,2);
then exec_handle(4,5,6,7)
for completion.
PTII
space
A
3
4
1 2
Grid
space
B
7
1.
2.
3.
4.
5.
6.
7.
AGA: get_handle
GAA: return &X
AB: send &X
BGB: request &X
GBGA: request &X
GA GB: send *X
GBB: send done(&X)
5
GA
NeSCR Dec-3 -2003 Bertram Ludaescher
6
GB
Example:
&X = “GA.17”
*X =<some_huge_file>
Transparently Grid-Enabling PTII
• Different phases
–
–
–
–
Register designed WF (could include external validation service)
Find suitable grid service hosts for actors
Pre-stage execution
Execute (w/ provenance)
• Interactively steer (pause; revise; resume)
• Batch process; re-run parts later
– Register/store data products and execution logs
• Kepler implementation choices:
– Grid-actors (no change of Director necessary!?) and/or
– Grid-(PN)-director (also need to change actors!?)
– Add grid service host id as actor parameter: A@GA
– Similar for data: myDB@GA
NeSCR Dec-3 -2003 Bertram Ludaescher
“C-z ; bf &” – Detach your WF execution!
• Currently in PTII
– tight coupling of WF execution and PTII Java client (also Vergil GUI)
• To-do for Kepler:
– detaching WF console (Vergil) from a Grid-aware execution engine
Grid-PN Director!
Transport protocol
parameter
NeSCR Dec-3 -2003 Bertram Ludaescher
Data location
parameter
Host location
parameter
Semantic Type-enabling Ptolemy II
(OWL – here we go… ;-)
(Kepler-SEEK; Shawn Bowers)
NeSCR Dec-3 -2003 Bertram Ludaescher
Semantic Type Extensions
• Take concepts and relationships from an ontology to
“semantically type” the data-in/out ports
• Application: e.g., design support:
– smart/semi-automatic wiring, generation of “massaging actors”
m1
p3
(normalize)
Takes Abundance Count
Measurements for Life Stages
NeSCR Dec-3 -2003 Bertram Ludaescher
p4
Returns Mortality Rate Derived
Measurements for Life Stages
NeSCR Dec-3 -2003 Bertram Ludaescher
NeSCR Dec-3 -2003 Bertram Ludaescher
Semantic Types
• The semantic type signature
– Type expressions over the (OWL) ontology
m1
p3
(normalize)
p4
SemType m1 ::
Observation & itemMeasured.AbundanceCount &
hasContext.appliesTo.LifeStageProperty
->
DerivedObservation & itemMeasured.MortalityRate &
hasContext.appliesTo.LifeStageProperty
NeSCR Dec-3 -2003 Bertram Ludaescher
Extended Type System (here: OWL Semantic Types)
SemType m1 ::
Observation & itemMeasured.AbundanceCount &
hasContext.appliesTo.LifeStageProperty
 DerivedObservation & itemMeasured.MortalityRate
& hasContext.appliesTo.LifeStageProperty
Substructure association:
XML raw-data =(X)Query=> object model =link => OWL ontology
NeSCR Dec-3 -2003 Bertram Ludaescher
Programming Extensions
(some lessons from SciDAC/SSDBM demo)
NeSCR Dec-3 -2003 Bertram Ludaescher
Promoter
Identification
Workflow
in control
Ptolemy-II
hand-crafted
(SSDBM’03)
solution; also:
forces
designed to fit
designed to fit
sequential execution!
hand-crafted
Web-service actor
No data transformations
available
NeSCR Dec-3 -2003 Bertram Ludaescher
Complex backward
control-flow
Promoter Identification Workflow in FP
genBankG :: GeneId -> GeneSeq
genBankP :: PromoterId -> PromoterSeq
blast
:: GeneSeq -> [PromoterId]
promoterRegion :: PromoterSeq -> PromoterRegion
transfac :: PromoterRegion -> [TFBS]
gpr2str :: (PromoterId, PromoterRegion) -> String
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
=
=
=
=
=
=
=
=
=
=
Gid "7"
-- start with some gene-id
genBankG d0
-- get its gene sequence from GenBank
blast d1
-- BLAST to get a list of potential promoters
map genBankP d2
-- get list of promoter sequences
map promoterRegion d3 -- compute list of promoter regions and ...
map transfac d4
-- ... get transcription factor binding sites
zip d2 d4
-- create list of pairs promoter-id/region
map gpr2str d6
-- pretty print into a list of strings
concat d7
-- concat into a single "file"
putStr d8
-- output that file
NeSCR Dec-3 -2003 Bertram Ludaescher
Cleaned up Process Network PIW
• Back to purely functional
dataflow process network
map(f)-style
iterators
(= also a data streaming model!)
Powerful type
checking
Generic, declarative
“programming”
constructs
Generic data
transformation actors
• Re-introducing map(f) to
Ptolemy-II (was there in PT
Classic)
 no control-flow spaghetti
 data-intensive apps
 free concurrent execution
 free type checking
 automatic support to go from
piw(GeneId) to
PIW :=map(piw) over [GeneId]
Forward-only, abstractable subworkflow piw(GeneId)
NeSCR Dec-3 -2003 Bertram Ludaescher
Optimization by Declarative Rewriting I
• PIW as a declarative,
referentially transparent
functional process
map(f
o
 optimization via functional
rewriting possible
g)
instead of
map(f) o map(g)
e.g. map(f o g) = map(f) o map(g)
• Details:
Combination of
map and zip
– Technical report &PIW specification
in Haskell
http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
NeSCR Dec-3 -2003 Bertram Ludaescher
Optimizing II: Streams & Pipelines
Source: Real-Time Signal
Processing: Dataflow, Visual, and
Functional Programming, Hideki
John Reekie, University of
Technology, Sydney
• Clean functional semantics facilitates algebraic workflow (program)
transformations (Bird-Meertens); e.g. mapS f • mapS g  mapS (f • g)
NeSCR Dec-3 -2003 Bertram Ludaescher
Summary
• Many (most of ours anyways) scientific workflows are dataflows
– lots of workflow “standards” (messy and not focused on SWF problems)
– should we start a new wave of dataflow standards??
• Importance of clear semantics for
–
–
–
–
different MoCs (models of computation: PN, SDF, DE, CT, …)
component composition across MoCs
component interaction
 Ptolemy II directors
• Kepler:
– Based on extensible Ptolemy II system
– Cross-project activity (SEEK, SDM, Ptolemy II, GEON, BIRN, and counting)
– Plug-in / interface with your SWF planner, execution engine, grid-WF tool!
NeSCR Dec-3 -2003 Bertram Ludaescher
Your Projects & Icons <HERE>
NeSCR Dec-3 -2003 Bertram Ludaescher
A Note on the Style of these Slides
Due to lack of time, most of the following slides are “by reference” only ;-)
– …Each speaker was given four minutes to present his paper, as there were so
many scheduled -- 198 from 64 different countries. To help expedite the
proceedings, all reports had to be distributed and studied beforehand, while the
lecturer would speak only in numerals, calling attention in this fashion to the
salient paragraphs of his work. ... Stan Hazelton of the U.S. delegation
immediately threw the hall into a flurry by emphatically repeating: 4, 6, 11, and
therefore 22; 5, 9, hence 22; 3, 7, 2, 11, from which it followed that 22 and only
22!! Someone jumped up, saying yes but 5, and what about 6, 18, or 4 for that
matter; Hazelton countered this objection with the crushing retort that, either
way, 22. I turned to the number key in his paper and discovered that 22 meant the
end of the world… [The Futurological Congress, Stanislaw Lem, translated from
the Polish by Michael Kandel, Futura 1977]
NeSCR Dec-3 -2003 Bertram Ludaescher
F I N: Words to/from the Wise
FYI: Flow-based programming has been re-discovered/re-invented several times by
different communities. Here is an “IBM practitioner’s view”:
– Flow-based Programming, http://www.jpaulmorrison.com/fbp/
…
In "Flow-Based Programming" (FBP), applications are defined as networks of "black box" processes, which
exchange data across predefined connections. These black box processes can be reconnected endlessly to form
different applications without having to be changed internally. It is thus naturally component-oriented. To describe this
capability, the distinguished IBM engineer, Nate Edwards, coined the term "configurable modularity", which he calls
the basis of all true engineered systems.
When using FBP, the application developer works with flows of data, being processed asynchronously, rather than the
conventional single hierarchy of sequential, procedural code. It is thus a good fit with multiprocessor computers, and
also with modern embedded software. In many ways, an FBP application resembles more closely a real-life factory,
where items travel from station to station, undergoing various transformations. Think of a soft drink bottling factory,
where bottles are filled at one station, capped at the next and labelled at yet another one. FBP is therefore highly
visual: it is quite hard to work with an FBP application without having the picture laid out on one's desk, or up on a
screen! For an example, see Sample DrawFlow Diagram.
Strangely though, in spite of being at the leading edge of application development, it is also simple enough that trainee
programmers can pick it up, and it is a much better match with the primitives of data processing than the conventional
primitives of procedural languages. The key, of course (and perhaps the reason why it hasn't caught on more widely), is
that it involves a significant paradigm shift that changes the way you look at programming, and once you have made
this transition, you find you can never go back!
FBP seems to dovetail neatly with a concept that I call "smart data". There is a section on this in stuff about the author.
A new web page on this topic has just been uploaded - see "Smart Data" and Business Data Types - and we will be
publishing more as it develops.
…
NeSCR Dec-3 -2003 Bertram Ludaescher
Download