The KEPLER Scientific Workflow System

advertisement
The
KEPLER
Scientific Workflow
System
Bertram Ludäscher
Ilkay Altintas
… & the Kepler Team
San Diego Supercomputer Center
University of California, San Diego
SDM Center AHM, LBL, August 3-5, 2004
Outline
• Project Overview
– from Ptolemy II to Kepler
• Workflow Modeling Issues
– from Dataflow to Control-flow (CCA et al)
• Current Kepler Features
– from plumbing to distributed execution
• Example Workflows
– from bioinformatics to geoinformatics
• Future Plans
– from today to tomorrow ;-)
Kepler, B. Ludäscher, SDSC
2
What is a Scientific Workflow
(SWF)?
• Goals:
– automate a scientist’s repetitive steps (data analysis, data
transformation, computational steps, …)
– can encompass data generation, aggregation, analysis,
visualization (WF granularity)
– design, test, share, deploy, execute, reuse, … SWFs
• Typical requirements/characteristics:
–
–
–
–
–
–
–
–
data-intensive and/or compute-intensive
plumbing-intensive
dataflow-oriented
distribution (data, processing)
user-interaction “in the middle”, …
… vs. (C-z; bg; fg)-ing (“detach” and reconnect)
advanced programming constructs (map(f), zip, takewhile, …)
logging, provenance, “registering back” (intermediate) products…
• … easy to recognize a SWF when you see one!
Kepler, B. Ludäscher, SDSC
3
Promoter Identification Workflow
Kepler, B. Ludäscher, SDSC
4
Source: Matt Coleman (LLNL)
Source: NIH BIRN (Jeffrey Grethe, UCSD)
Kepler, B. Ludäscher, SDSC
5
Ecology: GARP Analysis Pipeline for
Invasive Species Prediction
Test sample (d)
Registered
Ecogrid
Database
EcoGrid
Query
Species
presence &
absence points
(native range)
(a)
Registered
Ecogrid
Database
+A1
+A2
+A3
Sample
Data
Training
sample
(d)
Data
Calculation
GARP
rule set
(e)
Integrated
layers
(native range) (c)
Invasion
area prediction
map (f)
Map
Generation
Layer
Integration
Registered
Ecogrid
Database
Validation
Model quality
parameter (g)
Environmental
layers (native
range) (b)
Environmental
layers (invasion
area) (b)
Layer
Integration
User
Model quality
parameter (g)
Integrated layers
(invasion area) (c)
EcoGrid
Query
Registered
Ecogrid
Database
Map
Generation
Native
range
prediction
map (f)
Validation
Archive
To Ecogrid
Selected
prediction
maps (h)
Generate
Metadata
Species presence
&absence points
(invasion area) (a)
Kepler, B. Ludäscher, SDSC
Source:
NSF SEEK (Deana Pennington et. al, UNM)
6
Kepler, B. Ludäscher, SDSC
7
Starting Point for SDMCenter/SPA + SEEK:
Ptolemy II
read!
see!
try!
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
An Early Example:
Promoter Identification
SSDBM, AD 2003
•
•
•
Scientist models
application as a
“workflow” of
connected
components
(“actors”)
If all components
exist, the
workflow can be
automated/
executed
Different
directors can be
used to pick
appropriate
execution model
(often “pipelined”
execution: PN
director)
Kepler, B. Ludäscher, SDSC
9
Why Ptolemy II (and thus KEPLER)?
• Ptolemy II Objective:
– “The focus is on assembly of concurrent components. The key underlying principle
in the project is the use of well-defined models of computation that govern the
interaction between components. A major problem area being addressed is the use
of heterogeneous mixtures of models of computation.”
• Dataflow Process Networks w/ natural pipelining/streaming support
• User-Orientation
– Workflow design & exec console (Vergil GUI)
– “Application/Glue-Ware”
•
•
•
•
excellent modeling and design support
run-time support, monitoring, …
not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …)
but middle-/underware is conveniently accessible through actors!
• PRAGMATICS
– Ptolemy II is mature, continuously extended & improved, well-documented
(500+pp)
– open source system
– Ptolemy II folks actively participate in KEPLER
Kepler, B. Ludäscher, SDSC
10
KEPLER: An Open Collaboration
• “Founding projects”:
– DOE SDM/SPA and NSF SEEK
• Open Source (BSD-style license)
• Intensive Communications:
– Web-archived mailing lists
– IRC (!)
• Co-development:
– via shared CVS repository
– joining as a new co-developer (currently):
• get a CVS account (read-only)
• local development + contribution via existing KEPLER member
• be voted “in” as a member/co-developer
• Software & social engineering
– How to better accommodate new groups/communities?
– How to better accommodate different usage/contribution models (core
dev … special purpose extender … user)?
Kepler, B. Ludäscher, SDSC
11
KEPLER/CSP:
Contributors, Sponsors, Projects
(or loosely coupled Communicating Sequential Persons ;-)
Ilkay Altintas SDM, Resurgence
Kim Baldridge Resurgence, NMI
Chad Berkley SEEK
Shawn Bowers SEEK
Terence Critchlow SDM
Tobin Fricke ROADNet
Jeffrey Grethe BIRN
Christopher H. Brooks Ptolemy II
Zhengang Cheng SDM
Dan Higgins SEEK
Efrat Jaeger GEON
Matt Jones SEEK
Werner Krebs, EOL
Edward A. Lee Ptolemy II
Kai Lin GEON
Bertram Ludaescher BIRN, SDM, SEEK, GEON
Mark Miller EOL
Steve Mock NMI
Steve Neuendorffer Ptolemy II
Jing Tao SEEK
Mladen Vouk SDM
Xiaowen Xin SDM
Yang Zhao Ptolemy II
Bing Zhu SEEK
Kepler,
•••B. Ludäscher, SDSC
Ptolemy II
12
History
•
•
•
Gabriel (1986-1991)
–
–
–
–
–
–
Written in Lisp
•
Aimed at signal processing
Synchronous dataflow (SDF) block diagrams
•
Parallel schedulers
Code generators for DSPs
•
Hardware/software co-simulators
–
–
–
–
–
–
–
Written in C++
Multiple models of computation
Hierarchical heterogeneity
Dataflow variants: BDF, DDF, PN
C/VHDL/DSP code generators
Optimizing SDF schedulers
Higher-order components
–
–
–
–
–
–
–
Written in Java
Domain polymorphism
Multithreaded
Network integrated
Modal models
Sophisticated type system
CT, HDF, CI, GR, etc.
Ptolemy Classic (1990-1997)
Ptolemy II (1996-2022)
Kepler, B. Ludäscher, SDSC
PtPlot (1997-??)
– Java plotting package
Tycho (1996-1998)
– Itcl/Tk GUI framework
Diva (1998-2000)
– Java GUI framework
•
Copernicus (code generator)
•
KEPLER (2003-2028)
– scientific workflow extensions
Ptolemy II: A laboratory for
investigating design
KEPLER:
A problem-solving environment
for Scientific Workflows
KEPLER = “Ptolemy II + X” for
Scientific Workflows
Source (Ptolemy):
Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
13
KEPLER then …
Kepler, B. Ludäscher, SDSC
14
… and KEPLER today…
… so,you see,
scientific workflows need
domain and datapolymorphic actors &
must scale to HPC!
What’s
a scientific
workflow?
What’s
a polymorphic
actor?
BTW: Kepler is NOT
a GUI (Vergil is)
Kepler, B. Ludäscher, SDSC
15
What
is
HPC?
The KEPLER/Ptolemy II GUI (Vergil)
“Directors” define the
component interaction
& execution semantics
Large, polymorphic component
(“Actors”) and Directors
libraries (drag & drop)
Kepler, B. Ludäscher, SDSC
16
Actor-Oriented Design
• Object orientation:
What flows through
an object is
sequential control
(cf. CCA, MPI)
class name
data
methods
call
return
• Actor/Dataflow orientation:
actor name
data (state)
Input data
Kepler, B. Ludäscher, SDSC
parameters
ports
What flows through
an object is a stream
of data tokens
(in SWFs/KEPLER
also references!!)
Output data
Source:
17
Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
Object-Oriented vs.
Actor-Oriented Interfaces
Object Oriented
Actor/Dataflow
Oriented
TextToSpeech
initialize(): void
notify(): void
isReady(): boolean
getSpeech(): double[]
OO interface gives procedures that have to
be invoked in an order not specified as
part of the interface definition.
AO interface definition says “Give me
text and I’ll give you speech”
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
Kepler, B. Ludäscher, SDSC
18
Ptolemy II: Actor-Oriented Modeling
• Component (“actor”) interaction semantics not hard-wired
inside components, but “factored out” in a “director”
• Different directors for different modeling and execution
needs (… can even be combined!)
Better abstraction, modeling, component reuse, …
Kepler, B. Ludäscher, SDSC
19
Behavioral Polymorphism in Ptolemy
«Interface»
Receiver
+get() : Token
+getContainer() : IOPort
+hasRoom() : boolean
+hasToken() : boolean
+put(t : Token)
+setContainer(port : IOPort)
These polymorphic methods implement the
communication semantics of a domain in
Ptolemy II. The receiver instance used in
communication is supplied by the director,
not by the component.
(cf. CCA, WS-??, [G]BPL4??, … !)
Behavioral polymorphism is the idea
that components can be defined to
operate with multiple models of
computation and multiple middleware
frameworks.
Kepler, B. Ludäscher, SDSC
Director
IOPort
consumer
actor
producer
actor
Receiver
Source:
20
Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
Component Composition & Interaction
DIR1
DIR2
DIR3
DIR4
???
•
•
•
Components linked via ports
Dataflow (and msg/ctl-flow)
Where is the component interaction
semantics defined??
–
•
Kepler, B. Ludäscher, SDSC
each component is its own director!
But still useful for special applications,
e.g. parallel programs (MPI, …)
Source: GRIST/SC4DEVO workshop,
July 2004, Caltech
21
Data/Control-Flow Spectrum
“clean” data(=ctl)-flow
special tokens flow
message passing, control flow
• Data (tokens) flow
– (almost) no other side effects
– WYSIWYG (usually)
• References flow
– token reference type may be “http-get”, “ftp-get”, “hsi put”…
– generic handling still possible
• Application specific tokens flow
– e.g. current Nimrod job management in Resurgence
– “invisible contract” between components
– Director is unaware of what’s going on … (sounds familiar? ;-)
• Specific messages passing protocols (e.g., CSP, MPI)
– for systems of tightly coupled components
Kepler, B. Ludäscher, SDSC
22
CCA via special (“look the other way”)
Director(s)?
CCA!?
• Dataflow in CCA
• a CCA “convention” can be used to accommodate actororiented/dataflow modeling
• CCA/Message Passing in KEPLER
• Kepler/Ptolemy can be extended to accommodate message
passing semantics (CSP is already in Ptolemy II)
Kepler, B. Ludäscher, SDSC
23
Domains and Directors: Semantics for
Component Interaction
•
•
•
•
•
•
•
•
•
CI – Push/pull component interaction
CSP – concurrent threads with rendezvous
CT – continuous-time modeling
For (finer-grained)
DE – discrete-event systems
concurrent jobs!?
DDE – distributed discrete events
FSM – finite state machines
DT – discrete time (cycle driven)
For (coarse grained)
Scientific Workflows!
Giotto – synchronous periodic
GR – 2-D and 3-D graphics
• PN – process networks
• SDF – synchronous dataflow
• SR – synchronous/reactive
• TM – timed multitasking
Kepler, B. Ludäscher, SDSC
Source:
24
Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
Polymorphic Actor Components Working
Across Data Types and Domains
• Actor Data Polymorphism:
–
–
–
–
Add
Add
Add
Add
numbers (int, float, double, Complex)
strings (concatenation)
complex types (arrays, records, matrices)
user-defined types
• Actor Behavioral Polymorphism:
– In dataflow, add when all connected inputs have data
– In a time-triggered model, add when the clock ticks
– In discrete-event, add when any connected input has
data, and add in zero time
– In process networks, execute an infinite loop in a thread
that blocks when reading empty inputs
– In CSP, execute an infinite loop that performs
rendezvous on input or output
– In push/pull, ports are push or pull (declared or inferred)
and behave accordingly
– In real-time CORBA, priorities are associated with ports
and a dispatcher determines when to add
Kepler, B. Ludäscher, SDSC
Source:
25
By not choosing
among these when
defining the
component, we
get a huge
increment in
component reusability. But how
do we ensure that
the component
will work in all
these
circumstances?
Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
Directors and Combining Different
Component Interaction Semantics
Possible app. in SWF:
• time-series aware …
• parameter-sweep aware …
• XY aware …
… execution models
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Kepler, B. Ludäscher, SDSC
26
A Few Specific Kepler Features and
Example Workflows
Web Services  Actors
(WS Harvester)
1
2
4
3
 “Minute-made” (MM) WS-based application integration
• Similarly: MM workflow design & sharing w/o implemented components
Kepler, B. Ludäscher, SDSC
28
Recent Actor Additions
Kepler, B. Ludäscher, SDSC
29
Digression: Who are the clients?
• Domain scientists
– C/Perl/Python/Java/WS/DB-enabled ones
– others (the rest of us?)
• Goal: make the life better for both categories!
– Workflow automation
– Plumbing support
– Execution monitoring, steering, runtime revision
(pause-inspect-modify-resume cycle)
Kepler, B. Ludäscher, SDSC
30
GEON Mineral Classification Workflow
Kepler, B. Ludäscher, SDSC
31
… inside the Classifier
BrowserUI actor w/
SVG client display
Kepler, B. Ludäscher, SDSC
32
GEON Dataset Generation & Registration
(and co-development in KEPLER)
% Makefile
$> ant run
Matt et al.
(SEEK)
SQL database access (JDBC)
Efrat
(GEON)
Ilkay
(SDM)
Yang (Ptolemy)
Xiaowen (SDM)
Edward et al.(Ptolemy)
Kepler, B. Ludäscher, SDSC
33
GEON Data Registration UI
Kepler, B. Ludäscher, SDSC
34
GEON Data Registration in KEPLER
Kepler, B. Ludäscher, SDSC
35
Registered Resources show up in Vergil
(joint SEEK, SPA, GEON, … Registry!?)
Kepler, B. Ludäscher, SDSC
36
Data Analysis: Biodiversity Indices
Kepler, B. Ludäscher, SDSC
37
Kepler, B. Ludäscher, SDSC
Traffic info for a list of highways: Uses
iterate (higher-order “map”) actor to access
highway info web service repeatedly, sending
out one email per highway.
38
Kepler, B. Ludäscher, SDSC
Traffic info for a list of highways: Uses
iterate (higher-order “map”) actor to access
highway info web service repeatedly, sending
out one email per highway.
39
Kepler, B. Ludäscher, SDSC
Traffic info for a list of highways: Uses
iterate (higher-order “map”) actor to access
highway info web service repeatedly, sending
out one email per highway.
40
Re-engineered PIW w/ Iteration Constructs
AD 2004
map(GenbankWS)
Input: {“NM_001924”, “NM020375”}
Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}
Kepler, B. Ludäscher, SDSC
41
Streaming Real-time Data
Straightforward Example:
Laser Strainmeter Channels in;
Scientific Workflow;
Earth-tide signal out
Kepler, B. Ludäscher, SDSC
42
Seismic Waveforms
ORB
Kepler, B. Ludäscher, SDSC
43
Job Management (here: NIMROD)
• Job management infrastructure in place
• Results database: under development
• Goal: 1000’s of GAMESS jobs (quantum
mechanics) – Fall/Winter’04
Kepler, B. Ludäscher, SDSC
44
KEPLER Today
• Support for SWF life cycle
– Design, share, prototype, run, monitor, deploy, …
• Coarse-grained scientific workflows, e.g.,
– web service actors, grid actors, command-line actors, …
• Fine grained workflows and simulations, e.g.,
– Database access, XSLT transformations, …
• Kepler Extensions
– SDM Center/SPA: support for data- and compute-intensive workflows!
– real-time data streaming (ROADNet)
– other special and generic extensions (e.g. GEON, SEEK)
• Status
–
–
–
–
first release (alpha) was in May 2004
nightly builds w/ version tests
“Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …)
Participation in various workshops and conferences (GGF10, SSDBMs,
eScience WF workshop, …)
Kepler, B. Ludäscher, SDSC
45
KEPLER Tomorrow
• Application-driven extensions:
– access to/integration with other IDMAF components
• SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, …
– support for execution of new SWF domains
• Astrophysics: TSI/Blondin (SPA/NCSU)
• Nuclear Physics: Swesty (SPA/LLNL)
• …
• Generic extensions:
– addtl. support for data-intensive and compute-intensive workflows (all
SRB Scommands, CCA support, …)
– (C-z; bg; fg)-ing (“detach” and reconnect)
– workflow deployment models
• Additional “domain awareness” (e.g. via new directors)
– time series, parameter sweeps, job scheduling, …
– hybrid type system with semantic types
• Consolidation
– More installers, regular releases, improved documentation, …
Kepler, B. Ludäscher, SDSC
46
KEPLER & SPA
First alpha releases since May 2004
http://kepler.ecoinformatics.org
Kepler, B. Ludäscher, SDSC
https://www-casc.llnl.gov/sdm/
47
Hybrid Types (Structure + Semantics)
• Services can be semantically compatible, but
structurally incompatible
Ontologies (OWL)
Compatible (⊑)
Semantic
Type Ps
Structural
Type Ps
Incompatible
(⋠)

Source
Service
Kepler, B. Ludäscher, SDSC
(Ps)
Semantic
Type Pt
Structural
Type Pt
(≺)
Desired Connection
Pt
Ps
49
Target
Service
Source: [Bowers-Ludaescher, DILS’04]
Download