Kepler: Towards a Grid-Enabled System for Scientific Workflows Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher* , Steve Mock *ludaesch@SDSC.EDU San Diego Supercomputer Center (SDSC) University of California, San Diego (UCSD) Outline • Motivation: Scientific Workflows (SEEK, SDM, GEON, ..) • Current Features of the Kepler Scientific Workflows System • Extending Kepler: – Grid-Enabling Kepler: • 3rd party transfer – WF planning & optimization • Shipping and Handling Algebra (SHA) • Web Service Composition as Declarative Query Plans – Semantic Types for Scientific Workflows • Conclusions B. Ludäscher et al. – Grid-Enabling Kepler 2 Kepler Team, Projects, Sponsors • • • • • • • • • • • • • • • • • Ilkay Altintas SDM Chad Berkley SEEK Shawn Bowers SEEK Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Efrat Jaeger GEON Matt Jones SEEK Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludäscher BIRN, GEON, SDM, SEEK Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Yang Zhao Ptolemy II … B. Ludäscher et al. – Grid-Enabling Kepler 3 Ptolemy II Example: SEEK – Science Environment for Ecological Knowledge (large NSF ITR) • Analysis & Modeling System – Design and execution of ecological models and analysis – End user focus – application-/upperware • Semantic Mediation System – Data Integration of hardto-relate sources and processes – Semantic Types and Ontologies – upper middleware • EcoGrid – Access to ecology data and tools – middle-/underware B. Ludäscher et al. – Grid-Enabling Kepler Architecture Overview (cf. Cyberinfrastructure) 4 Ecology: GARP Analysis Pipeline for Invasive Species Prediction Test sample (d) Registered Ecogrid Database EcoGrid Query Species presence & absence points (native range) (a) Registered Ecogrid Database +A1 +A2 +A3 Sample Data Training sample (d) Data Calculation GARP rule set (e) Map Generation Native range prediction map (f) Model quality parameter (g) Integrated layers (native range) (c) Environmental layers (native range) (b) Invasion area prediction map (f) Map Generation Layer Integration Registered Ecogrid Database Environmental layers (invasion area) (b) Layer Integration User Model quality parameter (g) Integrated layers (invasion area) (c) EcoGrid Query Registered Ecogrid Database Validation Validation Archive To Ecogrid Selected prediction maps (h) Generate Metadata Species presence &absence points (invasion area) (a) B. Ludäscher et al. – Grid-Enabling Kepler Source: NSF SEEK (Deana Pennington et. al, UNM) 5 Genomics Example: Promoter Identification Workflow (PIW) Source: Matt Coleman (LLNL) B. Ludäscher et al. – Grid-Enabling Kepler 6 Source: NIH BIRN (Jeffrey Grethe, UCSD) B. Ludäscher et al. – Grid-Enabling Kepler 7 Scientific “Workflows”: Some Findings • More dataflow than (business control-/) workflow – DiscoveryNet, Kepler, SCIRun, Scitegic, Taverna, Triana,, …, • Need for “programming extension” – Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …) • Need for abstraction and nested workflows • Need for data transformations (WS1DTWS2) • Need for rich user interaction & workflow steering: – pause / revise / resume – select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF • Need for high-throughput transfers (“grid-enabling”, “streaming”) • Need for persistence of intermediate products and provenance B. Ludäscher et al. – Grid-Enabling Kepler 8 In a Flux: Workflow “Standards” Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/ http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html B. Ludäscher et al. – Grid-Enabling Kepler 10 Commercial & Open Source Scientific “Workflow” (well Dataflow) Systems Kensington Discovery Edition from InforSense Triana B. Ludäscher et al. – Grid-Enabling Kepler Taverna 11 SCIRun: Problem Solving Environments for Large-Scale Scientific Computing • • • SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations New collaboration under Kepler/SDM Component model, based on generalized dataflow programming Steve Parker (cs.utah.edu) B. Ludäscher et al. – Grid-Enabling Kepler 12 Our Starting Point: Ptolemy II & Dataflow Process Networks see! read! try! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/ Why Ptolemy II? • Ptolemy II Objective: – “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.” • Data & Process oriented: Dataflow process networks • Natural Data Streaming Support • User-Orientation – “application-ware”, not middle-/under-ware) – Workflow design & exec console (Vergil GUI) • PRAGMATICS – mature, actively maintained, well-documented (500+pp) – open source system – developed across multiple projects (NSF/ITRs SEEK and GEON, DOE SciDAC SDM, …) – hoping to leverage e-sister projects (e.g. Taverna, …) B. Ludäscher et al. – Grid-Enabling Kepler 14 Dataflow Process Networks: Putting Computation Models (“Orchestration”) first! typed i/o ports FIFO actor actor • Synchronous Dataflow Network (SDF) – Statically schedulable single-threaded dataflow advanced push/pull • Can execute multi-threaded, but the firing-sequence is known in advance – Maximally well-behaved, but also limited expressiveness • Process Network (PN) – Multi-threaded dynamically scheduled dataflow – More expressive than SDF (dynamic token rate prevents static scheduling) – Natural streaming model • Other Execution Models (“Domains”) – Implemented through different “Directors” B. Ludäscher et al. – Grid-Enabling Kepler 15 Actor-/Dataflow Orientation vs Object-/ Control flow Orientation B. Ludäscher et al. – Grid-Enabling Kepler Source: Edward Lee 16 et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/ Marrying or Divorcing Control- & Dataflow Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/ B. Ludäscher et al. – Grid-Enabling Kepler 17 Overview: Scientific Workflows in Kepler • Modeling and Workflow Design • Web services = individual components (“actors”) • “Minute-Made” Application Integration: – Plugging-in and harvesting web service components is easy, fast • Rich SWF modeling semantics (“directors”): – Different and precise dataflow models of computation – Clear and composable component interaction semantics Web service composition and application integration tool • Coming soon: – Shrinked wrapped, pre-packaged “Kepler-to-Go” – Structural and semantic typing (better design support) – Grid-enabled web services (for big data, big computations,…) – Different deployment models (web service, web site, applet, …) B. Ludäscher et al. – Grid-Enabling Kepler 18 The KEPLER GUI: Vergil (Steve Neuendorffer, Ptolemy II) Drag and drop utilities, director and actor libraries. B. Ludäscher et al. – Grid-Enabling Kepler 19 Running a Genomics WF (Ilkay Altintas, SDM) B. Ludäscher et al. – Grid-Enabling Kepler 20 Support for Multiple Workflow Granularities Boulders Plumbing Powder Abstraction: Sand to Rocks Sand B. Ludäscher et al. – Grid-Enabling Kepler 21 Directors and Combining Different Component Interaction Semantics Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/ B. Ludäscher et al. – Grid-Enabling Kepler 22 Application Examples: Mineral Classification with Kepler … (Efrat Jaeger, GEON) B. Ludäscher et al. – Grid-Enabling Kepler 23 … inside the Classifier B. Ludäscher et al. – Grid-Enabling Kepler 24 Standard BrowserUI: Client-Side SVG B. Ludäscher et al. – Grid-Enabling Kepler 25 SWF Reengineering (Ashraf, Efrat, Kai, GEON) B. Ludäscher et al. – Grid-Enabling Kepler 26 DataMapper Sub-Workflow B. Ludäscher et al. – Grid-Enabling Kepler 27 Result launched via BrowserUI actor (coupling with ESRI’s ArcIMS) B. Ludäscher et al. – Grid-Enabling Kepler 28 Distributed Workflows in KEPLER • Web and Grid Service plug-ins – WSDL (now) and Grid services (stay tuned …) – ProxyInit, GlobusGridJob, GridFTP, DataAccessWizard – SSH, SCP, SDSC SRB, OGS?-???… coming • WS Harvester – Import query-defined WS operations as Kepler actors • XSLT and XQuery Data Transformers – to link not “designed-to-fit” web services • WS-deployment interface (planned) B. Ludäscher et al. – Grid-Enabling Kepler 29 Generic Web Service Actor (Ilkay Altintas) Given a WSDL and the name of an operation of a web service, dynamically customizes itself to implement and execute that method. Configure - select service operation B. Ludäscher et al. – Grid-Enabling Kepler 30 Set Parameters and Commit Set parameters and commit B. Ludäscher et al. – Grid-Enabling Kepler 31 Specialized WS Actor (after instantiation) B. Ludäscher et al. – Grid-Enabling Kepler 32 Web Service Harvester (Ilkay Altintas, SDM) • Imports the web services in a repository into the actor library. • Has the capability to search for web services based on a keyword. B. Ludäscher et al. – Grid-Enabling Kepler 33 Composing 3rd-Party WSs (NMI, Steve Mock) Output of previous web service User interaction & Transformations B. Ludäscher et al. – Grid-Enabling Kepler 34 Input of next web service A Special Generic Ingestion Actor for EML Data (SEEK, Chad Berkley) Ingests any data format described by EML metadata Converts raw data to Ptolemy format Data can then be operated on with other actors B. Ludäscher et al. – Grid-Enabling Kepler 35 Wrapping Legacy Applications B. Ludäscher et al. – Grid-Enabling Kepler 36 Promoter Identification Workflow (PIW) B. Ludäscher et al. – Grid-Enabling Kepler 37 Source: Matt Coleman (LLNL) Execution Semantics Promoter Identification Workflow in Ptolemy-II [SSDBM’03] B. Ludäscher et al. – Grid-Enabling Kepler 38 designed to fit hand-crafted control solution; also: forces sequential execution! designed to fit hand-crafted Web-service actor No data transformations available B. Ludäscher et al. – Grid-Enabling Kepler 39 Complex backward control-flow Promoter Identification Workflow in FP genBankG :: GeneId -> GeneSeq genBankP :: PromoterId -> PromoterSeq blast :: GeneSeq -> [PromoterId] promoterRegion :: PromoterSeq -> PromoterRegion transfac :: PromoterRegion -> [TFBS] gpr2str :: (PromoterId, PromoterRegion) -> String d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 = = = = = = = = = = Gid "7" -- start with some gene-id genBankG d0 -- get its gene sequence from GenBank blast d1 -- BLAST to get a list of potential promoters map genBankP d2 -- get list of promoter sequences map promoterRegion d3 -- compute list of promoter regions and ... map transfac d4 -- ... get transcription factor binding sites zip d2 d4 -- create list of pairs promoter-id/region map gpr2str d6 -- pretty print into a list of strings concat d7 -- concat into a single "file" putStr d8 -- output that file B. Ludäscher et al. – Grid-Enabling Kepler 40 Cleaned up Process Network PIW • Back to purely functional dataflow process network (= also a data streaming model!) map(f)-style iterators Powerful type checking • Re-introducing map(f) to PtolemyII (was there in PT Classic) no control-flow spaghetti Generic, declarative “programming” data-intensive apps constructs free concurrent execution free type checking automatic support to go from Generic data piw(GeneId) to transformation actors PIW :=map(piw) over [GeneId] Forward-only, abstractable subworkflow piw(GeneId) B. Ludäscher et al. – Grid-Enabling Kepler 41 Optimization by Declarative Rewriting I map(f o • PIW as a declarative, referentially transparent functional process optimization via functional rewriting possible g) e.g. map(f o g) = map(f) o map(g) • Technical report &PIW specification in Haskell instead of map(f) o map(g) Combination of map and zip http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf B. Ludäscher et al. – Grid-Enabling Kepler 42 Optimizing II: Streams & Pipelines Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney • Clean functional semantics facilitates algebraic workflow (program) transformations (Bird-Meertens); e.g. mapS f • mapS g mapS (f • g) B. Ludäscher et al. – Grid-Enabling Kepler 43 Middle/Underware Access: Querying Databases • Database connection actor: – Opening a database connection and passing it to all actors accessing this database. • Database query actor: – A generic actor that queries a database and provides its result. • DBConnection type and DBConnectionToken: – A new IOPort type and a token to distinguish a database connection from any general type. Database Connection Actor • OpenDBConnection actor: – Input: database connection information – Output: DBConnectionToken (reference to a DB connection instance, via a DBConnection output port) Database Query Actor • Database Query actor: – Input: SQL query string and a DB connection token – Parameters: • output type: XML, Record, or String • tuple-at-a-time vs set-at-a-time – Process: • execute query • produce results according to parameters Querying Example An (oversimplified) Model of the Grid • Hosts: {h1, h2, h3, …} • Data@Hosts: d1@{hi}, d2@{hj}, … • Functions@Hosts: f1@{hi}, f2@{hj}, … X • Given: data/workflow: • … as a functional plan: • … as a logic plan: f Y g Z […; Y := f(X); Z := g(Y); …] […; f(X,Y)g(Y,Z); …] • Find Host Assignment: di hi , fj hj for all di , fj … s.t. […; d3@h3 := f@h2(d1@h1), …] is a valid plan B. Ludäscher et al. – Grid-Enabling Kepler 48 Shipping and Handling Algebra (SHA) f@A x@b f@A Logical view x@b y@c (1) y@c (2) f@A plan Y@C = F@A of X@B = 1. y@c [ X@B to A, Y@A := F@A(X@A), Y@A to C ] x@b 2. [ F@A => B, Y@B := F@B(X@B), Y@B to C ] 3. [ X@B to C, F@A => C, Y@C := F@C(X@C) ] Physical view: SHA Plans B. Ludäscher et al. – Grid-Enabling Kepler 49 f@A x@b y@c (3) Grid-Enabling PTII: Handles 1. 2. 3. 4. 5. 6. 7. Logical token transfer (3) requires get_handle(1,2); then exec_handle(4,5,6,7) for completion. Kepler space A 3 4 1 2 Grid space B Example: &X = “GA.17” *X =<some_huge_file> 7 5 GA B. Ludäscher et al. – Grid-Enabling Kepler 6 AGA: get_handle GAA: return &X AB: send &X BGB: request &X GBGA: request &X GA GB: send *X GBB: send done(&X) Candidate Formalisms: • GridFTP • SSH, SCP • SDSC SRB • OGS?-??? … WSRF? GB 50 Extensions: Semantic Type • Take concepts and relationships from an ontology to “semantically type” the data-in/out ports • Application: e.g., design support: – smart/semi-automatic wiring, generation of “massaging actors” m1 p3 (normalize) Takes Abundance Count Measurements for Life Stages B. Ludäscher et al. – Grid-Enabling Kepler p4 Returns Mortality Rate Derived Measurements for Life Stages 51 B. Ludäscher et al. – Grid-Enabling Kepler 52 B. Ludäscher et al. – Grid-Enabling Kepler 53 Semantic Types • The semantic type signature – Type expressions over the (OWL) ontology m1 p3 (normalize) p4 SemType m1 :: Observation & itemMeasured.AbundanceCount & hasContext.appliesTo.LifeStageProperty -> DerivedObservation & itemMeasured.MortalityRate & hasContext.appliesTo.LifeStageProperty B. Ludäscher et al. – Grid-Enabling Kepler 54 Extended Type System (here: OWL Semantic Types) SemType m1 :: Observation & itemMeasured.AbundanceCount & hasContext.appliesTo.LifeStageProperty DerivedObservation & itemMeasured.MortalityRate & hasContext.appliesTo.LifeStageProperty Substructure association: XML raw-data =(X)Query=> object model =link => OWL ontology B. Ludäscher et al. – Grid-Enabling Kepler 55 Semantic Types for Scientific Workflows B. Ludäscher et al. – Grid-Enabling Kepler 56 Deriving Data Transformations from Semantic Service Registration [Bowers-Ludaescher, DILS’04] B. Ludäscher et al. – Grid-Enabling Kepler 57 Structural and Semantic Mappings [Bowers-Ludaescher, DILS’04] B. Ludäscher et al. – Grid-Enabling Kepler 58 Workflow Planning as Planning Queries with Limited Access Patterns • User query Q: answer(ISBN, Author, Title) book(ISBN, Author, Title), catalog(ISBN, Author), not library(ISBN). • Limited (web service) Access Patterns (API) – Src1.books: in: ISBN out: Author, Title – Src1.books: in: Author out: ISBN, Title – Src2.catalog: in: {} out: ISBN, Author – Src3.library: in: {} out: ISBN • Q is not executable, but feasible (equivalent to executable Q’: catalog ; book ; not library) ICDE (poster), EDBT, PODS (papers), [Nash-Ludaescher,2004] B. Ludäscher et al. – Grid-Enabling Kepler 59 Conclusions • Summary – Kepler Scientific Workflow System – Open source, cross-project collaboration (SEEK, GEON, SDM,…) – Actor & Dataflow-oriented Modeling, Design, Execution (Ptolemy II heritage) – Prototyping, static analysis, web services, data transformations • Next Steps – First official release (“Kepler-to-Go”) April/May ’04 • e-Science meeting NeSC, Edinburgh – Grid-enabling • 3rd party transfer, planning, optimization, … – – – – Semantic Typing [DILS’04] Provenance, Fault tolerance, … Link-Up w/ e.g. Taverna, Pegasus, … Become a member or co-developer (You!) B. Ludäscher et al. – Grid-Enabling Kepler 60