The Chimera Virtual Data System www.griphyn.org/chimera Presented by Mike Wilde Workflow Workshop 3 December 2003 e-Science Institute, Edinburgh Acknowledgements GriPhyN – the Grid Physics Network – is supported by The National Science Foundation, Information Technology Research Program The Chimera Virtual Data System is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi This talk was also delivered at the Data Provenance and Annotation Workshop, 1 Dec 2003 3 Dec 2003 www.griphyn.org/chimera 2 The Virtual Data Concept Enhance scientific productivity through: z z Discovery and application of datasets and programs Enabling use of a worldwide data grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance. Provenance ~= Virtual Data 3 Dec 2003 www.griphyn.org/chimera 3 Provenance System Goals Producing data from transformations with uniform, precise data interface descriptions enables… z z Discovery: finding and understanding datasets and transformations Workflow: structured paradigm for organizing, locating, specifying, & producing scientific datasets – Forming new workflow – Building new workflow from existing patterns – Managing change z Planning: automated to make the Grid transparent z Audit: explanation and validation via provenance 3 Dec 2003 www.griphyn.org/chimera 4 Virtual Data Grid Vision discovery workflow planner workflow executor (DAGman) request planner request executor (Condor-G, GRAM) request predictor (Prophesy) Grid Monitor Grid Operations 3 Dec 2003 w ra storage element Data Transport g nin ri de virtual data catalog n io t va replica location service Data Grid simulation Researcher virtual data index simulation data storage element ta a d detector Storage Resource Mgmt n pla Production Manager sharing composition Science Review virtual data catalog storage element analysis virtual data catalog discovery Computing Grid www.griphyn.org/chimera 5 Usage Models and Cases z z Domains where its valuable (and where its not)? benefit ratios? Batch models Cost – Cluster finding laboratory: code and data changes, track results. z Interactive models – Using provenance within interactive dialogs in graphical and textual tools – Moving back and forth between interactive and batch modes z z z z z Discovery Understand / review / audit Compose Passive Provenance: recording Active Provenance: declaration 3 Dec 2003 www.griphyn.org/chimera 6 Virtual Data Example: Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution 100000 Number of Clusters 10000 1000 100 10 1 1 Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, www.griphyn.org/chimera 7 University of Chicago 10 Number of Galaxies 3 Dec 2003 100 Virtual Data Application: High Energy Physics Data Analysis mass = 200 decay = bb mass = 200 mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 event = 8 mass = 200 plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida 3 Dec 2003 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW stability = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW plot = 1 www.griphyn.org/chimera mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 8 Provenance Scenario psearch –t 10 … file1 file8 simulate –t 10 … file2 reformat –f fz … file1 file1 File3,4,5 file7 conv –I esd –o aod… Update workflow following changes 3 Dec 2003 file6 summarize –t 10 … Manage workflow; Explain provenance, e.g. for file8: On-demand data psearch –t 10 –i file1 file3 file4 file5 file7–o file8 generation simulate –t 10 –o file1 file2 reformat –f fz –i file2 –o file3 file4 file5 summarize –t 10 –i file6 –o file7 conv –l esd –o aod –i file 2 –o file6 www.griphyn.org/chimera 9 Fundamental Units z Transformations – – – – z Interface Declarations Action Declarations Call declaration Invocation Datasets – Contents – Representation – Location 3 Dec 2003 www.griphyn.org/chimera 10 VDL: Virtual Data Language Describes Data Transformations z Transformation – Abstract template of program invocation – Similar to "function definition" z Derivation – “Function call” to a transformation – Stores past and future: > A record of how data products were generated > A recipe of how data products can be generated z Invocation – Record of a Derivation execution 3 Dec 2003 www.griphyn.org/chimera 11 Example Transformation TR t1( out a2, in a1, none pa = "500", none env = "100000" ) { argument = "-p "${pa}; $a1 argument = "-f "${a1}; argument = "-x –y"; t1 argument stdout = ${a2}; profile env.MAXMEM = ${env}; $a2 } 3 Dec 2003 www.griphyn.org/chimera 12 Example Transformation Calls (Derivations) DV d1->t1 ( env="20000", pa="600", a2=@{out:run1.exp15.T1932.summary}, a1=@{in:run1.exp15.T1932.raw}, ); DV d2->t1 ( a1=@{in:run1.exp16.T1918.raw}, a2=@{out.run1.exp16.T1918.summary} ); 3 Dec 2003 www.griphyn.org/chimera 13 Workflow from File Dependencies TR tr1(in a1, out a2) { file1 argument stdin = ${a1}; argument stdout = ${a2}; } x1 TR tr2(in a1, out a2) { argument stdin = ${a1}; file2 argument stdout = ${a2}; } DV x1->tr1(a1=@{in:file1}, a2=@{out:file2}); x2 DV x2->tr2(a1=@{in:file2}, a2=@{out:file3}); file3 3 Dec 2003 www.griphyn.org/chimera 14 Example Invocation Completion status and resource usage Attributes of executable transformation Attributes of input and output files 3 Dec 2003 www.griphyn.org/chimera 15 Example Workflow preprocess z Complex structure – Fan-in – Fan-out findrange – "left" and "right" can run in parallel findrange z Uses input file – Register with RC z analyze 3 Dec 2003 Complex file dependencies – Glues workflow www.griphyn.org/chimera 16 Workflow step "preprocess" z TR preprocess turns f.a into f.b1 and f.b2 TR preprocess( output b[], input a ) { argument = "-a top"; argument = " –i "${input:a}; argument = " –o " ${output:b}; } z Makes use of the "list" feature of VDL – Generates 0..N output files. – Number file files depend on the caller. 3 Dec 2003 www.griphyn.org/chimera 17 Workflow step "findrange" z Turns two inputs into one output TR findrange( output b, input a1, input a2, none name="findrange", none p="0.0" ) { argument = "-a "${name}; argument = " –i " ${a1} " " ${a2}; argument = " –o " ${b}; argument = " –p " ${p}; } z Uses the default argument feature 3 Dec 2003 www.griphyn.org/chimera 18 Can also use list[] parameters TR findrange( output b, input a[], none name="findrange", none p="0.0" ) { argument = "-a "${name}; argument = " –i " ${" "|a}; argument = " –o " ${b}; argument = " –p " ${p}; } 3 Dec 2003 www.griphyn.org/chimera 19 Workflow step "analyze" z Combines intermediary results TR analyze( output b, input a[] ) { argument = "-a bottom"; argument = " –i " ${a}; argument = " –o " ${b}; } 3 Dec 2003 www.griphyn.org/chimera 20 Complete VDL workflow z Generate appropriate derivations DV top->preprocess( b=[ @{out:"f.b1"}, @{ out:"f.b2"} ], a=@{in:"f.a"} ); DV left->findrange( b=@{out:"f.c1"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" ); DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" ); DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} ); 3 Dec 2003 www.griphyn.org/chimera 21 Compound Transformations z Using compound TR – Permits composition of complex TRs from basic ones – Calls are independent > unless linked through LFN – A Call is effectively an anonymous derivation > Late instantiation at workflow generation time – Permits bundling of repetitive workflows – Model: Function calls nested within a function definition 3 Dec 2003 www.griphyn.org/chimera 22 Compound Transformations z (cont) TR diamond encapsulates “diamond” workflows: TR diamond( out fd, io fc1, io fc2, io fb1, io fb2, in fa, p1, p2 ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); } 3 Dec 2003 www.griphyn.org/chimera 23 Compound Transformations z (cont) Multiple DVs allow easy generator scripts: DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" ); DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" ); ... DV d70->diamond( fd=@{out:"f.001A3"}, fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" ); 3 Dec 2003 www.griphyn.org/chimera 24 Dataset Requirements <FORM <Title…> /FORM> File Set of files Relational query or spreadsheet range 3 Dec 2003 Object closure XML Element New user-defined Set of files with dataset type: relational index www.griphyn.org/chimera 25 Possible Dataset Type Model z Types used for – Managing dataset representation – Determining argument conformance in invocations – Discovery of datasets and transformations z Two parallel type hierarchies separate representation and semantics – Representational: organizes and specifies families of dataset representation 3 Dec 2003 – Logical: organizes and specifies applicationspecific semantics of datasets www.griphyn.org/chimera 26 Example Dataset Types (Nonleaf Types are Superclasses) FileDataset File MultiFileSet Representational FileSet TarFileSet Logical EventCollection RawEventSet MonteCarlo Simulation 3 Dec 2003 www.griphyn.org/chimera SimulatedEventSet DiscreteEvent Simulation 27 Dataset Representation Descriptor z z z Defines a dataset’s physical layout Permits transformations to access datasets Structure is defined by dataset type (examples) – – – – File: <lfn> <evt.02> MultiFileSet: <lfn+> <evt.03, evt.04, evt05> TarFileSet: <lfn,taropts> <evts.1998, "-b50 -z"> Relation: <<odbc><select .*>> <server name="db.mcs.anl.gov" db="hepdb" id="uchep"/> <query request="select * from evt where eid>2897 and eid<3945" /> z z Stored in dataset catalog Format constrained by DS type def 3 Dec 2003 www.griphyn.org/chimera 28 Provenance Schema Type name=type2 repres=<...> instance of Dataset name=foo type=type2 Contains arguments of Reads/writes/ creates/deletes Transformation Derivation type-signature= prog1( in type1 X, out type2 Y ) instance of type-signature= prog1( in type1 fnn, out type2 foo ) describes physical replica of Replica locn=U.Chicago Reads/writes/ creates/deletes Invocation invocation when=10am time=20 secs of locn=U.Chicago describes Metadata 3 Dec 2003 www.griphyn.org/chimera 29 Observations z z A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation 3 Dec 2003 www.griphyn.org/chimera 30 Vision for Provenance in the Large z z z Universal knowledge management and production systems Vendors integrate the provenance tracking protocol into data processing products Ability to run anywhere “in the Grid” 3 Dec 2003 www.griphyn.org/chimera 31 Virtual Data Grid Vision discovery workflow planner workflow executor (DAGman) request planner request executor (Condor-G, GRAM) request predictor (Prophesy) Grid Monitor Grid Operations 3 Dec 2003 w ra storage element Data Transport g nin ri de virtual data catalog n io t va replica location service Data Grid simulation Researcher virtual data index simulation data storage element ta a d detector Storage Resource Mgmt n pla Production Manager sharing composition Science Review virtual data catalog storage element analysis virtual data catalog discovery Computing Grid www.griphyn.org/chimera 32 Systems requirements: Services and Interfaces z z z z z Provenance databases, servers, virtual machines, workflow composers Provenance navigation portals and webs Embedded tracing systems esp. within interactive tools: SPSS, ROOT, Excel, etc Catalog integration: replica catalogs, metadata catalogs, transformation catalogs, integrity, coherence, interoperability. Interaction between provenance systems and workflow systems 3 Dec 2003 www.griphyn.org/chimera 33 Provenance Servers z OGSA-based Grid services – Discovery, security, resource management z z z z Supports code and data discovery and workflow management Object names (TR, DS, TY, DV, IV) can be used as global cross-server links Derivations can reference remote transformations and datasets Structured object namespaces & object-level access control enable large VO collaboration 3 Dec 2003 www.griphyn.org/chimera 34 Provenance Hyperlinks Personal VDS DV DS TR DV DV TR DS DS TR TR Collaboration VDS DV Group VDS DV DV Personal VDS 3 Dec 2003 www.griphyn.org/chimera 35 Indexing Provenance Servers to Support Discovery Group Index Personal VDS DV Personal Index Personal Index DS TR Collaborationlevel index DV DV TR DS TR TR Collaboration VDS DV Group VDS DS Personal Index DV DV Personal VDS Collaboration-wide index 3 Dec 2003 www.griphyn.org/chimera 36 Challenges z What’s the unit of change? Dataset? File? Object? Relations to the worlds of HDF, CDF, FITS, many others Does a dataset type have multiple dimensions? Dataset names/handles z z z z z z Unification of processing models: App, SQL, Service Closure and reflection: Are transformations and workflows datasets? Can we track provenance of annotations? Version management: mutability, timestamps Garbage collection, retention, pruning Distribution: what standards and naming protocols are needed? Catalogs, schemas? Theoretical models? Unification of fine-grain and coarse-grained models? 3 Dec 2003 www.griphyn.org/chimera 37