The Chimera Virtual Data System www.griphyn.org/chimera Presented by Mike Wilde Workflow Workshop 3 December 2003 e-Science Institute, Edinburgh Acknowledgements GriPhyN – the Grid Physics Network – is supported by The National Science Foundation, Information Technology Research Program The Chimera Virtual Data System is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi This talk was also delivered at the Data Provenance and Annotation Workshop, 1 Dec 2003 3 Dec 2003 www.griphyn.org/chimera 2 Provenance System Goals Producing data from transformations with uniform, precise data interface descriptions enables… Discovery: finding and understanding datasets and transformations Workflow: structured paradigm for organizing, locating, specifying, & producing scientific datasets – Forming new workflow – Building new workflow from existing patterns – Managing change Planning: automated to make the Grid transparent Audit: explanation and validation via provenance 3 Dec 2003 www.griphyn.org/chimera 4 Virtual Data Grid Vision virtual data catalog discovery discovery request planner request executor (Condor-G, GRAM) request predictor (Prophesy) Grid Monitor Grid Operations 3 Dec 2003 replica location service storage element t da a detector storage element Data Grid simulation data analysis workflow executor (DAGman) storage element simulation g nin workflow planner virtual data catalog w ra n io t a iv r de Data Transport Researcher virtual data index Storage Resource Mgmt n pla Production Manager sharing composition Science Review virtual data catalog Computing Grid www.griphyn.org/chimera 5 Virtual Data Example: Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution 100000 Number of Clusters 10000 1000 100 10 1 1 Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, www.griphyn.org/chimera 7 University of Chicago 10 Number of Galaxies 3 Dec 2003 100 Virtual Data Application: High Energy Physics Data Analysis mass = 200 decay = bb mass = 200 mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 event = 8 mass = 200 plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida 3 Dec 2003 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW stability = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW plot = 1 www.griphyn.org/chimera mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 8 Provenance Scenario psearch –t 10 … file1 file8 simulate –t 10 … file2 reformat –f fz … file1 file1 File3,4,5 file7 conv –I esd –o aod… Update workflow following changes 3 Dec 2003 file6 summarize –t 10 … Manage workflow; Explain provenance, e.g. for file8: On-demand data psearch –t 10 –i file1 file3 file4 file5 file7–o file8 generation simulate –t 10 –o file1 file2 reformat –f fz –i file2 –o file3 file4 file5 summarize –t 10 –i file6 –o file7 conv –l esd –o aod –i file 2 –o file6 www.griphyn.org/chimera 9 Fundamental Units Transformations – – – – Interface Declarations Action Declarations Call declaration Invocation Datasets – Contents – Representation – Location 3 Dec 2003 www.griphyn.org/chimera 10 VDL: Virtual Data Language Describes Data Transformations Transformation – Abstract template of program invocation – Similar to "function definition" Derivation – “Function call” to a transformation – Stores past and future: > A record of how data products were generated > A recipe of how data products can be generated Invocation – Record of a Derivation execution 3 Dec 2003 www.griphyn.org/chimera 11 Example Transformation TR t1( out a2, in a1, none pa = "500", none env = "100000" ) { argument = "-p "${pa}; $a1 argument = "-f "${a1}; argument = "-x –y"; t1 argument stdout = ${a2}; profile env.MAXMEM = ${env}; $a2 } 3 Dec 2003 www.griphyn.org/chimera 12 Example Transformation Calls (Derivations) DV d1->t1 ( env="20000", pa="600", a2=@{out:run1.exp15.T1932.summary}, a1=@{in:run1.exp15.T1932.raw}, ); DV d2->t1 ( a1=@{in:run1.exp16.T1918.raw}, a2=@{out.run1.exp16.T1918.summary} ); 3 Dec 2003 www.griphyn.org/chimera 13 Workflow from File Dependencies TR tr1(in a1, out a2) { file1 argument stdin = ${a1}; argument stdout = ${a2}; } x1 TR tr2(in a1, out a2) { argument stdin = ${a1}; file2 argument stdout = ${a2}; } DV x1->tr1(a1=@{in:file1}, a2=@{out:file2}); x2 DV x2->tr2(a1=@{in:file2}, a2=@{out:file3}); file3 3 Dec 2003 www.griphyn.org/chimera 14 Example Invocation Completion status and resource usage Attributes of executable transformation Attributes of input and output files 3 Dec 2003 www.griphyn.org/chimera 15 Example Workflow preprocess Complex structure – Fan-in – Fan-out findrange – "left" and "right" can run in parallel findrange Uses input file – Register with RC analyze 3 Dec 2003 Complex file dependencies – Glues workflow www.griphyn.org/chimera 16 Compound Transformations (cont) TR diamond encapsulates “diamond” workflows: TR diamond( out fd, io fc1, io fc2, io fb1, io fb2, in fa, p1, p2 ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); } 3 Dec 2003 www.griphyn.org/chimera 23 Compound Transformations (cont) Multiple DVs allow easy generator scripts: DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" ); DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" ); ... DV d70->diamond( fd=@{out:"f.001A3"}, fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" ); 3 Dec 2003 www.griphyn.org/chimera 24 Dataset Requirements <FORM <Title…> /FORM> File Set of files Relational query or spreadsheet range 3 Dec 2003 Object closure XML Element New user-defined Set of files with dataset type: relational index www.griphyn.org/chimera 25 Possible Dataset Type Model Types used for – Managing dataset representation – Determining argument conformance in invocations – Discovery of datasets and transformations Two parallel type hierarchies separate representation and semantics – Representational: organizes and specifies families of dataset representation – Logical: organizes and specifies applicationspecific semantics of datasets 3 Dec 2003 www.griphyn.org/chimera 26 Example Dataset Types (Nonleaf Types are Superclasses) FileDataset File MultiFileSet Representational FileSet TarFileSet Logical EventCollection RawEventSet MonteCarlo Simulation 3 Dec 2003 www.griphyn.org/chimera SimulatedEventSet DiscreteEvent Simulation 27 Dataset Representation Descriptor Defines a dataset’s physical layout Permits transformations to access datasets Structure is defined by dataset type (examples) – – – – File: <lfn> <evt.02> MultiFileSet: <lfn+> <evt.03, evt.04, evt05> TarFileSet: <lfn,taropts> <evts.1998, "-b50 -z"> Relation: <<odbc><select .*>> <server name="db.mcs.anl.gov" db="hepdb" id="uchep"/> <query request="select * from evt where eid>2897 and eid<3945" /> Stored in dataset catalog Format constrained by DS type def 3 Dec 2003 www.griphyn.org/chimera 28 Provenance Schema Type name=type2 repres=<...> instance of Dataset name=foo type=type2 Contains arguments of Reads/writes/ creates/deletes Transformation Derivation type-signature= prog1( in type1 X, out type2 Y ) instance of type-signature= prog1( in type1 fnn, out type2 foo ) describes physical replica of Replica locn=U.Chicago Reads/writes/ creates/deletes Invocation invocation when=10am time=20 secs of locn=U.Chicago describes Metadata 3 Dec 2003 www.griphyn.org/chimera 29 Observations A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation 3 Dec 2003 www.griphyn.org/chimera 30 Vision for Provenance in the Large Universal knowledge management and production systems Vendors integrate the provenance tracking protocol into data processing products Ability to run anywhere “in the Grid” 3 Dec 2003 www.griphyn.org/chimera 31 Virtual Data Grid Vision virtual data catalog discovery discovery request planner request executor (Condor-G, GRAM) request predictor (Prophesy) Grid Monitor Grid Operations 3 Dec 2003 replica location service storage element t da a detector storage element Data Grid simulation data analysis workflow executor (DAGman) storage element simulation g nin workflow planner virtual data catalog w ra n io t a iv r de Data Transport Researcher virtual data index Storage Resource Mgmt n pla Production Manager sharing composition Science Review virtual data catalog Computing Grid www.griphyn.org/chimera 32 Systems requirements: Services and Interfaces Provenance databases, servers, virtual machines, workflow composers Provenance navigation portals and webs Embedded tracing systems esp. within interactive tools: SPSS, ROOT, Excel, etc Catalog integration: replica catalogs, metadata catalogs, transformation catalogs, integrity, coherence, interoperability. Interaction between provenance systems and workflow systems 3 Dec 2003 www.griphyn.org/chimera 33 Provenance Servers OGSA-based Grid services – Discovery, security, resource management Supports code and data discovery and workflow management Object names (TR, DS, TY, DV, IV) can be used as global cross-server links Derivations can reference remote transformations and datasets Structured object namespaces & object-level access control enable large VO collaboration 3 Dec 2003 www.griphyn.org/chimera 34 Indexing Provenance Servers to Support Discovery Group Index Personal VDS DV Personal Index Personal Index DS TR Collaborationlevel index DV DV TR DS TR TR Collaboration VDS DV Group VDS DS Personal Index DV DV Personal VDS Collaboration-wide index 3 Dec 2003 www.griphyn.org/chimera 36 Challenges What’s the unit of change? Dataset? File? Object? Relations to the worlds of HDF, CDF, FITS, many others Does a dataset type have multiple dimensions? Dataset names/handles Unification of processing models: App, SQL, Service Closure and reflection: Are transformations and workflows datasets? Can we track provenance of annotations? Version management: mutability, timestamps Garbage collection, retention, pruning Distribution: what standards and naming protocols are needed? Catalogs, schemas? Theoretical models? Unification of fine-grain and coarse-grained models? 3 Dec 2003 www.griphyn.org/chimera 37