GriPhyN and Data Provenance The Grid Physics Network Virtual Data System DOE Data Management Workshop SLAC, 17 March 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division GriPhyN: Grid Physics Network Mission Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance. GriPhyN works to “cross the chasm” application and computer scientists create and field-test paradigms and toolkits together DOE Data Management www.griphyn.org/chimera 17 Mar 2004 2 Virtual Data Scenario psearch –t 10 … file1 file8 simulate –t 10 … file2 reformat –f fz … file1 file1 File3,4,5 file7 conv –I esd –o aod Update workflow following changes file6 summarize –t 10 … Manage workflow; Explain provenance, e.g. for file8: DOE Data Management psearch –t 10 –i file3 file4 file5 –o file8 summarize –t 10 –i file6 –o file7 reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6 simulate –t 10 –o file1 file2 www.griphyn.org/chimera On-demand data generation 17 Mar 2004 3 Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy. DOE Data Management www.griphyn.org/chimera 17 Mar 2004 4 VDL: Virtual Data Language Describes Data Transformations Transformation – Abstract template of program invocation – Similar to "function definition" Derivation – “Function call” to a Transformation – Store past and future: > A record of how data products were generated > A recipe of how data products can be generated Invocation – Record of a Derivation execution These XML documents reside in a “virtual data catalog” – VDC - a relational database DOE Data Management www.griphyn.org/chimera 17 Mar 2004 5 VDL Describes Workflow via Data Dependencies file1 TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } x1 TR tr2(in a1, out a2) { argument stdin = ${a1}; file2 argument stdout = ${a2}; } DV x1->tr1(a1=@{in:file1}, a2=@{out:file2}); x2 DV x2->tr2(a1=@{in:file2}, a2=@{out:file3}); file3 DOE Data Management www.griphyn.org/chimera 17 Mar 2004 6 Workflow example preprocess Graph structure – Fan-in – Fan-out findrange – "left" and "right" can run in parallel findrange Needs external input file – Located via replica catalog analyze DOE Data Management Data file dependencies – Form graph structure www.griphyn.org/chimera 17 Mar 2004 7 Complete VDL workflow Generate appropriate derivations DV top->preprocess( b=[ @{out:"f.b1"}, @{ out:"f.b2"} ], a=@{in:"f.a"} ); DV left->findrange( b=@{out:"f.c1"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" ); DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" ); DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} ); DOE Data Management www.griphyn.org/chimera 17 Mar 2004 8 Compound Transformations Enable Functional Abstractions Compound TR encapsulates an entire sub-graph: TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); } DOE Data Management www.griphyn.org/chimera 17 Mar 2004 9 Derivation scripts Representation of virtual data provenance: DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" ); DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" ); ... DV d70->diamond( fd=@{out:"f.001A3"}, fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" ); DOE Data Management www.griphyn.org/chimera 17 Mar 2004 10 Invocation Provenance Completion status and resource usage Attributes of executable transformation Attributes of input and output files DOE Data Management www.griphyn.org/chimera 17 Mar 2004 11 Executing VDL Workflows Global planner “Pegasus” Abstract workflow “jit” planner Grid Info Concrete DAG (research) local planner DOE Data Management www.griphyn.org/chimera DAGman / Condor-G 17 Mar 2004 12 GriPhyN-iVDGL Applications to date ATLAS, BTeV, CMS – HEP event simulation Argonne Computational Biology – sequence comparison and result capture LIGO – Pulsar search Sloan Digital Sky Survey – cluster finding; near-earth object search planned Quarknet – science education – cosmic rays, HEP analysis DOE Data Management www.griphyn.org/chimera 17 Mar 2004 13 Genome Analysis Database Update End Users HitPublic and Run Registered Groups Collaborators Jetspeed Interface to the Server A B D C B C B A C A C D D D A B Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev, Argonne MCS Described in GGF10 workshop paper. GADU - G Server UofWisc Jazz/ANL Grid3 Chimera, Condor, Globus Data Flow and Storage at various levels Automatic Workflows Created as per User Request or Project Grid DOE Data Management www.griphyn.org/chimera 17 Mar 2004 14 Virtual Data Example: Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution 100000 Number of Clusters 10000 1000 100 10 1 1 DOE Data Management Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, Number of Galaxies University of Chicago. 15 Described in SC2002 www.griphyn.org/chimera 17 Mar 2004 paper 10 100 Cluster Search Workflow Graph and Execution Trace Workflow jobs vs time DOE Data Management www.griphyn.org/chimera 17 Mar 2004 16 Virtual Data Application: High Energy Physics Data Analysis mass = 200 decay = bb mass = 200 mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 plot = 1 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Ref: CHEP 2002 paper DOE Data Management mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW event = 8 mass = 200 decay = WW plot = 1 www.griphyn.org/chimera mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 17 Mar 2004 17 Using Virtual Data for Science Education The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education Its an experiment to give students the means to: – discover and apply datasets, algorithms, and data analysis methods – collaborate by developing new ones and sharing results and observations – learn data analysis methods that will ready and excite them for a scientific career And in later steps, we may actually use the Grid! DOE Data Management www.griphyn.org/chimera 17 Mar 2004 18 Quarknet Virtual Data Project Locally Collected Data Student Data, Algorithms, Results, Notes, and communications Standard Web access Locally Collected Data Cosmic Ray Detector Yale / Middletown High Collaboration Hartford, Connecticut Cosmic Ray Locally Detector Collected Data Student/ Teacher Teams Virtual Data Catalog Foothills High School Great Falls, Montana Student/ Teacher Teams Virtual Data Toolkit Cosmic Ray Detector Student/ Teacher Teams Central High School Reston, Virginia Quarknet Virtual Data Portal Student teacher teams sharing data, methods, programs, and knowledge Enabling collaboration-intensive science discovery with virtual data tools and methods DOE Data Management www.griphyn.org/chimera 17 Mar 2004 19 Detector Performance Study DOE Data Management www.griphyn.org/chimera 17 Mar 2004 20 Example: BTeV Event Simulation DOE Data Management www.griphyn.org/chimera 17 Mar 2004 21 Search by Metadata DOE Data Management www.griphyn.org/chimera 17 Mar 2004 22 Derving a new dataset …to find mass of “z” particle: DOE Data Management www.griphyn.org/chimera 17 Mar 2004 23 Workflow for missing energy calculations DOE Data Management www.griphyn.org/chimera 17 Mar 2004 24 Virtual Provenance: list of derivations and files <job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <argument><filename file="run1a.event"/> <filename file="run1a.esm"/></argument> <uses file="run1a.esm" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.event" link="input" dontRegister="false" dontTransfer="false"/> </job> <job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … <argument><filename file="electron10GeV.event"/> <filenamefile="electron10GeV.sum" </job> <job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… <argument><filename file="run1a.mis"/> <filename file="run1a.ecal"/> … <uses file="run1a.muon" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.total" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.ecal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.hcal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.mis" link="input" dontRegister="false" dontTransfer="false"/> </job> <!--list of <filename <filename <filename <filename all files used --> file="ecal.pct" link="inout"/> file="electron10GeV.avg" link="inout"/> file="electron10GeV.sum" link="inout"/> file="hcal.pct" link="inout"/>…. (excerpted for display) DOE Data Management www.griphyn.org/chimera 17 Mar 2004 25 Virtual Provenance in XML: control flow graph <child <child <child <child <child <child <child <child ref="ID000003"> ref="ID000004"> ref="ID000005"> ref="ID000009"> ref="ID000010"> ref="ID000012"> ref="ID000013"> ref="ID000014"> <parent <parent <parent <parent <parent <parent <parent <parent <parent ref="ID000002"/> ref="ID000003"/> ref="ID000004"/> ref="ID000008"/> ref="ID000009"/> ref="ID000011"/> ref="ID000011"/> ref="ID000010"/> ref="ID000013"/>… </child> </child> <parent ref="ID000001 </child> <parent ref="ID000006 </child> </child> <parent ref="ID000012 </child>… (excerpted for display…) DOE Data Management www.griphyn.org/chimera 17 Mar 2004 26 And writing the results up in a “poster” Poster describing analysis DOE Data Management www.griphyn.org/chimera 17 Mar 2004 28 Observations A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder DOE Data Management www.griphyn.org/chimera 17 Mar 2004 29 Vision for Provenance in the Large Universal knowledge management and production systems Vendors integrate the provenance tracking protocol into data processing products Ability to run anywhere “in the Grid” DOE Data Management www.griphyn.org/chimera 17 Mar 2004 30 Virtual Data Grid Vision virtual data catalog discovery discovery request planner request executor (Condor-G, GRAM) request predictor (Prophesy) Grid Monitor DOE Data Management storage element replica location service storage element t da a detector storage element Data Grid simulation data analysis workflow executor (DAGman) virtual data catalog w ra simulation g nin workflow planner Grid Operations n io t a iv r de Data Transport Researcher virtual data index Storage Resource Mgmt n pla Production Manager sharing composition Science Review virtual data catalog Computing Grid www.griphyn.org/chimera 17 Mar 2004 31 Planned Dataset Model <FORM <Title…> /FORM> File Set of files Relational query or spreadsheet range Object closure XML Element New user-defined Set of files with dataset type: relational index Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao DOE Data Management www.griphyn.org/chimera 17 Mar 2004 32 Planned Dataset Type Model FileDataset File MultiFileSet Representational FileSet TarFileSet Logical EventCollection (Nonleaf Types are Superclasses) RawEventSet SimulatedEventSet MonteCarlo Simulation DOE Data Management www.griphyn.org/chimera DiscreteEvent Simulation 17 Mar 2004 33 Provenance Server Plans OGSA-based Grid services – Discovery, security, resource management Supports code and data discovery and workflow management Object names (TR, DS, TY, DV, IV) can be used as global cross-server links Derivations can reference remote transformations and datasets Structured object namespaces & object-level access control enable large VO collaboration Generalize transforms to describe service calls, database queries and language interpreters DOE Data Management www.griphyn.org/chimera 17 Mar 2004 34 Provenance Hyperlinks Personal VDS DV DS TR DV DV TR DS DS TR TR Collaboration VDS DV Group VDS DV DV Personal VDS DOE Data Management www.griphyn.org/chimera 17 Mar 2004 35 Indexing Servers to Support Discovery Group Index Personal VDS DV Personal Index Personal Index DS DV TR Collaborationlevel index DV TR DS TR TR DV Collaboration VDS Group VDS DS Personal Index DV DV Personal VDS Collaboration-wide index DOE Data Management www.griphyn.org/chimera 17 Mar 2004 36 For Information and Software Virtual Data System – www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software Grids and Grid Software – – – – – www.ivdgl.org/grid2003 - Using Grid3 www.griphyn.org/vdt - Virtual Data Toolkit www.globus.org – The Globus Toolkit www.cs.wisc.edu/condor - The Condor Project www.ppdg.net – Particle Physics Data Grid DOE Data Management www.griphyn.org/chimera 17 Mar 2004 37 Acknowledgements: Virtual Data is a Large Team Effort The Chimera Virtual Data System is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, and their wonderful teams DOE Data Management www.griphyn.org/chimera 17 Mar 2004 38 Acknowledgements GriPhyN, iVDGL, and QuarkNet (in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM DOE Data Management www.griphyn.org/chimera 17 Mar 2004 39