GriPhyN and Data Provenance The Grid Physics Network Virtual Data System

advertisement
GriPhyN and Data
Provenance
The Grid Physics Network
Virtual Data System
DOE Data Management Workshop
SLAC, 17 March 2004
Mike Wilde
Argonne National Laboratory
Mathematics and Computer Science Division
GriPhyN:
Grid Physics Network Mission
Enhance scientific productivity through
discovery and processing of datasets, using
the grid as a scientific workstation
Virtual Data enables this approach by
creating datasets from workflow “recipes” and
recording their provenance.
GriPhyN works to “cross the chasm” application and computer scientists create
and field-test paradigms and toolkits together
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
2
Virtual Data Scenario
psearch –t 10 …
file1
file8
simulate –t 10 …
file2
reformat –f fz …
file1
file1
File3,4,5
file7
conv –I esd –o aod
Update
workflow
following
changes
file6
summarize –t 10 …
Manage workflow;
Explain provenance, e.g. for file8:
DOE Data Management
psearch –t 10 –i file3 file4 file5 –o file8
summarize –t 10 –i file6 –o file7
reformat –f fz –i file2 –o file3 file4 file5
conv –l esd –o aod –i file 2 –o file6
simulate –t 10 –o file1 file2
www.griphyn.org/chimera
On-demand
data
generation
17 Mar 2004
3
Grid3 – The Laboratory
Supported by the National
Science Foundation and the
Department of Energy.
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
4
VDL: Virtual Data Language
Describes Data Transformations

Transformation
– Abstract template of program invocation
– Similar to "function definition"

Derivation
– “Function call” to a Transformation
– Store past and future:
> A record of how data products were generated
> A recipe of how data products can be generated

Invocation
– Record of a Derivation execution

These XML documents reside in a “virtual
data catalog” – VDC - a relational database
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
5
VDL Describes Workflow
via Data Dependencies
file1
TR tr1(in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
x1
TR tr2(in a1, out a2) {
argument stdin = ${a1};
file2
argument stdout = ${a2}; }
DV x1->tr1(a1=@{in:file1}, a2=@{out:file2});
x2
DV x2->tr2(a1=@{in:file2}, a2=@{out:file3});
file3
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
6
Workflow example
preprocess

Graph structure
– Fan-in
– Fan-out
findrange
– "left" and "right" can run in
parallel
findrange

Needs external input file
– Located via replica catalog

analyze
DOE Data Management
Data file dependencies
– Form graph structure
www.griphyn.org/chimera
17 Mar 2004
7
Complete VDL workflow

Generate appropriate derivations
DV top->preprocess( b=[ @{out:"f.b1"}, @{
out:"f.b2"} ], a=@{in:"f.a"} );
DV left->findrange( b=@{out:"f.c1"},
a2=@{in:"f.b2"}, a1=@{in:"f.b1"},
name="left", p="0.5" );
DV right->findrange( b=@{out:"f.c2"},
a2=@{in:"f.b2"}, a1=@{in:"f.b1"},
name="right" );
DV bottom->analyze( b=@{out:"f.d"}, a=[
@{in:"f.c1"}, @{in:"f.c2"} );
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
8
Compound Transformations
Enable Functional Abstractions

Compound TR encapsulates an entire sub-graph:
TR rangeAnalysis (in fa, p1, p2,
out fd, io fc1,
io fc2, io fb1, io fb2, )
{
call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ]
);
call findrange( a1=${in:fb1}, a2=${in:fb2},
name="LEFT", p=${p1}, b=${out:fc1} );
call findrange( a1=${in:fb1}, a2=${in:fb2},
name="RIGHT", p=${p2}, b=${out:fc2} );
call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} );
}
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
9
Derivation scripts

Representation of virtual data provenance:
DV d1->diamond( fd=@{out:"f.00005"},
fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"},
fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"},
fa=@{io:"f.00000"}, p2="100", p1="0" );
DV d2->diamond( fd=@{out:"f.0000B"},
fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"},
fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"},
fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" );
...
DV d70->diamond( fd=@{out:"f.001A3"},
fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"},
fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"},
fa=@{io:"f.0019E"}, p2="800", p1="18" );
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
10
Invocation Provenance
Completion status and
resource usage
Attributes of executable
transformation
Attributes of input and output
files
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
11
Executing VDL Workflows
Global planner
“Pegasus”
Abstract
workflow
“jit” planner
Grid
Info
Concrete
DAG
(research)
local planner
DOE Data Management
www.griphyn.org/chimera
DAGman /
Condor-G
17 Mar 2004
12
GriPhyN-iVDGL
Applications to date





ATLAS, BTeV, CMS – HEP event simulation
Argonne Computational Biology – sequence
comparison and result capture
LIGO – Pulsar search
Sloan Digital Sky Survey – cluster finding;
near-earth object search planned
Quarknet – science education – cosmic
rays, HEP analysis
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
13
Genome Analysis Database Update
End Users
HitPublic
and Run
Registered
Groups
Collaborators
Jetspeed
Interface to the
Server
A
B
D
C
B
C
B
A
C
A
C
D
D
D
A
B
Application work by
Alex Rodriguez,
Dina Sulakhe,
Natalia Matlsev,
Argonne MCS
Described in GGF10
workshop paper.
GADU - G
Server
UofWisc
Jazz/ANL
Grid3
Chimera, Condor, Globus
Data Flow and Storage at various levels
Automatic Workflows Created as per User
Request or Project
Grid
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
14
Virtual Data Example:
Galaxy Cluster Search
DAG
Sloan Data
Galaxy cluster
size distribution
100000
Number of Clusters
10000
1000
100
10
1
1
DOE Data Management
Jim Annis, Steve Kent, Vijay Sehkri,
Fermilab, Michael Milligan, Yong Zhao,
Number of Galaxies
University of Chicago.
15
Described
in SC2002
www.griphyn.org/chimera
17 Mar
2004 paper
10
100
Cluster Search
Workflow Graph
and Execution Trace
Workflow jobs vs time
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
16
Virtual Data Application:
High Energy Physics
Data Analysis
mass = 200
decay = bb
mass = 200
mass = 200
decay = ZZ
mass = 200
decay = WW
stability = 3
mass = 200
decay = WW
mass = 200
decay = WW
stability = 1
mass = 200
event = 8
mass = 200
plot = 1
Work and slide by
Rick Cavanaugh and
Dimitri Bourilkov,
University of Florida
Ref: CHEP 2002 paper
DOE Data Management
mass = 200
decay = WW
stability = 1
LowPt = 20
HighPt = 10000
mass = 200
decay = WW
event = 8
mass = 200
decay = WW
plot = 1
www.griphyn.org/chimera
mass = 200
decay = WW
stability = 1
event = 8
mass = 200
decay = WW
stability = 1
plot = 1
17 Mar 2004
17
Using Virtual Data for
Science Education


The QuarkNet-Trillium collaboration is using Grid
virtual data tools and methods to enrich science
education
Its an experiment to give students the means to:
– discover and apply datasets, algorithms, and data
analysis methods
– collaborate by developing new ones and sharing results
and observations
– learn data analysis methods that will ready and excite
them for a scientific career

And in later steps, we may actually use the Grid!
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
18
Quarknet Virtual Data Project
Locally
Collected Data
Student Data,
Algorithms,
Results, Notes,
and communications
Standard Web access
Locally
Collected Data
Cosmic
Ray
Detector
Yale / Middletown High Collaboration
Hartford, Connecticut
Cosmic
Ray
Locally
Detector
Collected Data
Student/
Teacher
Teams
Virtual
Data
Catalog
Foothills High School
Great Falls, Montana
Student/
Teacher
Teams
Virtual
Data
Toolkit
Cosmic
Ray
Detector
Student/
Teacher
Teams
Central High School
Reston, Virginia
Quarknet Virtual Data Portal
Student teacher teams sharing data,
methods, programs, and knowledge
Enabling collaboration-intensive science
discovery with virtual data tools and
methods
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
19
Detector Performance Study
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
20
Example: BTeV Event Simulation
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
21
Search by
Metadata
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
22
Derving
a new
dataset
…to find
mass of
“z”
particle:
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
23
Workflow for
missing energy calculations
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
24
Virtual Provenance:
list of derivations and files
<job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“
dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum">
<argument><filename file="run1a.event"/> <filename file="run1a.esm"/></argument>
<uses file="run1a.esm" link="output" dontRegister="false" dontTransfer="false"/>
<uses file="run1a.event" link="input" dontRegister="false" dontTransfer="false"/>
</job>
<job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“
dv-namespace="Quarknet.HEPSRCH" …
<argument><filename file="electron10GeV.event"/> <filenamefile="electron10GeV.sum"
</job>
<job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"…
<argument><filename file="run1a.mis"/> <filename file="run1a.ecal"/> …
<uses file="run1a.muon" link="input" dontRegister="false" dontTransfer="false"/>
<uses file="run1a.total" link="output" dontRegister="false" dontTransfer="false"/>
<uses file="run1a.ecal" link="input" dontRegister="false" dontTransfer="false"/>
<uses file="run1a.hcal" link="input" dontRegister="false" dontTransfer="false"/>
<uses file="run1a.mis" link="input" dontRegister="false" dontTransfer="false"/>
</job>
<!--list of
<filename
<filename
<filename
<filename
all files used -->
file="ecal.pct" link="inout"/>
file="electron10GeV.avg" link="inout"/>
file="electron10GeV.sum" link="inout"/>
file="hcal.pct" link="inout"/>….
(excerpted for display)
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
25
Virtual Provenance in XML:
control flow graph
<child
<child
<child
<child
<child
<child
<child
<child
ref="ID000003">
ref="ID000004">
ref="ID000005">
ref="ID000009">
ref="ID000010">
ref="ID000012">
ref="ID000013">
ref="ID000014">
<parent
<parent
<parent
<parent
<parent
<parent
<parent
<parent
<parent
ref="ID000002"/>
ref="ID000003"/>
ref="ID000004"/>
ref="ID000008"/>
ref="ID000009"/>
ref="ID000011"/>
ref="ID000011"/>
ref="ID000010"/>
ref="ID000013"/>…
</child>
</child>
<parent ref="ID000001
</child>
<parent ref="ID000006
</child>
</child>
<parent ref="ID000012
</child>…
(excerpted for display…)
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
26
And
writing
the
results
up in a
“poster”
Poster describing analysis
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
28
Observations



A provenance approach based on interface definition
and data flow declaration fits well with Grid
requirements for code and data transportability and
heterogeneity
Working in a provenance-managed system has many
fringe benefits: uniformity, precision, structure,
communication, documentation
The real world is messy – finding the right abstractions
is hard, and handling “legacy” applications is even
harder
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
29
Vision for Provenance in the Large



Universal knowledge management and
production systems
Vendors integrate the provenance tracking
protocol into data processing products
Ability to run anywhere “in the Grid”
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
30
Virtual Data Grid Vision
virtual
data
catalog
discovery
discovery
request planner
request executor
(Condor-G,
GRAM)
request
predictor
(Prophesy)
Grid Monitor
DOE Data Management
storage
element
replica location
service
storage
element
t
da
a
detector
storage
element
Data Grid
simulation data
analysis
workflow
executor
(DAGman)
virtual
data
catalog
w
ra
simulation
g
nin
workflow
planner
Grid
Operations
n
io
t
a
iv
r
de
Data
Transport
Researcher
virtual data
index
Storage
Resource
Mgmt
n
pla
Production
Manager
sharing
composition
Science
Review
virtual
data
catalog
Computing Grid
www.griphyn.org/chimera
17 Mar 2004
31
Planned Dataset Model
<FORM
<Title…>
/FORM>
File
Set of files
Relational query or
spreadsheet range
Object closure
XML Element
New user-defined Set of files with
dataset type:
relational index
Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
32
Planned Dataset Type Model
FileDataset
File
MultiFileSet
Representational
FileSet
TarFileSet
Logical
EventCollection
(Nonleaf Types
are Superclasses)
RawEventSet
SimulatedEventSet
MonteCarlo
Simulation
DOE Data Management
www.griphyn.org/chimera
DiscreteEvent
Simulation
17 Mar 2004
33
Provenance Server Plans

OGSA-based Grid services
– Discovery, security, resource management





Supports code and data discovery
and workflow management
Object names (TR, DS, TY, DV, IV) can be used as
global cross-server links
Derivations can reference remote transformations
and datasets
Structured object namespaces & object-level access
control enable large VO collaboration
Generalize transforms to describe service calls,
database queries and language interpreters
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
34
Provenance Hyperlinks
Personal
VDS
DV
DS
TR
DV
DV
TR
DS
DS
TR
TR
Collaboration
VDS
DV
Group VDS
DV
DV
Personal
VDS
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
35
Indexing Servers
to Support Discovery
Group Index
Personal
VDS
DV
Personal
Index
Personal
Index
DS
DV
TR
Collaborationlevel
index
DV
TR
DS
TR
TR
DV
Collaboration
VDS
Group VDS
DS
Personal
Index
DV
DV
Personal
VDS
Collaboration-wide
index
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
36
For Information and Software

Virtual Data System
– www.griphyn.org/chimera - Chimera Virtual Data
System: Overview, papers, software

Grids and Grid Software
–
–
–
–
–
www.ivdgl.org/grid2003 - Using Grid3
www.griphyn.org/vdt - Virtual Data Toolkit
www.globus.org – The Globus Toolkit
www.cs.wisc.edu/condor - The Condor Project
www.ppdg.net – Particle Physics Data Grid
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
37
Acknowledgements:
Virtual Data is a Large Team Effort
The Chimera Virtual Data System
is the work of Ian Foster, Jens Voeckler,
Mike Wilde and Yong Zhao
The Pegasus Planner is the work of Ewa
Deelman, Gaurang Mehta, and Karan Vahi
Applications described are the work of many
people, including: James Annis, Rick
Cavanaugh, Dan Engh, Rob Gardner,
Albert Lazzarini, Natalia Maltsev, and their
wonderful teams
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
38
Acknowledgements
GriPhyN, iVDGL, and QuarkNet
(in part) are supported by the
National Science Foundation
The Globus Alliance, PPDG, and QuarkNet are
supported in part by the US Department of
Energy, Office of Science; by the NASA
Information Power Grid program; and by IBM
DOE Data Management
www.griphyn.org/chimera
17 Mar 2004
39
Download