The Chimera Virtual Data System www.griphyn.org/chimera Presented by Mike Wilde Workflow Workshop

advertisement
The Chimera Virtual Data System
www.griphyn.org/chimera
Presented by Mike Wilde
Workflow Workshop
3 December 2003
e-Science Institute, Edinburgh
Acknowledgements
GriPhyN – the Grid Physics Network –
is supported by The National Science Foundation,
Information Technology Research Program
The Chimera Virtual Data System
is the work of Ian Foster, Jens Voeckler, Mike
Wilde and Yong Zhao
The Pegasus Planner is the work of Ewa Deelman,
Gaurang Mehta, and Karan Vahi
This talk was also delivered at the Data Provenance and Annotation Workshop, 1 Dec 2003
3 Dec 2003
www.griphyn.org/chimera
2
The Virtual Data Concept
Enhance scientific productivity through:
z
z
Discovery and application of datasets and programs
Enabling use of a worldwide data grid as a scientific
workstation
Virtual Data enables this approach by creating datasets
from workflow “recipes” and recording their
provenance.
Provenance ~= Virtual Data
3 Dec 2003
www.griphyn.org/chimera
3
Provenance System Goals
Producing data from transformations with uniform,
precise data interface descriptions enables…
z
z
Discovery: finding and understanding datasets and
transformations
Workflow: structured paradigm for organizing,
locating, specifying, & producing scientific datasets
– Forming new workflow
– Building new workflow from existing patterns
– Managing change
z
Planning: automated to make the Grid transparent
z
Audit: explanation and validation via provenance
3 Dec 2003
www.griphyn.org/chimera
4
Virtual Data Grid Vision
discovery
workflow
planner
workflow
executor
(DAGman)
request planner
request executor
(Condor-G,
GRAM)
request
predictor
(Prophesy)
Grid Monitor
Grid
Operations
3 Dec 2003
w
ra
storage
element
Data
Transport
g
nin
ri
de
virtual
data
catalog
n
io
t
va
replica location
service
Data Grid
simulation
Researcher
virtual data
index
simulation data
storage
element
ta
a
d
detector
Storage
Resource
Mgmt
n
pla
Production
Manager
sharing
composition
Science
Review
virtual
data
catalog
storage
element
analysis
virtual
data
catalog
discovery
Computing Grid
www.griphyn.org/chimera
5
Usage Models and Cases
z
z
Domains where its valuable (and where its not)?
benefit ratios?
Batch models
Cost
– Cluster finding laboratory: code and data changes, track
results.
z
Interactive models
– Using provenance within interactive dialogs in graphical
and textual tools
– Moving back and forth between interactive and batch
modes
z
z
z
z
z
Discovery
Understand / review / audit
Compose
Passive Provenance: recording
Active Provenance: declaration
3 Dec 2003
www.griphyn.org/chimera
6
Virtual Data Example:
Galaxy Cluster Search
DAG
Sloan Data
Galaxy cluster
size distribution
100000
Number of Clusters
10000
1000
100
10
1
1
Jim Annis, Steve Kent, Vijay Sehkri,
Fermilab, Michael Milligan, Yong Zhao,
www.griphyn.org/chimera
7
University of Chicago
10
Number of Galaxies
3 Dec 2003
100
Virtual Data Application:
High Energy Physics
Data Analysis
mass = 200
decay = bb
mass = 200
mass = 200
decay = ZZ
mass = 200
decay = WW
stability = 3
mass = 200
decay = WW
mass = 200
event = 8
mass = 200
plot = 1
Work and slide by
Rick Cavanaugh and
Dimitri Bourilkov,
University of Florida
3 Dec 2003
mass = 200
decay = WW
stability = 1
LowPt = 20
HighPt = 10000
mass = 200
decay = WW
stability = 1
mass = 200
decay = WW
event = 8
mass = 200
decay = WW
plot = 1
www.griphyn.org/chimera
mass = 200
decay = WW
stability = 1
event = 8
mass = 200
decay = WW
stability = 1
plot = 1
8
Provenance Scenario
psearch –t 10 …
file1
file8
simulate –t 10 …
file2
reformat –f fz …
file1
file1
File3,4,5
file7
conv –I esd –o aod…
Update
workflow
following
changes
3 Dec 2003
file6
summarize –t 10 …
Manage workflow;
Explain provenance, e.g. for file8:
On-demand
data
psearch –t 10 –i file1 file3 file4 file5 file7–o file8
generation
simulate –t 10 –o file1 file2
reformat –f fz –i file2 –o file3 file4 file5
summarize –t 10 –i file6 –o file7
conv –l esd –o aod –i file 2 –o file6
www.griphyn.org/chimera
9
Fundamental Units
z
Transformations
–
–
–
–
z
Interface Declarations
Action Declarations
Call declaration
Invocation
Datasets
– Contents
– Representation
– Location
3 Dec 2003
www.griphyn.org/chimera
10
VDL: Virtual Data Language
Describes Data Transformations
z
Transformation
– Abstract template of program invocation
– Similar to "function definition"
z
Derivation
– “Function call” to a transformation
– Stores past and future:
> A record of how data products were generated
> A recipe of how data products can be generated
z
Invocation
– Record of a Derivation execution
3 Dec 2003
www.griphyn.org/chimera
11
Example Transformation
TR t1( out a2, in a1, none pa = "500", none
env = "100000" ) {
argument = "-p "${pa};
$a1
argument = "-f "${a1};
argument = "-x –y";
t1
argument stdout = ${a2};
profile env.MAXMEM = ${env};
$a2
}
3 Dec 2003
www.griphyn.org/chimera
12
Example Transformation Calls
(Derivations)
DV d1->t1 (
env="20000", pa="600",
a2=@{out:run1.exp15.T1932.summary},
a1=@{in:run1.exp15.T1932.raw},
);
DV d2->t1 (
a1=@{in:run1.exp16.T1918.raw},
a2=@{out.run1.exp16.T1918.summary}
);
3 Dec 2003
www.griphyn.org/chimera
13
Workflow from File Dependencies
TR tr1(in a1, out a2) {
file1
argument stdin = ${a1};
argument stdout = ${a2}; }
x1
TR tr2(in a1, out a2) {
argument stdin = ${a1};
file2
argument stdout = ${a2}; }
DV x1->tr1(a1=@{in:file1}, a2=@{out:file2});
x2
DV x2->tr2(a1=@{in:file2}, a2=@{out:file3});
file3
3 Dec 2003
www.griphyn.org/chimera
14
Example Invocation
Completion status and
resource usage
Attributes of executable
transformation
Attributes of input and
output files
3 Dec 2003
www.griphyn.org/chimera
15
Example Workflow
preprocess
z
Complex structure
– Fan-in
– Fan-out
findrange
– "left" and "right" can
run in parallel
findrange
z
Uses input file
– Register with RC
z
analyze
3 Dec 2003
Complex file
dependencies
– Glues workflow
www.griphyn.org/chimera
16
Workflow step "preprocess"
z
TR preprocess turns f.a into f.b1 and f.b2
TR preprocess( output b[], input a ) {
argument = "-a top";
argument = " –i "${input:a};
argument = " –o " ${output:b};
}
z
Makes use of the "list" feature of VDL
– Generates 0..N output files.
– Number file files depend on the caller.
3 Dec 2003
www.griphyn.org/chimera
17
Workflow step "findrange"
z
Turns two inputs into one output
TR findrange( output b, input a1, input a2,
none name="findrange", none p="0.0" ) {
argument = "-a "${name};
argument = " –i " ${a1} " " ${a2};
argument = " –o " ${b};
argument = " –p " ${p};
}
z
Uses the default argument feature
3 Dec 2003
www.griphyn.org/chimera
18
Can also use list[] parameters
TR findrange( output b, input a[],
none name="findrange", none p="0.0" ) {
argument = "-a "${name};
argument = " –i " ${" "|a};
argument = " –o " ${b};
argument = " –p " ${p};
}
3 Dec 2003
www.griphyn.org/chimera
19
Workflow step "analyze"
z
Combines intermediary results
TR analyze( output b, input a[] ) {
argument = "-a bottom";
argument = " –i " ${a};
argument = " –o " ${b};
}
3 Dec 2003
www.griphyn.org/chimera
20
Complete VDL workflow
z
Generate appropriate derivations
DV top->preprocess( b=[ @{out:"f.b1"}, @{
out:"f.b2"} ], a=@{in:"f.a"} );
DV left->findrange( b=@{out:"f.c1"},
a2=@{in:"f.b2"}, a1=@{in:"f.b1"},
name="left", p="0.5" );
DV right->findrange( b=@{out:"f.c2"},
a2=@{in:"f.b2"}, a1=@{in:"f.b1"},
name="right" );
DV bottom->analyze( b=@{out:"f.d"}, a=[
@{in:"f.c1"}, @{in:"f.c2"} );
3 Dec 2003
www.griphyn.org/chimera
21
Compound Transformations
z
Using compound TR
– Permits composition of complex TRs from basic
ones
– Calls are independent
> unless linked through LFN
– A Call is effectively an anonymous derivation
> Late instantiation at workflow generation time
– Permits bundling of repetitive workflows
– Model: Function calls nested within a function
definition
3 Dec 2003
www.griphyn.org/chimera
22
Compound Transformations
z
(cont)
TR diamond encapsulates “diamond” workflows:
TR diamond( out fd, io fc1, io fc2, io fb1, io fb2, in fa, p1,
p2 ) {
call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ]
);
call findrange( a1=${in:fb1}, a2=${in:fb2},
name="LEFT", p=${p1}, b=${out:fc1} );
call findrange( a1=${in:fb1}, a2=${in:fb2},
name="RIGHT", p=${p2}, b=${out:fc2} );
call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} );
}
3 Dec 2003
www.griphyn.org/chimera
23
Compound Transformations
z
(cont)
Multiple DVs allow easy generator scripts:
DV d1->diamond( fd=@{out:"f.00005"},
fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"},
fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"},
fa=@{io:"f.00000"}, p2="100", p1="0" );
DV d2->diamond( fd=@{out:"f.0000B"},
fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"},
fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"},
fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" );
...
DV d70->diamond( fd=@{out:"f.001A3"},
fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"},
fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"},
fa=@{io:"f.0019E"}, p2="800", p1="18" );
3 Dec 2003
www.griphyn.org/chimera
24
Dataset Requirements
<FORM
<Title…>
/FORM>
File
Set of files
Relational query or
spreadsheet range
3 Dec 2003
Object closure
XML Element
New user-defined Set of files with
dataset type:
relational index
www.griphyn.org/chimera
25
Possible Dataset Type Model
z
Types used for
– Managing dataset representation
– Determining argument conformance in
invocations
– Discovery of datasets and transformations
z
Two parallel type hierarchies separate
representation and semantics
– Representational: organizes and specifies
families of dataset representation
3 Dec 2003
– Logical: organizes and specifies applicationspecific semantics of datasets
www.griphyn.org/chimera
26
Example Dataset Types
(Nonleaf Types are Superclasses)
FileDataset
File
MultiFileSet
Representational
FileSet
TarFileSet
Logical
EventCollection
RawEventSet
MonteCarlo
Simulation
3 Dec 2003
www.griphyn.org/chimera
SimulatedEventSet
DiscreteEvent
Simulation
27
Dataset Representation Descriptor
z
z
z
Defines a dataset’s physical layout
Permits transformations to access datasets
Structure is defined by dataset type (examples)
–
–
–
–
File: <lfn>
<evt.02>
MultiFileSet: <lfn+>
<evt.03, evt.04, evt05>
TarFileSet: <lfn,taropts> <evts.1998, "-b50 -z">
Relation: <<odbc><select .*>>
<server name="db.mcs.anl.gov" db="hepdb" id="uchep"/>
<query request="select * from evt where eid>2897 and eid<3945" />
z
z
Stored in dataset catalog
Format constrained by DS type def
3 Dec 2003
www.griphyn.org/chimera
28
Provenance Schema
Type
name=type2
repres=<...>
instance
of
Dataset
name=foo
type=type2
Contains
arguments of
Reads/writes/
creates/deletes
Transformation
Derivation
type-signature=
prog1(
in type1 X,
out type2 Y
)
instance
of
type-signature=
prog1(
in type1 fnn,
out type2 foo
)
describes
physical
replica of
Replica
locn=U.Chicago
Reads/writes/
creates/deletes
Invocation
invocation when=10am
time=20 secs
of
locn=U.Chicago
describes
Metadata
3 Dec 2003
www.griphyn.org/chimera
29
Observations
z
z
A provenance approach based on interface
definition and data flow declaration fits
well with Grid requirements for code and
data transportability and heterogeneity
Working in a provenance-managed system
has many fringe benefits: uniformity,
precision, structure, communication,
documentation
3 Dec 2003
www.griphyn.org/chimera
30
Vision for Provenance in the Large
z
z
z
Universal knowledge management and
production systems
Vendors integrate the provenance tracking
protocol into data processing products
Ability to run anywhere “in the Grid”
3 Dec 2003
www.griphyn.org/chimera
31
Virtual Data Grid Vision
discovery
workflow
planner
workflow
executor
(DAGman)
request planner
request executor
(Condor-G,
GRAM)
request
predictor
(Prophesy)
Grid Monitor
Grid
Operations
3 Dec 2003
w
ra
storage
element
Data
Transport
g
nin
ri
de
virtual
data
catalog
n
io
t
va
replica location
service
Data Grid
simulation
Researcher
virtual data
index
simulation data
storage
element
ta
a
d
detector
Storage
Resource
Mgmt
n
pla
Production
Manager
sharing
composition
Science
Review
virtual
data
catalog
storage
element
analysis
virtual
data
catalog
discovery
Computing Grid
www.griphyn.org/chimera
32
Systems requirements:
Services and Interfaces
z
z
z
z
z
Provenance databases, servers, virtual machines,
workflow composers
Provenance navigation portals and webs
Embedded tracing systems esp. within interactive
tools: SPSS, ROOT, Excel, etc
Catalog integration: replica catalogs, metadata
catalogs, transformation catalogs, integrity,
coherence, interoperability.
Interaction between provenance systems and
workflow systems
3 Dec 2003
www.griphyn.org/chimera
33
Provenance Servers
z
OGSA-based Grid services
– Discovery, security, resource management
z
z
z
z
Supports code and data discovery
and workflow management
Object names (TR, DS, TY, DV, IV) can be
used as global cross-server links
Derivations can reference remote
transformations and datasets
Structured object namespaces & object-level
access control enable large VO collaboration
3 Dec 2003
www.griphyn.org/chimera
34
Provenance Hyperlinks
Personal
VDS
DV
DS
TR
DV
DV
TR
DS
DS
TR
TR
Collaboration
VDS
DV
Group VDS
DV
DV
Personal
VDS
3 Dec 2003
www.griphyn.org/chimera
35
Indexing Provenance Servers to
Support Discovery
Group Index
Personal
VDS
DV
Personal
Index
Personal
Index
DS
TR
Collaborationlevel
index
DV
DV
TR
DS
TR
TR
Collaboration
VDS
DV
Group VDS
DS
Personal
Index
DV
DV
Personal
VDS
Collaboration-wide
index
3 Dec 2003
www.griphyn.org/chimera
36
Challenges
z
What’s the unit of change? Dataset? File? Object?
Relations to the worlds of HDF, CDF, FITS, many others
Does a dataset type have multiple dimensions?
Dataset names/handles
z
z
z
z
z
z
Unification of processing models: App, SQL,
Service
Closure and reflection: Are transformations and
workflows datasets? Can we track provenance of
annotations?
Version management: mutability, timestamps
Garbage collection, retention, pruning
Distribution: what standards and naming protocols
are needed? Catalogs, schemas?
Theoretical models? Unification of fine-grain and
coarse-grained models?
3 Dec 2003
www.griphyn.org/chimera
37
Download