The Way Things Go

advertisement
The Way Things Go



e-Science is a
complex activity
Scientific knowledge
is comprehensible
only in the context of
those activities
Adopt the Rube
Goldberg view
Rube Goldberg
National Center for Supercomputing Applications
Grand challenge: systems-scale
science


“... modeling complex systems will be a
major research challenge for the 21st
century”
- National Science Foundation

Observation and
modeling of multiple
systems at multiple
scales
Linking data and tools
from different
disciplines
to get a valid global
result!
National Center for Supercomputing Applications
Building current practices up isn't
working



Heterogeneous tools,
data formats
Little global
coordination of
research
Little funding for
sustained
stewardship of tools
and data
M.C. Escher, “Tower of Babel” (1928)
National Center for Supercomputing Applications
Proposed solutions aren't working

e-Journals – not machine-interpretable

Collaboration tools



scientists just use email like everyone else
Portals and digital libraries – typically:

centralized

domain-specific
The Grid – can orchestrate complex processing
jobs, but that's not science
National Center for Supercomputing Applications
Only networks work at scale
Desktop

Single researcher

Workgroup

Community


Network
Ad hoc data mgt,
single-user apps
Community tools,
resources, control
Global

No global practice,
tools, control
National Center for Supercomputing Applications
How do we get there?

model
refine
predict
observe
critical
interface


e-Science means
managing

Process, and

Data
Current approaches
favor one or the other
Information is getting
lost
data
National Center for Supercomputing Applications
Trends: process data
process
Workflow
* provenance
* the grid
* portals
Interactive
* desktop apps
* e-notebooks
* digital libraries
Batch
* formats
* mainframes
Data
Metadata
* rules
* ontologies
data
Semantics
National Center for Supercomputing Applications
Key technologies

Semantic web: data/metadata


Workflow: process


Provides means of merging descriptive information
even if it only partially agrees (e.g., comes from two
different communities)
Describes complex procedures independently of
how they are executed
Provenance: process + data/metadata

Links workflow, data, and any ancillary descriptive
information (e.g., attribution)
National Center for Supercomputing Applications
Semantics: data to knowledge
Abstract
Knowledge
Ontologies, rules,
models, etc.
(a.k.a. semantics)
Learning, inference
Information
Collections, tags,
attributes, etc.
(a.k.a. metadata)
Aggregation, annotation
Concrete
Data
Streams, arrays,
swaths, etc.
(a.k.a. files)
(cf Reagan Moore)
National Center for Supercomputing Applications
Semantic web: RDF triple
subject



predicate
object
Declarative: asserts a fact
Subject and object URI's identify arbitrary
entities (things, people, concepts, events)
Predicate identifies the relationship between
them
National Center for Supercomputing Applications
Triples form an open network



hasBreed
Subject nodes aren't
“owned” by any single
agent or container
Any actor can add
arcs to the implicit,
total, world graph
Any two graphs can
be joined
National Center for Supercomputing Applications
Non satis non scire
(to know is not enough)



Semantic web “layer
cake”
Where do we manage
process?

User interface?

Applications?
“Semantic Grid” (D.
DeRoure, C. Goble)
(source: World Wide Web Consortium)
National Center for Supercomputing Applications
Workflow: process description

(Taverna)


Describe complex
operations as
networks of simpler
operations
Abstract operation
execution from
description
Can be shared (but
may not be portable)
(Kepler)
National Center for Supercomputing Applications
Anatomy of a workflow
Execution model (usu. implicit)



Declarative: says
what do to
Modules identify
arbitrary procedures
Arcs identify flow of
control and/or data
(data flow is usually
implicit)
“Module”
Control flow
National Center for Supercomputing Applications
Workflow systems



D2K (source: NCSA)
Modules representing
units of computation
Language for
specifying WF

modules

control flow
Engine for executing
WF
National Center for Supercomputing Applications
Work vs. workflow systems


(source: CNRS/UCSD)
Scientists are not WF
modules
Science work also
involves

social organization
incl. funding

field and “wet lab”
manual work

discourse: review,
validation
National Center for Supercomputing Applications
Provenance: what happened


Answers critical
questions

What led to this
result?

When and how were
observations made,
conclusions reached?
Is a causal network of
events
National Center for Supercomputing Applications
Complementary incomplete notions
of provenance

Artifact-centric (e.g.,
digital libraries)

Process-centric (e.g.,
workflow)

“lineage”= events in
lifecycle of artifact
e.g., custody

computational events
(e.g., service
invocations)

IR's focus on curation
events (not
antecedent
processes)

control flow

artifacts are either not
mentioned or opaque
(tool-specific)
National Center for Supercomputing Applications
Provenance Challenges 1 & 2


IPAW 2006, HPDC
2007
20 teams, 1 workflow,
9 queries


major players
Interoperability?

lots of manual work
required

call for standards
(source: gridprovenance.org)
National Center for Supercomputing Applications
Artifact + process provenance =
“open provenance”



(source: Luc Moreau et al)
Can describe any
process, not just WF
execution (e.g.,
science!)
Allows alternate
accounts by different
observers
Rules for inferring
transitive causal
relationships
National Center for Supercomputing Applications
Open Provenance Model
(source: Luc Moreau et al)




3 node types – artifact, process, agent
5 arc types – used, generated, triggered,
derived, controlled – and inference rules
Generic – extensibility via annotation
Choice of granularity and focus (e.g., artifact or
process-centric)
National Center for Supercomputing Applications
NCSA Provenance Infrastructure
Visualization,
interaction
destkop,
portal,
etc.
Tracking,
modeling,
presentation
OPM toolkit
OPM toolkit
Open Provenance Model
Tupelo Semantic Content Repository
Abstraction,
inference,
storage
Context
Context
Context
Store
Store
Store
National Center for Supercomputing Applications
Tupelo: semantic content



(tupeloproject.org)
Abstracts content from
storage impls (e.g., Sesame,
Mulgara)
Provides location-independent
addressing of content and
metadata
Supports transparent
mirroring, caching, failover,
etc.
National Center for Supercomputing Applications
CyberIntegrator: workflow by
example


Records what users
do as provenance

source, intermediate,
and final artifacts

steps and parameters
Can re-enact
interaction as a
workflow
National Center for Supercomputing Applications
MAEviz: analaysis/viz app, workflow
“behind the scenes”




GIS app. platform
Earthquake hazard
analysis plug-in
Data catalog

built environment

fragility/hazard
models
Driven by workflow ->
provenance
National Center for Supercomputing Applications
CyberCollaboratory: collaboration +
provenance



User interaction with
tools generates
events
Events are captured
using the OPM and
published to Tupelo
Non-portal apps can
browse / use
provenance
National Center for Supercomputing Applications
Summary



“The way things go” is
critical to e-Science at
scale
Provenance is an
open causal network
New infrastructure
supports provenance
National Center for Supercomputing Applications
Resources / acknowledgements

Grid Provenance Challenge



http://twiki.gridprovenance.org/
NCSA technologies

Tupelo: http://tupeloproject.org/

CyberIntegrator: http://isda.ncsa.uiuc.edu/

MAEviz: http://maeviz.cee.uiuc.edu/

CyberCollaboratory:
http://ecid.ncsa.uiuc.edu/cybercollab/
Acknowledgements:

Jim Myers, Luc Moreau, Juliana Friere, Patrick Paulson,
Simon Miles, Bob McGrath, and more ...
National Center for Supercomputing Applications
Download