VHIRL-ConnectivityViaProv-Fy2015

advertisement
Connecting RDSI, NeCTAR and
ANDS via Provenance - VHIRL
Ryan Fraser, Nicholas Carr
Outline
● What are VLs?
● What is provenance?
● How do we represent VLs using standardised provenance?
What are VLs?
From https://nectar.org.au/virtual-laboratories-1, they are:
●
data repositories and computational tools and streamlining research workflows
Connecting the commons with VHIRL and Provenance
What is provenance?
From http://en.wikipedia.org/wiki/Provenance#Computer_Science:
“Computer science uses the term provenance to mean
the lineage of data or processes, as per data
provenance. However there is a field of informatics
research within computer science called provenance
that studies how provenance of data and processes
should be characterised, stored and used. Semantic
web standards bodies, such as the World Wide Web
Consortium, ratified a standard for provenance
representation in 2014, known as PROV.”
Do you make decisions? Yes. Should someone remember how you made those decisions? Yes
= PROV
Data Services
Data Layers discovered
Layers consist of numerous remote
data services
PROV:
a) Service
captures
data
service
informatio
n (hosted
on RDS)
Subset Selected for processing
a)
Captures
subset
details of
data
selected
Compute/Storage Services
PROV:
Captures job
details, login
info,
where/what/
when/how
computed etc
Flexibility in what compute provider to
utilise
Includes all
relevant
NeCTAR
details for
cloud
processing
Available Toolboxes
TCRM – estimate wind speed from cyclone and
severe wind
ANUGA – estimate inundation from riverine floods,
tsunami, dam break and storm surge
PROV:
Captures code
utilised along
with “how” it is
used
(template/input
files)
Example for tsunami inundation
PROV:
Captures
location (PID)
of where input
files/scripts are
persisted
Processing Services
PROV:
Captures
location (PID)
of where input
files/scripts are
persisted
The steps so far have been building an
environment to run a processing script
...when you’re done, it will be submitted for
Either write your own script...
processing on the Cloud!
...or build from existing templates
PROV:
Finalised outputs are
persisted with PIDs on
RDS and captured in
prov information
PROV:
After job is completed –
finalised Prov record is
published to provenance
store
PROV record endpoints
could be registered in
ANDS RDA along side
output data!!!
Components of the Virtual Hazard Impact & Risk Laboratory (VHIRL)
Data Services
Processing
Services
Magnetics
eScript
Gravity
ANUGA
Bathymetry
NCI
Petascale
NCI
Cloud
NeCTAR
Cloud
DEM
Landsat
Compute
Services
TCRM
Amazon
Cloud
Desktop
Enablers
Service
Orchestration
Data
Analytics
Coastal
Inundation
Virtual
Laboratories/Ap
ps
Tsuanmi
Inundation
Scenario
Auth.
Cyclone Wind
Provenance
Metadata
Model
Surface Wave
Propagation
(earthquake)
Cyclone Wind
Path Calculation
Basic scientific data processing model - 1
Input Data
Process
Output
Data
Background: How do we represent VLs
using standardised provenance?
Basic scientific data processing model - 2
Input Data
Code
input item Roles
Config
Process
Output
Data
Basic scientific data processing model - 3, PROV
Who
Input Data
Who/
which
system
Code
Process
Output
Data
used
Config
Entity
Activity
Agent
Basic scientific data processing model - 4, PROMS
Report N
Reporting
System X
R.S.
Report
Entity
Activity
Agent
Basic scientific data processing model - 5, Storage
Report N
Report N
Report N
Report M
Reporting
System X
Report N
Report N
Report N
Report N
Reporting
System Y
R.S.
Report
reported and stored
Organisational
Provenance
Store
Entity
Activity
Agent
Data Management
managed
data
VL ID’d and persisted
web
service
data
cited using PROMS-O format
user
supplied
data
soon to be VL ID’d and persisted, with
minimal metadata recorded too
managed
code
SSSC ID’s and persisted
user
supplied
code
perhaps SSSC ID’s and persisted,
perhaps VL managed
output
data
soon to be VL ID’d and persisted, if required,
perhaps with time limits
Virtual Labs Service Citation Example
Data Management
managed
data
web
service
data
user
supplied
data
managed
code
user
supplied
code
output
data
[{ref}]
VL ID’d and persisted
{service title}
{service endpoint URI}
{query}
{time queried}
{cached copy ID}
cited using PROMS-O format
[1]
“Subset of elevation”
soon to be VL ID’d and persisted,
with
http://pid.csiro.au/service/anuga-thredds
minimal metadata recorded too
“bussleton.nc?var=elevation&spatial=bb&
north=-33.06495205829679&south=SSSC ID’s and persisted
33.551573283840156&west=114.849678
74597227&east=115.70661233971667&t
perhaps SSSC ID’s and persisted,
emporal=all&time_start=&time_end=&hor
perhaps VL managed
izStride”
soon to be VL ID’d and persisted, if required,
perhaps with time limits “2014-12-15T13:15:11”
http://pid.csiro.au/dataset/abcd1234
Download