Connecting RDSI, NeCTAR and ANDS via Provenance - VHIRL Ryan Fraser, Nicholas Carr Outline ● What are VLs? ● What is provenance? ● How do we represent VLs using standardised provenance? What are VLs? From https://nectar.org.au/virtual-laboratories-1, they are: ● data repositories and computational tools and streamlining research workflows Connecting the commons with VHIRL and Provenance What is provenance? From http://en.wikipedia.org/wiki/Provenance#Computer_Science: “Computer science uses the term provenance to mean the lineage of data or processes, as per data provenance. However there is a field of informatics research within computer science called provenance that studies how provenance of data and processes should be characterised, stored and used. Semantic web standards bodies, such as the World Wide Web Consortium, ratified a standard for provenance representation in 2014, known as PROV.” Do you make decisions? Yes. Should someone remember how you made those decisions? Yes = PROV Data Services Data Layers discovered Layers consist of numerous remote data services PROV: a) Service captures data service informatio n (hosted on RDS) Subset Selected for processing a) Captures subset details of data selected Compute/Storage Services PROV: Captures job details, login info, where/what/ when/how computed etc Flexibility in what compute provider to utilise Includes all relevant NeCTAR details for cloud processing Available Toolboxes TCRM – estimate wind speed from cyclone and severe wind ANUGA – estimate inundation from riverine floods, tsunami, dam break and storm surge PROV: Captures code utilised along with “how” it is used (template/input files) Example for tsunami inundation PROV: Captures location (PID) of where input files/scripts are persisted Processing Services PROV: Captures location (PID) of where input files/scripts are persisted The steps so far have been building an environment to run a processing script ...when you’re done, it will be submitted for Either write your own script... processing on the Cloud! ...or build from existing templates PROV: Finalised outputs are persisted with PIDs on RDS and captured in prov information PROV: After job is completed – finalised Prov record is published to provenance store PROV record endpoints could be registered in ANDS RDA along side output data!!! Components of the Virtual Hazard Impact & Risk Laboratory (VHIRL) Data Services Processing Services Magnetics eScript Gravity ANUGA Bathymetry NCI Petascale NCI Cloud NeCTAR Cloud DEM Landsat Compute Services TCRM Amazon Cloud Desktop Enablers Service Orchestration Data Analytics Coastal Inundation Virtual Laboratories/Ap ps Tsuanmi Inundation Scenario Auth. Cyclone Wind Provenance Metadata Model Surface Wave Propagation (earthquake) Cyclone Wind Path Calculation Basic scientific data processing model - 1 Input Data Process Output Data Background: How do we represent VLs using standardised provenance? Basic scientific data processing model - 2 Input Data Code input item Roles Config Process Output Data Basic scientific data processing model - 3, PROV Who Input Data Who/ which system Code Process Output Data used Config Entity Activity Agent Basic scientific data processing model - 4, PROMS Report N Reporting System X R.S. Report Entity Activity Agent Basic scientific data processing model - 5, Storage Report N Report N Report N Report M Reporting System X Report N Report N Report N Report N Reporting System Y R.S. Report reported and stored Organisational Provenance Store Entity Activity Agent Data Management managed data VL ID’d and persisted web service data cited using PROMS-O format user supplied data soon to be VL ID’d and persisted, with minimal metadata recorded too managed code SSSC ID’s and persisted user supplied code perhaps SSSC ID’s and persisted, perhaps VL managed output data soon to be VL ID’d and persisted, if required, perhaps with time limits Virtual Labs Service Citation Example Data Management managed data web service data user supplied data managed code user supplied code output data [{ref}] VL ID’d and persisted {service title} {service endpoint URI} {query} {time queried} {cached copy ID} cited using PROMS-O format [1] “Subset of elevation” soon to be VL ID’d and persisted, with http://pid.csiro.au/service/anuga-thredds minimal metadata recorded too “bussleton.nc?var=elevation&spatial=bb& north=-33.06495205829679&south=SSSC ID’s and persisted 33.551573283840156&west=114.849678 74597227&east=115.70661233971667&t perhaps SSSC ID’s and persisted, emporal=all&time_start=&time_end=&hor perhaps VL managed izStride” soon to be VL ID’d and persisted, if required, perhaps with time limits “2014-12-15T13:15:11” http://pid.csiro.au/dataset/abcd1234