- Tetherless World Constellation

advertisement
Motivations and Challenges
Citation and Recognition of contributions
using Semantic Provenance Knowledge
Captured in the OPeNDAP Software
Framework
Patrick
West1 (westp@rpi.edu),
James
(michaj6@rpi.edu), Tim Lebo1 (lebot@rpi.edu), Deborah L.
McGuinness1 (dlm@cs.rpi.edu), Peter Fox1 (pfox@cs.rpi.edu)
(1Rensselaer
Polytechnic Institute 110
8th
Michaelis1
St., Troy, NY, 12180 United States)
Abstract
Providing proper citation and attribution for published data, derived data products, and the
software tools used to generate them, has always been an important aspect of scientific
research. However, it is often the case that this type of detailed citation and attribution is
lacking. This is in part because it often requires manual markup since dynamic generation of
this type of provenance information is not typically done by the tools used to access,
manipulate, transform and visualize data. In addition, the tools themselves lack the
information needed to be properly cited themselves.
The OPeNDAP Hyrax Software Framework is a tool that provides access to and the ability to
constrain, manipulate and transform different types of data from different data formats into a
common format, the DAP (Data Access Protocol), in order to derive new data products. A
user, or another software client, specifies an HTTP URL in order to access a particular piece of
data, and appropriately transform it to suit a specific purpose of use. The resulting data
products, however, do not contain any information about what data was used to create it, or
the software process used to generate it, let alone information that would allow the proper
citation and attribution to down stream researchers and tool developers.
We will present our approach to provenance capture in Hyrax including a mechanism that can
be used to report back to the hosting site any derived products, such as publications and
reports, using the W3C PROV recommendation pingback service. We will demonstrate our
utilization of Semantic Web and Web standards, the development of an information model
that extends the PROV model for provenance capture, and the development of the pingback
service. We will present our findings, as well as our practices for providing provenance
information, visualization of the provenance information, and the development of pingback
services, to better enable scientists and tool developers to be recognized and properly cited
for their contributions.
•
Proper data management hinges on recording and maintaining “steps” applied to create data.
•
Consumers require methods to assess whether available data is fit for their usage.
Was this dataset produced by a trustworthy source?
•
Producers are often expected to justify their efforts in generating new datasets.
•
•
•
Who is using our data?
•
What are they using it for? And why?
HOWEVER, most current-generation data analysis and manipulation tools fail to capture appropriate meta-information to
address these needs.
Use Cases
• a
•
•
•
PROV pingback-enabled community collaborates to categorize the points in a LiDAR scan of Disneyland.
A client accesses a data point from a LiDAR scan of Disneyland
The client categorizes the point as “water”, which is a new derivation of that point
The client pings-back about this new derivation
• A researcher generates a data product using OPeNDAP and uses it in a derivation. Another researcher, visualizing that
derivation, wishes to access the provenance of the data product. What were the original data sources? Can they use them?
• A scientist wishes to discover any derivations of data sources they created.
• OPeNDAP servers are widely used, but are rarely recognized.
W3C PROV Recommendation
Representing OPeNDAP
provenance trace for use case
1 using PROV Model.
OPeNDAP Back-End Server Design
Simple Concepts and properties
representing the concepts of
the PROV Model
Initial request for data
Host: opendap.tw.rpi.edu
Client: coyote.example.com
C: GET http://opendap.tw.rpi.edu/opendap/CA_OrangeCo_2011_000402.nc.ascii?constraint
S: 200 OK
S: Link: <http://opendap.tw.rpi.edu/disney/provenance_record>
rel=“http://www.w3.org/ns/prov#has_provenance”
S: Link: <http://opendap.tw.rpi.edu/disney/pingback>
rel=“http://www.w3.org/ns/prov#pingback”
(CA_OrangeCo_2011_000402 ascii representation)
First attempt was to capture
provenance after-the-fact.
Didn’t have enough information
in the end.
Second attempt is to collect the
information on-the-go but not forcing
module developers to have to
implement anything, but providing
hooks to allow them to.
Sponsors:
Tetherless World Constellation
Acknowledgements:
OPeNDAP.org for their support and being open source, especially James Gallagher and Nathan Potter.
:CA_OrangeCo_2011_000402.nc.ascii
rdf:type prov:Entity;
prov:wasDerivedFrom :NC_File.
prov:wasGeneratedBy :BES_Process;
.
:BES_Process
rdf:type prov:Activity;
prov:qualifiedAssociation [
a prov:Association;
prov:agent :BES_Agent;
prov:hadPlan :BES_Plan;
rdfs:comment
"Execution of BES Server"@en
];
.
:BES_Agent
rdf:type prov:Agent;
foaf:name "BES Server"
:BES_Plan
.
Pingback of derived data
Host: opendap.tw.rpi.edu
S: 204 No Content
.
rdf:type prov:Plan, prov:Collection;
prov:qualifiedInfluence [
a prov:Influence;
prov:entity opendap:NC_Module;
prov:hadRole opendap:Read;
opendap:order
1;
];
prov:qualifiedInfluence [
a prov:Influence;
prov:entity opendap:DAP_Module;
prov:hadRole opendap:Constrain;
opendap:order
2;
];
prov:qualifiedInfluence [
a prov:Influence;
prov:entity opendap:ASCII_Module;
prov:hadRole opendap:Transmit;
opendap:order
3;
];
Client: coyote.example.com
C: POST http://opendap.tw.rpi.edu/disney/pingback HTTP/1.1
C: Content-Type: text/uri-list
C:
C: http://coyote.example.org/diagram_abc123/provenance
C:
http://coyote.example.org/journal_article_def456/provenance
Poster: IN31C-3738
Glossary:
OPeNDAP - Open-source Project for a Network Data Access Protocol
Provenance – information about entities, activities, and people involved in
producing a piece of data or thing, which can be used to form assessments about its
quality, reliability or trustworthiness
RPI – Rensselaer Polytechnic Institute
TWC – Tetherless World Constellation at Rensselaer Polytechnic Institute
RDF Representation of provenance collected in first use case
Major goal to provide visualizations of provenance
Initial visualization is
of the provenance trace
allowing users to get
back to the original
data, actions taken on
the data, and the
agents that performed
those actions.
Take Away
• Tool developers want and deserve credit for the tools that are used in the
derivation of data
• Users of derived data more and more want to discover how the products were
generated, the original data used, to determine if they can use the original or
derived data
• Credit given to data creators, data curators, data providers, and data users
Be able to see the
software modules
that acted on the
data in producing
the derived data.
All the way to
seeing who
implemented
the code that
was used to act
on the data.
Download