Motivations and Challenges Citation and Recognition of contributions using Semantic Provenance Knowledge Captured in the OPeNDAP Software Framework Patrick West1 ([email protected]), James ([email protected]), Tim Lebo1 ([email protected]), Deborah L. McGuinness1 ([email protected]), Peter Fox1 ([email protected]) (1Rensselaer Polytechnic Institute 110 8th Michaelis1 St., Troy, NY, 12180 United States) Abstract Providing proper citation and attribution for published data, derived data products, and the software tools used to generate them, has always been an important aspect of scientific research. However, it is often the case that this type of detailed citation and attribution is lacking. This is in part because it often requires manual markup since dynamic generation of this type of provenance information is not typically done by the tools used to access, manipulate, transform and visualize data. In addition, the tools themselves lack the information needed to be properly cited themselves. The OPeNDAP Hyrax Software Framework is a tool that provides access to and the ability to constrain, manipulate and transform different types of data from different data formats into a common format, the DAP (Data Access Protocol), in order to derive new data products. A user, or another software client, specifies an HTTP URL in order to access a particular piece of data, and appropriately transform it to suit a specific purpose of use. The resulting data products, however, do not contain any information about what data was used to create it, or the software process used to generate it, let alone information that would allow the proper citation and attribution to down stream researchers and tool developers. We will present our approach to provenance capture in Hyrax including a mechanism that can be used to report back to the hosting site any derived products, such as publications and reports, using the W3C PROV recommendation pingback service. We will demonstrate our utilization of Semantic Web and Web standards, the development of an information model that extends the PROV model for provenance capture, and the development of the pingback service. We will present our findings, as well as our practices for providing provenance information, visualization of the provenance information, and the development of pingback services, to better enable scientists and tool developers to be recognized and properly cited for their contributions. • Proper data management hinges on recording and maintaining “steps” applied to create data. • Consumers require methods to assess whether available data is fit for their usage. Was this dataset produced by a trustworthy source? • Producers are often expected to justify their efforts in generating new datasets. • • • Who is using our data? • What are they using it for? And why? HOWEVER, most current-generation data analysis and manipulation tools fail to capture appropriate meta-information to address these needs. Use Cases • a • • • PROV pingback-enabled community collaborates to categorize the points in a LiDAR scan of Disneyland. A client accesses a data point from a LiDAR scan of Disneyland The client categorizes the point as “water”, which is a new derivation of that point The client pings-back about this new derivation • A researcher generates a data product using OPeNDAP and uses it in a derivation. Another researcher, visualizing that derivation, wishes to access the provenance of the data product. What were the original data sources? Can they use them? • A scientist wishes to discover any derivations of data sources they created. • OPeNDAP servers are widely used, but are rarely recognized. W3C PROV Recommendation Representing OPeNDAP provenance trace for use case 1 using PROV Model. OPeNDAP Back-End Server Design Simple Concepts and properties representing the concepts of the PROV Model Initial request for data Host: opendap.tw.rpi.edu Client: coyote.example.com C: GET http://opendap.tw.rpi.edu/opendap/CA_OrangeCo_2011_000402.nc.ascii?constraint S: 200 OK S: Link: <http://opendap.tw.rpi.edu/disney/provenance_record> rel=“http://www.w3.org/ns/prov#has_provenance” S: Link: <http://opendap.tw.rpi.edu/disney/pingback> rel=“http://www.w3.org/ns/prov#pingback” (CA_OrangeCo_2011_000402 ascii representation) First attempt was to capture provenance after-the-fact. Didn’t have enough information in the end. Second attempt is to collect the information on-the-go but not forcing module developers to have to implement anything, but providing hooks to allow them to. Sponsors: Tetherless World Constellation Acknowledgements: OPeNDAP.org for their support and being open source, especially James Gallagher and Nathan Potter. :CA_OrangeCo_2011_000402.nc.ascii rdf:type prov:Entity; prov:wasDerivedFrom :NC_File. prov:wasGeneratedBy :BES_Process; . :BES_Process rdf:type prov:Activity; prov:qualifiedAssociation [ a prov:Association; prov:agent :BES_Agent; prov:hadPlan :BES_Plan; rdfs:comment "Execution of BES Server"@en ]; . :BES_Agent rdf:type prov:Agent; foaf:name "BES Server" :BES_Plan . Pingback of derived data Host: opendap.tw.rpi.edu S: 204 No Content . rdf:type prov:Plan, prov:Collection; prov:qualifiedInfluence [ a prov:Influence; prov:entity opendap:NC_Module; prov:hadRole opendap:Read; opendap:order 1; ]; prov:qualifiedInfluence [ a prov:Influence; prov:entity opendap:DAP_Module; prov:hadRole opendap:Constrain; opendap:order 2; ]; prov:qualifiedInfluence [ a prov:Influence; prov:entity opendap:ASCII_Module; prov:hadRole opendap:Transmit; opendap:order 3; ]; Client: coyote.example.com C: POST http://opendap.tw.rpi.edu/disney/pingback HTTP/1.1 C: Content-Type: text/uri-list C: C: http://coyote.example.org/diagram_abc123/provenance C: http://coyote.example.org/journal_article_def456/provenance Poster: IN31C-3738 Glossary: OPeNDAP - Open-source Project for a Network Data Access Protocol Provenance – information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness RPI – Rensselaer Polytechnic Institute TWC – Tetherless World Constellation at Rensselaer Polytechnic Institute RDF Representation of provenance collected in first use case Major goal to provide visualizations of provenance Initial visualization is of the provenance trace allowing users to get back to the original data, actions taken on the data, and the agents that performed those actions. Take Away • Tool developers want and deserve credit for the tools that are used in the derivation of data • Users of derived data more and more want to discover how the products were generated, the original data used, to determine if they can use the original or derived data • Credit given to data creators, data curators, data providers, and data users Be able to see the software modules that acted on the data in producing the derived data. All the way to seeing who implemented the code that was used to act on the data.