What is Provenance?

advertisement
Department of Computer Science
Software Engineering Research
Group, Berlin, Germany
PROVENANCE
Welcome to this Presentation
Abdul Saboor
Presentation Agenda
 What is Provenance?
 Why Provenance is important and two major
strands of Provenance?
 Provenance and Linked Data
 Provenance Data Model
 Provenance Vocabularies
 The Open Provenance Model
 Provenance Data Quality Assessment
 Summary - Scientific and Technical Challenges of
Provenance
1
What is Provenance?
Provenance
 Recording the history of data and its place of origin
Provenance Dictionary Definitions
1. The Merriam-Webster online diction – Origin , Source
2. Oxford English Dictionary – The place of origin or earliest
known history of something; origin, derivation.
Provenance Definitions
1. Provenance refers to the source of Information such as
entities and processes involved in producing or delivering
an artifact. (Yolanda)
2. Provenance is a description of how things came to be, and
how they came to be in the state they are in today.
Statements about the provenance can themselves be
considered to have provenance. (Jim M)
Continues ...
2
What is Provenance?
Provenance Working Definitions
3. Provenance of a resource is a record that describes
entities and processes involved in producing and
delivering or otherwise influencing that resource.
Provenance provides a critical foundation for assessing
authenticity, enabling trust, and allowing reproducibility.
Provenance assertions are a form of contextual metadata
and can themselves become important records with their
own provenance. (W3C)
Provenance Web Definition
4. On the web, provenance would include information about
the creation and publication of web resources as well as
information about access of those resources, and
activities related to their discussion, linking, and reuse.
Continues ...
3
What is Provenance?
Provenance Definitions
5. Provenance is documentation of the set of artifacts,
processes, and agents that have caused a artifact to be,
and of the contexts of these entities. Provenance
provides a critical foundation for assessing authenticity,
enabling trust, and allowing reproducibility and
assertions of provenance can themselves become
important records with their own provenance. (Jim M)
4
What kind of History?





Data Creator/Data Publisher
Data Creation Date
Data Modifier & Modification Date
Data Description
Etc...
5
Why Provenance is Important?
The need of Provenance for data integration and reuse
 Data comes from various diverse data sources
 Varying Quality
 Different Scope
 Different Assumptions
6
Two major strands of
Provenance
7
Data And Workflow
Provenance
Data Provenance
When information describing that how data has moved through a network of
databases is referred to as “fine-grain” or “data” provenance. Fine-grain
provenance can further categorized into: where, how and why-Provenance. A
query execution simply copy data elements from some source to some target
database and where-provenance identifies these source elements where the
data in the target is copied from. Why-provenance provides justification for
the data elements appearing in the output and how-provenance describes
some parts of the input influenced certain parts of the output.
Workflow Provenance
When Information describing how derived data has been calculated from raw
observations that is referred to as “coarse-grain” or “workflow” provenance.
The widespread use of workflow flow tools for processing scientific data
facilitate for capturing provenance information. The workflow process
describes all the steps involved in producing a given data set and, hence
captures it provenance information.
7A
Provenance Dimensions - 1
Content of Provenance Information
Attribution - provenance as the sources or entities that were used to create a
new result


Responsibility - knowing who endorses a particular piece of information or result
Origin - recorded vs reconstructed, verified vs non-verified, asserted vs inferred
Process - provenance as the process that yielded an artifact


Reproducibility (e.g. workflows, mashups, text extraction)
Data Access (e.g. access time, accessed server, party responsible for accessed
server)
Evolution and versioning


Republishing (e.g. re-tweeting, re-blogging, re-publishing)
Updates (e.g. a document with content from various sources and that changes
over time)
Justification for decisions – Includes argumentation, hypotheses, why-not
questions
Entailment - given the results to a particular query, what tuples led to those
results
8
Provenance Dimensions - 2
Management of Provenance Information
Publication - Making provenance information available (expose, distribute)
Access - Finding and querying provenance information
Dissemination control – Track policies specified by creator for when/how an
artifact can be used
 Access Control - incorporate access control policies to access
provenance information
 Licensing - stating what rights the object creators and users have
based on provenance
 Law enforcement (e.g. enforcing privacy policies on the use of
personal information)
Scale - how to operate with large amounts of provenance information
Use of Provenance Information
Understanding - End user consumption of provenance


abstraction, multiple levels of description, summary
presentation, visualization
9
Provenance Dimensions - 3



Interoperability - combining provenance produced by multiple different
systems
Comparison - finding what is common in the provenance of two or more
entities
Accountability - the ability to check the provenance of an object with respect
to some expectation



Trust - making trust judgments based on provenance



Information quality - choosing among competing evidence from diverse sources (e.g.
linked data use cases)
Incorporating reputation and reliability ratings with attribution information
Imperfections - reasoning about provenance information that is not complete
or correct





Verification - of a set of requirements
Compliance - with a set of policies
Incomplete provenance
Uncertain/probabilistic provenance
Erroneous provenance
Fraudulent provenance
Debugging
10
Web of Data
Adapted from Cetinia, iSOCO Innovation Lab, J.M.G Perez, Provenance: eScience to the Web of Data, 11/09
11
The Linked Data Paradigm
 How can we exploit all the available data?

Data can be reuse and remix

Common flexible and usable APIs

Standard vocabularies to describe interlinked datasets

Various Tools

Understand the Semantic Web vision
12
Provenance and Link Data
 Provenance provides the ability


Trace the sources of various kinds of data
Enable the exploration of relationships between datasets, their
authors and affiliations
 Provenance analysis provides an insight on how data
is produced and exploited
 Provenance create a notion of information quality



Is a certain dataset consistent and up to date?
Is the connection between two datasets meaningful?
Is a given dataset relevant for a particular domain?
 Provenance to establish information trustworthiness
 Provenance to provide data views relating to some
criteria
13
The Provenance Data Model
Institutional Level
Metadata associated with origin in terms of its data
attributes (e.g, AuthorName, Title, URL, etc.)
Experimental
Protocol Level
The Origin of datasets (e.g. History area, region,
organisation or institution)
Data Analysis and
Significance
Level
Datasets statistical analysis methodology for selecting
relevant attributes (e.g. Either datasets divided into parts,
output values, versions, etc)
Dataset
Description Level
Who published that datasets. The vocabulary of
interlinked datasets such as Dublin Core, voiD, PRV, etc.
14
The Provenance Related
Vocabularies













DC – Dublin Core
FOAF – Friend of a Friend
SIOC – Semantic Interlinked online communities
WOT – Web of Trust Schema
OMV – Ontology Metadata vocabulary
SWP – Semantic Web Publishing
VoiD – Vocabulary for interlinked datasets
PRV – Provenance Vocabulary
PML – Proof Markup Language
PAV – SWAN provenance ontology
OUZO – Provenance ontology
CS – Changeset Vocabulary
Etc.
15
Provenance Related Metadata
Provenance related metadata is either directly attached
to data item or its host the documents or it is
available as additional data on web.
For example – Attached metadata are RDF statements
about an RDF graph that contains the statements,
AuthorName and Creation date of blog entries added
to syndication feed, or information about an image
and detached metadata can be represented in RDF
using vocabularies.
16
A Provenance Architecture
for the Web of Data
Application
Layer
Authoritative
agencies require
to certify and keep
data provenance
secure
Adapted from Cetinia, iSOCO Innovation Lab, J.M.G Perez, Provenance: eScience to the Web of Data, 11/09
17
Main Action Points
Provenance
Vocabularies
Represent and
reason with trust
and information
quality
Awareness of Data
Providers
W3C Provenance
Incubator Group
Extend emerging
Linked data
vocabularies
VOiD
Tools for Data
Providers
Generalization of
Provenance Metadata
Provenance
Authoritative
Agencies
Linked Data
Standards
(VOiD)
Provenance
Visualization
Adapted from Cetinia, iSOCO Innovation Lab, J.M.G Perez, Provenance: eScience to the Web of Data, 11/09
18
The Open Provenance Model
 The Open Provenance Model in which data is being
produced/transformed into new state. It can also
represent the one or more data items from an old to a
new state.
 OPM graph model for provenance which describes
the graph whose edges denote the relationship
between occurrence presented by the nodes.
 The main purpose of OPM is to support the
assessment of various data qualities such as
reliability, accuracy and timeliness.
19
OPM Classifies nodes into
three parts
Artifacts
Artifacts are the parts of data of fixed value and context that possibly
represent an entity in a given state. Edges can also have annotations for
providing the information on how occurrence cause another.
Process
Process are performed on artifacts in order to produce another artifact.
Agents
Agents indicate the entities which are controlling the process such as
user.
20
Model of Web Data Provenance
Provenance Graph – It describes the provenance of data Items:
Nodes
Edges
Provenance
elements
(Pieces of
provenance
information)
Relating
Provenance
elements to
each other
Sub-graphs
Related data
items if possible
21
Main Focus of Provenance of
Web Data
 Provenance Models Define


Types of Provenance elements (roles)
Relationship between those elements
Adapted from Olaf Hartig’s, Humboldt University Berlin, Provenance Information in the Web of Data, 04/09
22
Provenance Data Quality
Assessment
The Quality of Information
 Main Objectives are accessing the quality of datasets
 Quality of datasets in multidimensional perspectives
Categories
Criteria
Intrinsic
Objectivity, Believability, Accuracy
Contextual
Completeness, Relevance, Timeliness
Representational
Understandable, Concise, Precise
Accessibility
Availability, Securing & Licensing, Constrains (Format &
Procedures)
 Relevance of criteria determined by preferences and
performing certain tasks on available datasets
23
Provenance Data Quality
 Data Trustworthiness


Data Authenticity
Data Reliability

Quality of Data Provenance
has Three dimensions:
 Correctness
 Dimensions of Believability

Trustworthiness of source



Data Lineage – The origin of data
Related Artifacts and actors
 Completeness
 Relevancy
Reasonableness of data


Possibility – The extent to which
data value is possible
Consistency – The extent to which a
data value is consistent with other
values of same data
24
Provenance Data Quality
 Quality of Datasets



Timeliness
Consistency between datasets
 Consistency over source – The extent to which a data value is
consistent with other values of the same data
 Consistency over time – The extent to which the data value is
consistent with past data values
Stable and meaningful data
 Temporal of Data

Transaction valid times closeness – The extent to which a data value
is credible based on proximity of transaction time to valid times.

Transaction time overlap – The extent to which a data value is
derived from data values with overlapping valid times.
25
Trust Evaluation
Some Questions must need to be considered while
provenance data trust evaluation…
1. Who created that content(s) (author or attributions)?
2. Was the contents manipulated? If yes then by what
process or source?
3. Who is providing those contents (repositories)?
26
Quality of Data Assessment
 Assign numeric values to Quality Criteria of Datasets
or Scoring/Rating Systems
 Proactive Approach

Precision vs Practicality
Manual Approach
 Questionnaires base
system
Semi-Automatic Approach
 Rating based system
 Reputation based system
27
Reasons of Assessment
Main Reasons
 Provenance of assessed data on the web
 Primary Objectives

Identify the methods / approaches to automatically
assess the quality of data on the web

Or Identify the methods to assess the Quality
Criteria of Data automatically of web data.
28
A Generalize Assessment
Approach
Step - 1
Generate a provenance graph for the data
item
Step - 2
Annotate the provenance graph with impact
values
Step - 3
Execute the assessment function/program
(script)
29
Generate a Provenance Graph
1. What types of provenance elements are necessarily
require?
1. What types of details
necessarily require?
(i.e.
granularity)
are
2. Where and how do we get provenance information?
 Two complementary options


Recordings
Analyzing the metadata
30
Annotation with Impact Values
1. How might each Provenance element can influence
the quality of data?
 Each type of element has to analyze systematically
1. What kinds of impact values are necessary and how
to represent the influence through impact values?
 It is not necessary that impact values should be numeric
 It also depends on the assessment functions
1. How do we determine the impact values?
31
Determine the Impact Values
1. From Provenance Information
2. From user Input
 Rating-based systems, or reputation-based systems
 Configuration options
1. Through Content Analysis
 Comparison of data contents
 Adoption of information retrieval methods
 Adoption of data cleansing techniques
2. Through Context Analysis
 Further metadata
 Domain knowledge
32
Annotation with Impact
Values
 How might each Provenance element can influence
the quality of data?
Provenance Element Type
Creation Date
Impact Values
Creation time
Creation Guidelines
Source data items
Data creator
Weights
Expiry time
33
Assessment Function (s)
1. How the assessment function look alike?
 Develop function together with impact values
 Take incompleteness into consideration
 Provenance graph could be fragmentary
 Annotation could be missing
34
Scientific and Technical
Challenges of Provenance – 1
(SUMMARY)
Provenance information need to be:
 Represented
 Captured and recorded
 Stored and secured, queries and reasoned about
 Visualized and browsed
35
Scientific and Technical
Challenges of Provenance - 2
 Vocabularies for representation of provenance contents

Need representation of process (workflow), entities roles, data
collections, meta-assertions, etc.

The open provenance model (OPM)
 Granularity of provenance records

How much detail is useful, manageable/scalable in practice?

Size of provenance can be orders of magnitude larger than
base data.
 Provenance evaluation for information quality and trust
management
36
Scientific and Technical
Challenges of Provenance – 2a
 Evaluation and updates

Shelf timeliness of data
 Determine when data becomes obsolete based on provenance
information

Versioning of data sources

Relate updates of data based on provenance information
 Provenance-aware
visualization,
resource consumption
navigation
and
37
Scientific and Technical Challenges
of Provenance and Trust – 3
 Policies based on Provenance information

Association-based policies
 Source is cited in Spiegel
 Source is cited in Wikipedia

Bias-based policies
 Source is an Oil company

Distrust policies
 Source is a blog
 Policies may be restricted to a context

Topic of search, topics of pages, tags of page
 Trust policies may be shared across users
38
Thanks for your
attentions !
Any Questions?
Freie University Berlin
Computer Science Department
Software Engineering Research Group
TakuStr 9, Berlin, Germany.
39
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
W3C Website, What is provenance? Modified at November 2010,
http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance
W3C Website, A working Definition of Provenance, Modified at November 2010,
http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance#A_Working_Definition_of_Provenance
Hartig, O. Provenance information in the Web of data. In Proceedings of LDOW 2009 (Madrid, Spain, April
2009).
O. Hartig and J. Zhao. Using web data provenance for quality assessment. Pro-ceedings of the 1st Int.
Workshop on the Role of Semantic Web in Provenance
D. Brickley and L. Miller, FOAF Vocabulary Specification, November 2007. http://xmlns.com/foaf/spec
U. Bojars and J. G. Breslin. SIOC Core Ontology Specification, Revision 1.30, Jan. 2009.
http://rdfs.org/sioc/spec/
Luc Moreau, Juliana Freire, Joe Futrelle, Robert E. McGrath, Jim Myers, and Patrick Paulson. The open
provenance model: An overview. In IPAW, pages 323–326, 2008.
L. L. Pipino, Y. W. Lee, and R. Y. Wang, “Data Quality Assessment,”Communications of the ACM, vol. 45, Issue
no. 4, p. 211-218, 2009.
You-Wei cheah, Beth Plale. Provenance Analysis: Towards qaulity provenance. In proceeding of 8th IEEE
International conference on eScience, Chicago Illinois, Oct. 2012.
http://www.ci.uchicago.edu/escience2012/pdf/Provenance_Analysis-Towards_Quality_Provenance.pdf
Yogesh Simmhan, Beth Plale, and Dennis Gannon. A survey of data provenance in e-science. SIGMOD
Record, 34(3):31–36, 2005.
Prat, N., and Madnick, S. Evaluating and aggregating data believability across quality sub-dimensions and
data lineage. In Proceedings of WITS 2007 (Montreal, Canada, December 2007), p.169-174.
Y. Simmhan, B. Plale, and D. Gannon. A Survey of Data Provenance in e-Science. SIGMOD Record, Computer
Science Department, Indiana University. Vol. 34, Issue No. 3, p31–36, ACM, Sept. 2005.
P. Buneman, S. Khanna, and W. C. Tan. Data Provenance: Some Basic Issues. In Proceedings of the 20th
Conference on Foundations of Software Technology and Theoretical Computer Science (FST TCS), p87-93,
Springer, Dec. 2000.
Prat, N., and Madnick, S. Measuring data believability: A provenance approach. Proceedings of HICSS-41
(Big Island, HI, January 2008), IEEE, p.1-10.
Jose Manuel Gomez-Perez, Invited Lectures on Programmable web and the web of data, November 2009,
URJC, Campus de Mostoles, Departmental II, Salon de grados, Madrid, Spain, Website,
http://www.cetinia.urjc.es/es/node/331
Website : http://www.w3.org/2005/Incubator/prov/wiki/images/0/02/Provenance-XG-Overview.pdf
http://www.w3.org/2005/Incubator/prov/wiki/Provenance_Dimensions
http://www.w3.org/2005/Incubator/prov/wiki/W3C_Provenance_Incubator_Group_Wiki
Download