- Sacramento

advertisement
ANALYSIS OF DATA PROVENANCE
ACROSS VARIOUS APPLICATIONS
A Project
Presented to the faculty of the Department of Computer Science
California State University, Sacramento
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
Computer Science
by
Praneet Mysore
SPRING
2013
ANALYSIS OF DATA PROVENANCE
ACROSS VARIOUS APPLICATIONS
A Project
by
Praneet Mysore
Approved by:
__________________________________, Committee Chair
Isaac Ghansah, Ph.D.
__________________________________, Second Reader
Robert A. Buckley
____________________________
Date
ii
Student: Praneet Mysore
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the project.
__________________________, Graduate Coordinator
Behnam Arad, Ph.D.
Department of Computer Science
iii
___________________
Date
Abstract
of
ANALYSIS OF DATA PROVENANCE
ACROSS VARIOUS APPLICATIONS
by
Praneet Mysore
Data Provenance refers to the line of descent or the ancestry of information. It constitutes
the origin of that data, along with some key events that occur over the course of its
lifecycle. Additionally, some important details associated with the creation of that data,
its processing and archiving are also a part of it.
Such information is instrumental in determining how secure and trustworthy the data is.
This is the primary reason why provenance is one important aspect of data security. Even
in applications like Digital Forensics, provenance helps maintain a proper chain of
custody by providing information about who collected the evidence, what procedure was
followed, when and why it was collected and where it was stored.
This report discusses provenance models and some existing environments and
applications where provenance is used such as, a multi-level secure environment, sensor
networks, and electronic data transfer.
iv
Other environments where provenance can be useful such as cloud and grid computing
are not part of this report.
_______________________, Committee Chair
Isaac Ghansah, Ph.D.
_______________________
Date
v
ACKNOWLEDGEMENTS
I express my heartfelt gratitude to my advisor, Dr. Isaac Ghansah for his helpful
comments and feedback throughout the course of this project. I genuinely appreciate the
valuable time and effort Dr. Ghansah had dedicated towards this project despite his hectic
schedule.
I sincerely thank Professor Robert A. Buckley, my second reader, for going through my
project report and making some highly beneficial suggestions on content. I also thank Dr.
Behnam Arad for his valuable comments on the format of this report.
Finally, words alone cannot express the love I have for my parents, Narasimha Rao and
Mridula Mysore and my brother, Praveen Mysore, for their endless support throughout
my Masters program. Their love and support has kept me immensely motivated to do my
best.
vi
TABLE OF CONTENTS
Page
Acknowledgements ..................................................................................................... vi
List of Tables ............................................................................................................... ix
List of Figures ............................................................................................................... x
Chapter
1. INTRODUCTION…………………………………………………………...........1
1.1. Background…………………………………………………………….....1
1.1.1 Bunge’s Ontology .........................................................................1
1.1.2 W7 Model ....................................................................................3
1.2. Data Provenance .........................................................................................7
1.3. Report Organization………………………………………………………8
2. OVERVIEW OF EXISTING APPROACHES ........................................................9
2.1 Introduction………………………………………………………………..9
2.2 Existing Models to Represent Provenance Information ........................... 10
2.3 Active Conceptual Modeling with Provenance………………………..... 25
3. SOME APPLICATIONS OF PROVENANCE .....................................................30
3.1 Introduction………………………………………………………………30
3.2 Supporting Information Assurance in Multi-level Secure Environment... 33
3.2.1 Introduction ................................................................................. 33
3.2.2 Message-Structure Overview…………………………………... 34
3.2.3 Wrappers……………………………………………………….. 36
3.2.4 Messages……………………………………………………….. 37
3.2.5 Data Provenance……………………………………………….. 38
3.2.6 Addressing Information Assurance Attributes………………… 46
3.2.7 Example Analysis……………………………………………… 49
3.2.8 Monitoring and Analyzing Workflows………………………....52
vii
3.3 Provenance Mechanism to tackle Packet Dropping in Sensor
Networks…………………………………………………………………..54
3.3.1 Introduction……………………………………………………... 54
3.3.2 Overview of the Scheme………………………………………... 58
3.3.3 Secure Provenance Transmission Mechanism………………….. 64
3.3.4 Packet-Dropping Adversary Identification Scheme……………...67
3.4 Provenance of Electronic Data Transfer……………………………..…… 71
3.4.1 Introduction……………………………………………………... 71
3.4.2 Lifecycle of Provenance in Computer Systems…………………. 72
3.4.3 Open Model for Process Documentation……………………...… 75
3.4.4 Querying the Provenance of Electronic Data……………………. 78
3.4.5 Example: Provenance in Healthcare Management………………. 79
4. CONCLUSION………………………………………………………………... 84
4.1 Summary………………………………………………………………... 84
4.2 Future Work…………………………………………………………….. 85
Glossary…………………………………………………………………………... 86
References………………….………………………………………………...........88
viii
LIST OF TABLES
Tables
Page
1.
Definition of seven W’s in the W7 Model……… . .……………………………4
2.
Application of the W7 Model in the Wikipedia……. ………………………… 6
3.
DP Records for Example Scenario……….……………………………………49
ix
LIST OF FIGURES
Figures
Page
1.
Overview of the W7 Model………………… ....................................... .………5
2.
Provenance Model……………………………….…………………………… 13
3.
Manipulation and Query Facilities……………………. ………….………… 19
4.
Storage and Recording Model……………………. ………………………… 21
5.
A Sample Provenance-based Conceptual Model .............................................. 26
6.
Envelope Structure with Data Provenance…………………………………….35
7.
Adding Wrappers & De-Wrappers to minimize impact on workflow………...37
8.
Data Provenance Record ............................................................................... …39
9.
Multiple DP records with Forwarded Message ................................................ 42
10.
Truncated DP Record ........................................................................................ 43
11.
DP Record with Proxy Owner .......................................................................... 44
12.
DP Record with an Included Attachment ......................................................... 45
13.
Dashboard for Monitoring and Analyzing Workflows ..................................... 53
14.
Provenance Examples for a Sensor Network .................................................... 61
15.
Provenance Encoding at Sensor Node and Decoding at Base Station .............. 64
16.
Provenance Lifecycle ........................................................................................ 74
17.
Categories of P-assertions Made by Computational Services .......................... 76
18.
Provenance DAG of a Donation Decision ........................................................ 81
x
1
CHAPTER 1
INTRODUCTION
1.1 Background
Data is the core entity that drives the digital world in today’s times. Owing to that fact, it
is imperative that we ensure that our data is secure. The security of data is of paramount
importance because it essentially determines the amount of trust we can put into the data.
So, while it is important to provide security to our data, which may be present in various
storage systems such as databases or even networks for that matter, it is also important to
know exactly how much security needs to be provided and how much is needed to safely
and confidently assert that our data is secure and trustworthy.
Provenance is that aspect of data, which tells us the story behind the data. In general
terms, it aims to give a comprehensive account of the history associated with the data.
We discuss provenance in explicit detail, in the following subsections. Data Provenance
was conceptualized largely based on a basic ontological view, known as ‘Bunge’s
Ontology’.
1.1.1 Bunge’s Ontology
Devised by the Argentine physicist Mario Bunge, this ontology forms the basis for
perfectly defining the fundamentals of data provenance.
2
The core element of Bunge’s ontology is a ‘thing’. The ‘state’ of a thing is nothing but
the set of property values of the thing at a given time.
Bunge’s ontology postulates the following statement:
“Everything changes, and every change is a change of the state of things, i.e., the change
of properties of things. A change of state is termed as an event.”
Therefore, it can be inferred that an event occurs when a thing acquires, loses or changes
the value of a given property. Data can also be considered as a ‘thing’. Hence, a set of
things is analogous to pieces of data [1].
Action, Agent, Time and Space are all constructs related to events. An event, on a thing,
occurs when another thing, often a human or software agent, acts upon it. An event
occurs at a particular point in time, in a particular space. Based on these constructs of
event and state, comes the concept of ‘history’. History of data is a sequence of events
that happens to the data over the course of its lifetime [1].
Bunge’s theory regarding history and events is a perfect match for defining data
provenance and its semantics since data provenance is often referred to as the pedigree or
history of data.
In this manner, the constructs in Bunge’s ontology including history, event, action, etc.
lay a theoretical foundation for defining provenance and its components.
3
Based on this ontology, we can devise a model known as the W7 Model, which is
instrumental in providing a sense of completeness to the semantics of data provenance.
1.1.2 The W7 Model
Using this model, provenance can be represented as a combination of seven W’s namely,
What, When, Where, How, Who, Which and Why. This model is widely perceived to be
generic and extensible enough to capture the essence of the semantics of provenance
across various domains.
Definition of Provenance in the W7 Model is as follows:
“Provenance of some data D is a set of n-tuples: p(D) = {< What, When, Where, How,
Who, Which, Why >}.”
The definitions of these seven W’s, which are analogous to the constructs in Bunge’s
Ontology, are tabulated as follows:
4
Table 1: Definition of the seven W’s in the W7 Model [1]
PROVENANCE
CONSTRUCT IN
ELEMENT
BUNGE’S ONTOLOGY
What
Event
DEFINITION
How
Action
When
Time
An event such as a change of state,
that happens to data during its
lifetime
An action, which triggers the
occurrence of an event
Timestamp or duration of an event
Where
Space
Location(s) associated with an event
Who
Agent
Which
Tools
Why
Reason
Person(s) or organizations involved
in the occurrence of the event
Software programs or any hardware
tools used in the event’s occurrence
Reasons giving an accurate account
of why an event has occurred
As we can infer from the table, ‘What’ denotes an event that affected data during its
lifetime. ‘When’ refers to the time at which the event occurred. ‘Where’ tells the location
of the event. ‘How’ depicts the action leading up to the event. ‘Who’ tells about the
agent(s) involved in the event. ‘Which’ are the software programs or tools used in the
event, and ‘Why’ represents the reasons for the occurrence of events. Hence, the name
W7 is given to this model.
A diagrammatic depiction of the W7 model is given below. In the figure, the boxes
represent the concepts and the bubbles represent the relationships between those
concepts. The seven W’s are inter-connected in such a way that they form a selfexplanatory flow of operation of the model. [1]
5
Figure 1: Overview of the W7 Model [1]
From the above figure, it can be inferred that ‘What’ (i.e., Event) is the anchor of the W7
model. The other six W’s all revolve around ‘What’ and are determined based on the
occurrence of an event.
Therefore, this implies that, under the scenario where a piece of data present in a database
is modified in any manner, it needs to be thoroughly investigated and information about
the seven W’s needs to be gathered and stored so that the following is known in detail.
What type of modification was performed, who performed it, why it has been performed,
how it has been done, from where the data was taken and where the modified data is
stored and at what time this modification has taken place. [1]
By collecting all this information and storing it for future reference, a concrete idea can
be obtained regarding the trustworthiness of the data.
6
In order to describe the application of the W7 model in a common real-life scenario, we
consider applying it to the context of a Wikipedia page.
Table 2: Application of the W7 Model in the Wikipedia [1]
PROVENANCE
ELEMENT
What
APPLICATION TO WIKIPEDIA PAGE
Who
Creation, Modification, Destruction, Quality Assessment,
Access-Rights Change
Sentence insertion/updation/deletion, reference
insertion/updation/deletion, Revert (to a previous version)
Administrators, registered editors, and anonymous editors
When
Timestamp of events
Where
IP Address of the editor
Which
Software used to work on the page
Why
User comments (feedback and suggestions)
How
Implementing data provenance in Wikipedia requires very little manual effort. The
Mediawiki software used by Wikipedia is set to automatically capture the what, who,
how, when, where, and which. Only the why provenance demands manual input [1].
Applying the W7 Model to the Wikipedia enables us to capture and store provenance of
Wikipedia pages in a structured and comprehensive manner.
7
1.2 Data Provenance:
With Bunge’s Ontology as the fundamental basis, and by using the semantics provided
through the W7 model, we can say that data provenance conveys the basic idea that data
needs to be captured in the hope that it is comprehensive enough to be useful in the
future. It is the background knowledge that enables a piece of data to be interpreted
correctly and supports learning [1].
Although a vast amount of research has already been done on the concept of provenance,
according to several researchers, it is still unclear as to what the scope of provenance is
and how much information it could include. With the rapidly growing use of databases,
there is not only a need to make the data more secure, but also to ensure that it is
trustworthy.
This report discusses the concept of data provenance in detail by classifying the various
existing approaches of capturing and interpreting provenance information. The various
applications and the future scope of data provenance will also be analyzed.
8
1.3 Report Organization:
Chapter 1: Introduction
This chapter discusses the background of the concept of data provenance and briefly
touches on the basics about provenance.
Chapter 2: Overview of Existing Approaches
This chapter delves further into the depths of the concept of provenance by offering an
overview of the various existing models that are used to represent provenance. Active
Conceptual Modeling, an enhancement to the existing provenance models, is also
discussed.
Chapter 3: Various Applications of Provenance
This chapter throws light on some of the most prominent applications of data provenance.
A detailed discussion of the way provenance is implemented in these applications and
how it contributes towards enhancing the trustworthiness of data is provided.
Chapter 4: Conclusion
This chapter briefly summarizes the research made on data provenance throughout this
report and the way it is implemented across various applications, thereby providing a
scope for future work by identifying some possible ways in which the implementation of
data provenance could be further enhanced.
Following Chapter 4, are Glossary and References.
9
CHAPTER 2
OVERVIEW OF EXISTING APPROACHES
2.1 Introduction
In data warehousing, e-science and several other application areas, data provenance is a
very valuable asset. The provenance of a data item includes information about the source
data items along with the processes and events responsible for its creation, subsequently
leading to its current representation. However, the diversity of data representation models
and application domains has led to the inception of a number of formal definitions of
provenance. Most of those definitions are limited to a specific application domain, data
representation model or data processing facility. Unsurprisingly, the associated
implementations are also restricted to a certain application domain and depend on a
special data model for their representation.
In this chapter, a limited selection of such data provenance models and prototypes are
observed. A general categorization scheme for these provenance models is provided.
Thereby, this categorization scheme is used to study the properties of the existing
approaches. This categorization would eventually help distinguish between different
kinds of provenance information and thereby could lead to a greater understanding of
provenance in general
10
2.2 Existing Models To Represent Provenance Information
Scientists in various fields often use data from ‘curated’ databases in their experimental
analysis. Most of the data stored in a curated database is the result of a set of manual
transformations and derivations. Hence, those who use data from a curated database are
often interested in information about the data sources and transformations that were
applied to the data from these sources. This information is used either to assess the
quality and thereby the authenticity of the data, or to re-examine some data-derivation
processes to see if re-running a certain set of tests or experiments is required [3].
Data warehouses are used to integrate the data collected from various sources, each
having different representations. Thereafter, the integrated data is analyzed thoroughly to
check for anomalies. This analysis could also benefit from any obtainable information
about the original data sources and the set of transformations that were applied in order to
generate and store the integrated data in the data warehouse.
These are just some of the several reasons why data provenance is so important in most
applications. Besides provenance information related to the data, it is also important to
include the storage requirements and transformation requirements for all kinds of
provenance information across these diverse application domains. Although a broad
variety of applications would benefit from provenance information, the type of
provenance data, the manipulation facilities and querying facilities needed, differ from
one application to another. [3]
11
For that reason, the differences and similarities between the various application and data
model provenance needs are identified and a general scheme for the categorization of
provenance is presented henceforth.
Provenance can be viewed in two different ways. In one view, the provenance of a data
item can be described as the processes that lead to its creation whereas in the other view,
the objective is to focus on the source data from which a given data item is derived. The
term “Source Provenance” represents the latter, whereas the term “Transformation
Provenance” represents the former. In other words, we can say that transformation refers
to the creation process itself and the terms source and result refer to the input and output
of a transformation respectively.
The existing research directions can be classified into two distinct categories based on
their approach to provenance recording. One research direction focuses on computing
provenance information when data is created, while the other computes provenance data
when it is requested. These approaches can be termed as ‘eager’ and ‘lazy’ respectively.
Most of the eager approaches are based on source data items and transformations, while
most of the lazy approaches rely on inversion or input tracing of transformations. [3]
There is a close relation between data provenance and temporal data management. Much
like in temporal data management, in provenance also, the previous versions of a data
item are queried and accessed.
12
So provenance management systems may benefit from existing storage methods and
query optimizations for temporal databases. Hence, the identification methods used in
temporal data management may be applicable to provenance management as well [3].
The sections that follow, deal with data provenance from a conceptual point of view, and
define a general categorization scheme for provenance management systems. Several
functionalities of a provenance management system are defined, and these functionalities
are ordered in a hierarchy of categories.
The three main categories of our categorization scheme are provenance model, query and
manipulation functionality and storage model and recording strategy [3]. An overview
figure for each main category (Figures 1, 2 and 3) is presented. Boxes, in these figures,
represent categories and ellipses represent functionalities.
13
Figure 2: Provenance Model [3]
The provenance model embodies the expressiveness of the provenance management
system in defining the provenance of a data item. As specified earlier, the provenance of
a data item can be divided into Transformation Provenance and Source Provenance.
Source provenance is information about the data that was involved in the creation of a
data item. Source provenance can be defined in three distinct conceptual terms, such as
the original source, contributing source and input source.
14
The input source includes all data items that were used in the creation of a particular data
item. The positive contributing source includes all the other data items that are essential
for the creation of a particular data item. The original source contains all data items that
include data that is copied to the resulting data item [3].
For example, assume we manage the provenance of data in a relational database with two
relations R1 and R2 and handle data items at tuple level. When executing the SQL query
SELECT R1.name FROM R1, R2 WHERE R1.id = R2.id against a database including
relations R1 and R2. The input source of a resulting tuple T includes all the tuples in R1
and R2. The positive contributing source of T contains all tuples T0 from relation R1 and
T00 from relation R2 with T.name = T0.name and T0.id = T00.id. At last the original
source of T includes all tuples T0 from relation R1 with T.name = T0.name. [3]
Note that the following subset relationship holds:
input source ‫ ﬤ‬positive contributing source ‫ ﬤ‬original source
Some applications would benefit from information about data items that do not actually
exist in the source, but would affect the creation of a resulting data item, if they were
included. The term ‘negative contributing source’ is used for this concept.
Unlike in the case of positive contributing source, the task of accurately defining this
contrasting concept is not relatively straightforward.
15
It appears to be reasonable either to include all the data items that would, prohibit the
creation of the result or to include all the possible combinations of data items that would
prohibit the creation of the result. In most data repositories, the amount of data stored in
the repository is only a very small fraction of the data that could be stored in the
repository. Owing to this reason, it is not realistically possible to store the negative
contributing source of a data item in the repository [3].
Not only does a provenance management system take into consideration, which kind of
sources should be part of the source provenance, but it can also record information about
each source data item. A source can be represented is one of these four ways: As the
original data, as metadata attached to the source, as a source hierarchy structure or as a
combination of any or all of these representations.
A provenance management system has the innate ability to record source data items at
not just one level of detail but in multiple levels of detail. For example, the source of a
tuple data item in relational view would by default, include all tuples from a relation R.
However, if the provenance model has the capability of handling multiple levels of detail,
the source can be represented as a relation R instead.
Managing provenance information at different levels of detail is a relatively more
sensible approach as it provides more flexibility and can in turn result in smaller storage
overhead.
16
One possible downside is that, this requires a more complex provenance model.
Transformation provenance is the information about the transformations that were
involved in the creation of a data item. To make a clear separation between a concrete
execution of a process and the process itself, the term ‘transformation class’ is used for
the former and ‘transformation’ is used for the latter. A transformation is not limited to be
an automatic process. It might as well be a manual process or a semi-automatic process
with user interaction, wherever necessary. The transformation provenance of a data item
could include metadata like author of the transformation, the user who executed the
transformation and the total execution time.
Some of the several examples for transformations include SQL statements that are used
to create views, workflow descriptions of a workflow management system, and
executable (.exe) files with command-line parameters [3].
Another vital part of the provenance model is the world model, which could be either
closed or open in nature. In closed world models, the provenance management system
controls transformations and data items, whereas, in open world models, the provenance
management has limited or no control over the executed transformations and data items.
In other words, the execution, manipulation, creation and deletion of data items can be
done without any notification. Judging by the way the provenance management system
views this; the world has an uncertain behavior.
17
This uncertain behavior makes the process of provenance recording rather complex and at
times, even impossible to record accurate provenance information.
The closed world and open world models are widely considered the two extreme ends of
a spectrum. In fact, several other possible world models can depict representations that
can be considered neither closed nor open.
A provenance management system should also be able to recognize if data items from
two different data repositories represent the same real world object. For example, it is
possible to store the same data item in several databases [3].
As real world objects tend to change over time, it is imperative to have mechanisms that
make it plausible for checking if two data items are different versions of the same real
world object. This is even more important when updates to the repositories are not
controlled by the provenance management system. This is because in this case, the
information about source data items recorded by the system might be incorrect, as these
data items might have been changed or deleted by an update.
There are various methods to identify duplicates. One method would be to check if the
data item and the duplicate represent exactly the same piece of information. This is called
value-based duplicate identification.
18
If data items have a key property associated with them, then the most feasible alternative
would be to identify duplicates by using their key property.
Let us consider the example where the data items in consideration are tuples in a
relational database. Two tuples are defined as duplicates if they have the same attribute
values or if they hold the same key attribute values. In this case, using the primary key
constraints of a relational database for identification could become a problem when no
further restrictions are introduced, as the primary key uniqueness is usually confined to
one relation and primary keys can be changed by updates.
Many data models usually have either an implicit or an explicit hierarchical structure.
This hierarchy in combination with a certain key property equivalence or value
equivalence, could prove to be instrumental in identifying a given data item. For instance,
if the provenance of a certain tag in a given XML-document is recorded, duplicates can
be defined by the name of the tag and the position of the tag in the hierarchy of the
document [3].
19
Figure 3: Manipulation and Query Facilities [3]
A provenance management system must be able to provide facilities to manipulate and
query provenance information and data items, in order to be applicable in a real world
scenario. It would be incomplete if manipulation and querying of data items were
discussed without integration of provenance information. A provenance management
system must also be able to provide mechanisms for merging a selection of individual
transformations into one complex transformation and vice-versa. If this is facilitated, then
provenance data can be used to recreate result data items, which cannot be accessed or
are expensive to access, by executing the transformations that were used to create the
result data item. In addition to this, if a provenance management system possesses the
capability to compute the inverse of a transformation, then the inversion can be used to
recreate source data items from result data items.
20
Split and merge operations can be applied to individual data items as well. It is
understood that, the split operation involves dividing a higher-level data item into its
lower-level parts whereas the merge operation combines lower-level data items into a
higher-level data item. While the understanding of the split operation is reasonably clear,
the merge operation possibly raises some questions that need answering in order to make
optimal use of it. Some of those questions are: What is the result of the merge operation
on a subset of the lower-level data items that form a higher-level data item? How can this
result be distinguished from the result of a merge operation on the set as a whole? For a
provenance management system that records provenance information for different data
models, it is best to provide facilities for converting the representation of data items from
a given data model to the other, in order to optimally utilize its ability.
Regarding the storage strategy employed by a provenance management system,
provenance information either can be attached to the physical representation of a data
item or can be stored in a separate data repository. It is interesting to note that, a
provenance management system can support more than one storage strategy and can also
offer mechanisms for changing the storage strategy for data items. Overall, it can be said
that the feasibility of implementing the manipulation operations taken into consideration
largely depends on the properties of the chosen provenance model and world model. [3]
21
Figure 4: Storage and Recording Model [3]
The various techniques a provenance management system uses for the purposes of storing
provenance information, recording provenance information and propagating provenance
information recorded for source data items are all included in the Storage and Recording
Model. Storage strategy explains the relationship between the provenance data and the
target data that is to be used for the purpose of provenance recording [3].
No coupling, tight coupling and loose coupling are the three main types of recording
strategies that could be used in this model. Any of these three can be adopted, depending
on the underlying requirements.
22
The no-coupling strategy involves storing of only provenance information in one or many
repositories. The tight-coupling strategy involves the storage of provenance directly
associated with the data for which provenance is recorded. The loose-coupling strategy
uses a mixture of these two strategies, which means that both provenance and data are
stored in one single, but logically separated, storage system.
Most of the annotation-based approaches use either tight-coupling or loose-coupling
strategy. This can be made to work either by attaching provenance annotations to data
items or by storing annotations in the same data repository, but segregated from the
corresponding data items. On the other hand, approaches that are service-based in nature,
involve recording provenance for several data repositories in a distributed environment.
These types of approaches usually deal with a highly heterogeneous environment with
limited control over the execution of processes and manipulation of data items [3]. This
makes the recording of provenance information quite a herculean task.
By using a no-coupling storage strategy in a closed world data model, a certain degree of
control could be gained over provenance information. In theory, all possible data models
could be used to store provenance information, but not every combination of storage
model and storage strategy is reasonable enough, in all kinds of situations. [3]
23
In a scenario where provenance is recorded for a transformation that uses source data
items with attached provenance information, it is unclear as to how provenance
information is propagated to the result data items. The three possible answers to this
question are no-propagation, restricted propagation and complete propagation. With nopropagation, the provenance of source data items of a transformation is ignored when
creating provenance data for result data items. Contrary to this, in complete propagation,
the result data items of a transformation inherit all provenance data from source data
items, according to the kind of source used. With restricted propagation, a result data
item inherits a part of provenance from the source data items, i.e., provenance that was
generated during the last t transformations [3].
The provenance recording strategy determines the stage at which provenance data is
recorded. User-controlled recording, Eager recording, No recording and Systemcontrolled recording are the various types of recording strategies taken into consideration.
In user-controlled recording, the user decides at what point and for which data item the
provenance information is supposed to be recorded. Eager recording involves recording
provenance simultaneously along with every transformation on the go. No recording
strategy generates provenance at query time, whereas in system-controlled recording,
data creation is regulated by the provenance management system. Such a system could
use improvised strategies like, recording the provenance data once a day or recording the
provenance after every t transformations. [4]
24
Developing a provenance management system for open world models is a rather
intriguing problem [3]. A formal model designed with the help of this categorization
scheme, can possibly form the basis for a provenance management system that handles
not only various storage models, but also different types of source and transformation
provenance.
Some of the problems faced when dealing with provenance are related to the integration
of data. Let us consider a scenario where the concept of semantic identity needed to
recognize duplicates or versions of data items in an open world model was thoroughly
researched by various publications. A provenance management system, handling
different kinds of data items stored in distributed repositories, needs to integrate this data
to gain a unified view on the data. This scenario makes it obvious that data integration
systems might benefit greatly by including provenance management into their operational
scheme of things. Provenance data could also be used to help a user to make an accurate
assessment of the overall quality of the integrated data.
Thus, a categorization scheme for different types of provenance has been observed. This
scheme helps us gain a systematic overview of the capabilities and limitations of these
models.
25
This investigation can be extended further to cover the various problems encountered
during implementation of the mentioned techniques and to analyze the complexity of
various combinations of functionality. With the help of such an investigation, a formal
language for the management of provenance data can also be defined. This language
should include generation, querying and manipulation of provenance data, as and when
required. Unlike existing approaches, this language should cover not only different data
models, but also manage different types of provenance information. It will do best to
include certain language constructs for converting between different data models and
kinds of provenance data. [3]
2.3 Active Conceptual Modeling with Provenance
One of the major problems in current data modeling practices is that database design
approaches have generally viewed data models as representing only a snapshot of the
world and hence suggest overlooking some seemingly minute differences in information
as well as the causes and other details of those differences, during the task of data
modeling [8]. The solution to this problem lies in ‘Active Conceptual Modeling’. It
describes all aspects of the world including its activities and changes under various
perspectives, thereby providing a multi-level and multi-perspective view of reality.
Active conceptual modeling primarily deals with capturing provenance knowledge in
terms of what change the data might undergo, during the stage of conceptual modeling.
26
Moreover, it takes cue from the W7 Model. Therefore, it is imperative to identify
provenance components such as “where”, “when”, “how”, “who”, and “why” behind the
“what” to provide sufficient insight into the changes.
The W7 model is a generic model of data provenance and is intended to be easily
adaptable to represent either domain-specific or application-specific provenance
requirements in conceptual modeling. Nowadays, provenance knowledge is indispensable
in various applications. It is essentially critical in the domain of homeland security where,
given some background intelligence information, provenance regarding the information
such as how and when it was collected and by whom, is required to evaluate the quality
of the information in order to avoid false intelligence [2]. Consider the homeland security
application described in the conceptual schema given in the figure below:
Figure 5: A Sample Provenance-based Conceptual Model [2]
27
In today’s times, organizations and/or ordinary citizens are often called upon to report
suspicious activities that might possibly indicate terrorist threats. As a successful
example, the Pan American Flight School reported that Zacarias Moussaoui seemed
overly inquisitive about the operation of the plane’s doors and control panel, which leads
to Moussaoui’s subsequent arrest prior to 9/11 [2].
However, there is always a genuine possibility that, intelligence information such as
threat reports may be false, out-of-date, and from unreliable sources, which calls for a
provenance-based schema in order to eradicate such occurrences. Hence, various
provenance events are recorded, such as the creation and transformation of the threat
reports at the conceptual level. By doing this, the conceptual schema is made to be
‘active’.
It is a relatively straightforward task to capture the “when”, “who”, and “how” aspects
associated with data creation based on the semantics specified in the W7 model. The
“timestamp” attribute captures when the creation event occurs (see Fig. 9). “Who”, in this
case, describes individuals or organizations involved in the event including those who
report the threat as well as agents who record the report. The “how” aspect of the event is
recorded by instantiating the “method” attribute in the W7 model into two different
instances namely, “reporting method” and “recording method”. When data is transformed
or updated, information regarding who made a certain change at a particular time is
captured.
28
The “input” attribute provides information regarding how a report is updated. It normally
records the previous version of the report and may even include more when the report is
updated by combining information from other sources. The reason why the information is
updated, is also captured by specifying the “goal” attribute.
For the reasons mentioned, the task of recording data provenance is imperative in the
domain of homeland security as it facilitates the following:
- Information quality: To enforce national security, the right people must collect the right
information from the right sources to identify the right security threats all in a foolproof
manner. Capturing information such as, who reported the threat via what reporting
method, assists in evaluating data reliability. Provenance regarding how the report was
recorded or who participated in its updation, also helps ensure that the information is
trustworthy, before we take any action.
- Information currency: Some types of intelligence information may have a very short
shelf-life. As an example, after Saddam Hussein fled Baghdad, information about him
being spotted at a specific location changed six to eight times a day. Capturing
provenance such as: when the report of his being spotted was created and updated could
be used to avoid being misled by old or out-of-date information.
29
- Pattern recognition: Provenance also helps uncover certain unusual behavioral patterns,
which would in turn be extremely helpful for predicting and preventing potential terrorist
threats. As an example, a sudden increase in the number of threat reports from people in
the same region within a short span of time might give a slight indication of a terrorist
plot. In addition, the “who” part of our provenance information could help us in
identifying key reliable sources and forestall unreliable sources from feeding false
intelligence. [2]
30
CHAPTER 3
SOME APPLICATIONS OF PROVENANCE
3.1 Introduction
As mentioned earlier, there is a very close relationship between provenance, security and
consequently, the trustworthiness of data. Data can be dubious at best; if it is proven
insecure, as it is not trustworthy. Provenance helps determine the level of trust that can be
bestowed upon the given data in any situation. Thereby, it inadvertently provides security
to the data in the sense that, whenever some aspect of the data in consideration changes in
any manner, those changes are logged and safely stored for future reference. Thereby, it
provides a method to easily identify the modified data and revert the changes, if
necessary.
There are several real-world applications where provenance is of utmost importance. For
instance, in a cloud storage environment, there is uncontrolled movement of data from
one place to another. Due to this, determining the origin of the data and keeping track of
the modifications it undergoes from time to time, becomes highly essential. Ensuring
regulatory compliance of data within a cloud environment is also necessary. [25]
Both the National Geographic Society's Genographic Project and the DNA Shoah project
track the processing of DNA samples. The participants of these projects, who submit
their DNA samples for testing, want strong assurances that, unauthorized parties, such as
insurance companies or anti-Semitic organizations, will not be able to gain access to the
provenance information of the samples.
31
The US Sarbanes-Oxley Act states that the officers of companies that issue incorrect
financial statements are subject to imprisonment. Due to this act, officers have become
proactive in tracking the path of their financial reports during their development,
including the origins of input data and the corresponding authors. The US Health
Insurance Portability and Accountability Act also mandate the logging of access and
change histories for medical records. [24]
However, without appropriate guarantees, as data crosses both application and
organizational boundaries and passes through untrusted environments such as a cloud
environment, its associated provenance information becomes vulnerable to alteration and
cannot be completely trusted. Therefore, the task securing provenance information,
thereby making sure that its integrity and trustworthiness are preserved, is of high
importance.
Making provenance records trustworthy is challenging. It is imperative to guarantee
completeness, so it is assured that all relevant actions performed on a document are
thoroughly captured. There are a few cross-platform, low overhead architectures that
have been proposed for this purpose. These architectures contain a provenance tracking
system that tracks all the changes made in the provenance information pertaining to
certain data from time to time.
32
They use cryptographic hashes and semantically secure encryption schemes such as the
IND-CPA (Indistinguishable under Chosen Plaintext Attack) meaning, in cryptographic
terms, that any knowledge of the cipher text and the length of a message cannot reveal
any additional information about the plaintext of that message [24]. Such a phenomenon,
when applied to provenance information, makes it secure and trustworthy.
In this chapter, we discuss some applications in which provenance is used, such as a
multi-level secure environment, sensor networks and electronic data transfer. In a multilevel, secure environment, where data is passed across users with multiple levels of
security clearance, when one level of users cannot access the data belonging to users with
a higher security clearance level, provenance information about that data helps those
restricted users to verify that the data they received is authentic. In sensor networks, there
is a usually continuous stream of data transfer taking place. In such an area where data
transmission is fast and continuous, there is a possibility of data packets being dropped at
random, leaving the door open to the malicious packet-dropping attack. In order to detect
the occurrence of such an attack, provenance information would prove to be very helpful.
On a similar note, provenance can be helpful in electronic data transfer as well.
Electronic data does not always contain all the necessary historical information that
would help end-users validate the accuracy of that data. Hence, there is a dire need to
capture the additional fragments of provenance information, which accounts to the seven
W’s concerning that data.
33
3.2 Supporting Information Assurance in a Multi-level Secure Environment
3.2.1 Introduction
Multilevel security can be defined as the application of a computer system to process
information at different levels of security, thereby permitting simultaneous access by
users with different security clearances and preventing any unauthorized users from
gaining access to sensitive information.
In multi-level secure systems, it is not always possible to pass data source transformation
and processing information across various levels of security constraints. A framework
that is designed to make this process easier is hereby discussed. This framework captures
provenance information in a multi-level secure environment, while ensuring that the data
satisfies the basic information assurance attributes like availability, authenticity,
confidentiality, integrity and non-repudiation. The amount of trust that can be bestowed
upon any system should essentially be based upon a foundation of repeatable
measurements [10].
Hence, this framework ensures that data provenance not only supports these information
assurance attributes by combining the subjective trust and objective trust in data as a
"Figure of Merit" value that can cross security boundaries. The architecture associated
with this framework, facilitates the adding of information to an existing message system
to provide data provenance.
34
The information can be added with the use of wrappers, to ensure that there is minimal
impact on the existing workflow. The intention is to describe the original message
system and the DP section as two separate pieces. This simplifies the addition and
removal of any provenance information. Separating the two components also provides
flexibility in implementation.
Some existing real world implementations may provide the desired fields by either
changing the message format used for a SOA system, or by augmenting an existing SOAbased workflow. This system is designed to work with both peer-to-peer as well as
message/workflow services. In a Service Oriented Architecture (SOA), client
applications talk directly to the SOA servers and processes communicate using protocols
like SOAP or REST. This system has a routing service that supports both explicit
destinations and role-based destinations. Moreover, the framework is languageindependent.
3.2.2 Message-Structure Overview
This framework assumes that messaging is based on XML-serialization, supported by
transport protocols such as SOAP and REST and security standards such as WSS.
35
Figure 6: Envelope Structure with Data Provenance [10]
The above figure shows a structure of a message envelope where the DP records are
encapsulated inside the Information Assurance Verification Section.
According to the convention, a dotted line at the bottom of a block is used to show the
relationship between a signature and what it verifies.
36
An alternative approach is to encapsulate DP information outside of the Information
Assurance Verification Section and rely on XML stack to do the verification using Outof-band XML Headers. This not only raises portability issues, but also requires access to
"raw" and unmodified XML headers if we want to verify authenticity and confidentiality
by ourselves, but this in turn poses the threat of a replay attack.
Keeping DP records encapsulated inside the Information Assurance Verification Section
means that, an extension of the message body class can be created which can handle the
data manipulation directly outside of the XML stack, thereby providing greater
flexibility. For simplification purposes, a static, message-based system of information
collection is used in the architecture, as opposed to a dynamic, actor-based data collection
system.
3.2.3 Wrappers
In order to minimize the impact on the existing workflow, the use of 'wrappers' and ‘dewrappers’ is advisable. Wrappers add appropriate DP information and 'de-wrappers' that
strip the DP information before the message reaches its destination.
Instead of adding wrappers and de-wrappers, if the underlying workflow is altered, then
the processes may examine the provenance of the data by themselves and use that
information in their processing. The use of wrappers and de-wrappers is depicted in the
following figure.
37
Figure 7: Adding Wrappers & De-Wrappers to minimize impact on workflow [10]
3.2.4 Messages
The message body contains the data that is being transferred. It can be in the form of text,
images, etc. Every message has a unique Message-ID value.
Attachments can also be considered as messages. Therefore, they also have a unique
Message-ID associated with them. All messages have two parts namely, the invariant
part, and the variant part.
38
The value of the invariant part will never alter. Precisely speaking, any system that
retrieves the invariant part and calculates a one-way hash of that part will always get the
same value under all conditions.
Therefore, any XML encoding system will always provide the serialization of the data in
the exact same format, irrespective of the implementation method. The variant part of the
message may change. For example, the routing information may change as the message is
in the process of being forwarded from one place to another.
3.2.5 Data Provenance
The DP system should allow flexible implementations so that multiple SOA systems can
exchange data over the SOA enterprise bus. Some systems may act like routers in
forwarding messages to the proper workflow recipient. For instance, the sending system
may send the information to a role, such as an ‘Analyst’. A workflow system may then
decide on the next available analyst, and forward the message to that individual. The DP
system should also support the use of gateways, and protocols that encapsulate data.
Most importantly, it should support Multi-Level Secure systems and should be able work
with encrypted data. DP records can be sent along with the messages and workflow.
Alternatively, they can be sent to a service, for a retrieval-based implementation.
39
Consider a single DP record of a message with no attachments. This corresponds to a
sender transmitting a message to a receiver. There are two different perspectives of any
single message transmission - outgoing and incoming. The outgoing perspective is the
intended transport DP characteristics from the perspective of the sender. The incoming is
the observed transport properties from the receiver's perspective. The appropriate party
signs each perspective.
The receiving perspective includes the intended outgoing perspective as well. Therefore,
the receiving party signs the DP record from the sending party. This is shown in the
figure below:
Figure 8: Data Provenance Record [10]
40
The sender’s DP section includes the following pieces of information:
• Message-ID - a unique identifier that allows the retrieval of the message to verify the
DP record
• Outgoing Security Attributes
• Timestamp - (Optional) useful for availability and non-repudiation analysis
• Owner of Signature - supports signature verification
• Hash of Invariant part of message and Hash Algorithm used - allows integrity testing
• Security Label - to simplify data classification of a multi-part message
It is important to note that the user and/or the application signs DP record. The XML
stack does the signature in the XML envelope. Therefore, the certificates and the signing
algorithm might differ. The signature for the DP record might be done by a multi-purpose
private key, or a dedicated key may be used and it can be associated with an individual or
an embedded crypto system in automated systems. [10]
When receiving the message, the receiver adds DP information for analysis. The data that
is signed by the receiver includes the sender's signed information. The data and
functioning of the receiver's DP record is similar to the Sender's DP record for the most
part. However, it does not include the Message ID because that is included in the Sender's
DP record and need not be repeated.
41
Having the receiver sign the Sender’s DP record provides non-repudiation in case the
sender denies sending the message. The hash value and algorithm is included in case the
sender and receiver use different hash algorithms (or if the sender does not provide the
information). After a message is transmitted (that is, it goes from the sender to receiver),
the DP record is completed. If a message remains unchanged throughout the
transmission, it may not be necessary to retain the routing information when the message
is being forwarded.
When forwarding an unchanged message, each hop provides a DP record. The following
scenario is an example:
Alice sends a message titled ‘Ml’ to Bob, who forwards the same to Carol, who in turn
forwards it to Dave. In this case, there will be three DP records sent along with message
‘Ml’ to Dave. The following figure depicts the above scenario and gives us a clear idea of
what is going on.
42
Figure 9: Multiple DP records with Forwarded Message [10]
As mentioned earlier, since the message is unchanged, it is not necessary to retain routing
information. Therefore, Dave can accept the message from Carol, but create a DP record
with Alice in the Sender's section and delete the other records. An example of a truncated
DP record is shown in Figure 10. If the invariant part of the message does not include the
intended receiver (i.e.; the To: field), it will be impossible to detect misrouted messages.
43
Figure 10: Truncated DP Record [10]
In a multi-level secure environment, there will be cases where a message is transmitted
from a higher security level to a lower security level. In such cases, it may be necessary
to remove all trace of the original source for security reasons. This can be done by
creating a proxy owner.
Note that this message could be modified, and the receiver of the message does not know
its original source. In addition to this, DP records can be created for plaintext or for
encapsulated and encrypted messages. This allows a system to provide non-repudiation
that they saw an encrypted message, without revealing the actual message contents to
them. The following figure depicts the creation of a proxy owner.
44
Figure 11: DP Record with Proxy Owner [10]
In the case of messages that are changed during transmission, routing information must
be retained. If a device or person receives a message, and that message is forwarded as an
attachment by attaching it as part of another message, then as a result, a new message will
be created with a new Message-ID.
This message must identify the previous message by its Message-ID so that any DP
record associated with the included attachment / message can be found. The following
figure shows an example of an attachment being included in a new message.
45
Figure 12: DP Record with an Included Attachment [10]
As the above figure depicts, the workflow is: Alice sends message ‘Ml’ to Bob, Bob
creates a new message ‘M2’’. He then attaches ‘Ml’ to it and sends it to Carol.
This architecture also supports DP analysis of complex messages. For example, Alice
sends a message M1 to Bob, who adds a note to M1, and sends it, with the original
message M1 as an attachment, to Carol who forwards it to Dave.
46
In this scenario, Dave's system must be capable of performing a DP analysis of the entire
workflow. DP analysis is a complex task. Therefore, it will require system level support
to address issues like:
• Expired or Compromised Certificates
• Unavailability of Messages (Message not archived or inaccessible due to security
classification)
• Modification of DP Records.
Additionally, if the implementation has a message storage system, storage-based retrieval
can be done, where messages can be retrieved whenever needed, instead of being sent in
the workflow. In such cases, DP records associated with each message can be sent along
with the message that is retrieved. The message storage system simply becomes another
entity in the workflow, and creates DP records for each message received and
transmitted. [10]
3.2.6 Addressing Information Assurance Attributes
This architecture provides objective measurements. That is, if two individuals evaluate
two different pieces of information that have identical means of transmission and
workflow, they will end up with identical values. It is possible to assign confidence in
underlying technology and algorithms used. For example, one hash algorithm may be
superior to another. Therefore, if two people have the same confidence in the algorithmic
strength of a hash algorithm, they will get similar degrees of trust.
47
Let us briefly consider how DP information can be used towards each of the Information
Assurance attributes. These attributes can be used to address various types of attacks.
Since the sender signs each DP record, authenticity and integrity can be determined as
follows:
• Examine the DP record created by the sender of the message, and get the hash algorithm
and the hash value of the message associated with the DP record.
• Calculate the hash of the message using the algorithm. If the hash values do not agree,
the integrity of the message can be suspected.
• Assure the authenticity of the DP record by verifying the signature. In the event that the
DP record from the sender uses an inferior hash algorithm (MD5 or SHA-1 vs. SHA256), additional DP records in the chain of transfer can be examined, providing additional
attributes used in the calculation of authenticity using subjective values.
While confidentiality is difficult to prove, one can verify that a message was intended to
be confidential by ensuring that it was encrypted before being sent or attached. If at any
point the message was sent unencrypted, then there may have been an unintentional
exposure of confidential information. It may also indicate an implementation error if the
sending and receiving characteristics from the perception of the sender and receiver do
not agree.
48
The DP records directly provide non-repudiation as they are all signed. However,
someone can send a message, and then claim that their key was compromised. It is
possible to examine the sequence of events to determine if the retraction occurred while
the key was considered secure or not. The timestamp is useful in dealing with
compromised and expired certificates.
Availability is a system level property but our architecture can be used to detect some
attacks on availability by comparing actual transmission time with historical transmission
time. This works with both reliable transmission models like TCP and unreliable
transmission models like UDP.
Another way to detect attacks is to rely on report numbers to detect 'skipped' reports.
Additional information facilitates more detailed availability analysis.
For example, if a receiver includes the number of retries needed to receive a message in
the DP information then DP analysis can use this information, and past knowledge, to
identify availability issues.
The DP analysis can also detect some replay attacks. If a message is intercepted and
retransmitted later then the timestamp may be useful in detecting it. If the receiver or DP
analysis does keep track of messages, and can detect messages that have arrived earlier,
then it becomes easier to detect a true replay attack.
49
3.2.7 Example Analysis
Data Provenance can be performed on complex messages. Consider the following
scenario:
• Alice creates message M1 (in this case, an image) and sends it to Bob.
• Bob creates message M2 with his analysis of M1, attaches M1, and sends it to Carol.
• Carol uses the information in M2 and M1 to create another message M3 (in this case, a
report), and forwards it to Dave.
The transmission includes messages M3, and the referred messages M2 and M1. Dave
receives the message, and fills in his section of the DP records. The following table
summarizes Dave’s DP records.
Table 3: DP Records for Example Scenario
DP Record of M3 from Carol to Dave
DP Record of M2 from Carol to Dave
DP Record of M2 from Bob to Carol
DP Record of M1 from Carol to Dave
DP Record of M1 from Bob to Carol
DP Record of M1 from Alice to Bob
If the DP analysis was concerned about the authenticity of the message, the hash
algorithm used by the message creator (Alice) can be applied to each of the messages,
50
and compared to the value in the DP record. If there is a discrepancy, the system can
examine each DP record associated with that message. If there was a discrepancy in
message M1, there is enough information to determine when the discrepancy occurred.
For example, if the hash value when Carol received M1 from Bob differed from the value
when Carol sent M1 to Dave, then message M1 changed while Carol was examining it. If
the hash value when Carol sent M1 to Dave is different from the value when Dave
received it, then that means M1 was modified in transit between Carol and Dave. There
may be additional records from systems that forward messages that can provide
additional non-repudiation.
The prospect of including subjective information also raises a few concerns. Subjective
information constitutes personal opinions. That is, Alice may trust Carol more than Dave
based on experience, rumors, and hunches. One approach for supporting the inclusion of
subjective information in this architecture is to provide the ability where users can enter
their subjective values for various entities. The architecture can then propagate this
information and make it available for inspection and analysis.
A person can analyze the data and form their opinion. Some of the relevant questions are:
How confident are we that the document creator actually has the key; and nobody else
has the key? If a certificate authority signs the certificate, how confident are we in the
certificate authority? How confident are we in the process used for granting a signature?
51
As mentioned earlier, in a Multi-Level Secure environment, data source and processing
information cannot always be passed across the different levels of security boundaries.
To overcome this restriction, the best approach is to take the objective and subjective
information that is captured as part of DP information and combine it to generate ‘Figure
of Merit’ values that accurately depict the amount of trust can be put in the data. While
the notion of trust has been widely studied, its definition and usage varies across various
domains and application areas. As the data crosses a security domain, if the DP
information cannot be passed on, then this DP information is replaced by Figure of Merit
values.
The Figure of Merit values are calculated from both objective and subjective
components. The objective values (as calculated at time t) are independent of the analyst
and can be incorporated automatically in the determination of Figure of Merit. Subjective
values (as calculated at time t) depend on the analyst doing the analysis and may depend
on the context of the workflow. Note that both objective and subjective values specified
by an analyst may evolve over time. [10]
The values associated with the Information Assurance attributes such as authenticity,
confidentiality, integrity, non-repudiation and availability are objective values, while the
confidence in Certification Authorities that provide certificates involved in the workflow
can have objective as well as subjective components. The confidence in the components
of the workflow is subjective.
52
A number of different approaches can be taken for generating the Figure of Merit value
from the DP information and which approach is selected depends on the type of DP
analysis being done as well as personal preferences. Figure of Merit provides
summarization, so it along with the notion of wrappers and the ability to send DP records
on a separate channel help in making systems scalable.
3.2.8 Monitoring and Analyzing Workflows
Let us consider some issues around monitoring and analyzing Information Assurance
attributes of a workflow in real-time.
Both the intrinsic Information Assurance attributes of a workflow as well as its actual
execution in the deployed system are to be considered. The workflow, as designed,
provides a baseline for entitlement, i.e., it provides the maximum value that can be
achieved in the deployed system.
For example, if the workflow is structured to send data in clear-text, then confidentiality
is not ensured, and thus, the deployed system will not guarantee confidentiality. On the
other hand, if the workflow as designed, is structured to encrypt the data, then the
baseline entitlement guarantees confidentiality.
This DP Architecture allows the calculating, displaying of values for Information
Assurance attributes for each message, and its transmission as the workflow is executed.
The dashboard architecture displays the values for the Information Assurance attributes.
53
It can also be used to do What-If Analysis by incorporating subjective trust values.
Moreover, it can be used to do after the fact analysis to understand vulnerabilities and
determine the impact of compromised assets. The following figure illustrates a dashboard
that allows an analyst to monitor as well as analyze the execution of a representative
workflow.
Figure 13: Dashboard for Monitoring and Analyzing Workflows [10]
54
To summarize, an architectural framework has been provided for incorporating Data
Provenance to support Information Assurance attributes in a multi-level secure
environment. This architecture allows us to capture objective trust attributes that can be
combined with subjective trust data to calculate a "Figure of Merit" value. The exact
formulae used in computing these values are user-definable and may depend on the
scenario and type of analysis being done. This Figure of Merit value can be used to hide
the detailed Data Provenance information so that it may cross security boundaries in a
multi-level secure environment.
3.3 Provenance Mechanism to tackle Packet Dropping in Sensor Networks
3.3.1 Introduction
Provenance provides the assurance of data trustworthiness, which is highly desired to
guarantee accurate decisions in mission critical applications, such as Supervisory Control
and Data Acquisition (SCADA) systems. The 2009 report published by the Institute for
Information Infrastructure Protection (I3P) on National Cyber Security Research and
Development Challenges, in which the research initiatives on efficient implementation of
provenance in real-time systems are highly recommended, also emphasizes on the
importance of provenance for streaming data. However, existing research on provenance
has mainly focused on the tasks of modeling, collection, and querying, leaving the
aspects of trustworthiness and security issues relatively unexplored.
55
Here, a framework is investigated that is proposed to transmit provenance information
along with sensor data, hiding it over inter-packet delays (the delays between the
transmissions of sensor data items). The embedding of provenance information within a
host medium makes the technique reminiscent of digital watermarking, wherein
information is embedded into a digital signal, which may be used to verify its authenticity
or the identity of its owners. The reason behind adopting watermarking based scheme
rather than one that is based on traditional security solutions like cryptography and digital
signature is discussed later.
Moreover, a suitable justification is provided for the design choices of using inter-packet
delays (IPD) as the watermark carrier, employing a direct-sequence spread spectrum
based technique to support multi-user communication over the same medium.
The proliferation of internet, embedded systems and sensor networks has greatly
contributed to the wide development of streaming applications. Examples include, realtime location-based services, sensor networks monitoring environmental characteristics,
controlling automated systems, power grids etc. The data that drives these systems is
produced by a variety of sources, ranging from individual sensors up to very different
systems altogether, processed by multiple intermediate agents. This diversity of data
sources accelerates the importance of data provenance to ensure secure and predictable
operation of data-streaming applications like sensor networks.
56
Malicious Packet-Dropping attack is a major security threat to the data traffic in sensor
networks, since it reduces the overall network throughput and may hinder the propagation
of sensitive data. In this attack, a malicious node drops packets at random during
transmission, to prevent their further propagation. It selectively drops packets and
forwards the remaining data traffic. Due to this nature, this attack is also called as
‘Selective Forwarding Attack’. [12]
Dealing with this attack is challenging, to say the least, since there can be a variety of
other reasons that cause data or packet-loss in such systems. To name a few, the
unreliable wireless communication feature and the inherent resource-constraints of sensor
networks may also be the possible causes for communication failure and data loss.
Moreover, transient network congestion can also result in packet-loss. Power-scarcity can
also make a node unavailable and there may be a communication failure due to physical
damage as well. Thereby, these possibilities raise a false alarm and mislead us to an
incorrect decision regarding the presence of such a malicious attack. The mass
deployment of tiny sensors, often in unattended and hostile environments makes them
susceptible to such attacks. This attack can result in a significant loss of sensitive data
and can degrade legitimate network throughput.
57
One approach to defend a malicious packet dropping attack is Multipath Routing.
However, multipath routing suffers from several drawbacks such as, high communication
overhead with an increase in the number of paths, inability to identify the malicious node
etc. Traditional transport layer protocols also fail to guarantee that packets are not
maliciously dropped in sensor networks. They are not designed to deal with these kinds
of malicious attacks.
A data-provenance based mechanism is hereby proposed, to detect the presence of such
an attack and identify the source of the attack i.e. the malicious node.
As mentioned earlier, this scheme utilizes inter-packet delay characteristics for
embedding provenance information. This scheme consists of three phases:
1) Packet Loss Detection
2) Identification of Attack Presence
3) Localizing the Malicious Node/Link
The packet-loss is detected based on the distribution of inter-packet delays. The presence
of an attack is determined by comparing the empirical average packet-loss rate with the
natural packet-loss rate of the data flow path. To isolate the malicious link, more
provenance information is transmitted along with the sensor data. [12]
58
We have two goals to accomplish –
1) To transfer provenance along with the sensor-data in a bandwidth efficient
manner, while ensuring that the quality of the data remains intact.
2) To detect the packet dropping attack and thereby identify the malicious node
using the provenance transmission technique.
To accomplish these goals, a unique strategy is proposed, where provenance is securely
embedded as a list of unordered nodes over the same medium. This is a key design choice
in this proposed scheme for secure provenance transmission.
3.3.2. Overview of the Scheme:
Consider a typical deployment of wireless sensor networks, consisting of a large number
of nodes. Sensor nodes are stationary after deployment, though routing paths may change
due to node failure, resource optimization, etc. The routing infrastructure is assumed to
have a certain lower bound on the time before the routing paths change in the network.
The network is modeled as a graph G (N, E) where N is the set of nodes in the network
and E is the set of edges between the nodes.
There exists a powerful base station (BS) that acts as sink/root and connects the network
to the outside infrastructure such as Internet. All nodes form a tree rooted at the BS and
report the tree topology to BS once after the deployment or any change in topology.
59
Since the network does not change so frequently, such a communication will not incur
significant overhead. Sensory data from the children are aggregated at cluster-head a.k.a.
Aggregator and routed to the applications through the routing tree rooted at the BS.
It is assumed that the BS cannot be compromised and it has a secure mechanism to
broadcast authentic messages into the network. Each sensor has a unique identifier and
shares a secret key Ki with the BS. Each node is also assigned a Pseudo Noise (PN)
sequence of fixed length Lp which acts as the provenance information for that node.
The sensor network allows multiple distinguishable data flows where source nodes
generate data periodically. A node may also receive data from other nodes to forward
towards the BS. While transmitting, a node may send the sensed data, may pass an
aggregated data item computed from multiple sensors’ readings or act as an intermediate
routing node. Each data packet in the transmission stream contains an attribute value and
provenance for that particular attribute. The data packet is also timestamped by the source
with the generation time. As will be seen later, packet timestamp is very crucial for
provenance embedding and decoding process and hence Message Authentication Code
(MAC) is used to maintain its integrity and authenticity. The MAC is computed on data
value and timestamp to ensure the same properties for data.
The provenance of a data item, in this context, includes information about its origin and
how it is transmitted to the BS.
60
The notion of provenance is formally defined as follows:
The provenance 𝑝𝑑 for a data item d is a rooted tree satisfying the properties:
(1) 𝑝𝑑 is a subgraph of the sensor network G (N, E)
(2) The root node of 𝑝𝑑 is the BS, expressed as 𝑛𝑏
(3) For 𝑛i , 𝑛𝑗 in N, included in 𝑝𝑑, 𝑛𝑖 is a child of 𝑛𝑗 if and only if 𝑛𝑖 participated in the
distributed calculation of d and/or passed data information to 𝑛𝑗 [12]
Figure 14 on the next page, shows two different kinds of provenance. In 14(a), a data
item d generated at leaf node 𝑛1 and the internal nodes simply pass it to BS.
Such internal nodes are called ‘Simple Nodes’ and this kind of provenance is ‘Simple
Provenance’. Simple provenance can be represented as a simple path. In 14(b), the
internal node 𝑛1 generates the data d by aggregating data 𝑑1... 𝑑4 from 𝑛𝑙1... 𝑛𝑙4 and
passes d towards BS. Here, 𝑛1 is an ‘Aggregator’ and the provenance is called
‘Aggregate Provenance’, which is represented as a tree.
61
Figure 14: Provenance Examples for a Sensor Network [13]
One key design choice in this proposed scheme is to utilize the Spread Spectrum
Watermarking technique, to pass on the provenance of multiple sensor nodes over the
medium. Spread Spectrum Watermarking is a transmission technique where a
narrowband data signal is spread over a much larger bandwidth so that the signal energy
in any single frequency in undetectable.
In the context of this scheme, the set of IPD’s is considered as the Communication
Channel and provenance information is the signal transmitted through it.
62
Provenance information is spread over many IPD’s such that the amount of information
present in one container is small. Consequently, any unauthorized party needs to add very
high amplitude noise to all of the containers to destroy provenance. Thus, the use of the
spread spectrum technique for watermarking provides strong security against different
attacks.
In specific terms, the ‘Direct Sequence’ Spread Spectrum (DSSS) technique is used.
“ DSSS phase-modulates a sine wave pseudo-randomly with a continuous string of
pseudo-noise (PN) code symbols called ‘chips’, each of which has a much shorter
duration than an information bit. That is, each information bit is modulated by a
sequence of much faster chips. Therefore, the chip rate is much higher than the
information signal bit rate. ” [15]
DSSS uses a signal structure in which, the sequence of PN symbols produced by the
transmitter is already known by the receiver. The receiver can use the same PN sequence
in order to reconstruct the information signal.
This noise signal in a DSSS transmission is a pseudorandom sequence of 1 and −1
values, at a frequency much higher than that of the original signal. The resulting signal
resembles white noise. However, this noise-like signal can be used to exactly reconstruct
the original data at the receiving end, by multiplying it by the same pseudorandom
sequence (because 1 × 1 = 1, and −1 × −1 = 1). This process is called "de-spreading".
63
It involves a mathematical correlation of the transmitted PN sequence with the PN
sequence that the receiver believes the transmitter is using.
The components of DSSS system are:
• The original data signal d(t), as a series of +1 and -1.
• A PN sequence px(t), encoded like the data signal. Nc is the number of bits per symbol
and is called PN length.
Spreading: The transmitter multiplies data with PN code to produce spreaded signal as
s(t) = d(t) px(t)
Despreading: The received signal r(t) is a combination of the transmitted signal and noise
in the communication channel. Thus r(t) = s(t) + n(t), where n(t) is a white Gaussian
noise.
To retrieve the original signal, the correlation between r(t) and the PN sequence pr(t) at
the receiver is computed using the following formula:
𝑇+𝑁𝑐
R(𝜏) = (1 / Nc) Σ 𝑟(𝑡) 𝑝𝑟(𝑡+𝜏 )
𝑡=𝑇
Now, if px(t) = pr(t) and 𝜏 = 0 i.e. px(t) is synchronized with pr(t), then the original
signal can be retrieved. Otherwise, the data signal cannot be retrieved. Therefore, a
receiver without having PN sequence of the transmitter cannot reproduce the originally
transmitted data. This fact is the basis for allowing multiple transmitters to share a
channel.
64
In case of multiuser communication in DSSS, spreaded signals produced by multiple
users are summed up and transmitted over the channel. Multiuser communication
introduces noise to the signal of interest and interferes with the desired signal in
proportion to the number of users.
3.3.3 Secure Provenance Transmission Mechanism
In the IEEE journal, Secure Provenance Transmission for Streaming Data, by S. Sultana,
M. Shehab and E. Bertino, a novel approach is proposed which deals with securely
transmitting the provenance of sensor data. In this mechanism, provenance is
watermarked over the delay between consecutive sensor data items.
Figure 15: Provenance Encoding at Sensor Node and Decoding at Base Station [13]
65
A set of (Lp + 1) data packets is used to embed provenance over the IPDs. Thus, the
sequence of Lp IPDs, DS = {Δ(1), Δ(2), …, Δ(Lp)} is the medium where we hide
provenance. Δ[j] represents the inter-packet delay between jth and (j+1)-th data item. The
process also uses the secret key 𝐾𝑖 (1 ≤ 𝑖 ≤ 𝑛, where 𝑛 is the number of nodes in the
network), a locally generated random number 𝛼𝑖 (known as impact factor) and the
provenance information pni. 𝛼𝑖 is a random real number generated according to a normal
distribution (𝜇, 𝜎). 𝜇 and 𝜎 are pre-determined and known to the BS and all the nodes.
The PN sequence consists of a sequence of +1’s and -1’s and is characterized by a zero
mean. [13]
The provenance encoding process at a node 𝑛𝑖 is summarized as follows:
Step E1 – Generation of Delay Perturbations:
By using provenance information pni and impact factor 𝛼𝑖, the node generates a set of
delay perturbations, 𝒱𝑖 = {𝑣i[1], 𝑣𝑖[2], ..., 𝑣𝑖[𝐿𝑝]}, as a sequence of real numbers. Thus,
= {[1], [2], ..., 𝑣𝑖[𝐿𝑝]} = 𝛼𝑖 × pni = {(𝛼𝑖 × 𝑝𝑛𝑖[1]), ..., (𝛼𝑖 × 𝑝𝑛𝑖[𝐿𝑝])}
Here, [𝑗] corresponds to the provenance bit [𝑗].
Step E2 – Bit Selection:
On the arrival of any (j+1)th data packet, the node records the IPD Δ[𝑗] and assigns a
delay perturbation 𝑣𝑖[𝑘𝑗 ] ∈ 𝒱 to it. The selection process uses the secret 𝐾𝑖 and packet
timestamp [𝑗 + 1] as follows:
66
(Δ[𝑗]) = (𝐾𝑖 || (([𝑗 + 1] ||𝐾𝑖)) 𝑚𝑜𝑑 𝐿𝑝
Here, 𝐻 is a lightweight, secure hash function and || is the concatenation operator.
Step E3 – Provenance Embedding:
IPD Δ[𝑗] is increased by 𝑣𝑖[𝑘𝑗] time unit. As [𝑘𝑗] corresponds to provenance bit [𝑘𝑗], a
provenance bit is embedded over an IPD. Provenance bits are watermarked over IPDs by
manipulating them with corresponding delay perturbations, termed as ‘watermark
delays’. This way, 𝒟𝒮 is transformed into the watermarked version 𝒟𝒮𝑤.
The sensor dataset is thereby transmitted towards the BS while reflecting the
watermarked IPDs. Throughout the propagation, each intermediate node watermarks its
provenance as well as increasing the IPDs.
Data packets may also experience different propagation delays or attacks aimed at
destroying the provenance information. At the end, the BS receives the dataset along with
watermarked IPDs 𝒟𝒮𝑤, which can be interpreted as the sum of delays imposed by the
intermediate nodes, attackers and difference between consecutive propagation delays
along the data path. Thus, 𝒟𝒮𝑤 represents the DSSS encoded signal in this context.
The provenance retrieval process at the BS approximates the provenance from this DSSS
signal based on an optimal threshold 𝑇∗. The threshold, corresponding to the network
diameter and PN length, is calculated once after the deployment of the network.
67
For retrieval purposes, the BS also requires the set of secret keys {𝐾1, 𝐾2... 𝐾𝑛} and PN
sequences {pn1, pn2... pnn}. The retrieval process at the BS follows two steps:
Step R1 - Bit Selection:
The IPDs for the incoming packets are recorded at the BS. For each node, the IPDs are
reordered according to the algorithm used in E2, which produces a node specific
sequence CSi.
Step R2 - Threshold based Decoding: For any node ni, the BS computes the cross
correlation Ri between CSi and provenance pni and takes decision on whether pni was
embedded by a comparison of Ri with threshold T*. As the BS does not know which
nodes participated in the data flow, it performs the Bit selection and Threshold
Comparison for all nodes. Based on the threshold comparison result, it deduces the
participation of nodes in a data flow.
3.3.4 Packet-Dropping Adversary Identification Scheme
This scheme is developed using the provenance transmission technique. In this scheme,
the malicious packet dropping attack is detected and the malicious node/link is identified.
This scheme relies upon the distribution of provenance embedded inter-packet delays and
consists of the following phases:

Packet Loss Detection

Identification of Attack Presence

Localizing the Malicious Node/Link
68
The BS initiates the process for each sensor data flow by leading the packet loss
identification mechanism based on the inter-packet delays of the flow and the extracted
provenance. As mentioned earlier, since there can be various reasons for packet-loss, the
BS waits until a sufficient number of packet losses occur and then calculates the average
packet loss rate. [12]
A comparison of this loss rate with the natural packet loss rate of the path confirms the
event of malicious packet dropping. If a packet drop attack is suspected, the BS signals
the data source and the intermediate nodes to start the mechanism for pinpointing the
malicious node/link. The details of the mechanism are presented below:
A. Packet Loss Detection
Both the provenance information and the IPDs are used to detect a packet loss. The BS
can observe the data flows transmitted by all sensor nodes and obtain their timing
characteristics. Since provenance is embedded over the IPDs, the watermarked IPDs
follow a different distribution than the regular IPDs.
After the receipt of a few initial group of (Lp+ 1) packets from a data flow, the BS can
approximate the distribution of the watermarked IPDs for that flow. Afterwards, the BS
analyzes each IPD to check whether it follows the estimated distribution. If the data is
dropped by a node that definitely must be traversed to reach the BS, the attack will end
the journey of the data.
69
Here, the IPD observed by the BS will be large enough to go beyond the distribution and
be detected as a packet loss. For the packets containing sequence numbers, any out of
order packet can verify the detection of packet loss. On the other hand, if the data packet
is dropped by an intermediate router within a cluster, it cannot interfere with the data
from other nodes that are to be aggregated at the cluster head and transmitted towards the
BS. [12]
Consequently, the IPD-based check will not be effective in such attack scenarios.
Fortunately, in this case, the provenance retrieved at the BS does not include the simple
path containing the malicious node from the source to the aggregator. Thus, the
dissimilarity of this provenance with the provenance of earlier rounds exposes the fact of
a packet loss.
B. Identification of Attack Presence
For performing this task, the BS collects Ga group of (Lp+1) packets, where Ga > 0 is a
real number. Assume the BS identifies m packet losses. Thus, the average packet loss rate
Lavg can be calculated as, Lavg = m / [Ga * (Lp + 1) + m]. The natural packet loss-rate is
calculated by,
h
i−1
Ln = Σ 𝜌i Π (1 − 𝜌j)
i=1
j=1
70
Here, h is the number of hops in the path, 𝜌i is the natural loss rate of the link between
nodes ni and ni+1. Comparing Lavg with Ln, the BS confirms a packet dropping attack. If
Lavg > Ln, malicious node surely exists in the flow path. [12]
C. Localizing the Malicious Node/Link:
For identifying the malicious link, more information is needed in the provenance apart
from just the nodeID. The data payload is used to carry this additional provenance data.
Whenever the BS detects the attack, it notifies the source and intermediate nodes in the
path about it. While forwarding the data packet, each nodes adds information including
the hash of the data, timestamp etc. of the last data packet it received through this path.
Hence the format of a data packet at a node ni is,
mt = <data || timestamp || Pi >
Pi = {ni ||H(mt - 1) || Pi -1}Ki
Here, mt represents the current data packet and mt-1 is the most recent packet before the
current one. Pi denotes the provenance report at the node ni. H is a collision resistant hash
function and {D}k denotes a message D authenticated by a secret key k, using a message
authentication code (MAC). The data and timestamp are authenticated using MAC to
ensure integrity. Upon receiving a data packet containing the hash chain of provenance,
the BS can sequentially verify each provenance report embedded in it. Assume the flow
path i.e. the data provenance is {n1, n2... ni-1, ni, BS}, where n1 is the source node and
BS is the base station. The link between the nodes ni and ni+1 is represented as li.
71
For some i < d, if the provenance report from each intermediate node nj, where j lies
within [1, i], contains the recent value for timestamp and hash of the data but the
provenance reports from ni+1 contains the older values, then the BS identifies the link li
as faulty.
After observing the Gl group of (Lp+1) packets, the BS calculates the average packet loss
rate of all the links of the path. Gl is also a real number greater than 0. If the loss rate of a
link li is significantly higher than the natural packet loss rate 𝜌i, then the BS labels the
link as a malicious link.
To quantify the term significantly, we introduce a per-link drop rate threshold, denoted
by 𝜏, where 𝜏 > 𝜌i. If the empirical packet loss rate of a link li is greater than 𝜏, then li is
termed as a malicious link. In this way, malicious packet dropping attack in sensor
networks can be tackled using data provenance. [12]
3.4 Provenance of Electronic Data Transfer
3.4.1 Introduction
The IT landscape is evolving as illustrated by applications that are open, dynamically
composed, and that discover results and services on the fly. Against this vastly growing
and challenging background, it is crucial for users to be able to have confidence in the
results produced by such applications.
72
If the provenance of data produced by computer systems could be determined, then users,
in their daily applications, would be able to interpret and judge the quality of data better.
In this attempt, a provenance lifecycle is introduced and an open approach is proposed
based on some key principles underpinning existing provenance systems in computer
systems. To accomplish this vision, computer applications need to be transformed into
‘Provenance-aware applications’, for which the provenance of data may be retrieved,
analyzed and reasoned over.
3.4.2 Lifecycle of Provenance in Computer Systems
Both the scientific and business communities have adopted a predominantly serviceoriented architectural (SOA) style of implementation, which allows services to be
discovered and composed dynamically. Now, while SOA-based applications are more
dynamic and open in their functionality, they must however satisfy new, evolving
requirements both in Business as well as in E-science.
Ideally speaking, end users of E-science applications would be able to perform tasks like
reproducing their results by replaying previous computations, understanding why two
seemingly identical runs with the same inputs produce different results, and determining
which data sets, algorithms, or services were involved in the derivation of their results.
Some users, reviewers, auditors, or even regulators need to verify that the process that led
to some result is compliant with specific regulations or methodologies that ought to be
followed. [16]
73
This is usually done because, it needs to be proven that the results obtained were derived
independently from services or databases with proper license restrictions. In addition to
this, they must also establish that data was captured at the source by using some highly
precise and technically accurate instruments.
However, they either cannot do so or can do it only imperfectly or incompletely at best,
because the underpinning principles have not yet been thoroughly investigated and
systems have not yet been defined to cater such specific requirements. One key
observation in this regard is that electronic data does not always contain the necessary
historical information that would help end-users, reviewers, or regulators make the
necessary verifications in order to accurately validate the data.
Hence, there is a dire need to capture those additional fragments of missing information
that describes what actually occurred at execution time, in an exact manner. Such extra
information can be named as ‘Process Documentation’.
74
Figure 16: Provenance Lifecycle [16]
As shown in the above figure, provenance-aware applications create process
documentation and store it in a ‘provenance store’ database, the role of which is to offer a
long-term persistent, secure storage of process documentation.
It is a logical role, which accommodates various physical deployments. For instance, a
provenance store can be a single, autonomous service or, to be more scalable, it can be a
collection of distributed stores.
After the process documentation is recorded, the provenance of resultant data can be
retrieved by querying the provenance store, and can be analyzed according to the user’s
needs.
75
Over time, the provenance store and its contents may need to be managed or curated.
Hence, it can be summarized that, the provenance life cycle consists of four different
phases: creating, recording, querying and managing. [16]
3.4.3 Open Model for Process Documentation
For many applications, process documentation cannot be generated in one single
iteration. Its generation must be in conjuncture with the execution. This is why, it is
imperative to differentiate a specific item involved in documenting a specific part of a
process, from the whole process of comprehensive documentation. The former, referred
to as a p-assertion, is seen as an assertion made by an individual service involved in the
process. Thus, the documentation of a process consists of a set of p-assertions made
solely by the services involved in the process.
In order to minimize its impact on the performance of the application, documentation
needs to be structured in such a way that it can be constructed and recorded
autonomously by services. If this is not ensured, synchronizations will be required
between these services to agree upon how and where to document execution. This will
ultimately result in a huge loss in performance of the application. So, since it is highly
essential to satisfy this design requirement, various kinds of p-assertions have been
identified, which applications are expected to adopt in order to document their execution.
Figure 17 below illustrates a computational service sending and receiving messages, and
creating p-assertions describing its involvement in such activity.
76
Figure 17: Categories of P-assertions Made by Computational Services [16]
Interaction P-assertions:
In SOAs, interactions mainly occur via messages that are exchanged between numerous
services. By thoroughly capturing all these interactions, execution can be analyzed, its
validity can be verified, and it can be compared with other executions as well. In order to
facilitate this, process documentation often includes interaction p-assertions, where an
interaction p-assertion is a description of the contents of a message by a service that has
sent or received that message.
77
Relationship P-assertions:
Irrespective of whether a service returns a result directly or calls other services for that
purpose, the relationship between its inputs and outputs, cannot be explicitly represented
in the messages themselves. It can only be understood by performing a comprehensive
analysis of the business logic behind the concerned services.
To promote flexibility and generality, no assumption is made about the technology used
by services to implement their business logic, such as source code, workflow language,
etc. Instead, a requirement is placed on services to provide some information, in the form
of relationship p-assertions.
A relationship p-assertion is a description, asserted by a service, of how it obtained output
data sent in an interaction by applying some function, or algorithm, to input data from
other interactions. In Figure 17, output message M3 was obtained by applying function f1
to input M1. [16]
Service-state P-assertions:
Apart from the data flow in a given process, the internal service states may also be
necessary in order to understand non-functional characteristics of execution, such as the
performance or accuracy of services.
78
Thereby the nature of the result they compute can also be determined. Hence, a service
state p-assertion may be defined as documentation provided by a service about its internal
state in the context of a specific interaction.
Service state p-assertions can be extremely varied in nature. They can include the amount
of disk and CPU time a service used in a computation, its local time when an action
occurred, the floating-point precision of the results it produced, or application-specific
state descriptions. [16]
Provenance-aware applications need to be able to work together despite their innate
diversities. For this reason, it is important that the process documentation they all
produce is structured according to a shared data model. The novelty of this approach is
the openness of the proposed model of documentation, which is conceived to be
independent of application technologies [17]. Taken together, these characteristics allow
process documentation to be produced autonomously by application services, and be
expressed in an open format, over which provenance queries can be expressed [18].
3.4.4 Querying the Provenance of Electronic Data
Provenance queries are user-tailored queries over process documentation. They are aimed
at obtaining the provenance of electronic data. The primary challenge in this regard, is to
characterize the data item in which the user is interested.
79
As data can be mutable, its provenance can vary according to the particular point of
execution from which a user wishes to find it. A provenance query, therefore, needs to
identify a data item with respect to sending or receiving a message. The complete details
of everything that ultimately caused a data item to be the way it is, could potentially be
very large in size.
For instance, the complete provenance of the results of an experiment would possibly
include a description of the process that produced the materials used in the experiment,
the provenance of any source materials used in producing those materials, the devices and
software used in the experiment and their settings, etc. This way, if documentation were
available, details of processes leading back to the beginning of time or at least the epoch
of provenance awareness would also be included. [16]
Thus, it is highly essential for users to express their scope of interest in a process, using a
provenance query. Such a query essentially performs a reverse graph traversal over the
data flow DAG and terminates at the exact point catering to the query-specified scope.
The query output would be a subset of the DAG. Scoping can be based on types of
relationships, intermediate results, services, or sub-processes. [17]
3.4.5 Example: Provenance in Healthcare Management
In order to illustrate the proposed approach, let us consider a healthcare management
application.
80
The Organ Transplant Management (OTM) system that manages all the activities related
to organ transplants across multiple Catalan hospitals. Their regulatory authority is the
focus here.
OTM is made up of a complex process, involving not only the surgery itself, but also a
wide range of other activities, such as data collection and patient organ analysis, which
all have to comply with a set of regulatory rules.
Currently, OTM is supported by an IT infrastructure that maintains records allowing
medical personnel to view and edit a given patient’s local file within a given institution or
laboratory. [16]
However, the system does not connect records, nor does it capture all the dependencies
between them. It does not allow external auditors or patients’ families to analyze or
understand how decisions are reached. [19]
If OTM is made to be provenance-aware, powerful queries that were not possible to
execute earlier could be supported with ease. Some of those queries include, finding all
doctors involved in a decision, finding the blood test results that were involved in a
donation decision, finding all data that led to a decision to be taken. Such functionality
can be made available not only to the medical profession but also to regulators or
families.
81
For easier understanding, the discussion is limited to a small, simplified subset of the
OTM Workflow, namely the process leading to the decision of donating an organ. [16]
Figure 18: Provenance DAG of a Donation Decision [16]
As a hospitalized patient’s health declines and in anticipation of a potential organ
donation, one of the attending doctors requests the full health record for the patient and
sends a blood sample for analysis. Through a user interface (UI), these requests are made
by the attending doctor and passed on to a software component (Donor Data Collector)
responsible for collecting all the expected results.
82
After brain death is observed and logged into the system, if all requested data and
analysis results have been obtained, a doctor is asked to make a decision about the
donation of an organ. The decision, i.e., the outcome of the doctor’s medical judgment
based on the collected data, is explained in a report that is submitted as the decision’s
justification.
Figure 18 displays the components involved in this scenario and their interactions. The
UI sends requests (I1, I2, I3) to the Donor Data Collector service, which gets data from
the patient records database (I4, I5) and analysis results from the laboratory (I6, I7), and
finally requests a decision (I8, I9). To make OTM provenance-aware, it is augmented
with a capability to produce an explicit representation of the process actually taking
place. This includes p-assertions for all interactions (I1 to I9), relationship p-assertions
capturing dependencies between data, and service state p-assertions. [16]
In Figure 18, the DAG that represents the provenance of a donation decision can be seen,
made of relationships p-assertions produced by provenance-aware OTM. DAG nodes
denote data items, whereas DAG edges (in blue) represent relationships such as data
dependencies (is based on, is justified by) or causal relationships (in response to, is
caused by). Each data item is annotated by the interaction in which it occurs.
Furthermore, the UI asserts a service state p-assertion, for each of its interactions, about
the user who is logged into the system.
83
Over such documentation, provenance queries can be issued that navigate the provenance
graph and prune it according to the querier’s needs. For instance, from the graph, it can
be derived that users X and Y are both causing a donation decision to be reached. [16]
Figure 18 is merely a snapshot of one scenario, but in real life examples, with vast
amount of documentation, users benefit from a powerful and accurate provenance query
facility.
Hence, this kind of open approach allows complex distributed applications, possibly
involving multiple technologies (such as Web Services, Command-line Executables,
Monolithic Executables), to be documented. It also allows complex provenance queries
to be expressed, identifying data and scoping processes independently of technologies
used.
84
CHAPTER 4
CONCLUSION
4.1 Summary
This report mainly focused on exploring the implementation details of Data Provenance
across some applications as well as observing the various models proposed to represent
the functionality of data provenance. This report has primarily provided a basic
understanding of the concept of data provenance and a review of the various aspects of
provenance such as the generalized W7 Model, and some proposed conceptual models
that could be developed based on the W7 Model for satisfying a wide range of
requirements. Then, the technique of Active Conceptual Modeling was observed, which
is proposed in an attempt to make the conceptual schema active while capturing
provenance in a database. Following on was a detailed discussion of the implementation
of data provenance across some common, real-world applications. Firstly, it was
observed how provenance supports information assurance attributes like availability,
confidentiality and integrity in a multi-level secure environment. Then, a provenancebased mechanism was described, which can tackle the malicious packet-dropping attack
in the field of Sensor Networks. Then the implementation of provenance collection and
querying in electronic data was explained. The example application of a Healthcare
Management System was discussed to further clarify the implementation.
85
4.2 Future Work
This report has offered a rather generalized perspective of the concept of Data
Provenance attributed to some applications. Apart from the applications that are
discussed in this report, data provenance holds paramount importance in several other
areas. The future work that could be conducted, would involve delving deeper into the
details of capturing, storing and querying provenance information in areas such as, grid
computing, cloud computing, web applications among some others in order to gain a
better idea about the versatility of the concept of provenance. By implementing Active
Conceptual Modeling across a wider range of applications than currently in use, such as
in online website development for one, phishing attacks on websites can be reduced more
efficiently. Virtual Data System (VDS) and myGrid are execution environments for
scientific workflows, which also provide support for provenance. They follow their
respective workflow language, which allows them to obtain compact process
documentation. By adopting an open data model for process documentation, such
systems could be integrated into heterogeneous applications for which provenance
queries could be executed seamlessly [16]. Phantom Lineage, the concept that deals with
the ways to store provenance of missing or deleted data, also requires further
consideration. Finally, it can be said that data provenance is still a relatively new and
exploratory field. Hence, a deeper understanding of provenance is essential to identify
novel ways to recognize its full potential.
86
GLOSSARY
XML
Extensible Markup Language
SOA
Service-Oriented Architecture
SOAP
Service-Oriented Architecture Protocal
REST
Representational State Transfer Protocol
DP
Data Provenance
WSS
Web Services Security
MD
Message Digest
SHA
Secure Hash Algorithm
TCP
Transmission Control Protocol
UDP
User Datagram Protocol
87
IPD
Inter-Packet Delay
DAG
Direct Acyclic Graph
BS
Base Station
OTM
Organ Transplant Management
88
REFERENCES
[1] A New Perspective on the Semantics of Data Provenance, by Sudha Ram and Jun Liu.
Link: http://ceur-ws.org/Vol-526/InvitedPaper_1.pdf
[2] Understanding the Semantics of Data Provenance to Support Active Conceptual
Modeling, by Sudha Ram and Jun Liu.
Link: http://kartik.eller.arizona.edu/ACML_Provenance_final.pdf
[3] Data Provenance: A Categorization of Existing Approaches, by Boris Glavic and
Klaus Dittrich
Link: http://www.cs.toronto.edu/~glavic/files/pdfs/GD07.pdf
[4] Research Problems in Data Provenance, Wang-Chiew Tan.
Link: http://users.soe.ucsc.edu/~wctan/papers/2004/ieee.pdf
[5] Archiving Scientific Data, by Peter Buneman, Sanjeev Khanna, Keishi Tajima,
Wang-Chiew Tan.
Link: http://homepages.inf.ed.ac.uk/opb/papers/sigmod2002.pdf
[6] Why and Where: A characterization of Provenance, by Peter Buneman, Sanjeev
Khanna, and Wang Chiew Tan.
Link: http://db.cis.upenn.edu/DL/whywhere.pdf
89
[7] The virtual data grid: a new model and architecture for data-intensive collaboration,
by Ian T. Foster
Link: http://www.ci.uchicago.edu/swift/papers/ModelAndArchForDataCollab2003.pdf
[8] Modeling Temporal Dynamics for Business Systems, by G. Allen and S. March.
Link: http://wenku.baidu.com/view/ae750a651ed9ad51f01df242.html
[9] Future direction of conceptual modeling, by P. P. Chen, B. Thalheim, and L. Wong
Link: http://www.springerlink.com/content/9r6hd2c344ceqm85/
[10] Data Provenance Architecture to Support Information Assurance in a Multi-level
Secure Environment, by Abha Moitra, Bruce Barnett, Andrew Crapo and Stephen J Dill.
Link: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05379854
[11] A Survey of Trust in Computer Science and the Semantic Web. Journal of Web
Semantics: Science, Services and Agents on the World Wide Web, Vol. 5, pg. 58-71,
2007 by Artz D. and Gil Y.
Link: http://www.isi.edu/~gil/papers/jws-trust-07.pdf
[12] A Provenance-based Mechanism to Identify Malicious Packet Dropping Adversaries
in Sensor Networks, by Salmin Sultana, Elisa Bertino, and Mohamed Shehab.
Link: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5961507
90
[13] Secure Provenance Transmission for Streaming Data, by S. Sultana, M. Shehab, and
E. Bertino.
Link: http://ieeeexplore.com/stamp/stamp.jsp?tp=&arnumber=6152110
[14] National Cyber Security Research and Development Challenges - Related to
Economics, Physical Infrastructure and Human Behavior, 2009
Link: http://www.carlisle.army.mil/DIME/documents/i3pnationalcybersecurity.pdf
[15] http://en.wikipedia.org/wiki/Direct-sequence_spread_spectrum
[16] Provenance of Electronic Data, by Luc Moreau, Paul Groth, Simon Miles, Javier
Salceda, John Ibbotson, Sheng Jiang, Steve Munroe,Omer Rana, Andreas Schreiber,
Victor Tan, and Laszlo Varga
Link: http://eprints.soton.ac.uk/270862/1/cacm08.pdf
[17] An Architecture for Provenance Systems, by Paul Groth, Sheng Jiang, Simon Miles,
Steve Munroe, Victor Tan, Sofia Tsasakou, and Luc Moreau
Link: http://eprints.soton.ac.uk/263216/1/provenanceArchitecture10.pdf
[18] The Requirements of Recording and Using Provenance in E-science Experiments, by
Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau
Link: http://eprints.soton.ac.uk/260269/1/pasoa04requirements.pdf
91
[19] Applying Provenance in Distributed Organ Transplant Management, by Sergio
Alvarez, Javier Vazquez-Salceda, Tamas Kifor, Laszlo Varga, and Steven Willmott.
Link: http://www.gridprovenance.org/publications/IPAW-OTM-EHCR.pdf
[20] Tracking Provenance in a Virtual Data Grid, by Ben Clifford, Ian Foster, Jens
Volcker, Michael Wilde and Yong Zhao.
Link: ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1407.pdf
[21] Mining Taverna’s Semantic Web of Provenance, by Jun Zhao, Carole Goble, Robert
Stevens and Daniele Turi.
Link: http://onlinelibrary.wiley.com/doi/10.1002/cpe.1231/pdf
[22] Passing the Provenance Challenge, Margo Seltzer, David A. Holland, Uri Braun, and
Kiran-Kumar Muniswamy-Reddy.
Link: http://www.eecs.harvard.edu/~kiran/pubs/ccpe.pdf
[23] A Survey of Data Provenance in E-science, by Yogesh Simmhan, Beth Plale and
Dennis Gannon
Link: http://pti.iu.edu/sites/default/files/simmhanSIGMODrecord05.pdf
92
[24] The Case of the Fake Picasso: Preventing History Forgery with Secure Provenance,
by Ragib Hasan, Radu Sion, Marianne Winslet
Link: http://www.usenix.org/event/fast09/tech/full_papers/hasan/hasan.pdf
[25] Secure Provenance for Cloud Storage, by Masoud Valafar, Kevin Butler
Link: www.ieee-security.org/TC/SP2011/posters/Secure_Provenance_for_Cloud_Storage
Download