ANALYSIS OF DATA PROVENANCE ACROSS VARIOUS APPLICATIONS A Project Presented to the faculty of the Department of Computer Science California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Computer Science by Praneet Mysore SPRING 2013 ANALYSIS OF DATA PROVENANCE ACROSS VARIOUS APPLICATIONS A Project by Praneet Mysore Approved by: __________________________________, Committee Chair Isaac Ghansah, Ph.D. __________________________________, Second Reader Robert A. Buckley ____________________________ Date ii Student: Praneet Mysore I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project. __________________________, Graduate Coordinator Behnam Arad, Ph.D. Department of Computer Science iii ___________________ Date Abstract of ANALYSIS OF DATA PROVENANCE ACROSS VARIOUS APPLICATIONS by Praneet Mysore Data Provenance refers to the line of descent or the ancestry of information. It constitutes the origin of that data, along with some key events that occur over the course of its lifecycle. Additionally, some important details associated with the creation of that data, its processing and archiving are also a part of it. Such information is instrumental in determining how secure and trustworthy the data is. This is the primary reason why provenance is one important aspect of data security. Even in applications like Digital Forensics, provenance helps maintain a proper chain of custody by providing information about who collected the evidence, what procedure was followed, when and why it was collected and where it was stored. This report discusses provenance models and some existing environments and applications where provenance is used such as, a multi-level secure environment, sensor networks, and electronic data transfer. iv Other environments where provenance can be useful such as cloud and grid computing are not part of this report. _______________________, Committee Chair Isaac Ghansah, Ph.D. _______________________ Date v ACKNOWLEDGEMENTS I express my heartfelt gratitude to my advisor, Dr. Isaac Ghansah for his helpful comments and feedback throughout the course of this project. I genuinely appreciate the valuable time and effort Dr. Ghansah had dedicated towards this project despite his hectic schedule. I sincerely thank Professor Robert A. Buckley, my second reader, for going through my project report and making some highly beneficial suggestions on content. I also thank Dr. Behnam Arad for his valuable comments on the format of this report. Finally, words alone cannot express the love I have for my parents, Narasimha Rao and Mridula Mysore and my brother, Praveen Mysore, for their endless support throughout my Masters program. Their love and support has kept me immensely motivated to do my best. vi TABLE OF CONTENTS Page Acknowledgements ..................................................................................................... vi List of Tables ............................................................................................................... ix List of Figures ............................................................................................................... x Chapter 1. INTRODUCTION…………………………………………………………...........1 1.1. Background…………………………………………………………….....1 1.1.1 Bunge’s Ontology .........................................................................1 1.1.2 W7 Model ....................................................................................3 1.2. Data Provenance .........................................................................................7 1.3. Report Organization………………………………………………………8 2. OVERVIEW OF EXISTING APPROACHES ........................................................9 2.1 Introduction………………………………………………………………..9 2.2 Existing Models to Represent Provenance Information ........................... 10 2.3 Active Conceptual Modeling with Provenance………………………..... 25 3. SOME APPLICATIONS OF PROVENANCE .....................................................30 3.1 Introduction………………………………………………………………30 3.2 Supporting Information Assurance in Multi-level Secure Environment... 33 3.2.1 Introduction ................................................................................. 33 3.2.2 Message-Structure Overview…………………………………... 34 3.2.3 Wrappers……………………………………………………….. 36 3.2.4 Messages……………………………………………………….. 37 3.2.5 Data Provenance……………………………………………….. 38 3.2.6 Addressing Information Assurance Attributes………………… 46 3.2.7 Example Analysis……………………………………………… 49 3.2.8 Monitoring and Analyzing Workflows………………………....52 vii 3.3 Provenance Mechanism to tackle Packet Dropping in Sensor Networks…………………………………………………………………..54 3.3.1 Introduction……………………………………………………... 54 3.3.2 Overview of the Scheme………………………………………... 58 3.3.3 Secure Provenance Transmission Mechanism………………….. 64 3.3.4 Packet-Dropping Adversary Identification Scheme……………...67 3.4 Provenance of Electronic Data Transfer……………………………..…… 71 3.4.1 Introduction……………………………………………………... 71 3.4.2 Lifecycle of Provenance in Computer Systems…………………. 72 3.4.3 Open Model for Process Documentation……………………...… 75 3.4.4 Querying the Provenance of Electronic Data……………………. 78 3.4.5 Example: Provenance in Healthcare Management………………. 79 4. CONCLUSION………………………………………………………………... 84 4.1 Summary………………………………………………………………... 84 4.2 Future Work…………………………………………………………….. 85 Glossary…………………………………………………………………………... 86 References………………….………………………………………………...........88 viii LIST OF TABLES Tables Page 1. Definition of seven W’s in the W7 Model……… . .……………………………4 2. Application of the W7 Model in the Wikipedia……. ………………………… 6 3. DP Records for Example Scenario……….……………………………………49 ix LIST OF FIGURES Figures Page 1. Overview of the W7 Model………………… ....................................... .………5 2. Provenance Model……………………………….…………………………… 13 3. Manipulation and Query Facilities……………………. ………….………… 19 4. Storage and Recording Model……………………. ………………………… 21 5. A Sample Provenance-based Conceptual Model .............................................. 26 6. Envelope Structure with Data Provenance…………………………………….35 7. Adding Wrappers & De-Wrappers to minimize impact on workflow………...37 8. Data Provenance Record ............................................................................... …39 9. Multiple DP records with Forwarded Message ................................................ 42 10. Truncated DP Record ........................................................................................ 43 11. DP Record with Proxy Owner .......................................................................... 44 12. DP Record with an Included Attachment ......................................................... 45 13. Dashboard for Monitoring and Analyzing Workflows ..................................... 53 14. Provenance Examples for a Sensor Network .................................................... 61 15. Provenance Encoding at Sensor Node and Decoding at Base Station .............. 64 16. Provenance Lifecycle ........................................................................................ 74 17. Categories of P-assertions Made by Computational Services .......................... 76 18. Provenance DAG of a Donation Decision ........................................................ 81 x 1 CHAPTER 1 INTRODUCTION 1.1 Background Data is the core entity that drives the digital world in today’s times. Owing to that fact, it is imperative that we ensure that our data is secure. The security of data is of paramount importance because it essentially determines the amount of trust we can put into the data. So, while it is important to provide security to our data, which may be present in various storage systems such as databases or even networks for that matter, it is also important to know exactly how much security needs to be provided and how much is needed to safely and confidently assert that our data is secure and trustworthy. Provenance is that aspect of data, which tells us the story behind the data. In general terms, it aims to give a comprehensive account of the history associated with the data. We discuss provenance in explicit detail, in the following subsections. Data Provenance was conceptualized largely based on a basic ontological view, known as ‘Bunge’s Ontology’. 1.1.1 Bunge’s Ontology Devised by the Argentine physicist Mario Bunge, this ontology forms the basis for perfectly defining the fundamentals of data provenance. 2 The core element of Bunge’s ontology is a ‘thing’. The ‘state’ of a thing is nothing but the set of property values of the thing at a given time. Bunge’s ontology postulates the following statement: “Everything changes, and every change is a change of the state of things, i.e., the change of properties of things. A change of state is termed as an event.” Therefore, it can be inferred that an event occurs when a thing acquires, loses or changes the value of a given property. Data can also be considered as a ‘thing’. Hence, a set of things is analogous to pieces of data [1]. Action, Agent, Time and Space are all constructs related to events. An event, on a thing, occurs when another thing, often a human or software agent, acts upon it. An event occurs at a particular point in time, in a particular space. Based on these constructs of event and state, comes the concept of ‘history’. History of data is a sequence of events that happens to the data over the course of its lifetime [1]. Bunge’s theory regarding history and events is a perfect match for defining data provenance and its semantics since data provenance is often referred to as the pedigree or history of data. In this manner, the constructs in Bunge’s ontology including history, event, action, etc. lay a theoretical foundation for defining provenance and its components. 3 Based on this ontology, we can devise a model known as the W7 Model, which is instrumental in providing a sense of completeness to the semantics of data provenance. 1.1.2 The W7 Model Using this model, provenance can be represented as a combination of seven W’s namely, What, When, Where, How, Who, Which and Why. This model is widely perceived to be generic and extensible enough to capture the essence of the semantics of provenance across various domains. Definition of Provenance in the W7 Model is as follows: “Provenance of some data D is a set of n-tuples: p(D) = {< What, When, Where, How, Who, Which, Why >}.” The definitions of these seven W’s, which are analogous to the constructs in Bunge’s Ontology, are tabulated as follows: 4 Table 1: Definition of the seven W’s in the W7 Model [1] PROVENANCE CONSTRUCT IN ELEMENT BUNGE’S ONTOLOGY What Event DEFINITION How Action When Time An event such as a change of state, that happens to data during its lifetime An action, which triggers the occurrence of an event Timestamp or duration of an event Where Space Location(s) associated with an event Who Agent Which Tools Why Reason Person(s) or organizations involved in the occurrence of the event Software programs or any hardware tools used in the event’s occurrence Reasons giving an accurate account of why an event has occurred As we can infer from the table, ‘What’ denotes an event that affected data during its lifetime. ‘When’ refers to the time at which the event occurred. ‘Where’ tells the location of the event. ‘How’ depicts the action leading up to the event. ‘Who’ tells about the agent(s) involved in the event. ‘Which’ are the software programs or tools used in the event, and ‘Why’ represents the reasons for the occurrence of events. Hence, the name W7 is given to this model. A diagrammatic depiction of the W7 model is given below. In the figure, the boxes represent the concepts and the bubbles represent the relationships between those concepts. The seven W’s are inter-connected in such a way that they form a selfexplanatory flow of operation of the model. [1] 5 Figure 1: Overview of the W7 Model [1] From the above figure, it can be inferred that ‘What’ (i.e., Event) is the anchor of the W7 model. The other six W’s all revolve around ‘What’ and are determined based on the occurrence of an event. Therefore, this implies that, under the scenario where a piece of data present in a database is modified in any manner, it needs to be thoroughly investigated and information about the seven W’s needs to be gathered and stored so that the following is known in detail. What type of modification was performed, who performed it, why it has been performed, how it has been done, from where the data was taken and where the modified data is stored and at what time this modification has taken place. [1] By collecting all this information and storing it for future reference, a concrete idea can be obtained regarding the trustworthiness of the data. 6 In order to describe the application of the W7 model in a common real-life scenario, we consider applying it to the context of a Wikipedia page. Table 2: Application of the W7 Model in the Wikipedia [1] PROVENANCE ELEMENT What APPLICATION TO WIKIPEDIA PAGE Who Creation, Modification, Destruction, Quality Assessment, Access-Rights Change Sentence insertion/updation/deletion, reference insertion/updation/deletion, Revert (to a previous version) Administrators, registered editors, and anonymous editors When Timestamp of events Where IP Address of the editor Which Software used to work on the page Why User comments (feedback and suggestions) How Implementing data provenance in Wikipedia requires very little manual effort. The Mediawiki software used by Wikipedia is set to automatically capture the what, who, how, when, where, and which. Only the why provenance demands manual input [1]. Applying the W7 Model to the Wikipedia enables us to capture and store provenance of Wikipedia pages in a structured and comprehensive manner. 7 1.2 Data Provenance: With Bunge’s Ontology as the fundamental basis, and by using the semantics provided through the W7 model, we can say that data provenance conveys the basic idea that data needs to be captured in the hope that it is comprehensive enough to be useful in the future. It is the background knowledge that enables a piece of data to be interpreted correctly and supports learning [1]. Although a vast amount of research has already been done on the concept of provenance, according to several researchers, it is still unclear as to what the scope of provenance is and how much information it could include. With the rapidly growing use of databases, there is not only a need to make the data more secure, but also to ensure that it is trustworthy. This report discusses the concept of data provenance in detail by classifying the various existing approaches of capturing and interpreting provenance information. The various applications and the future scope of data provenance will also be analyzed. 8 1.3 Report Organization: Chapter 1: Introduction This chapter discusses the background of the concept of data provenance and briefly touches on the basics about provenance. Chapter 2: Overview of Existing Approaches This chapter delves further into the depths of the concept of provenance by offering an overview of the various existing models that are used to represent provenance. Active Conceptual Modeling, an enhancement to the existing provenance models, is also discussed. Chapter 3: Various Applications of Provenance This chapter throws light on some of the most prominent applications of data provenance. A detailed discussion of the way provenance is implemented in these applications and how it contributes towards enhancing the trustworthiness of data is provided. Chapter 4: Conclusion This chapter briefly summarizes the research made on data provenance throughout this report and the way it is implemented across various applications, thereby providing a scope for future work by identifying some possible ways in which the implementation of data provenance could be further enhanced. Following Chapter 4, are Glossary and References. 9 CHAPTER 2 OVERVIEW OF EXISTING APPROACHES 2.1 Introduction In data warehousing, e-science and several other application areas, data provenance is a very valuable asset. The provenance of a data item includes information about the source data items along with the processes and events responsible for its creation, subsequently leading to its current representation. However, the diversity of data representation models and application domains has led to the inception of a number of formal definitions of provenance. Most of those definitions are limited to a specific application domain, data representation model or data processing facility. Unsurprisingly, the associated implementations are also restricted to a certain application domain and depend on a special data model for their representation. In this chapter, a limited selection of such data provenance models and prototypes are observed. A general categorization scheme for these provenance models is provided. Thereby, this categorization scheme is used to study the properties of the existing approaches. This categorization would eventually help distinguish between different kinds of provenance information and thereby could lead to a greater understanding of provenance in general 10 2.2 Existing Models To Represent Provenance Information Scientists in various fields often use data from ‘curated’ databases in their experimental analysis. Most of the data stored in a curated database is the result of a set of manual transformations and derivations. Hence, those who use data from a curated database are often interested in information about the data sources and transformations that were applied to the data from these sources. This information is used either to assess the quality and thereby the authenticity of the data, or to re-examine some data-derivation processes to see if re-running a certain set of tests or experiments is required [3]. Data warehouses are used to integrate the data collected from various sources, each having different representations. Thereafter, the integrated data is analyzed thoroughly to check for anomalies. This analysis could also benefit from any obtainable information about the original data sources and the set of transformations that were applied in order to generate and store the integrated data in the data warehouse. These are just some of the several reasons why data provenance is so important in most applications. Besides provenance information related to the data, it is also important to include the storage requirements and transformation requirements for all kinds of provenance information across these diverse application domains. Although a broad variety of applications would benefit from provenance information, the type of provenance data, the manipulation facilities and querying facilities needed, differ from one application to another. [3] 11 For that reason, the differences and similarities between the various application and data model provenance needs are identified and a general scheme for the categorization of provenance is presented henceforth. Provenance can be viewed in two different ways. In one view, the provenance of a data item can be described as the processes that lead to its creation whereas in the other view, the objective is to focus on the source data from which a given data item is derived. The term “Source Provenance” represents the latter, whereas the term “Transformation Provenance” represents the former. In other words, we can say that transformation refers to the creation process itself and the terms source and result refer to the input and output of a transformation respectively. The existing research directions can be classified into two distinct categories based on their approach to provenance recording. One research direction focuses on computing provenance information when data is created, while the other computes provenance data when it is requested. These approaches can be termed as ‘eager’ and ‘lazy’ respectively. Most of the eager approaches are based on source data items and transformations, while most of the lazy approaches rely on inversion or input tracing of transformations. [3] There is a close relation between data provenance and temporal data management. Much like in temporal data management, in provenance also, the previous versions of a data item are queried and accessed. 12 So provenance management systems may benefit from existing storage methods and query optimizations for temporal databases. Hence, the identification methods used in temporal data management may be applicable to provenance management as well [3]. The sections that follow, deal with data provenance from a conceptual point of view, and define a general categorization scheme for provenance management systems. Several functionalities of a provenance management system are defined, and these functionalities are ordered in a hierarchy of categories. The three main categories of our categorization scheme are provenance model, query and manipulation functionality and storage model and recording strategy [3]. An overview figure for each main category (Figures 1, 2 and 3) is presented. Boxes, in these figures, represent categories and ellipses represent functionalities. 13 Figure 2: Provenance Model [3] The provenance model embodies the expressiveness of the provenance management system in defining the provenance of a data item. As specified earlier, the provenance of a data item can be divided into Transformation Provenance and Source Provenance. Source provenance is information about the data that was involved in the creation of a data item. Source provenance can be defined in three distinct conceptual terms, such as the original source, contributing source and input source. 14 The input source includes all data items that were used in the creation of a particular data item. The positive contributing source includes all the other data items that are essential for the creation of a particular data item. The original source contains all data items that include data that is copied to the resulting data item [3]. For example, assume we manage the provenance of data in a relational database with two relations R1 and R2 and handle data items at tuple level. When executing the SQL query SELECT R1.name FROM R1, R2 WHERE R1.id = R2.id against a database including relations R1 and R2. The input source of a resulting tuple T includes all the tuples in R1 and R2. The positive contributing source of T contains all tuples T0 from relation R1 and T00 from relation R2 with T.name = T0.name and T0.id = T00.id. At last the original source of T includes all tuples T0 from relation R1 with T.name = T0.name. [3] Note that the following subset relationship holds: input source ﬤpositive contributing source ﬤoriginal source Some applications would benefit from information about data items that do not actually exist in the source, but would affect the creation of a resulting data item, if they were included. The term ‘negative contributing source’ is used for this concept. Unlike in the case of positive contributing source, the task of accurately defining this contrasting concept is not relatively straightforward. 15 It appears to be reasonable either to include all the data items that would, prohibit the creation of the result or to include all the possible combinations of data items that would prohibit the creation of the result. In most data repositories, the amount of data stored in the repository is only a very small fraction of the data that could be stored in the repository. Owing to this reason, it is not realistically possible to store the negative contributing source of a data item in the repository [3]. Not only does a provenance management system take into consideration, which kind of sources should be part of the source provenance, but it can also record information about each source data item. A source can be represented is one of these four ways: As the original data, as metadata attached to the source, as a source hierarchy structure or as a combination of any or all of these representations. A provenance management system has the innate ability to record source data items at not just one level of detail but in multiple levels of detail. For example, the source of a tuple data item in relational view would by default, include all tuples from a relation R. However, if the provenance model has the capability of handling multiple levels of detail, the source can be represented as a relation R instead. Managing provenance information at different levels of detail is a relatively more sensible approach as it provides more flexibility and can in turn result in smaller storage overhead. 16 One possible downside is that, this requires a more complex provenance model. Transformation provenance is the information about the transformations that were involved in the creation of a data item. To make a clear separation between a concrete execution of a process and the process itself, the term ‘transformation class’ is used for the former and ‘transformation’ is used for the latter. A transformation is not limited to be an automatic process. It might as well be a manual process or a semi-automatic process with user interaction, wherever necessary. The transformation provenance of a data item could include metadata like author of the transformation, the user who executed the transformation and the total execution time. Some of the several examples for transformations include SQL statements that are used to create views, workflow descriptions of a workflow management system, and executable (.exe) files with command-line parameters [3]. Another vital part of the provenance model is the world model, which could be either closed or open in nature. In closed world models, the provenance management system controls transformations and data items, whereas, in open world models, the provenance management has limited or no control over the executed transformations and data items. In other words, the execution, manipulation, creation and deletion of data items can be done without any notification. Judging by the way the provenance management system views this; the world has an uncertain behavior. 17 This uncertain behavior makes the process of provenance recording rather complex and at times, even impossible to record accurate provenance information. The closed world and open world models are widely considered the two extreme ends of a spectrum. In fact, several other possible world models can depict representations that can be considered neither closed nor open. A provenance management system should also be able to recognize if data items from two different data repositories represent the same real world object. For example, it is possible to store the same data item in several databases [3]. As real world objects tend to change over time, it is imperative to have mechanisms that make it plausible for checking if two data items are different versions of the same real world object. This is even more important when updates to the repositories are not controlled by the provenance management system. This is because in this case, the information about source data items recorded by the system might be incorrect, as these data items might have been changed or deleted by an update. There are various methods to identify duplicates. One method would be to check if the data item and the duplicate represent exactly the same piece of information. This is called value-based duplicate identification. 18 If data items have a key property associated with them, then the most feasible alternative would be to identify duplicates by using their key property. Let us consider the example where the data items in consideration are tuples in a relational database. Two tuples are defined as duplicates if they have the same attribute values or if they hold the same key attribute values. In this case, using the primary key constraints of a relational database for identification could become a problem when no further restrictions are introduced, as the primary key uniqueness is usually confined to one relation and primary keys can be changed by updates. Many data models usually have either an implicit or an explicit hierarchical structure. This hierarchy in combination with a certain key property equivalence or value equivalence, could prove to be instrumental in identifying a given data item. For instance, if the provenance of a certain tag in a given XML-document is recorded, duplicates can be defined by the name of the tag and the position of the tag in the hierarchy of the document [3]. 19 Figure 3: Manipulation and Query Facilities [3] A provenance management system must be able to provide facilities to manipulate and query provenance information and data items, in order to be applicable in a real world scenario. It would be incomplete if manipulation and querying of data items were discussed without integration of provenance information. A provenance management system must also be able to provide mechanisms for merging a selection of individual transformations into one complex transformation and vice-versa. If this is facilitated, then provenance data can be used to recreate result data items, which cannot be accessed or are expensive to access, by executing the transformations that were used to create the result data item. In addition to this, if a provenance management system possesses the capability to compute the inverse of a transformation, then the inversion can be used to recreate source data items from result data items. 20 Split and merge operations can be applied to individual data items as well. It is understood that, the split operation involves dividing a higher-level data item into its lower-level parts whereas the merge operation combines lower-level data items into a higher-level data item. While the understanding of the split operation is reasonably clear, the merge operation possibly raises some questions that need answering in order to make optimal use of it. Some of those questions are: What is the result of the merge operation on a subset of the lower-level data items that form a higher-level data item? How can this result be distinguished from the result of a merge operation on the set as a whole? For a provenance management system that records provenance information for different data models, it is best to provide facilities for converting the representation of data items from a given data model to the other, in order to optimally utilize its ability. Regarding the storage strategy employed by a provenance management system, provenance information either can be attached to the physical representation of a data item or can be stored in a separate data repository. It is interesting to note that, a provenance management system can support more than one storage strategy and can also offer mechanisms for changing the storage strategy for data items. Overall, it can be said that the feasibility of implementing the manipulation operations taken into consideration largely depends on the properties of the chosen provenance model and world model. [3] 21 Figure 4: Storage and Recording Model [3] The various techniques a provenance management system uses for the purposes of storing provenance information, recording provenance information and propagating provenance information recorded for source data items are all included in the Storage and Recording Model. Storage strategy explains the relationship between the provenance data and the target data that is to be used for the purpose of provenance recording [3]. No coupling, tight coupling and loose coupling are the three main types of recording strategies that could be used in this model. Any of these three can be adopted, depending on the underlying requirements. 22 The no-coupling strategy involves storing of only provenance information in one or many repositories. The tight-coupling strategy involves the storage of provenance directly associated with the data for which provenance is recorded. The loose-coupling strategy uses a mixture of these two strategies, which means that both provenance and data are stored in one single, but logically separated, storage system. Most of the annotation-based approaches use either tight-coupling or loose-coupling strategy. This can be made to work either by attaching provenance annotations to data items or by storing annotations in the same data repository, but segregated from the corresponding data items. On the other hand, approaches that are service-based in nature, involve recording provenance for several data repositories in a distributed environment. These types of approaches usually deal with a highly heterogeneous environment with limited control over the execution of processes and manipulation of data items [3]. This makes the recording of provenance information quite a herculean task. By using a no-coupling storage strategy in a closed world data model, a certain degree of control could be gained over provenance information. In theory, all possible data models could be used to store provenance information, but not every combination of storage model and storage strategy is reasonable enough, in all kinds of situations. [3] 23 In a scenario where provenance is recorded for a transformation that uses source data items with attached provenance information, it is unclear as to how provenance information is propagated to the result data items. The three possible answers to this question are no-propagation, restricted propagation and complete propagation. With nopropagation, the provenance of source data items of a transformation is ignored when creating provenance data for result data items. Contrary to this, in complete propagation, the result data items of a transformation inherit all provenance data from source data items, according to the kind of source used. With restricted propagation, a result data item inherits a part of provenance from the source data items, i.e., provenance that was generated during the last t transformations [3]. The provenance recording strategy determines the stage at which provenance data is recorded. User-controlled recording, Eager recording, No recording and Systemcontrolled recording are the various types of recording strategies taken into consideration. In user-controlled recording, the user decides at what point and for which data item the provenance information is supposed to be recorded. Eager recording involves recording provenance simultaneously along with every transformation on the go. No recording strategy generates provenance at query time, whereas in system-controlled recording, data creation is regulated by the provenance management system. Such a system could use improvised strategies like, recording the provenance data once a day or recording the provenance after every t transformations. [4] 24 Developing a provenance management system for open world models is a rather intriguing problem [3]. A formal model designed with the help of this categorization scheme, can possibly form the basis for a provenance management system that handles not only various storage models, but also different types of source and transformation provenance. Some of the problems faced when dealing with provenance are related to the integration of data. Let us consider a scenario where the concept of semantic identity needed to recognize duplicates or versions of data items in an open world model was thoroughly researched by various publications. A provenance management system, handling different kinds of data items stored in distributed repositories, needs to integrate this data to gain a unified view on the data. This scenario makes it obvious that data integration systems might benefit greatly by including provenance management into their operational scheme of things. Provenance data could also be used to help a user to make an accurate assessment of the overall quality of the integrated data. Thus, a categorization scheme for different types of provenance has been observed. This scheme helps us gain a systematic overview of the capabilities and limitations of these models. 25 This investigation can be extended further to cover the various problems encountered during implementation of the mentioned techniques and to analyze the complexity of various combinations of functionality. With the help of such an investigation, a formal language for the management of provenance data can also be defined. This language should include generation, querying and manipulation of provenance data, as and when required. Unlike existing approaches, this language should cover not only different data models, but also manage different types of provenance information. It will do best to include certain language constructs for converting between different data models and kinds of provenance data. [3] 2.3 Active Conceptual Modeling with Provenance One of the major problems in current data modeling practices is that database design approaches have generally viewed data models as representing only a snapshot of the world and hence suggest overlooking some seemingly minute differences in information as well as the causes and other details of those differences, during the task of data modeling [8]. The solution to this problem lies in ‘Active Conceptual Modeling’. It describes all aspects of the world including its activities and changes under various perspectives, thereby providing a multi-level and multi-perspective view of reality. Active conceptual modeling primarily deals with capturing provenance knowledge in terms of what change the data might undergo, during the stage of conceptual modeling. 26 Moreover, it takes cue from the W7 Model. Therefore, it is imperative to identify provenance components such as “where”, “when”, “how”, “who”, and “why” behind the “what” to provide sufficient insight into the changes. The W7 model is a generic model of data provenance and is intended to be easily adaptable to represent either domain-specific or application-specific provenance requirements in conceptual modeling. Nowadays, provenance knowledge is indispensable in various applications. It is essentially critical in the domain of homeland security where, given some background intelligence information, provenance regarding the information such as how and when it was collected and by whom, is required to evaluate the quality of the information in order to avoid false intelligence [2]. Consider the homeland security application described in the conceptual schema given in the figure below: Figure 5: A Sample Provenance-based Conceptual Model [2] 27 In today’s times, organizations and/or ordinary citizens are often called upon to report suspicious activities that might possibly indicate terrorist threats. As a successful example, the Pan American Flight School reported that Zacarias Moussaoui seemed overly inquisitive about the operation of the plane’s doors and control panel, which leads to Moussaoui’s subsequent arrest prior to 9/11 [2]. However, there is always a genuine possibility that, intelligence information such as threat reports may be false, out-of-date, and from unreliable sources, which calls for a provenance-based schema in order to eradicate such occurrences. Hence, various provenance events are recorded, such as the creation and transformation of the threat reports at the conceptual level. By doing this, the conceptual schema is made to be ‘active’. It is a relatively straightforward task to capture the “when”, “who”, and “how” aspects associated with data creation based on the semantics specified in the W7 model. The “timestamp” attribute captures when the creation event occurs (see Fig. 9). “Who”, in this case, describes individuals or organizations involved in the event including those who report the threat as well as agents who record the report. The “how” aspect of the event is recorded by instantiating the “method” attribute in the W7 model into two different instances namely, “reporting method” and “recording method”. When data is transformed or updated, information regarding who made a certain change at a particular time is captured. 28 The “input” attribute provides information regarding how a report is updated. It normally records the previous version of the report and may even include more when the report is updated by combining information from other sources. The reason why the information is updated, is also captured by specifying the “goal” attribute. For the reasons mentioned, the task of recording data provenance is imperative in the domain of homeland security as it facilitates the following: - Information quality: To enforce national security, the right people must collect the right information from the right sources to identify the right security threats all in a foolproof manner. Capturing information such as, who reported the threat via what reporting method, assists in evaluating data reliability. Provenance regarding how the report was recorded or who participated in its updation, also helps ensure that the information is trustworthy, before we take any action. - Information currency: Some types of intelligence information may have a very short shelf-life. As an example, after Saddam Hussein fled Baghdad, information about him being spotted at a specific location changed six to eight times a day. Capturing provenance such as: when the report of his being spotted was created and updated could be used to avoid being misled by old or out-of-date information. 29 - Pattern recognition: Provenance also helps uncover certain unusual behavioral patterns, which would in turn be extremely helpful for predicting and preventing potential terrorist threats. As an example, a sudden increase in the number of threat reports from people in the same region within a short span of time might give a slight indication of a terrorist plot. In addition, the “who” part of our provenance information could help us in identifying key reliable sources and forestall unreliable sources from feeding false intelligence. [2] 30 CHAPTER 3 SOME APPLICATIONS OF PROVENANCE 3.1 Introduction As mentioned earlier, there is a very close relationship between provenance, security and consequently, the trustworthiness of data. Data can be dubious at best; if it is proven insecure, as it is not trustworthy. Provenance helps determine the level of trust that can be bestowed upon the given data in any situation. Thereby, it inadvertently provides security to the data in the sense that, whenever some aspect of the data in consideration changes in any manner, those changes are logged and safely stored for future reference. Thereby, it provides a method to easily identify the modified data and revert the changes, if necessary. There are several real-world applications where provenance is of utmost importance. For instance, in a cloud storage environment, there is uncontrolled movement of data from one place to another. Due to this, determining the origin of the data and keeping track of the modifications it undergoes from time to time, becomes highly essential. Ensuring regulatory compliance of data within a cloud environment is also necessary. [25] Both the National Geographic Society's Genographic Project and the DNA Shoah project track the processing of DNA samples. The participants of these projects, who submit their DNA samples for testing, want strong assurances that, unauthorized parties, such as insurance companies or anti-Semitic organizations, will not be able to gain access to the provenance information of the samples. 31 The US Sarbanes-Oxley Act states that the officers of companies that issue incorrect financial statements are subject to imprisonment. Due to this act, officers have become proactive in tracking the path of their financial reports during their development, including the origins of input data and the corresponding authors. The US Health Insurance Portability and Accountability Act also mandate the logging of access and change histories for medical records. [24] However, without appropriate guarantees, as data crosses both application and organizational boundaries and passes through untrusted environments such as a cloud environment, its associated provenance information becomes vulnerable to alteration and cannot be completely trusted. Therefore, the task securing provenance information, thereby making sure that its integrity and trustworthiness are preserved, is of high importance. Making provenance records trustworthy is challenging. It is imperative to guarantee completeness, so it is assured that all relevant actions performed on a document are thoroughly captured. There are a few cross-platform, low overhead architectures that have been proposed for this purpose. These architectures contain a provenance tracking system that tracks all the changes made in the provenance information pertaining to certain data from time to time. 32 They use cryptographic hashes and semantically secure encryption schemes such as the IND-CPA (Indistinguishable under Chosen Plaintext Attack) meaning, in cryptographic terms, that any knowledge of the cipher text and the length of a message cannot reveal any additional information about the plaintext of that message [24]. Such a phenomenon, when applied to provenance information, makes it secure and trustworthy. In this chapter, we discuss some applications in which provenance is used, such as a multi-level secure environment, sensor networks and electronic data transfer. In a multilevel, secure environment, where data is passed across users with multiple levels of security clearance, when one level of users cannot access the data belonging to users with a higher security clearance level, provenance information about that data helps those restricted users to verify that the data they received is authentic. In sensor networks, there is a usually continuous stream of data transfer taking place. In such an area where data transmission is fast and continuous, there is a possibility of data packets being dropped at random, leaving the door open to the malicious packet-dropping attack. In order to detect the occurrence of such an attack, provenance information would prove to be very helpful. On a similar note, provenance can be helpful in electronic data transfer as well. Electronic data does not always contain all the necessary historical information that would help end-users validate the accuracy of that data. Hence, there is a dire need to capture the additional fragments of provenance information, which accounts to the seven W’s concerning that data. 33 3.2 Supporting Information Assurance in a Multi-level Secure Environment 3.2.1 Introduction Multilevel security can be defined as the application of a computer system to process information at different levels of security, thereby permitting simultaneous access by users with different security clearances and preventing any unauthorized users from gaining access to sensitive information. In multi-level secure systems, it is not always possible to pass data source transformation and processing information across various levels of security constraints. A framework that is designed to make this process easier is hereby discussed. This framework captures provenance information in a multi-level secure environment, while ensuring that the data satisfies the basic information assurance attributes like availability, authenticity, confidentiality, integrity and non-repudiation. The amount of trust that can be bestowed upon any system should essentially be based upon a foundation of repeatable measurements [10]. Hence, this framework ensures that data provenance not only supports these information assurance attributes by combining the subjective trust and objective trust in data as a "Figure of Merit" value that can cross security boundaries. The architecture associated with this framework, facilitates the adding of information to an existing message system to provide data provenance. 34 The information can be added with the use of wrappers, to ensure that there is minimal impact on the existing workflow. The intention is to describe the original message system and the DP section as two separate pieces. This simplifies the addition and removal of any provenance information. Separating the two components also provides flexibility in implementation. Some existing real world implementations may provide the desired fields by either changing the message format used for a SOA system, or by augmenting an existing SOAbased workflow. This system is designed to work with both peer-to-peer as well as message/workflow services. In a Service Oriented Architecture (SOA), client applications talk directly to the SOA servers and processes communicate using protocols like SOAP or REST. This system has a routing service that supports both explicit destinations and role-based destinations. Moreover, the framework is languageindependent. 3.2.2 Message-Structure Overview This framework assumes that messaging is based on XML-serialization, supported by transport protocols such as SOAP and REST and security standards such as WSS. 35 Figure 6: Envelope Structure with Data Provenance [10] The above figure shows a structure of a message envelope where the DP records are encapsulated inside the Information Assurance Verification Section. According to the convention, a dotted line at the bottom of a block is used to show the relationship between a signature and what it verifies. 36 An alternative approach is to encapsulate DP information outside of the Information Assurance Verification Section and rely on XML stack to do the verification using Outof-band XML Headers. This not only raises portability issues, but also requires access to "raw" and unmodified XML headers if we want to verify authenticity and confidentiality by ourselves, but this in turn poses the threat of a replay attack. Keeping DP records encapsulated inside the Information Assurance Verification Section means that, an extension of the message body class can be created which can handle the data manipulation directly outside of the XML stack, thereby providing greater flexibility. For simplification purposes, a static, message-based system of information collection is used in the architecture, as opposed to a dynamic, actor-based data collection system. 3.2.3 Wrappers In order to minimize the impact on the existing workflow, the use of 'wrappers' and ‘dewrappers’ is advisable. Wrappers add appropriate DP information and 'de-wrappers' that strip the DP information before the message reaches its destination. Instead of adding wrappers and de-wrappers, if the underlying workflow is altered, then the processes may examine the provenance of the data by themselves and use that information in their processing. The use of wrappers and de-wrappers is depicted in the following figure. 37 Figure 7: Adding Wrappers & De-Wrappers to minimize impact on workflow [10] 3.2.4 Messages The message body contains the data that is being transferred. It can be in the form of text, images, etc. Every message has a unique Message-ID value. Attachments can also be considered as messages. Therefore, they also have a unique Message-ID associated with them. All messages have two parts namely, the invariant part, and the variant part. 38 The value of the invariant part will never alter. Precisely speaking, any system that retrieves the invariant part and calculates a one-way hash of that part will always get the same value under all conditions. Therefore, any XML encoding system will always provide the serialization of the data in the exact same format, irrespective of the implementation method. The variant part of the message may change. For example, the routing information may change as the message is in the process of being forwarded from one place to another. 3.2.5 Data Provenance The DP system should allow flexible implementations so that multiple SOA systems can exchange data over the SOA enterprise bus. Some systems may act like routers in forwarding messages to the proper workflow recipient. For instance, the sending system may send the information to a role, such as an ‘Analyst’. A workflow system may then decide on the next available analyst, and forward the message to that individual. The DP system should also support the use of gateways, and protocols that encapsulate data. Most importantly, it should support Multi-Level Secure systems and should be able work with encrypted data. DP records can be sent along with the messages and workflow. Alternatively, they can be sent to a service, for a retrieval-based implementation. 39 Consider a single DP record of a message with no attachments. This corresponds to a sender transmitting a message to a receiver. There are two different perspectives of any single message transmission - outgoing and incoming. The outgoing perspective is the intended transport DP characteristics from the perspective of the sender. The incoming is the observed transport properties from the receiver's perspective. The appropriate party signs each perspective. The receiving perspective includes the intended outgoing perspective as well. Therefore, the receiving party signs the DP record from the sending party. This is shown in the figure below: Figure 8: Data Provenance Record [10] 40 The sender’s DP section includes the following pieces of information: • Message-ID - a unique identifier that allows the retrieval of the message to verify the DP record • Outgoing Security Attributes • Timestamp - (Optional) useful for availability and non-repudiation analysis • Owner of Signature - supports signature verification • Hash of Invariant part of message and Hash Algorithm used - allows integrity testing • Security Label - to simplify data classification of a multi-part message It is important to note that the user and/or the application signs DP record. The XML stack does the signature in the XML envelope. Therefore, the certificates and the signing algorithm might differ. The signature for the DP record might be done by a multi-purpose private key, or a dedicated key may be used and it can be associated with an individual or an embedded crypto system in automated systems. [10] When receiving the message, the receiver adds DP information for analysis. The data that is signed by the receiver includes the sender's signed information. The data and functioning of the receiver's DP record is similar to the Sender's DP record for the most part. However, it does not include the Message ID because that is included in the Sender's DP record and need not be repeated. 41 Having the receiver sign the Sender’s DP record provides non-repudiation in case the sender denies sending the message. The hash value and algorithm is included in case the sender and receiver use different hash algorithms (or if the sender does not provide the information). After a message is transmitted (that is, it goes from the sender to receiver), the DP record is completed. If a message remains unchanged throughout the transmission, it may not be necessary to retain the routing information when the message is being forwarded. When forwarding an unchanged message, each hop provides a DP record. The following scenario is an example: Alice sends a message titled ‘Ml’ to Bob, who forwards the same to Carol, who in turn forwards it to Dave. In this case, there will be three DP records sent along with message ‘Ml’ to Dave. The following figure depicts the above scenario and gives us a clear idea of what is going on. 42 Figure 9: Multiple DP records with Forwarded Message [10] As mentioned earlier, since the message is unchanged, it is not necessary to retain routing information. Therefore, Dave can accept the message from Carol, but create a DP record with Alice in the Sender's section and delete the other records. An example of a truncated DP record is shown in Figure 10. If the invariant part of the message does not include the intended receiver (i.e.; the To: field), it will be impossible to detect misrouted messages. 43 Figure 10: Truncated DP Record [10] In a multi-level secure environment, there will be cases where a message is transmitted from a higher security level to a lower security level. In such cases, it may be necessary to remove all trace of the original source for security reasons. This can be done by creating a proxy owner. Note that this message could be modified, and the receiver of the message does not know its original source. In addition to this, DP records can be created for plaintext or for encapsulated and encrypted messages. This allows a system to provide non-repudiation that they saw an encrypted message, without revealing the actual message contents to them. The following figure depicts the creation of a proxy owner. 44 Figure 11: DP Record with Proxy Owner [10] In the case of messages that are changed during transmission, routing information must be retained. If a device or person receives a message, and that message is forwarded as an attachment by attaching it as part of another message, then as a result, a new message will be created with a new Message-ID. This message must identify the previous message by its Message-ID so that any DP record associated with the included attachment / message can be found. The following figure shows an example of an attachment being included in a new message. 45 Figure 12: DP Record with an Included Attachment [10] As the above figure depicts, the workflow is: Alice sends message ‘Ml’ to Bob, Bob creates a new message ‘M2’’. He then attaches ‘Ml’ to it and sends it to Carol. This architecture also supports DP analysis of complex messages. For example, Alice sends a message M1 to Bob, who adds a note to M1, and sends it, with the original message M1 as an attachment, to Carol who forwards it to Dave. 46 In this scenario, Dave's system must be capable of performing a DP analysis of the entire workflow. DP analysis is a complex task. Therefore, it will require system level support to address issues like: • Expired or Compromised Certificates • Unavailability of Messages (Message not archived or inaccessible due to security classification) • Modification of DP Records. Additionally, if the implementation has a message storage system, storage-based retrieval can be done, where messages can be retrieved whenever needed, instead of being sent in the workflow. In such cases, DP records associated with each message can be sent along with the message that is retrieved. The message storage system simply becomes another entity in the workflow, and creates DP records for each message received and transmitted. [10] 3.2.6 Addressing Information Assurance Attributes This architecture provides objective measurements. That is, if two individuals evaluate two different pieces of information that have identical means of transmission and workflow, they will end up with identical values. It is possible to assign confidence in underlying technology and algorithms used. For example, one hash algorithm may be superior to another. Therefore, if two people have the same confidence in the algorithmic strength of a hash algorithm, they will get similar degrees of trust. 47 Let us briefly consider how DP information can be used towards each of the Information Assurance attributes. These attributes can be used to address various types of attacks. Since the sender signs each DP record, authenticity and integrity can be determined as follows: • Examine the DP record created by the sender of the message, and get the hash algorithm and the hash value of the message associated with the DP record. • Calculate the hash of the message using the algorithm. If the hash values do not agree, the integrity of the message can be suspected. • Assure the authenticity of the DP record by verifying the signature. In the event that the DP record from the sender uses an inferior hash algorithm (MD5 or SHA-1 vs. SHA256), additional DP records in the chain of transfer can be examined, providing additional attributes used in the calculation of authenticity using subjective values. While confidentiality is difficult to prove, one can verify that a message was intended to be confidential by ensuring that it was encrypted before being sent or attached. If at any point the message was sent unencrypted, then there may have been an unintentional exposure of confidential information. It may also indicate an implementation error if the sending and receiving characteristics from the perception of the sender and receiver do not agree. 48 The DP records directly provide non-repudiation as they are all signed. However, someone can send a message, and then claim that their key was compromised. It is possible to examine the sequence of events to determine if the retraction occurred while the key was considered secure or not. The timestamp is useful in dealing with compromised and expired certificates. Availability is a system level property but our architecture can be used to detect some attacks on availability by comparing actual transmission time with historical transmission time. This works with both reliable transmission models like TCP and unreliable transmission models like UDP. Another way to detect attacks is to rely on report numbers to detect 'skipped' reports. Additional information facilitates more detailed availability analysis. For example, if a receiver includes the number of retries needed to receive a message in the DP information then DP analysis can use this information, and past knowledge, to identify availability issues. The DP analysis can also detect some replay attacks. If a message is intercepted and retransmitted later then the timestamp may be useful in detecting it. If the receiver or DP analysis does keep track of messages, and can detect messages that have arrived earlier, then it becomes easier to detect a true replay attack. 49 3.2.7 Example Analysis Data Provenance can be performed on complex messages. Consider the following scenario: • Alice creates message M1 (in this case, an image) and sends it to Bob. • Bob creates message M2 with his analysis of M1, attaches M1, and sends it to Carol. • Carol uses the information in M2 and M1 to create another message M3 (in this case, a report), and forwards it to Dave. The transmission includes messages M3, and the referred messages M2 and M1. Dave receives the message, and fills in his section of the DP records. The following table summarizes Dave’s DP records. Table 3: DP Records for Example Scenario DP Record of M3 from Carol to Dave DP Record of M2 from Carol to Dave DP Record of M2 from Bob to Carol DP Record of M1 from Carol to Dave DP Record of M1 from Bob to Carol DP Record of M1 from Alice to Bob If the DP analysis was concerned about the authenticity of the message, the hash algorithm used by the message creator (Alice) can be applied to each of the messages, 50 and compared to the value in the DP record. If there is a discrepancy, the system can examine each DP record associated with that message. If there was a discrepancy in message M1, there is enough information to determine when the discrepancy occurred. For example, if the hash value when Carol received M1 from Bob differed from the value when Carol sent M1 to Dave, then message M1 changed while Carol was examining it. If the hash value when Carol sent M1 to Dave is different from the value when Dave received it, then that means M1 was modified in transit between Carol and Dave. There may be additional records from systems that forward messages that can provide additional non-repudiation. The prospect of including subjective information also raises a few concerns. Subjective information constitutes personal opinions. That is, Alice may trust Carol more than Dave based on experience, rumors, and hunches. One approach for supporting the inclusion of subjective information in this architecture is to provide the ability where users can enter their subjective values for various entities. The architecture can then propagate this information and make it available for inspection and analysis. A person can analyze the data and form their opinion. Some of the relevant questions are: How confident are we that the document creator actually has the key; and nobody else has the key? If a certificate authority signs the certificate, how confident are we in the certificate authority? How confident are we in the process used for granting a signature? 51 As mentioned earlier, in a Multi-Level Secure environment, data source and processing information cannot always be passed across the different levels of security boundaries. To overcome this restriction, the best approach is to take the objective and subjective information that is captured as part of DP information and combine it to generate ‘Figure of Merit’ values that accurately depict the amount of trust can be put in the data. While the notion of trust has been widely studied, its definition and usage varies across various domains and application areas. As the data crosses a security domain, if the DP information cannot be passed on, then this DP information is replaced by Figure of Merit values. The Figure of Merit values are calculated from both objective and subjective components. The objective values (as calculated at time t) are independent of the analyst and can be incorporated automatically in the determination of Figure of Merit. Subjective values (as calculated at time t) depend on the analyst doing the analysis and may depend on the context of the workflow. Note that both objective and subjective values specified by an analyst may evolve over time. [10] The values associated with the Information Assurance attributes such as authenticity, confidentiality, integrity, non-repudiation and availability are objective values, while the confidence in Certification Authorities that provide certificates involved in the workflow can have objective as well as subjective components. The confidence in the components of the workflow is subjective. 52 A number of different approaches can be taken for generating the Figure of Merit value from the DP information and which approach is selected depends on the type of DP analysis being done as well as personal preferences. Figure of Merit provides summarization, so it along with the notion of wrappers and the ability to send DP records on a separate channel help in making systems scalable. 3.2.8 Monitoring and Analyzing Workflows Let us consider some issues around monitoring and analyzing Information Assurance attributes of a workflow in real-time. Both the intrinsic Information Assurance attributes of a workflow as well as its actual execution in the deployed system are to be considered. The workflow, as designed, provides a baseline for entitlement, i.e., it provides the maximum value that can be achieved in the deployed system. For example, if the workflow is structured to send data in clear-text, then confidentiality is not ensured, and thus, the deployed system will not guarantee confidentiality. On the other hand, if the workflow as designed, is structured to encrypt the data, then the baseline entitlement guarantees confidentiality. This DP Architecture allows the calculating, displaying of values for Information Assurance attributes for each message, and its transmission as the workflow is executed. The dashboard architecture displays the values for the Information Assurance attributes. 53 It can also be used to do What-If Analysis by incorporating subjective trust values. Moreover, it can be used to do after the fact analysis to understand vulnerabilities and determine the impact of compromised assets. The following figure illustrates a dashboard that allows an analyst to monitor as well as analyze the execution of a representative workflow. Figure 13: Dashboard for Monitoring and Analyzing Workflows [10] 54 To summarize, an architectural framework has been provided for incorporating Data Provenance to support Information Assurance attributes in a multi-level secure environment. This architecture allows us to capture objective trust attributes that can be combined with subjective trust data to calculate a "Figure of Merit" value. The exact formulae used in computing these values are user-definable and may depend on the scenario and type of analysis being done. This Figure of Merit value can be used to hide the detailed Data Provenance information so that it may cross security boundaries in a multi-level secure environment. 3.3 Provenance Mechanism to tackle Packet Dropping in Sensor Networks 3.3.1 Introduction Provenance provides the assurance of data trustworthiness, which is highly desired to guarantee accurate decisions in mission critical applications, such as Supervisory Control and Data Acquisition (SCADA) systems. The 2009 report published by the Institute for Information Infrastructure Protection (I3P) on National Cyber Security Research and Development Challenges, in which the research initiatives on efficient implementation of provenance in real-time systems are highly recommended, also emphasizes on the importance of provenance for streaming data. However, existing research on provenance has mainly focused on the tasks of modeling, collection, and querying, leaving the aspects of trustworthiness and security issues relatively unexplored. 55 Here, a framework is investigated that is proposed to transmit provenance information along with sensor data, hiding it over inter-packet delays (the delays between the transmissions of sensor data items). The embedding of provenance information within a host medium makes the technique reminiscent of digital watermarking, wherein information is embedded into a digital signal, which may be used to verify its authenticity or the identity of its owners. The reason behind adopting watermarking based scheme rather than one that is based on traditional security solutions like cryptography and digital signature is discussed later. Moreover, a suitable justification is provided for the design choices of using inter-packet delays (IPD) as the watermark carrier, employing a direct-sequence spread spectrum based technique to support multi-user communication over the same medium. The proliferation of internet, embedded systems and sensor networks has greatly contributed to the wide development of streaming applications. Examples include, realtime location-based services, sensor networks monitoring environmental characteristics, controlling automated systems, power grids etc. The data that drives these systems is produced by a variety of sources, ranging from individual sensors up to very different systems altogether, processed by multiple intermediate agents. This diversity of data sources accelerates the importance of data provenance to ensure secure and predictable operation of data-streaming applications like sensor networks. 56 Malicious Packet-Dropping attack is a major security threat to the data traffic in sensor networks, since it reduces the overall network throughput and may hinder the propagation of sensitive data. In this attack, a malicious node drops packets at random during transmission, to prevent their further propagation. It selectively drops packets and forwards the remaining data traffic. Due to this nature, this attack is also called as ‘Selective Forwarding Attack’. [12] Dealing with this attack is challenging, to say the least, since there can be a variety of other reasons that cause data or packet-loss in such systems. To name a few, the unreliable wireless communication feature and the inherent resource-constraints of sensor networks may also be the possible causes for communication failure and data loss. Moreover, transient network congestion can also result in packet-loss. Power-scarcity can also make a node unavailable and there may be a communication failure due to physical damage as well. Thereby, these possibilities raise a false alarm and mislead us to an incorrect decision regarding the presence of such a malicious attack. The mass deployment of tiny sensors, often in unattended and hostile environments makes them susceptible to such attacks. This attack can result in a significant loss of sensitive data and can degrade legitimate network throughput. 57 One approach to defend a malicious packet dropping attack is Multipath Routing. However, multipath routing suffers from several drawbacks such as, high communication overhead with an increase in the number of paths, inability to identify the malicious node etc. Traditional transport layer protocols also fail to guarantee that packets are not maliciously dropped in sensor networks. They are not designed to deal with these kinds of malicious attacks. A data-provenance based mechanism is hereby proposed, to detect the presence of such an attack and identify the source of the attack i.e. the malicious node. As mentioned earlier, this scheme utilizes inter-packet delay characteristics for embedding provenance information. This scheme consists of three phases: 1) Packet Loss Detection 2) Identification of Attack Presence 3) Localizing the Malicious Node/Link The packet-loss is detected based on the distribution of inter-packet delays. The presence of an attack is determined by comparing the empirical average packet-loss rate with the natural packet-loss rate of the data flow path. To isolate the malicious link, more provenance information is transmitted along with the sensor data. [12] 58 We have two goals to accomplish – 1) To transfer provenance along with the sensor-data in a bandwidth efficient manner, while ensuring that the quality of the data remains intact. 2) To detect the packet dropping attack and thereby identify the malicious node using the provenance transmission technique. To accomplish these goals, a unique strategy is proposed, where provenance is securely embedded as a list of unordered nodes over the same medium. This is a key design choice in this proposed scheme for secure provenance transmission. 3.3.2. Overview of the Scheme: Consider a typical deployment of wireless sensor networks, consisting of a large number of nodes. Sensor nodes are stationary after deployment, though routing paths may change due to node failure, resource optimization, etc. The routing infrastructure is assumed to have a certain lower bound on the time before the routing paths change in the network. The network is modeled as a graph G (N, E) where N is the set of nodes in the network and E is the set of edges between the nodes. There exists a powerful base station (BS) that acts as sink/root and connects the network to the outside infrastructure such as Internet. All nodes form a tree rooted at the BS and report the tree topology to BS once after the deployment or any change in topology. 59 Since the network does not change so frequently, such a communication will not incur significant overhead. Sensory data from the children are aggregated at cluster-head a.k.a. Aggregator and routed to the applications through the routing tree rooted at the BS. It is assumed that the BS cannot be compromised and it has a secure mechanism to broadcast authentic messages into the network. Each sensor has a unique identifier and shares a secret key Ki with the BS. Each node is also assigned a Pseudo Noise (PN) sequence of fixed length Lp which acts as the provenance information for that node. The sensor network allows multiple distinguishable data flows where source nodes generate data periodically. A node may also receive data from other nodes to forward towards the BS. While transmitting, a node may send the sensed data, may pass an aggregated data item computed from multiple sensors’ readings or act as an intermediate routing node. Each data packet in the transmission stream contains an attribute value and provenance for that particular attribute. The data packet is also timestamped by the source with the generation time. As will be seen later, packet timestamp is very crucial for provenance embedding and decoding process and hence Message Authentication Code (MAC) is used to maintain its integrity and authenticity. The MAC is computed on data value and timestamp to ensure the same properties for data. The provenance of a data item, in this context, includes information about its origin and how it is transmitted to the BS. 60 The notion of provenance is formally defined as follows: The provenance 𝑝𝑑 for a data item d is a rooted tree satisfying the properties: (1) 𝑝𝑑 is a subgraph of the sensor network G (N, E) (2) The root node of 𝑝𝑑 is the BS, expressed as 𝑛𝑏 (3) For 𝑛i , 𝑛𝑗 in N, included in 𝑝𝑑, 𝑛𝑖 is a child of 𝑛𝑗 if and only if 𝑛𝑖 participated in the distributed calculation of d and/or passed data information to 𝑛𝑗 [12] Figure 14 on the next page, shows two different kinds of provenance. In 14(a), a data item d generated at leaf node 𝑛1 and the internal nodes simply pass it to BS. Such internal nodes are called ‘Simple Nodes’ and this kind of provenance is ‘Simple Provenance’. Simple provenance can be represented as a simple path. In 14(b), the internal node 𝑛1 generates the data d by aggregating data 𝑑1... 𝑑4 from 𝑛𝑙1... 𝑛𝑙4 and passes d towards BS. Here, 𝑛1 is an ‘Aggregator’ and the provenance is called ‘Aggregate Provenance’, which is represented as a tree. 61 Figure 14: Provenance Examples for a Sensor Network [13] One key design choice in this proposed scheme is to utilize the Spread Spectrum Watermarking technique, to pass on the provenance of multiple sensor nodes over the medium. Spread Spectrum Watermarking is a transmission technique where a narrowband data signal is spread over a much larger bandwidth so that the signal energy in any single frequency in undetectable. In the context of this scheme, the set of IPD’s is considered as the Communication Channel and provenance information is the signal transmitted through it. 62 Provenance information is spread over many IPD’s such that the amount of information present in one container is small. Consequently, any unauthorized party needs to add very high amplitude noise to all of the containers to destroy provenance. Thus, the use of the spread spectrum technique for watermarking provides strong security against different attacks. In specific terms, the ‘Direct Sequence’ Spread Spectrum (DSSS) technique is used. “ DSSS phase-modulates a sine wave pseudo-randomly with a continuous string of pseudo-noise (PN) code symbols called ‘chips’, each of which has a much shorter duration than an information bit. That is, each information bit is modulated by a sequence of much faster chips. Therefore, the chip rate is much higher than the information signal bit rate. ” [15] DSSS uses a signal structure in which, the sequence of PN symbols produced by the transmitter is already known by the receiver. The receiver can use the same PN sequence in order to reconstruct the information signal. This noise signal in a DSSS transmission is a pseudorandom sequence of 1 and −1 values, at a frequency much higher than that of the original signal. The resulting signal resembles white noise. However, this noise-like signal can be used to exactly reconstruct the original data at the receiving end, by multiplying it by the same pseudorandom sequence (because 1 × 1 = 1, and −1 × −1 = 1). This process is called "de-spreading". 63 It involves a mathematical correlation of the transmitted PN sequence with the PN sequence that the receiver believes the transmitter is using. The components of DSSS system are: • The original data signal d(t), as a series of +1 and -1. • A PN sequence px(t), encoded like the data signal. Nc is the number of bits per symbol and is called PN length. Spreading: The transmitter multiplies data with PN code to produce spreaded signal as s(t) = d(t) px(t) Despreading: The received signal r(t) is a combination of the transmitted signal and noise in the communication channel. Thus r(t) = s(t) + n(t), where n(t) is a white Gaussian noise. To retrieve the original signal, the correlation between r(t) and the PN sequence pr(t) at the receiver is computed using the following formula: 𝑇+𝑁𝑐 R(𝜏) = (1 / Nc) Σ 𝑟(𝑡) 𝑝𝑟(𝑡+𝜏 ) 𝑡=𝑇 Now, if px(t) = pr(t) and 𝜏 = 0 i.e. px(t) is synchronized with pr(t), then the original signal can be retrieved. Otherwise, the data signal cannot be retrieved. Therefore, a receiver without having PN sequence of the transmitter cannot reproduce the originally transmitted data. This fact is the basis for allowing multiple transmitters to share a channel. 64 In case of multiuser communication in DSSS, spreaded signals produced by multiple users are summed up and transmitted over the channel. Multiuser communication introduces noise to the signal of interest and interferes with the desired signal in proportion to the number of users. 3.3.3 Secure Provenance Transmission Mechanism In the IEEE journal, Secure Provenance Transmission for Streaming Data, by S. Sultana, M. Shehab and E. Bertino, a novel approach is proposed which deals with securely transmitting the provenance of sensor data. In this mechanism, provenance is watermarked over the delay between consecutive sensor data items. Figure 15: Provenance Encoding at Sensor Node and Decoding at Base Station [13] 65 A set of (Lp + 1) data packets is used to embed provenance over the IPDs. Thus, the sequence of Lp IPDs, DS = {Δ(1), Δ(2), …, Δ(Lp)} is the medium where we hide provenance. Δ[j] represents the inter-packet delay between jth and (j+1)-th data item. The process also uses the secret key 𝐾𝑖 (1 ≤ 𝑖 ≤ 𝑛, where 𝑛 is the number of nodes in the network), a locally generated random number 𝛼𝑖 (known as impact factor) and the provenance information pni. 𝛼𝑖 is a random real number generated according to a normal distribution (𝜇, 𝜎). 𝜇 and 𝜎 are pre-determined and known to the BS and all the nodes. The PN sequence consists of a sequence of +1’s and -1’s and is characterized by a zero mean. [13] The provenance encoding process at a node 𝑛𝑖 is summarized as follows: Step E1 – Generation of Delay Perturbations: By using provenance information pni and impact factor 𝛼𝑖, the node generates a set of delay perturbations, 𝒱𝑖 = {𝑣i[1], 𝑣𝑖[2], ..., 𝑣𝑖[𝐿𝑝]}, as a sequence of real numbers. Thus, = {[1], [2], ..., 𝑣𝑖[𝐿𝑝]} = 𝛼𝑖 × pni = {(𝛼𝑖 × 𝑝𝑛𝑖[1]), ..., (𝛼𝑖 × 𝑝𝑛𝑖[𝐿𝑝])} Here, [𝑗] corresponds to the provenance bit [𝑗]. Step E2 – Bit Selection: On the arrival of any (j+1)th data packet, the node records the IPD Δ[𝑗] and assigns a delay perturbation 𝑣𝑖[𝑘𝑗 ] ∈ 𝒱 to it. The selection process uses the secret 𝐾𝑖 and packet timestamp [𝑗 + 1] as follows: 66 (Δ[𝑗]) = (𝐾𝑖 || (([𝑗 + 1] ||𝐾𝑖)) 𝑚𝑜𝑑 𝐿𝑝 Here, 𝐻 is a lightweight, secure hash function and || is the concatenation operator. Step E3 – Provenance Embedding: IPD Δ[𝑗] is increased by 𝑣𝑖[𝑘𝑗] time unit. As [𝑘𝑗] corresponds to provenance bit [𝑘𝑗], a provenance bit is embedded over an IPD. Provenance bits are watermarked over IPDs by manipulating them with corresponding delay perturbations, termed as ‘watermark delays’. This way, 𝒟𝒮 is transformed into the watermarked version 𝒟𝒮𝑤. The sensor dataset is thereby transmitted towards the BS while reflecting the watermarked IPDs. Throughout the propagation, each intermediate node watermarks its provenance as well as increasing the IPDs. Data packets may also experience different propagation delays or attacks aimed at destroying the provenance information. At the end, the BS receives the dataset along with watermarked IPDs 𝒟𝒮𝑤, which can be interpreted as the sum of delays imposed by the intermediate nodes, attackers and difference between consecutive propagation delays along the data path. Thus, 𝒟𝒮𝑤 represents the DSSS encoded signal in this context. The provenance retrieval process at the BS approximates the provenance from this DSSS signal based on an optimal threshold 𝑇∗. The threshold, corresponding to the network diameter and PN length, is calculated once after the deployment of the network. 67 For retrieval purposes, the BS also requires the set of secret keys {𝐾1, 𝐾2... 𝐾𝑛} and PN sequences {pn1, pn2... pnn}. The retrieval process at the BS follows two steps: Step R1 - Bit Selection: The IPDs for the incoming packets are recorded at the BS. For each node, the IPDs are reordered according to the algorithm used in E2, which produces a node specific sequence CSi. Step R2 - Threshold based Decoding: For any node ni, the BS computes the cross correlation Ri between CSi and provenance pni and takes decision on whether pni was embedded by a comparison of Ri with threshold T*. As the BS does not know which nodes participated in the data flow, it performs the Bit selection and Threshold Comparison for all nodes. Based on the threshold comparison result, it deduces the participation of nodes in a data flow. 3.3.4 Packet-Dropping Adversary Identification Scheme This scheme is developed using the provenance transmission technique. In this scheme, the malicious packet dropping attack is detected and the malicious node/link is identified. This scheme relies upon the distribution of provenance embedded inter-packet delays and consists of the following phases: Packet Loss Detection Identification of Attack Presence Localizing the Malicious Node/Link 68 The BS initiates the process for each sensor data flow by leading the packet loss identification mechanism based on the inter-packet delays of the flow and the extracted provenance. As mentioned earlier, since there can be various reasons for packet-loss, the BS waits until a sufficient number of packet losses occur and then calculates the average packet loss rate. [12] A comparison of this loss rate with the natural packet loss rate of the path confirms the event of malicious packet dropping. If a packet drop attack is suspected, the BS signals the data source and the intermediate nodes to start the mechanism for pinpointing the malicious node/link. The details of the mechanism are presented below: A. Packet Loss Detection Both the provenance information and the IPDs are used to detect a packet loss. The BS can observe the data flows transmitted by all sensor nodes and obtain their timing characteristics. Since provenance is embedded over the IPDs, the watermarked IPDs follow a different distribution than the regular IPDs. After the receipt of a few initial group of (Lp+ 1) packets from a data flow, the BS can approximate the distribution of the watermarked IPDs for that flow. Afterwards, the BS analyzes each IPD to check whether it follows the estimated distribution. If the data is dropped by a node that definitely must be traversed to reach the BS, the attack will end the journey of the data. 69 Here, the IPD observed by the BS will be large enough to go beyond the distribution and be detected as a packet loss. For the packets containing sequence numbers, any out of order packet can verify the detection of packet loss. On the other hand, if the data packet is dropped by an intermediate router within a cluster, it cannot interfere with the data from other nodes that are to be aggregated at the cluster head and transmitted towards the BS. [12] Consequently, the IPD-based check will not be effective in such attack scenarios. Fortunately, in this case, the provenance retrieved at the BS does not include the simple path containing the malicious node from the source to the aggregator. Thus, the dissimilarity of this provenance with the provenance of earlier rounds exposes the fact of a packet loss. B. Identification of Attack Presence For performing this task, the BS collects Ga group of (Lp+1) packets, where Ga > 0 is a real number. Assume the BS identifies m packet losses. Thus, the average packet loss rate Lavg can be calculated as, Lavg = m / [Ga * (Lp + 1) + m]. The natural packet loss-rate is calculated by, h i−1 Ln = Σ 𝜌i Π (1 − 𝜌j) i=1 j=1 70 Here, h is the number of hops in the path, 𝜌i is the natural loss rate of the link between nodes ni and ni+1. Comparing Lavg with Ln, the BS confirms a packet dropping attack. If Lavg > Ln, malicious node surely exists in the flow path. [12] C. Localizing the Malicious Node/Link: For identifying the malicious link, more information is needed in the provenance apart from just the nodeID. The data payload is used to carry this additional provenance data. Whenever the BS detects the attack, it notifies the source and intermediate nodes in the path about it. While forwarding the data packet, each nodes adds information including the hash of the data, timestamp etc. of the last data packet it received through this path. Hence the format of a data packet at a node ni is, mt = <data || timestamp || Pi > Pi = {ni ||H(mt - 1) || Pi -1}Ki Here, mt represents the current data packet and mt-1 is the most recent packet before the current one. Pi denotes the provenance report at the node ni. H is a collision resistant hash function and {D}k denotes a message D authenticated by a secret key k, using a message authentication code (MAC). The data and timestamp are authenticated using MAC to ensure integrity. Upon receiving a data packet containing the hash chain of provenance, the BS can sequentially verify each provenance report embedded in it. Assume the flow path i.e. the data provenance is {n1, n2... ni-1, ni, BS}, where n1 is the source node and BS is the base station. The link between the nodes ni and ni+1 is represented as li. 71 For some i < d, if the provenance report from each intermediate node nj, where j lies within [1, i], contains the recent value for timestamp and hash of the data but the provenance reports from ni+1 contains the older values, then the BS identifies the link li as faulty. After observing the Gl group of (Lp+1) packets, the BS calculates the average packet loss rate of all the links of the path. Gl is also a real number greater than 0. If the loss rate of a link li is significantly higher than the natural packet loss rate 𝜌i, then the BS labels the link as a malicious link. To quantify the term significantly, we introduce a per-link drop rate threshold, denoted by 𝜏, where 𝜏 > 𝜌i. If the empirical packet loss rate of a link li is greater than 𝜏, then li is termed as a malicious link. In this way, malicious packet dropping attack in sensor networks can be tackled using data provenance. [12] 3.4 Provenance of Electronic Data Transfer 3.4.1 Introduction The IT landscape is evolving as illustrated by applications that are open, dynamically composed, and that discover results and services on the fly. Against this vastly growing and challenging background, it is crucial for users to be able to have confidence in the results produced by such applications. 72 If the provenance of data produced by computer systems could be determined, then users, in their daily applications, would be able to interpret and judge the quality of data better. In this attempt, a provenance lifecycle is introduced and an open approach is proposed based on some key principles underpinning existing provenance systems in computer systems. To accomplish this vision, computer applications need to be transformed into ‘Provenance-aware applications’, for which the provenance of data may be retrieved, analyzed and reasoned over. 3.4.2 Lifecycle of Provenance in Computer Systems Both the scientific and business communities have adopted a predominantly serviceoriented architectural (SOA) style of implementation, which allows services to be discovered and composed dynamically. Now, while SOA-based applications are more dynamic and open in their functionality, they must however satisfy new, evolving requirements both in Business as well as in E-science. Ideally speaking, end users of E-science applications would be able to perform tasks like reproducing their results by replaying previous computations, understanding why two seemingly identical runs with the same inputs produce different results, and determining which data sets, algorithms, or services were involved in the derivation of their results. Some users, reviewers, auditors, or even regulators need to verify that the process that led to some result is compliant with specific regulations or methodologies that ought to be followed. [16] 73 This is usually done because, it needs to be proven that the results obtained were derived independently from services or databases with proper license restrictions. In addition to this, they must also establish that data was captured at the source by using some highly precise and technically accurate instruments. However, they either cannot do so or can do it only imperfectly or incompletely at best, because the underpinning principles have not yet been thoroughly investigated and systems have not yet been defined to cater such specific requirements. One key observation in this regard is that electronic data does not always contain the necessary historical information that would help end-users, reviewers, or regulators make the necessary verifications in order to accurately validate the data. Hence, there is a dire need to capture those additional fragments of missing information that describes what actually occurred at execution time, in an exact manner. Such extra information can be named as ‘Process Documentation’. 74 Figure 16: Provenance Lifecycle [16] As shown in the above figure, provenance-aware applications create process documentation and store it in a ‘provenance store’ database, the role of which is to offer a long-term persistent, secure storage of process documentation. It is a logical role, which accommodates various physical deployments. For instance, a provenance store can be a single, autonomous service or, to be more scalable, it can be a collection of distributed stores. After the process documentation is recorded, the provenance of resultant data can be retrieved by querying the provenance store, and can be analyzed according to the user’s needs. 75 Over time, the provenance store and its contents may need to be managed or curated. Hence, it can be summarized that, the provenance life cycle consists of four different phases: creating, recording, querying and managing. [16] 3.4.3 Open Model for Process Documentation For many applications, process documentation cannot be generated in one single iteration. Its generation must be in conjuncture with the execution. This is why, it is imperative to differentiate a specific item involved in documenting a specific part of a process, from the whole process of comprehensive documentation. The former, referred to as a p-assertion, is seen as an assertion made by an individual service involved in the process. Thus, the documentation of a process consists of a set of p-assertions made solely by the services involved in the process. In order to minimize its impact on the performance of the application, documentation needs to be structured in such a way that it can be constructed and recorded autonomously by services. If this is not ensured, synchronizations will be required between these services to agree upon how and where to document execution. This will ultimately result in a huge loss in performance of the application. So, since it is highly essential to satisfy this design requirement, various kinds of p-assertions have been identified, which applications are expected to adopt in order to document their execution. Figure 17 below illustrates a computational service sending and receiving messages, and creating p-assertions describing its involvement in such activity. 76 Figure 17: Categories of P-assertions Made by Computational Services [16] Interaction P-assertions: In SOAs, interactions mainly occur via messages that are exchanged between numerous services. By thoroughly capturing all these interactions, execution can be analyzed, its validity can be verified, and it can be compared with other executions as well. In order to facilitate this, process documentation often includes interaction p-assertions, where an interaction p-assertion is a description of the contents of a message by a service that has sent or received that message. 77 Relationship P-assertions: Irrespective of whether a service returns a result directly or calls other services for that purpose, the relationship between its inputs and outputs, cannot be explicitly represented in the messages themselves. It can only be understood by performing a comprehensive analysis of the business logic behind the concerned services. To promote flexibility and generality, no assumption is made about the technology used by services to implement their business logic, such as source code, workflow language, etc. Instead, a requirement is placed on services to provide some information, in the form of relationship p-assertions. A relationship p-assertion is a description, asserted by a service, of how it obtained output data sent in an interaction by applying some function, or algorithm, to input data from other interactions. In Figure 17, output message M3 was obtained by applying function f1 to input M1. [16] Service-state P-assertions: Apart from the data flow in a given process, the internal service states may also be necessary in order to understand non-functional characteristics of execution, such as the performance or accuracy of services. 78 Thereby the nature of the result they compute can also be determined. Hence, a service state p-assertion may be defined as documentation provided by a service about its internal state in the context of a specific interaction. Service state p-assertions can be extremely varied in nature. They can include the amount of disk and CPU time a service used in a computation, its local time when an action occurred, the floating-point precision of the results it produced, or application-specific state descriptions. [16] Provenance-aware applications need to be able to work together despite their innate diversities. For this reason, it is important that the process documentation they all produce is structured according to a shared data model. The novelty of this approach is the openness of the proposed model of documentation, which is conceived to be independent of application technologies [17]. Taken together, these characteristics allow process documentation to be produced autonomously by application services, and be expressed in an open format, over which provenance queries can be expressed [18]. 3.4.4 Querying the Provenance of Electronic Data Provenance queries are user-tailored queries over process documentation. They are aimed at obtaining the provenance of electronic data. The primary challenge in this regard, is to characterize the data item in which the user is interested. 79 As data can be mutable, its provenance can vary according to the particular point of execution from which a user wishes to find it. A provenance query, therefore, needs to identify a data item with respect to sending or receiving a message. The complete details of everything that ultimately caused a data item to be the way it is, could potentially be very large in size. For instance, the complete provenance of the results of an experiment would possibly include a description of the process that produced the materials used in the experiment, the provenance of any source materials used in producing those materials, the devices and software used in the experiment and their settings, etc. This way, if documentation were available, details of processes leading back to the beginning of time or at least the epoch of provenance awareness would also be included. [16] Thus, it is highly essential for users to express their scope of interest in a process, using a provenance query. Such a query essentially performs a reverse graph traversal over the data flow DAG and terminates at the exact point catering to the query-specified scope. The query output would be a subset of the DAG. Scoping can be based on types of relationships, intermediate results, services, or sub-processes. [17] 3.4.5 Example: Provenance in Healthcare Management In order to illustrate the proposed approach, let us consider a healthcare management application. 80 The Organ Transplant Management (OTM) system that manages all the activities related to organ transplants across multiple Catalan hospitals. Their regulatory authority is the focus here. OTM is made up of a complex process, involving not only the surgery itself, but also a wide range of other activities, such as data collection and patient organ analysis, which all have to comply with a set of regulatory rules. Currently, OTM is supported by an IT infrastructure that maintains records allowing medical personnel to view and edit a given patient’s local file within a given institution or laboratory. [16] However, the system does not connect records, nor does it capture all the dependencies between them. It does not allow external auditors or patients’ families to analyze or understand how decisions are reached. [19] If OTM is made to be provenance-aware, powerful queries that were not possible to execute earlier could be supported with ease. Some of those queries include, finding all doctors involved in a decision, finding the blood test results that were involved in a donation decision, finding all data that led to a decision to be taken. Such functionality can be made available not only to the medical profession but also to regulators or families. 81 For easier understanding, the discussion is limited to a small, simplified subset of the OTM Workflow, namely the process leading to the decision of donating an organ. [16] Figure 18: Provenance DAG of a Donation Decision [16] As a hospitalized patient’s health declines and in anticipation of a potential organ donation, one of the attending doctors requests the full health record for the patient and sends a blood sample for analysis. Through a user interface (UI), these requests are made by the attending doctor and passed on to a software component (Donor Data Collector) responsible for collecting all the expected results. 82 After brain death is observed and logged into the system, if all requested data and analysis results have been obtained, a doctor is asked to make a decision about the donation of an organ. The decision, i.e., the outcome of the doctor’s medical judgment based on the collected data, is explained in a report that is submitted as the decision’s justification. Figure 18 displays the components involved in this scenario and their interactions. The UI sends requests (I1, I2, I3) to the Donor Data Collector service, which gets data from the patient records database (I4, I5) and analysis results from the laboratory (I6, I7), and finally requests a decision (I8, I9). To make OTM provenance-aware, it is augmented with a capability to produce an explicit representation of the process actually taking place. This includes p-assertions for all interactions (I1 to I9), relationship p-assertions capturing dependencies between data, and service state p-assertions. [16] In Figure 18, the DAG that represents the provenance of a donation decision can be seen, made of relationships p-assertions produced by provenance-aware OTM. DAG nodes denote data items, whereas DAG edges (in blue) represent relationships such as data dependencies (is based on, is justified by) or causal relationships (in response to, is caused by). Each data item is annotated by the interaction in which it occurs. Furthermore, the UI asserts a service state p-assertion, for each of its interactions, about the user who is logged into the system. 83 Over such documentation, provenance queries can be issued that navigate the provenance graph and prune it according to the querier’s needs. For instance, from the graph, it can be derived that users X and Y are both causing a donation decision to be reached. [16] Figure 18 is merely a snapshot of one scenario, but in real life examples, with vast amount of documentation, users benefit from a powerful and accurate provenance query facility. Hence, this kind of open approach allows complex distributed applications, possibly involving multiple technologies (such as Web Services, Command-line Executables, Monolithic Executables), to be documented. It also allows complex provenance queries to be expressed, identifying data and scoping processes independently of technologies used. 84 CHAPTER 4 CONCLUSION 4.1 Summary This report mainly focused on exploring the implementation details of Data Provenance across some applications as well as observing the various models proposed to represent the functionality of data provenance. This report has primarily provided a basic understanding of the concept of data provenance and a review of the various aspects of provenance such as the generalized W7 Model, and some proposed conceptual models that could be developed based on the W7 Model for satisfying a wide range of requirements. Then, the technique of Active Conceptual Modeling was observed, which is proposed in an attempt to make the conceptual schema active while capturing provenance in a database. Following on was a detailed discussion of the implementation of data provenance across some common, real-world applications. Firstly, it was observed how provenance supports information assurance attributes like availability, confidentiality and integrity in a multi-level secure environment. Then, a provenancebased mechanism was described, which can tackle the malicious packet-dropping attack in the field of Sensor Networks. Then the implementation of provenance collection and querying in electronic data was explained. The example application of a Healthcare Management System was discussed to further clarify the implementation. 85 4.2 Future Work This report has offered a rather generalized perspective of the concept of Data Provenance attributed to some applications. Apart from the applications that are discussed in this report, data provenance holds paramount importance in several other areas. The future work that could be conducted, would involve delving deeper into the details of capturing, storing and querying provenance information in areas such as, grid computing, cloud computing, web applications among some others in order to gain a better idea about the versatility of the concept of provenance. By implementing Active Conceptual Modeling across a wider range of applications than currently in use, such as in online website development for one, phishing attacks on websites can be reduced more efficiently. Virtual Data System (VDS) and myGrid are execution environments for scientific workflows, which also provide support for provenance. They follow their respective workflow language, which allows them to obtain compact process documentation. By adopting an open data model for process documentation, such systems could be integrated into heterogeneous applications for which provenance queries could be executed seamlessly [16]. Phantom Lineage, the concept that deals with the ways to store provenance of missing or deleted data, also requires further consideration. Finally, it can be said that data provenance is still a relatively new and exploratory field. Hence, a deeper understanding of provenance is essential to identify novel ways to recognize its full potential. 86 GLOSSARY XML Extensible Markup Language SOA Service-Oriented Architecture SOAP Service-Oriented Architecture Protocal REST Representational State Transfer Protocol DP Data Provenance WSS Web Services Security MD Message Digest SHA Secure Hash Algorithm TCP Transmission Control Protocol UDP User Datagram Protocol 87 IPD Inter-Packet Delay DAG Direct Acyclic Graph BS Base Station OTM Organ Transplant Management 88 REFERENCES [1] A New Perspective on the Semantics of Data Provenance, by Sudha Ram and Jun Liu. Link: http://ceur-ws.org/Vol-526/InvitedPaper_1.pdf [2] Understanding the Semantics of Data Provenance to Support Active Conceptual Modeling, by Sudha Ram and Jun Liu. Link: http://kartik.eller.arizona.edu/ACML_Provenance_final.pdf [3] Data Provenance: A Categorization of Existing Approaches, by Boris Glavic and Klaus Dittrich Link: http://www.cs.toronto.edu/~glavic/files/pdfs/GD07.pdf [4] Research Problems in Data Provenance, Wang-Chiew Tan. Link: http://users.soe.ucsc.edu/~wctan/papers/2004/ieee.pdf [5] Archiving Scientific Data, by Peter Buneman, Sanjeev Khanna, Keishi Tajima, Wang-Chiew Tan. Link: http://homepages.inf.ed.ac.uk/opb/papers/sigmod2002.pdf [6] Why and Where: A characterization of Provenance, by Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. Link: http://db.cis.upenn.edu/DL/whywhere.pdf 89 [7] The virtual data grid: a new model and architecture for data-intensive collaboration, by Ian T. Foster Link: http://www.ci.uchicago.edu/swift/papers/ModelAndArchForDataCollab2003.pdf [8] Modeling Temporal Dynamics for Business Systems, by G. Allen and S. March. Link: http://wenku.baidu.com/view/ae750a651ed9ad51f01df242.html [9] Future direction of conceptual modeling, by P. P. Chen, B. Thalheim, and L. Wong Link: http://www.springerlink.com/content/9r6hd2c344ceqm85/ [10] Data Provenance Architecture to Support Information Assurance in a Multi-level Secure Environment, by Abha Moitra, Bruce Barnett, Andrew Crapo and Stephen J Dill. Link: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05379854 [11] A Survey of Trust in Computer Science and the Semantic Web. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 5, pg. 58-71, 2007 by Artz D. and Gil Y. Link: http://www.isi.edu/~gil/papers/jws-trust-07.pdf [12] A Provenance-based Mechanism to Identify Malicious Packet Dropping Adversaries in Sensor Networks, by Salmin Sultana, Elisa Bertino, and Mohamed Shehab. Link: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5961507 90 [13] Secure Provenance Transmission for Streaming Data, by S. Sultana, M. Shehab, and E. Bertino. Link: http://ieeeexplore.com/stamp/stamp.jsp?tp=&arnumber=6152110 [14] National Cyber Security Research and Development Challenges - Related to Economics, Physical Infrastructure and Human Behavior, 2009 Link: http://www.carlisle.army.mil/DIME/documents/i3pnationalcybersecurity.pdf [15] http://en.wikipedia.org/wiki/Direct-sequence_spread_spectrum [16] Provenance of Electronic Data, by Luc Moreau, Paul Groth, Simon Miles, Javier Salceda, John Ibbotson, Sheng Jiang, Steve Munroe,Omer Rana, Andreas Schreiber, Victor Tan, and Laszlo Varga Link: http://eprints.soton.ac.uk/270862/1/cacm08.pdf [17] An Architecture for Provenance Systems, by Paul Groth, Sheng Jiang, Simon Miles, Steve Munroe, Victor Tan, Sofia Tsasakou, and Luc Moreau Link: http://eprints.soton.ac.uk/263216/1/provenanceArchitecture10.pdf [18] The Requirements of Recording and Using Provenance in E-science Experiments, by Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau Link: http://eprints.soton.ac.uk/260269/1/pasoa04requirements.pdf 91 [19] Applying Provenance in Distributed Organ Transplant Management, by Sergio Alvarez, Javier Vazquez-Salceda, Tamas Kifor, Laszlo Varga, and Steven Willmott. Link: http://www.gridprovenance.org/publications/IPAW-OTM-EHCR.pdf [20] Tracking Provenance in a Virtual Data Grid, by Ben Clifford, Ian Foster, Jens Volcker, Michael Wilde and Yong Zhao. Link: ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1407.pdf [21] Mining Taverna’s Semantic Web of Provenance, by Jun Zhao, Carole Goble, Robert Stevens and Daniele Turi. Link: http://onlinelibrary.wiley.com/doi/10.1002/cpe.1231/pdf [22] Passing the Provenance Challenge, Margo Seltzer, David A. Holland, Uri Braun, and Kiran-Kumar Muniswamy-Reddy. Link: http://www.eecs.harvard.edu/~kiran/pubs/ccpe.pdf [23] A Survey of Data Provenance in E-science, by Yogesh Simmhan, Beth Plale and Dennis Gannon Link: http://pti.iu.edu/sites/default/files/simmhanSIGMODrecord05.pdf 92 [24] The Case of the Fake Picasso: Preventing History Forgery with Secure Provenance, by Ragib Hasan, Radu Sion, Marianne Winslet Link: http://www.usenix.org/event/fast09/tech/full_papers/hasan/hasan.pdf [25] Secure Provenance for Cloud Storage, by Masoud Valafar, Kevin Butler Link: www.ieee-security.org/TC/SP2011/posters/Secure_Provenance_for_Cloud_Storage