PERICLES - Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics [Digital Preservation] DELIVERABLE 4.1 Initial version of environment information extraction tools GRANT AGREEMENT: 601138 SCHEME FP7 ICT 2011.4.3 Start date of project: 1 February 2013 Duration: 48 months 1 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013) Dissemination level PU PUBLIC PP Restricted to other PROGRAMME PARTICIPANTS (including the Commission Services) RE RESTRICTED to a group specified by the consortium (including the Commission Services) CO CONFIDENTIAL only for members of the consortium (including the Commission Services) X DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Revision History V# Date Description / Reason of change Author V0.1 2013.12.03 First template and outline FC, AH V0.2 2014.1.16 Updated outline, created executive summary V0.3 2014.1.21 Updated outline, new sections V0.4 2014.04.1 Updated outline, additions and fixes FC, SW V0.5 2014.05.02 Updated outline, initial integration of substantial new material FC, AE, AH, MCH, SW Draft 1 2014.05.19 First draft with new section and additions FC, SD, AE, AH Draft 2 2014.06.17 Second draft improved SEI definition, new SOA, simplified outline, Context update FC, AH, AE, SK, TM Draft 3 2014.06.30 Complete draft for internal review. Finished chapters 1-2, work on 4.7, work on chapter 9 Draft 4 2014.07.04 Completed all chapters, updated definition of SEI FC, AE, JL Draft 5 2014.07.14 First external review draft. Addresses internal reviewer’s comments, small addition to lifecycle FC, MCH Draft 6 2014.07.15 Moved to DOCX template FC Draft 7 2014.07.17 Formatting and other fixes SK Draft 8 2014.07.29 Added explicit references to use case requirements (D2.3.1) and improved requirement analysis section FC V1.0 2014.07.29 Final version FC © PERICLES Consortium FC FC, MCH FC, AE, AH, SK Page 2 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Authors and Contributors Authors Partner Name University of Liverpool Fabio Corubolo (FC) University of Göttingen Anna Eggers (AE) University of Liverpool Adil Hasan (AD) King’s College London Mark Hedges (MH) University of Boras Sándor Darányi (SD) King’s College London Simon Waddington (SW) Centre for Research & Technology, Hellas Stratos Kontopoulos (SK) Contributors Partner Name Centre for Research & Technology, Hellas Tasos Maronidis (TM) University of Göttingen Jens Ludwig (JL) Tate Patricia Falcao (PF) Tate Pip Laurenson (PL) © PERICLES Consortium Page 3 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Contents Glossary……………………………………………………………………………………………………………………………………….6 1 Executive Summary ............................................................................................................... 8 2 Introduction & Rationale ....................................................................................................... 9 2.1 2.1.1 2.1.2 2.2 2.3 2.4 3 DO information for preservation & reuse ............................................................................. 14 3.1 3.2 3.3 3.4 4 State of the art about digital ecosystems .................................................................... 38 The DE metaphor in PERICLES research activities ................................................... 39 Interaction .............................................................................................................. 39 A parallel between SEI and environment in Biological Ecosystems ............................ 40 The metaphor applied to the SBA example ............................................................ 41 Mathematical approaches for SEI analysis............................................................................ 44 6.1 6.2 6.3 7 Definition of dependency and other entities in PERICLES ........................................... 24 Definition of Significant Environment Information ..................................................... 26 Measuring significance ................................................................................................ 27 SEI in the digital object lifecycle .................................................................................. 28 Post hoc curation and digital forensics ................................................................... 30 SEI in the PERICLES case studies .................................................................................. 31 Software Based Artworks ....................................................................................... 31 Space science scenario ............................................................................................ 32 Modelling SEI with an LRM ontology extension .......................................................... 33 Examples in support of weight and SEI modelling .................................................. 34 Modelling SEI with LRM - A preliminary approach ................................................. 35 Digital Ecosystem (DE) Metaphor and SEI ............................................................................. 38 5.1 5.1.1 5.1.2 5.2 5.2.1 6 Metadata ..................................................................................................................... 14 Significant Properties................................................................................................... 17 Context ........................................................................................................................ 18 Environment information ............................................................................................ 21 Significant Environment Information ................................................................................... 24 4.1 4.2 4.3 4.4 4.4.1 4.5 4.5.1 4.5.2 4.6 4.6.1 4.6.2 5 Context of this Deliverable Production ....................................................................... 10 Relation to other work packages ............................................................................ 10 Relation to the other work package tasks .............................................................. 11 What to expect from this Document ........................................................................... 12 Document Structure .................................................................................................... 12 Task 4.1 outline from the DOW ................................................................................... 13 Problem description .................................................................................................... 44 Solution: sensor networks ........................................................................................... 44 Contextual anomalies .................................................................................................. 44 The PERICLES Extraction Tool ............................................................................................... 47 7.1 7.2 7.3 Introduction ................................................................................................................. 47 General scenario for SEI capture ................................................................................. 47 Provenance, related work ........................................................................................... 48 © PERICLES Consortium Page 4 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 7.4 7.5 7.6 7.7 7.8 7.9 7.10 8 Experimental results and evaluation .................................................................................... 59 8.1 8.1.1 8.1.2 8.1.3 8.2 8.2.1 8.2.2 8.2.3 8.3 8.4 9 Space science ............................................................................................................... 59 PET on operations: anomaly dependency information........................................... 59 PET on extracting results of scientific calculations ................................................. 60 SOM and anomaly detection on sensor data ......................................................... 61 Software Based Art: system information, dependencies ............................................ 64 System information snapshot ................................................................................. 64 Extracting font dependencies ................................................................................. 65 Experiments on SBA: Brutalism............................................................................... 65 Self-organising maps on video data............................................................................. 66 First internal evaluation (London, January 2014) ........................................................ 68 Conclusions and future work ............................................................................................... 69 9.1 9.2 9.2.1 9.2.2 10 PET general description and feature overview............................................................ 49 Requirements analysis and software specification ..................................................... 51 Software Architecture ................................................................................................. 51 Extracting SEI by observing the use environment ....................................................... 54 Modules for environment extraction .......................................................................... 55 Development process and techniques ........................................................................ 57 Scalability ..................................................................................................................... 58 Conclusions .................................................................................................................. 69 Future work ................................................................................................................. 69 Ideas for future PERICLES tasks .............................................................................. 69 Other Ideas not assigned to specific PERICLES tasks .............................................. 71 Bibliography ....................................................................................................... 73 Appendix A: List of requirements ................................................................................................ 78 Appendix B: PET tool architecture in detail.................................................................................. 81 Appendix C: Ideas for further PET developments ......................................................................... 84 © PERICLES Consortium Page 5 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Glossary Abbreviation / Acronym Meaning BE Biological Ecosystem CLI Command Line Interface Dx.y Deliverable x.y from PERICLES DE Digital Ecosystem DF Digital Forensics DO Digital Object DOW Description of Work DP Digital Preservation EI Environment Information ESOM Emergent Self-Organizing Maps GUI Graphical User Interface LRM Linked Resource Model PET PERICLES Extraction Tool RTD Research and Technology Development SBA Software Based Art/Artwork SEI Significant Environment Information SOM Self-Organizing Maps SP Significant Properties SW Software Tx.y Task x.y from PERICLES UAG User Advisory Group UC Use Cases WPy Work Package number y from PERICLES © PERICLES Consortium Page 6 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Figures Figure 1. Relationships between different tasks in WP4. ..................................................................... 11 Figure 2. The problem of information loss for external data. ............................................................... 15 Figure 3. Information extraction and encapsulation to avoid external information loss. .................... 15 Figure 4. A view on Digital Object and related information (…)............................................................ 17 Figure 5. The OAIS information model. ................................................................................................. 19 Figure 6. From [1], the PREMIS 2 data model. ...................................................................................... 21 Figure 7. From [27], proposed changes for the PREMIS 3 standard (…) ............................................... 22 Figure 8. A possible example of a dependency graph. .......................................................................... 27 Figure 9. SEI influences on SP of DO...................................................................................................... 31 Figure 10. Extension to the LRM for representing weights and purposes. ........................................... 36 Figure 11. Calibration data dependency modelling. ............................................................................. 36 Figure 12. The biological metaphor....................................................................................................... 41 Figure 13. PET snapshot. ....................................................................................................................... 50 Figure 14. PET architecture sketch. ....................................................................................................... 52 Figure 15. PET detailed architecture. .................................................................................................... 53 Figure 16. PET snapshot showing Extraction Modules and their configuration. .................................. 54 Figure 17. Screenshot showing changes in the 'handover sheet' (…) ................................................... 59 Figure 18. Trace of document use (…)................................................................................................... 60 Figure 19. Octave script used for the example. .................................................................................... 60 Figure 20. Screenshot of the PET showing a calculation result extraction. .......................................... 60 Figure 21. An unsupervised pipeline for labelling outliers (…).............................................................. 62 Figure 22. The U-matrix of a toroid emergent self-organizing map (…) ............................................... 63 Figure 23. SOM map for the video data labels. ..................................................................................... 67 Figure 24. Close-up of the SOM map with labels. ................................................................................. 68 © PERICLES Consortium Page 7 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 1 Executive Summary This deliverable will define what constitutes significant environment information, supporting novel approaches for its collection and taking into account a wider scope of collectable information, to improve the information use and reuse, both in the present and in the future. Digital objects (DOs) live in an environment containing a variety of information that can be relevant to their use. In order to take into account all the information needed, we take a very broad view on the DO environment, considering information from anywhere, at any time, and of any type. We recognise that some of the environment information external to the organisation in charge of the data could change or disappear at any given time. Collecting such information at creation and use time is thus very relevant for the preserving long-term use of the data. Looking anywhere, we start by analysing how such information has been described in related work, considering common definitions of metadata, context, significant properties and environment, describing information from a DO environment, and we come to the conclusion that we need to consider the broadest set of information, which we term environment information. We have the intuition that the focus should be on what matters for a DO use and reuse, to support it in the longterm, when the environment is likely to change. How can we represent such prerequisites to the use of DO? We adopt the concept of dependency, common in PERICLES RTD work packages, and extend it to express the prerequisites of use. Building on the existing definitions, we introduce the concept of Significant Environment Information (SEI) that takes into account the dependencies of the digital object on external information for specific purposes and has significance weights that express the importance of such dependencies for the specific purpose. Such information will naturally form a graph structure (the concept of dependency graph is common between the RTD work packages) that in our case can support an enhanced appraisal process. The graph allows deducing other significant information to be extracted and to infer relationships between different objects. Our approach facilitates the gathering of information that is potentially not covered by established standards, and enables a better long-term preservation, in particular for complex digital objects. From there we expand the definition in time considering the importance of collecting SEI during any phase of the digital object lifecycle, from creation (sheer curation), during the primary use, to posthoc (e.g. using digital forensics). This will allow us to observe and derive the use of DOs and help infer their SEI dependencies. We present examples of SEI from the PERICLES stakeholders; and present an extension of the LRM model (the abstract model being developed in WP3) to represent SEI. We investigate the analogy between Digital Ecosystems and Environments with the Biological domain in to find approaches from the mature Biological domain that can be applied to the Digital Ecosystem and Environments, and present mathematical approaches for SEI analysis. Our PERICLES Extraction Tool (PET) is built on these concepts. It is a modular and generic open source (final approval pending at the moment of writing) framework for the extraction of SEI from system environments where digital objects are created and used. PET is built to be domain agnostic, and supports extension by external modules. The modules can easily make use of the variety of existing tools (such as Apache Tika, mediainfo, mdls, and others in order to address domain specific needs. The tool automates novel techniques to collect SEI, and supports sheer curation, a continuous transparent monitoring and collection process that otherwise the user (e.g. scientist, artist in our use cases) would have to find time to perform manually. We present experimental results supporting the approach, based on the PERICLES use cases. The results from the PET tool will be further analysed in later tasks of the work package. © PERICLES Consortium Page 8 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 2 Introduction & Rationale A digital object's (DO) environment contains a variety of information that needs to be captured to support its use and reuse in the long-term. In this deliverable we identify the forms that this information may take, and propose a software framework to support its collection. The environment where Digital Objects are created, processed and used is an active research topic in the area of long-term preservation [1][2]. Current approaches for collecting information from a DO’s environment are mostly standard driven, based on the large number of metadata standards available [3]. Standards are usually managed by committees deciding on their contents, based on community feedback. While this is of course a very tested and effective method, our focus is on what could be improved for metadata collection. By analysing the current situation and considering what could be improved, we discovered that, given notable exceptions, most metadata standards don’t include information about the environment, but rather descriptions of the objects themselves, and are stretched to their limits in supporting specialised use cases. We consider environment information to have a wide scope and include all the information that can be important to the DOs use and reuse. In recent approaches more attention is paid to the importance of information residing in the environment for long-term preservation purposes. For example the TIMBUS [4] project investigated the extraction of context information of business activities from the environment. The aim in [4] is to assure that business processes will be re-playable, by capturing their context; so this is parallel, but different from the objectives we have in this deliverable. We decided to focus our research on DOs, their uses by user communities and the requirement that these uses create in the environment of the DOs, in order to support the collection of a wider set of environment information. We consider that this use driven approach will allow us to address information that would likely be ignored as it does not fit existing standards, or because it’s dependant on the specifics of the use case. We start by analysing how such information has been described in related work, considering common definitions of metadata, context, significant properties and environment, describing information from a DO environment, and we come to the conclusion that we need to consider the broadest set of information, which we term environment information. We have the intuition that the focus should be on what matters for a DO use and reuse, to support it in the long-term, when the environment is likely to change. To represent environment information we introduce the term Significant Environment Information (SEI), formally described in chapter 4, roughly covering the information needed to make use of a DO, when considering a specific purpose and user community. We propose to use dependencies to express the link between the DO and its SEI, weighted by the significance, and qualified by the purpose of use. Such information will naturally form a graph structure (the concept of dependency graph is common between the RTD work packages) that in our case can support an enhanced appraisal process. The graph allows deducing other significant information to be extracted and to infer relationships between different objects. Our approach facilitates the gathering of information that is potentially not covered by established standards, and enables a better long-term preservation, in particular for complex digital objects. The direct environment of an investigated DO draws the boundaries of this perspective, whereas WP5 investigates a top-down approach in considering the dependencies in the whole digital ecosystem. We also investigate the analogy between Digital Ecosystems and Environments with the Biological domain in order to understand if there are approaches from the more mature Biological domain that © PERICLES Consortium Page 9 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS can be applied to the Digital Ecosystem and Environments, and present mathematical approaches for SEI analysis. As the digital environment continually changes, the significant information could be lost if it were not extracted at the moment of occurrence. To address this risk we investigate in Section 4.4 a sheer curation approach, in which the environment is observed for events and significant information is extracted continuously based on these environment events. Sheer curation is an approach where curation activities take place in the environment of use and creation of DOs. This technique offers a great potential for collecting additional environment information, as it provides direct access to the digital object’s creation and use environment, where we can observe directly dependencies of use of DOs. Our PERICLES Extraction Tool (PET) is building on these concepts. It is a modular and generic open source (final approval pending at the moment of writing) framework for the extraction of SEI from system environments where digital objects are created and used. The PET operates in a sheer curation scenario, where it observes the environment and continuously collects environment information. Furthermore it provides a post hoc extraction mode to capture a single snapshot of current environment information. The tool is designed to work in combination with an information encapsulation tool, which will be developed in the succeeding task 4.2 of this work package, to enable the possibility to encapsulate the extracted information directly with the related digital object. Experiments applying the tool as well as substantiating the theoretical part based on the PERICLES use case areas of digital art and space science, will also be presented in this deliverable, and confirm the applicability of the tool to the different use cases. Also a mathematical approach for the detection of anomalies by analysing significant environment is investigated. Significant environment information will enable many new usages for digital objects, which are not supported by the currently established techniques. Evaluating and extracting potential significant environment information can facilitate unpredictable future re-uses. This concept has potential to become a fundamental part of long-term preservation activities. 2.1 Context of this Deliverable Production 2.1.1 Relation to other work packages This deliverable draws from, and contributes to work in different work packages, as here briefly illustrated. WP2: The rapid case studies and use case scenarios were used in this deliverable to draw initial requirements for the PET tool (section 7.5), and also to validate the concept of SEI in very diverse cases, where the case studies were analysed and SEI was determined on them. In particular, these can be found in sections 4.5 (examples of SEI from the case studies), 5.2.1 (Digital Ecosystem metaphor applied to Software Based Art) and 8 (Experiments of PET extraction). WP3: The focus on the dependency model shared by the RTD WPs has brought a lot of very useful cross-fertilisation. The importance of the purpose when considering dependencies was one of our contributions to the concepts in the LRM. The modelling of SEI and weights has been built as an extension to the LRM from WP3 (see section 4.6), and will help to define a common language and allow a better reuse of our discoveries. WP5: While our focus is in the scope of the DO environment, that builds from DO up to the entities related to them, WP5 is exploring the concept of digital ecosystem, starting from the perspective from the system down to the DOs and other entities. This can be seen as, for WP4, taking into © PERICLES Consortium Page 10 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS account the properties of DOs with a strong focus on their properties and environment while in WP5 the focus is on the system properties specific of the organisations in custody of the data. Given this significant difference in the point of view, there is overlap within the two concepts, and we are for this reason using a same terminology and same type of entities across the work packages, as illustrated in section 4.1 where the D5.1 definitions are used to avoid duplication. WP6-7: The PET is designed to be a client side tool, as it needs to observe the environment of use; but its modules and tools could potentially be integrated on the server side in WP6-7. The extracted SEI is now stored on the user local storage (in the programme folder) and never sent over to a remote server for privacy issues; it is still conceivable that specific SEI could be sent over to a remote server, given the exchangeable storage engine we implemented in PET, this could be a place where WP6-7 could implement the integration. 2.1.2 Relation to the other work package tasks Task 4.1 will directly or indirectly feed into all the successive work package tasks, as illustrated by the following list and in Figure 1. ● ● ● T4.2 “Environment information encapsulation”, where the information collected in T.4.1 (and later task analysis results) will be encapsulated within the Digital Object to create richer, more self-describing DOs (see also sections 3.1 and 9.2.1). T4.3 “Semantic content and use-context analysis” will also use information from T4.1, where semantic analysis will be performed also on the extracted information to extract meaningful features T4.4 “Contextualised content interpretation” will use the 4.1 and 4.3 information to perform an analysis and interpretation of the extracted features in time. T4.1 Identification and extraction of the environment T4.2 Environment information encapsulation timestamped env. information T4.3 Semantic content and use-context analysis (feature extraction) timestamped features T4.4 Contextualised content interpretation (interpretation) Figure 1. Relationships between different tasks in WP4. © PERICLES Consortium Page 11 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 2.2 What to expect from this Document This deliverable will define significant environment information by broadening the existing definitions to cover information relevant to support better use and reuse of DOs that has up to now been largely ignored. The definition of SEI is based on an analysis of current approaches and standards. By analysing the Digital Ecosystem metaphor and its relation to SEI, we anticipate good potential pointers for future research. We will also describe the methods and tools we have developed (PET tool) to identify and extract such information by observing the DOs in their creation and use environment, while presenting mathematical methods for the analysis of SEI. We will present examples of SEI in PERICLES two use case domains, and later provide an analysis of experimental results on those use cases. This deliverable will focus on the following aspects, from the point of view of a DO environment: ● What can be defined as environment information in the perspective of the (re)use of the data in digital preservation? An analysis of current standards and approaches will provide the basis of our definition (Ch. 3). We argue that metadata standards driven approaches cannot cover, in some cases, all the information needed for DO reuse. ● Significant environment information is defined here, to cover a broader set of information, not constrained by strict definitions, and is based on the use purpose and weighed on its importance (Ch. 4). ● Volatile significant environment information exists only at the moment of interest and is difficult or impossible to reconstruct afterwards, and for this reason it’s important to collect it in different phases (creation and use time) of the DO lifecycle. ● SEI comes from a variety of different sources possibly under different domains. A generic approach is important in order to provide tools that can cover multiple domains ● We will describe various techniques that can be applied for the extraction of significant environmental information We are also investigating, in collaboration with the other RTD work packages, the issues of modelling dependencies between such information. 2.3 Document Structure The document follows the following structure: 1. Introduction Chapter 1: Executive summary Chapter 2: Introduction, project context, structure 2. Theory Chapter 3: State of the art – Metadata, significant properties, context, environment Chapter 4: Significant environment information – Dependencies, sheer curation, weighted graphs, examples to substantiate the theory 3. Models Chapter 5: Biological ecosystem – Digital ecosystem metaphor, pointers for future work Chapter 6: Mathematical SEI analysis, anomaly detection 4. Software Chapter 7: The PERICLES Extraction Tool – Description, requirements and features, architecture design, development process; © PERICLES Consortium Page 12 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 5. Experiments Chapter 8: Practical experimentation in the space science and art domains 6. Conclusion Chapter 9: Conclusion, future work 2.4 Task 4.1 outline from the DOW A digital object's environment contains a variety of information that needs to be captured if the object is to be preserved and used in the future. This task is concerned with identifying the forms that this information may take, and with developing methods to extract the information from the environment in a scalable and reliable manner. T4.1.1 Environment information specifications ● ● ● Identify the nature and source of the information necessary to preserve the critical features of the environment in which digital objects are created, curated and used (such as representation information, significant properties, provenance, policies, context, etc.), using information from the use cases (media and science) in WP2 and the existing expertise of the partners. Analyse the types and roles of resources involved in the extraction of environment information, to inform the modelling work in WP3 (resource modelling, context modelling, content modelling, change management) Identify models of semantic and mathematical approaches that fit the identified mediadependent environment variables T4.1.2 Methods for environment information extraction ● ● ● Investigate the application of techniques (e.g. weighted graphs, digital forensics, sheer curation) for extracting environment information such as that identified in T4.1.1. Develop toolkit for extracting environment information that implements some of these techniques. Apply the semantic and mathematical models identified in T4.1.1 to the extracted information. © PERICLES Consortium Page 13 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 3 DO information for preservation & reuse In this chapter, we examine previous work on identifying and representing the information for a DO that is relevant to support the reuse of that object, in the long-term, across different user communities and for different purposes. We structure this examination by beginning with information that comes from the DO itself, then move beyond the DO with the aim of identifying a broader set of information that needs to be taken into account to better support DO reuse. We recognise that this classification is one among many; the aim is however to show one thread that leads us to the topic of Significant Environment Information, introduced in chapter 4. 3.1 Metadata Metadata can be defined as the information necessary to find and use a DO during its lifetime [6]. This definition covers a wide variety of information, and was further refined by the Consultative Committee for Space Data Systems in their reference model for an Open Archival Information System (OAIS) [7]. This refinement covered the information necessary for the long-term storage of DO, and identified a number of high-level metadata categories, as follows. The Descriptive Information (DI) consists of information necessary to understand the DO, for example its name, a description of its intended use, when and where it was created, etc. The Preservation Description Information (PDI) consists of all the information necessary to ensure that the DO can be preserved, including fixity (e.g. a checksum), access rights, unique identifier, context information (described in more detail in the following subsection) and provenance, which describes how the object was created. The final category arises from the fact that the OAIS manages not the DO itself, but information packages which consist of the DO as well as the DI, PDI and information required to interpret the contents of the DO (which is described by the Representation Information (RI)). The Packaging Information (PI) category describes how the information package is arranged such that individual elements can be accessed. Standard file formats have standard structural metadata (e.g. MPEG21 [8]), and de facto standards (e.g. the Text Encoding Initiative [9]) exist for popular formats. The situation on standardisation for the descriptive part of the RI is more complex due to the different needs of different communities, although many approaches contain the Dublin Core metadata element set [10] as a core. A catalogue of metadata standards for different communities can be found on the Digital Curation Centre website [11]. Metadata may be treated as a separate entity, as it can be accessed without accessing the DO, but the lack of metadata adversely affects the access to or reuse of the DO. While such information is essential for the reuse of the DO it is not in general sufficient; information concerning the external relationships of a DO, whether to other DOs, stakeholder communities, or other aspects of the environment within which a DO is created or curated, also need to be taken into account to ensure that the DO can be used fully and appropriately. Metadata may be held internally in a DO, e.g. in the header of a structured file, or externally, e.g. in a database, or separate data structures such as in file system extended attributes. The location of metadata is an important factor for Long-term Digital Preservation (LTDP), as it will have consequences on the availability of the information when for example the data is moved to an external location, as in such event external metadata is in danger of being lost. Just to give a short motivating example, of potentially useful metadata that is stored in an external location, we take the example of the Apple OS X spotlight metadata. When downloading data from a web browser, OS X © PERICLES Consortium Page 14 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS stores provenance information about the URL where the data was downloaded. Such provenance information (that can be useful to understand the origin of the data) is lost, if proper action is not taken, as soon as the data is archived to an external location (as for example when archiving it to a ZIP file) as it is part of the OS indexing engine database. Figure 2. The problem of information loss for external data. The issue is illustrated in Figure 2 where the relationship between information that is needed to use a DO and the DO is implicit and is lost when the DO is transported to a new location. This can be solved in two steps: 1. We first need to identify and extract the useful information from the DO environment (external information). We are specifically addressing this issue with the definition of Significant Environment Information in chapter 4, and with the PET software, described in chapter 7. 2. We will address the encapsulation of such information together with the DO in the next task in this work package, Task 4.2, so that the information is not lost when DOs are moved or migrated (as shown in Figure 3). This is one of the reasons why we consider the extraction, and successive encapsulation of such external information as an important step to guarantee long-term use of DOs. Figure 3. Information extraction and encapsulation to avoid external information loss. © PERICLES Consortium Page 15 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Table 1. An extract of our early analysis of metadata standards and their information content across standard. As part of the work to analyse existing standards and determine useful information to collect as well as existing gaps, we have built a comparative table illustrating the different standards and features, © PERICLES Consortium Page 16 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS as illustrated in Table 1. The table presented here is an extract of the original table and therewith not complete; the complete table is available online1 and on the PERICLES WIKI. This table illustrates the scope of a number of current metadata standards, and drove us to aim at the collection of a broader set of information, defined on the uses as opposed to taking a standards driven approach. Our first attempt at determining what to capture was to determine this based on the different existing standards and their classifications. After some work, it was clear that a lot of time would be required for the classification and mapping of existing standards, and that would result in any case in subjective output. We do not try to imply that such classifications aren’t useful in many cases, but as a number of standards and classifications already exist, we determined that a different approach would be beneficial and would fill a gap in the current approaches. The standards were based on particular uses and so offer some clues in those domains as to what constitutes the useful environment information to collect. Still, for domain specific standards, it’s not so easy to map across domains, while in the case of the generic standards they of course are generic and cover a subset of what's needed and not everything so the users are left with the task of defining a metadata schema for their own domain. For this reason we opted for a bootstrap, practical approach. We considered that what really matters is the capability of making use of the information, as we will describe here and in the following chapters, and decided that this is what can help and guide our selection of the environment information. Starting from the data use and looking for the required information will allow, we assume, to gather a broader scope of information that could partially be ignored when using a standard driven approach, covering what is necessary for the use and reuse of the information by the use communities, and to observe its change in time. One possible interpretation of the scope of the different type of information is given in Figure 4. Figure 4. A view on Digital Object and related information, from the narrowest to the broadest. 3.2 Significant Properties The concept of significant properties (SP) has been much discussed in Digital Preservation (DP) over the past decade (see for example [12, 13, 1]), in particular in the context of maintaining authenticity under format migrations, given that some characteristics are bound to change as formats are 1 http://goo.gl/sJ1Z1m © PERICLES Consortium Page 17 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS migrated. The issue here was to identify which properties of an object are significant for maintaining its authenticity. Early work in this direction may be found in [12], where SP are introduced as a “canonical form for a class of digital objects that, to some extent, captures the essential characteristics of that type of object in a highly determined fashion”. Later work [13] investigated ways of classifying the properties: “Significant Properties, also referred to as “significant characteristics” or “essence”, are essential attributes of a digital object which affect its appearance, behaviour, quality and usability. They can be grouped into categories such as content, context, appearance (e.g. layout, colour), behaviour (e.g. interaction, functionality) and structure (e.g. pagination, sections).” The concept has been adopted by standards such as [1], describing SP as “Characteristics of a particular object subjectively determined to be important to maintain through preservation actions.” Such characteristics may be specific to an individual DO, but can also be associated with categories of DO. An important aspect of SP is that significance is not absolute; a property is significant only relative to (e.g.) an intended purpose [14], or a stakeholder [2], or some other way of identifying a viewpoint. This intuition is highly relevant to the work we will describe in chapter 4, where significance is considered a key property for collecting relevant information from the environment. While the concept of SP is useful for digital preservation, in its application it has usually been restricted to internal properties of a DO, for example the size and colour space of an image, or the formatting of text documents, rather than the potentially valuable information that is external to the object itself. There have been some indications of a broader conception: [13] identifies context as a category of SP, [15] refers to the need to preserve properties of the environment in which a DO is rendered, and [2] introduces the notion of characteristics of the environment. The latter associates environments with functions or purposes; this differs from what we are aiming at, which is to describe the significance of information from a DO’s environment in relation to the purpose the user is following (such as editing the object, processing the object, etc.). We thus see the purpose as qualifying the significance, not the environment – a piece of information is significant for a specific purpose, but not for some other purpose (within the same environment). In this deliverable we often draw on Software Based Artworks (SBA) as example, because they qualify to demonstrate that just preserving their intrinsic metadata cannot always preserve SPs. [16] described that the environment has a special role for born digital media art, as “The digital environment contains both: the core elements of the artistic software as well as the operating system and modules required by the artistic software.” In [17] the important question “What are the Spatial or Environmental Parameters of a Work?” is asked to identify one of the areas of focus for SPs of SBAs. Also, [18] outlines the “significance of the hardware/software systems for video and software based artworks”. One of her case studies concerning the artwork Becoming by Michael Craig-Martin is a good example to illustrate the dependency of SBAs SPs on the environment: “The randomness was created using Lingo, the programming language for Director. This randomness makes use of the computer’s system clock, and the speed at which images become visible or invisible is also dependent on the speed of the computer.” 3.3 Context Context is a term with many definitions, a basic dictionary definition being “the circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood [19]”. This clearly relates context to the purpose of understanding information, and this is a key feature of context in relation to digital objects. Context encompasses a broader range of information than metadata; it describes the setting that enables an understanding of a DO [20], including for © PERICLES Consortium Page 18 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS example other DOs, metadata, significant properties, relationships, and policies governing the curation or use of the DO. In [21] context is defined even more broadly as ‘all those things which are not an inherent part of information phenomena, but which nevertheless bear some relation to these’, where the nature of the ‘relation’ is left unspecified. Figure 5. The OAIS information model. The OAIS model (Figure 5) views context as the relationship between a DO (equivalent to the Content Information in OAIS terms) and its environment2. In this view, the environment is considered to be necessary for using the DO, although it does not take into account two factors that we consider essential for our purposes: firstly, the possible variety of different uses to which a DO may be put, which will in general differ in the demands of ‘necessity’ they make on the environment; secondly, the variable strengths of the relationship with different aspects of the environment. In TIMBUS [5] context is explored from the point of view of supporting business processes in the long term, describing a meta-model based on enterprise modelling frameworks. The context parameters cover an wide set of parameters, from the legal, business to the system, and technological ones, with the aim of supporting the execution of processes in the long term. In PERICLES, DP and digital forensics are approached from the angle of preserving content in context, where context can be spatial, temporal, conceptual and/or external, and all of these aspects apply to space and digital art, the domains in study. We note in passing that the dependence of content on 2 From OAIS: “Context Information: The information that documents the relationships of the Content Information to its environment. This includes why the Content Information was created and how it relates to other Content Information objects.” © PERICLES Consortium Page 19 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS context as an insight has been around for quite some time3, and this insight returns in the way dependencies play a key role in PERICLES models of content and context. Context Definition from the Multimedia Analysis Perspective In semantic multimedia analysis the ultimate goal is to obtain high-level semantic interpretations of a digital object. A plethora of techniques have been devised for semantic content description and categorization in many application domains. Although some of these techniques display satisfactory performance, they make use of no contextual information that arises from the specific application. In recent years though, the notion of context has been gaining increasing interest in image/video understanding. Modelling the context and integrating the contextual information with the raw information of the source digital media has proven to improve the performance of the state-of-theart techniques in image/video categorization [22]. In an attempt to model the context information, many approaches have been proposed and published in the literature for semantic image understanding [23], [24], [25] or video indexing [26]. Furthermore, the use of context in conjunction with the domain knowledge, that is information about the application domain in general, has boosted the efficiency of content analysis. From the multimedia analysis perspective, context is usually considered as the auxiliary information that can help us cope with the erroneous interpretations produced by the analysis modules that rely solely on the audio-visual content. This information may be related to the environment where the multimedia item is used, the purpose of its use, the reason for its creation, or even the equipment used for its creation. For instance, in the media case study, a painting might be exhibited in a museum sector referring to a specific time era (e.g., Renaissance, Rococo, etc.), where different, more or less abstract concepts might be depicted. This environment information, along with artistspecific information i.e., who is the creator of the painting as well as contextual information acquired from his/her other works, may also affect the extraction of semantic information about the artwork. Moreover, information regarding a potential reason why a piece of work has been created, e.g., an exhibition devoted to a certain concept, for instance “representation of spring”, may also contribute to semantic extraction (object categorization, scene recognition, concept recognition, etc.). As another example, in a music artwork, information about the equipment used for its creation (e.g. whether it has been performed by orchestral instruments or it has been electronically generated using computer software) may be integrated with the pure content for annotation purposes. Using a rough categorization, context could be classified into four main types: spatial, temporal, conceptual and external, according to its origin, nature and what it manifests. These types are briefly described here: ● ● ● Spatial context is related to the spatial organisation and structure of a digital object, e.g. image, and it usually originates from the pixels surrounding an object in an image or the relative location, orientation and scale of multiple objects within an image. Temporal context is encoded by the temporal relation among a collection of digital objects, for example in the form of the optical flow or frame proximity among consecutive frames in a video. In a higher level of abstraction, conceptual context might represent the co-occurrence of multiple concepts in a digital object (e.g., objects in an image) and in some sense it expresses the correlation among several concepts. 3 E.g. in linguistics in the works of Zellig Harris (1909-1992). His distributional hypothesis has led to today’s successful research on distributional or statistical semantics, underlying progress in information retrieval and machine learning. © PERICLES Consortium Page 20 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS ● External context includes all those types of information that come from sources other than the digital object itself and actually is intimately related to descriptive information as this has been previously identified in terms of metadata. It is worth mentioning though that this is only a rough classification enabling an easier perception of the inherently abstract notion of context. As a matter of fact, there is usually an overlap between the several classes; thereby a strict classification of contextual cues is rather far-fetched. Moreover, it is clear that not all of these types are always suited for every kind of audio-visual content. For instance, an image cannot be interpreted in terms of temporal context, while an audio file cannot be interpreted within spatial context. 3.4 Environment information The widest set of information in our view is the environment. We consider environment information to include all the entities (DOs, metadata, policies, rights, services, etc.) useful to correctly access, render and use the DO. The definition supports the use of unrelated DOs and conforms to the definition for environment used by PREMIS [1]. Figure 6. From [1], the PREMIS 2 data model. In the context of OAIS, the term ‘Representation information’4, and its specification ‘Other representation information’5, together with the information in the PDI, seem to include some of the information we would classify as environment information. The point of view in OAIS still is that of supporting the understanding of the object, and does not qualify the different uses and purposes for 4 “Representation information: The information that maps a Data Object into more meaningful concepts. An example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part of the standard. Another example is JPEG software which is used to render a JPEG file; rendering the JPEG file as bits is not very meaningful to humans but the software, which embodies an understanding of the JPEG standard, maps the bits into pixels which can then be rendered as an image for human viewing.” 5 “Other representation information: Representation Information which cannot easily be classified as Semantic or Structural. For example software, algorithms, encryption, written instructions and many other things may be needed to understand the Content Data Object, all of which therefore would be, by definition, Representation Information, yet would not obviously be either Structure or Semantics. Information defining how the Structure and the Semantic Information relate to each other, or software needed to process a database file would also be regarded as Other Representation Information.” © PERICLES Consortium Page 21 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS the information. In a way, we also propose a simpler definition, which does not aim to a classification of the different types of information (structural, semantic, other, and PDI plus sub categories) while we focus on the user and use of the information from the creation context and eventually from different communities. PREMIS builds on the OAIS reference model and defines a core set of metadata semantic units that are necessary for preserving DOs. The set is a restricted subset of all the potential metadata and only consists of metadata common to all types of DOs. The current, published set (PREMIS 2) defines a data model consisting of four entities (see Figure 6): the Object entity allows information about the DO environment to be recorded amongst other information. The Rights entity covers the information on rights and permissions for the DO. The Events entity covers actions that alter the object whilst in the repository. The Agents covers the people, organizations or software services relevant that may have roles in the series of events that alter the DO or in the rights statements. The Intellectual Entity allows a collection of digital objects to be treated as a single unit. Dependency relationships are defined in PREMIS 2 as “when one object requires another to support its function, delivery, or coherence of content. An object may require a font, style sheet, DTD, schema, or other file that is not formally part of the object itself but is necessary to render it.” The PREMIS working group undertook an investigation of the environment information metadata based on feedback from their user-groups that found the existing support to be difficult to use. The group reported in [27] their findings which entailed promoting the environment information to a first-class entity and not a subordinate element of the DO for the next version of PREMIS (PREMIS 3). They advocate the use of the Object entity to describe the environment, which allows relationships between different environment entities, as illustrated in Figure 7. This approach neatly supports the PERICLES view of the environment although PERICLES makes a distinction between the general environment and the environment significant for a particular set of purposes (termed the Significant Environment Information for a DO), which is described in the following chapter. Figure 7. From [27], proposed changes for the PREMIS 3 standard to make environment a first class object (light grey). We consider the environment information for a DO to be the widest set of entities that is related to it. This would include by definition all other DOs, information, services and other information that can relate to the DO, but also other information from the environment that is useful for any of its possible uses. We consider this a wider set, although related, to the one described in the OAIS as Representation Information, and we take a different focus than that defined by PREMIS 3, as we are not focused on Software Environments. Another important distinction is that in general, we look at the environment as something defined from a DO upwards, and thus we see environments as defined on the DO. This is a different point of view from the one that is taken in WP5, which takes an ecosystem perspective, © PERICLES Consortium Page 22 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS thus looking from the system and institution point of view downwards to the different entities, services, processes, users and digital objects. Furthermore, we consider that a part of the environment information will only be observable in the live environment of creation and use, reason that drives our choice of a sheer curation approach (as described later in the next chapter). While looking at the DO environment, we think that the user is an important part of it, and for that we observe the interaction between users and their communities, the DOs, and the rest of the environment. We think that this perspective will allow us to capture the information based on the pragmatic, sometimes neglected aspects of the real requirements for making use of DOs. This will also help us in the task of inferring dependencies that are not explicit and determine relevant information based on real use of the DOs. As environment information will be a very wide set of information, it will be important to qualify what information is significant for and what not, as we are introducing in the next chapter. © PERICLES Consortium Page 23 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 4 Significant Environment Information Based on the definition of environment information from the previous section, we now propose our definition of Significant Environment Information (SEI). For this we introduce the term “dependency” as needed to express usage constraints for the DOs that exist in the regarded environment. Furthermore we will propose some methods to measure SEI in the context of sheer curation, and investigate SEI for usage scenarios related to our use case partners. Our respective findings will be presented as a full paper at iPRES 2014 [28]. 4.1 Definition of dependency and other entities in PERICLES In PERICLES, dependencies are a concept shared by the three RTD Work Packages (WP3-5), together with other entities that can appear in a DO environment and in the general ecosystem in WP4 and 5. Although many of these entities are common and shared, it is useful to make a distinction in their use, which is based on the point of view taken by the work packages. In this deliverable, the point of view taken is that of observing a DO and its environment, aiming to support their use and reuse, and supporting DOs appraisal, making very few assumptions on the whole ecosystem (the institutional or system aspects of the environment). This allows us to focus on the properties that are independent of the system aspects and do not rely on the existence of the system or institution. For this reason this deliverable looks at the less structured aspects of the information and its processes. In WP5, the view taken is closer to the institution, and more attention is paid to the structure of the ecosystem and its processes and of its other entities. Looking in particular at dependency we propose a definition that has a different scope from the one introduced in D5.1. Here we take the perspective of the use made of DOs, to express what information is needed when a user wants to make a specific use of DOs. By looking at what uses are possible for a DO, and what the prerequisites for them are, we aim to discover the dependencies of a particular object, when considering a particular use or purpose. As this deliverable and D3.2 and D5.1 are concurrent developments, we are referring here to the most recent available drafts of the deliverable at the time of writing. The focus on the definition currently in D5.1 is that of assessing the impact of change; the definition proposed there is the following: “There are the objects A and B. An object A has a dependency on B when some changes in B have a significant impact on the state of A. And there is also a dependency if changes in B can impact the ability to perform function X on A.” In the current incarnation of D3.2, dependencies are defined as links between entities (based on the PROV-O ontology definition) with an intention, which is described by a plan. A plan can have “preconditions (when is it required to trigger the propagation of a change?) and impact (how depending resources will be impacted)”. The definition of dependency in 3.2 is focused on the change aspects of dependencies, as it is the case in the WP5 definition. We are planning to investigate on the possibility of a single, common definition covering the aspects of dependency used in the project in later work, but we considered important for now to make progress with the properties that express the need of other digital objects that fit better this deliverable objectives. It is important to note that the definition per se is more conceptual than practical, and does not preclude us to make use of the LRM model from WP3. © PERICLES Consortium Page 24 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS In section 4.6 we are presenting our early work on extending the LRM model to support the description of SEI. Dependency definition We define dependency as a directed relationship from entity A (source) to entity B (target), that expresses necessity/prerequisite/precondition of B in order to make use of A. In a different form, we view a dependency as: in order to use the source entity A, you need the target entity B. In this definition, a dependency is always qualified by a use: one can assume there can be more than one use of an entity. This assumes the users will naturally have an objective, a purpose of use when accessing a resource. As we will see later in this deliverable, although we cannot know in advance the future uses for an entity, we consider that dependencies covering different user communities will likely cover a broad set of uses, and monitoring the change in use by user communities will be the best options to address future uses. Even if some dependent entities can be replaced by other entities in an environment, as for example by using another word processor to edit a text fragment, our focus is on the currently established dependencies in the environment. A disjunctive dependency expresses the need for one entity from a set of entities (entity A OR entity B etc.); in this case the dependency would be between the entity and another entity representing the set of alternative entities. Dependencies at the type and instance level We defined dependencies to connect different entities, but there are distinctions to be made based on the type of entities: Dependencies from or to: ● Instance level: ○ Software or services: these will link an entity to a service, a process or a software component that is necessary (or useful) in order to make use of some data. For example, in the case of an application that makes use of a web service. This can be considered as a form of dynamic data. ○ Static data (single digital objects and their dependency): to make sense of file ‘anomaly.doc’ you need to also have access to ‘errorcodes.txt’ and ‘whereisthat.cvs’. ● Type level: ○ Software or services: to use data in format X you also usually need application Y or service Z. ○ Static data: to interpret this type of data you usually need this other type of data. We think that the distinction between type and instance level dependencies will be particularly relevant for later work on dependencies, and it has to be made explicit, as SEI will likely involve both types of entities. It is likely that future work may be addressing the creation of type dependencies by the observation over time of instance level dependencies. It is of course possible to further classify dependencies by a number of facets, such as re-playable or not re-playable dependencies; time-dependant or not time-dependant; dependencies for the different aspects, both from the technical environment and from the human or not strictly system bound environment; semantic dependencies etc. © PERICLES Consortium Page 25 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS For a detailed classification of dependencies, we refer to the work carried out in the scope of D5.1 that includes the definition of a number of entities in the ecosystem. For a more detailed state of the art on dependency and definitions, we refer again to Deliverable 3.2. DOs and other entities We again refer to WP5 for a more detailed definition of the ecosystem entities, and for the definition of Digital Object. The focus in this deliverable is on the DO and service entities and their types, and we consider that a DO can represent both proper data and a description and reference to existing services or software. We also consider the possibility of referring to subparts of a DO, and to the type of DO, such as some specific information contained in a DO metadata, or the type of that information, while talking about dependencies between DOs. This allows us to describe dependencies concisely without getting into the details of DO identifiers and subparts or type identifiers. 4.2 Definition of Significant Environment Information Based on the broad definition of environment in the previous chapter, and that of dependency, we define: Environment Information for a source DO is the set of dependencies from the source DO's, together with their target DOs for any type of use. Depending on the situation, the target of the dependency can be an existing DO, a new DO representing some extracted information, a reference to such objects, or finally a type of object, and should be considered part of the environment information. We further define purpose (or intended use) as one specific use or activity applied to the source DO, by a given user community. It is possible to imagine a hierarchy of purposes, where a higher-level purpose (as for example, ‘render with faithful appearance’) will lead to a set of detailed purposes (such as, ‘accurate colour reproduction’, ‘accurate font reproduction’ etc.). We further define significance weight, with respect to a purpose, as a value expressing the importance of each environment information dependency for that particular purpose. The significance weight will be a property of each dependency between the DO and the DO expressing the environment information for a specific purpose. We will explore methods and models for the significance weights in sections 4.3 and 4.6 Finally, we define ‘Significant Environment Information’ (SEI) for a source DO, with respect to a given purpose(s) as the set of environment information, qualified with significance weights. This will include both the dependency relationship (with purpose and weights) and the information that is target of the dependency. In a less formal way, what we are aiming at is to determine “more or less all you need to have” when interacting with a DO for a specific purpose, and the relative significance of each of these information units (dependency). Once SEI is determined for a collection of DOs, the different dependencies can form a graph structure, as illustrated in Figure 8, where DOs in the collection could have relationships between each other (when a DO in the collection will depend on another DO in the collection for a specific purpose). This graph of significant information from the environment can serve as the basis for appraising the set of DOs that should be maintained together with the relationships in the SEI (for example, by applying a simple threshold to the significance weight, in order to support the use of the DO in the future). In this graph we can imagine that weights will be assigned both to the data and the SEI. The combination of the DO weights and their propagation through the dependency weights © PERICLES Consortium Page 26 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS should allow determining an optimal set of DOs to appraise. This will be explored in future deliverables of this project, in particular within T5.5. Comparing this definition to existing definitions of SP of the environment, as for example the one described in [1], we note that the former is aimed at the collection of SEI for a DO to support the different purposes a user can have with respect to it, while the latter defines the significant properties of an environment in itself. The information we aim to collect is constituted by qualified relationships to other DOs, as opposed to properties of the environment. Figure 8. A possible example of a dependency graph. We can observe that SEI is a specialisation of the dependency definition, taking into account the purpose and the weight of the dependency. This means that SEI is a specialised subset of the dependencies for a DO. As mentioned before, the perspective we take is that of observing the current use of a DO in its use environment, which is often before it will enter a Digital Preservation system. This will allow to better determine what dependencies and information, targets of the dependency relationship (be it information or services) are significant for the uses of the DO. We consider that knowing the significant information necessary to support current uses will allow us to cover, or at least know more precisely, the needs also for the long-term, as long as we try to support different user communities. This is because different user communities will have different purposes and requirements, so this is a good approximation of knowing the needs of a future community (that we cannot know in advance, anyway). 4.3 Measuring significance Currently, we are focusing on collecting a wide array of environment information, based on the dependencies this information can have to the DO and its estimated relevance. We are also trying to infer object dependencies that have an implied significance, by looking at use data, as described later. Still, a very relevant part of what we need is measuring the significance of the collected data. Although we don’t have experimental results for this part yet, we have clear ideas on how to define and collect it. It’s in answering the question ‘what for – for what purpose?’ that should help us define what is significant. © PERICLES Consortium Page 27 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Collections of data often have more than one use. Determining what information is significant depends on the use of the data. For example, the calibration of the solar measurement instrument will require calibration data, which may be a subset of the complete collection of data, as well as applications necessary to read and analyse the calibration data. For a given collection, not all of its environment information may be necessary for every potential use. To represent this, we propose assigning weights to each relation between the collection DOs and their environment information. Weights have a value between 0 and 1 included; where a weight of 1 indicates the information is essential for all intended uses of the data. Monitoring the access of information as well as regular reviewing of the information required for each use would provide the opportunity to update the weights and could also accommodate new uses of the data. Weights could be determined by direct collection, which is by asking the users to provide such weights together with their current purpose of use, once the dependencies have been determined. Another possibility would be that of observing the frequency of data use to determine the significance weight, as for example by observing how often a particular object is used in conjunction with another, in cases where such a scenario could be applied (when the usage data can be collected across multiple users). For an individual DO these weights would include factors based on the cost of collecting the information (e.g. a subscription fee may be required to access the information contained in the resource). A threshold defined by the user community, archive and content providers would determine which pieces of information make up the SEI. For example, in the case of a subscription fee for accessing information contained in a third-party repository one could define the weight as: (1-cost/budget), where the budget could be the total funds allocated to the archival of this DO. The user community, content providers and archive will need to determine how the weight is defined. Current FP7 projects such as the 4C project6 and the SCIDIP-ES project [29]. Particular patterns in data usage could also help determine the current user activity - to infer the dependency purpose - and from there also automate the inference of the purpose of use. Other factors can also be taken into consideration in calculating the weight, to express value, such as cost in time and money to collect the information as well as whether the information is proprietary (which may limit the accessibility to the information); there may also be licensing and privacy constraints that restrict where the data can be accessed from. Any factor that influences use to the information may contribute to its weight. Significance is useful in the long-term preservation perspective, for example to support critical analysis of the science data, as it will be a useful representation of the point of view and importance of the information for the stakeholders. It can also provide a key to the understanding of the information and the relationships between its different parts. 4.4 SEI in the digital object lifecycle In recent years, there have been various efforts within the digital curation community to establish new methods of carrying out curation activities from the outset of the digital lifecycle. A major constraint that mitigates against this is that data creators (such as researchers) typically have time only to meet their own short-term goals, and – even when willing – may have insufficient resources, whether in terms of time, expertise or infrastructure, to spend making their datasets preservable, or reusable by others (e.g. [30]). Moreover, the very volume of information that may be useful can preclude this as a practical approach, and in any case the researcher may be unaware of the utility, or even the existence, of much of this information 6 http://www.4cproject.eu/ © PERICLES Consortium Page 28 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS One approach to this challenge has been termed sheer curation (by Alistair Miles of the Science and Technology Facilities Council, UK), and describes a situation in which curation activities are integrated into the workflow of the researchers creating or capturing data. The word ‘sheer’ here is used to describe the ‘lightweight and virtually transparent' way in which these curation activities are integrated, with minimal disruption (see [31]). Sheer curation is based on the principle that effective data management at the point of creation and initial use lays a firm foundation for subsequent data publication, sharing, reuse, curation and preservation activities; the sheer curation model has not been extensively discussed in the scientific literature. The term has sometimes been interpreted as motivating the performance of curatorial tasks by data creators and initial users of data by promoting the use of tools and good practice that add immediate value to the data. This is, in particular, the take of [32], which discusses the role of such an approach to the distributed, community-based curation of business data. However, this interpretation does not really address the challenges outlined above, and a more common understanding of sheer curation depends on data capture being embedded within the data creators’ working practices in such a way that it is automatic and invisible to them. For example, the SCARP project [33], during which the term ‘sheer curation’ was coined, carried out a number of case studies in which digital curators engaged with researchers in a range of disciplines, with the aim of improving data curation through a close understanding of the researchers’ practice [34][35]. In [36] the concept of sheer curation is extended further to take account of process and provenance as well as the data itself. The work examined a number of use cases in which scientists processed data through various stages using different tools in turn; however, as this processing was not carried out in any formally controlled way (e.g. by a workflow management system), it would have been impossible for a generic preservation environment to understand the significance of the various digital objects produced from the information available, as the story of the experiment was represented implicitly in a variety of opaque sources of information, such as the location of files in the directory hierarchy, metadata embedded in binary files, filenames, and log files. This was addressed by capturing information about changes on the file system as these changes occurred, when a variety of contextual information was still available, and the provenance graph was constructed from this dynamically using software that embedded the knowledge and expertise of the scientists. The BW-e Labs Project [37][38] comprises an example for sheer curation and the collection of context information in laboratory environments. The project stores context metadata during experiments at the laboratory equipment together with the experiments measurements to improve reuse and the collaboration between scientists. The most effective way to capture SEI is through observation in the environment of creation and use of the object. We look at the interaction between the DO, the environment and the user, with time dimension. This allows us to infer dependencies that are not explicit and determine relevant information useful for use and reuse of the DO. In terms of the DCC life-cycle, we are addressing the ‘create’ phase of the DCC lifecycle [42], with a strong focus on the ‘use and reuse’ and create phases, examining the creation and use-reuse context and try to extract SEI from these contexts. There is a close analogy between what we term sheer curation and modern models used in the records management community. In the traditional approach towards record-keeping – the so-called ‘life cycle model’ – archivists are only involved subsequent to the period of active use of a record within an organisation, when an object is transferred to a formal archive or otherwise disposed of. Partly in response to the move towards digital rather than paper records, record-keeping practice has increasingly adopted the ‘Records Continuum’ model, in which a record is regarded as existing in a continuum rather than passing through a series of fixed life-cycle stages, and archival practices are involved throughout, from the time the record is first created [85]. In this way, contextual © PERICLES Consortium Page 29 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS information available at the time the record is created and during its period of active use may be captured and subsequently exploited to support archiving, in much the same way as the metadata captured during data creation and reuse in sheer curation models. This metadata may be thought of as constituting records that document (e.g.) a scientific experiment, thus forming a ‘metadata continuum’. 4.4.1 Post hoc curation and digital forensics Sheer curation may be contrasted with post hoc curation, which takes place only after the period during which the digital objects are created and primarily used. When a DO is ingested into an archive or repository, it is certainly possible to extract metadata from it – indeed, many tools already exist to analyse content on ingest – however, much of the contextual information that was available in the environments in which the DO was created and used may no longer be available at ingest, or may be represented only implicitly. This may result in a reduced ability to understand the semantics of the DO, or to reuse it effectively. Digital forensics is one particular approach to such post hoc curation, and involves the recovery and analysis of data from digital devices. It began in law enforcement as a way of investigating computer crime, but since then has been applied to a variety of related areas, such as investigating internal corporate issues, and also to digital preservation. At the heart of digital forensics is the recognition that archives exist as physical (the storage media), logical (digital artifact), and conceptual (as recognized by a person) objects [39]. Physical storage media can degrade over time, and so there is the need to preserve the logical object (data) stored on the media. However physical media can also provide additional information relating to the context of the data’s creation. For example the data might be the final draft of a document, but the media may also contain deleted files relating to previous drafts of the document; it can also give information relating to the environment in which the document was created such as the operating system. Although digital forensic techniques were originally designed for application to computer hard drives, they can similarly be applied to any form of physical storage media, such as optical disks (e.g. compact discs or DVDs), USB storage devices (e.g., flash drives), digital cameras or iPods mobile phones. Key to digital forensics is the creation of a disk image – this is a bit for bit copy of the whole storage media, and which is identical to the original. As such, a disk image is different to a logical copy of the storage media, which may contain copies of files and folders of the storage media, but is not identical at the bit level [40]. Creating a disk image means that this can then be viewed, analysed and tested in exactly the same way as the original storage media, but without interfering with, or modifying, the original storage media. For example, in some cases even accessing the storage media would result in modification (such as the indexing process performed by Mac OS X), and for this reason digital forensics practitioners recommend the use of a ‘write blocker’ during the access of original storage media (e.g., when creating a disk image). The disk image can then be mounted in order to view the constituent files and directories (as could be done with the original storage device), or it can be analysed using specialist digital forensics tools (to identify deleted files and environmental information relating to the creation of the data). Both the creation of the disk images and use of digital forensic techniques lead to policy issues for preservation. Firstly there is the issue of storing the potentially large disk images which may be many gigabytes or even terabytes in size. Second, there is the issue raised by the potentially personal, © PERICLES Consortium Page 30 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS identifying or confidential information recovered from the storage media, some of which may be from previously deleted files. In this case, preserving institutions need to have policies relating to both the collection of such information from donors, and also to restrict access to, or redact, confidential information [41]. Digital forensics approaches will be addressed in more detail at a later stage of the project. 4.5 SEI in the PERICLES case studies In the following sections we illustrate the concept of SEI with examples from the Digital Art and Space Science domains (the main areas of interest for the PERICLES stakeholders). Practical applications of these examples can be found in the experimental results section. 4.5.1 Software Based Artworks The following use case example illustrates the SEI investigation inspired by the Software Based Art scenario from the Tate gallery. In this example a Software Based Artwork (SBA) should be migrated to a new computer system for the purpose of an exhibition. The software component of the SBA causes a strong dependency on the computer system environment. A description of SBAs and an extensive study on their SPs can be found in [17] and [18]. We assume there is a computer system with a validated SBA installation, which should be preserved to be able to configure and emulate the computer system environment as closely as possible for future exhibitions. Preserving only the SBA as a DO cannot solve the problem, as the original appearance and behaviour of the software cannot be reconstructed based only on the metadata that belongs to the DO. In the context of executing the SBA's software for the exhibition installation are, for example, other dependencies, such as external libraries and applications, and data dependencies (data used at run-time by the SBA). However, we have to look further into the whole environment to conceive all information that could be important for this scenario, as, for example, context-external running processes that can affect the availability of resources, or external network dependencies. The determination, extraction and preservation of SEI are essential to solve the problem of enabling a future faithful emulation of the original system. An investigation of the environment information with influence on the SP of the DO helps to identify the SEI for this use case. This is illustrated in Figure 9. Figure 9. SEI influences on SP of DO. An example of SEI influencing the SP is when software changes the execution speed, based on the system resources, since program procedures can adapt their execution speed to the available resources depending on the programming style. This will turn information about system resources © PERICLES Consortium Page 31 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS into SEI for the purpose of “maintaining execution speed”. Information about display settings, as colour profile and resolution, used fonts, the graphics card and its driver is SEI that can affect the SBA appearance (“render DO with faithful appearance” purpose). Changes of programming languagerelated software can result in execution bugs or different speed of execution. The peripheral driver or setting or response times that are dependent on the execution speed can affect the user interaction experience with the SBA. In order to determine the SEI, each SBA has to be individually analysed, regarding the use purpose and based on the properties of the artwork and the artist’s beliefs regarding the SP of his artwork. Typical SEI to emulate the environment for a SBA is: information about computer system specifications, available resources, required resources, installed software and software dependencies. Other relevant dependencies to capture can be, for example, all the files that are used during the SBA execution, and peripheral dependencies, which can be identified by analysing peripheral calls of the SBA. System resource requirements can be estimated on the basis of resource usage, but hardware changes come with the risk of affecting the software behaviour and should be avoided. Another SEI purpose is given, if the SBA has to be recompiled because of a migration to another platform or to fix malfunctions with a different set of significance weights. Here the SBA behaviour has to be validated by the comparison of behaviour patterns measured at the original system continuously in a sheer curation setting. Examples for such measurements are processing timings, log outputs, operating system calls, calls of libraries and dependent external software, peripheral calls and commands, resource usage, user interaction, video and audio recordings. The two latter can also be used for validating the appearance of the artwork. If the SBA has a component of randomness, it is more difficult to evaluate its behaviour based on the measured patterns. Furthermore, information about the original development environment can be useful for a recompilation, and to identify the source of a malfunction. Consequently, there is a great amount of accessible and potentially useful environment information. The decision of what to extract is often complicated by the fact that it is hard to foresee all further uses. Therefore, the whole environment (instead of only a current context) has to be considered, in order to identify the significant information to be extracted and preserved. With the PERICLES Extraction Tool, which will be described in chapter 7, we address ways to extract SEI from the environment in an automated fashion. In some cases the SEI, though significant, lies outside of the scope of the information that can be captured automatically by our tool, and will require different approaches. One such example can be found in [17] (our emphasis): “Many software-based artworks depend on external links or dependencies. This might include links to websites, data sets, live events, sculptural components, or interfaces.” 4.5.2 Space science scenario As one of the two main use cases, the PERICLES project is considering capturing and preserving information relating to measurements of the solar spectrum being carried out by the SOLAR payload [43] of the International Space Station. The information includes operational data concerning the planning and execution of experiments, engineering documentation relating to the payload and ground systems, calibration data, and scientific measurements of experiments performed by solar scientists. The ultimate aim of SOLAR is to produce a fully calibrated set of solar observations, together with appropriate metadata. We now consider three examples to illustrate the capture and use of SEI. In order to validate the experimental observations of the SOLAR instrument, it is necessary to understand the impact of many complex extraneous factors on the instrument. For example, vehicles visiting the ISS can affect © PERICLES Consortium Page 32 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS the trajectory of the ISS itself and cause pollution and temperature changes. Such effects are often only uncovered by a long-term analysis of the data by the scientists. Hence, there is a need to capture as much of the environment as possible at the time the observations are made to enable such analysis. This includes capturing a wide range of complex environment information relating to the instrument, the operational data, the payload sensors and events on the ISS itself. In this case, the purpose of SEI is to enable critical analysis of the solar observations by the scientists. The significance weights reflect the influence the DOs captured have on the critical analysis task. These weights can change over time as additional environmental factors may be uncovered that have an impact on the scientific data. The SEI (at a given time) will therefore reflect the DOs that are relevant to critical analysis with an appropriate weighting. In order to validate the solar measurements made by the SOLAR instrument, frequent comparisons are made with data collected independently by other scientific teams. Often the techniques and instruments are different, which provides a good way to ensure the results are not subject to unwanted effects caused by the experimental methods used. The data from other teams and the comparisons that have been made are a valuable part of the environment metadata for the SOLAR data. The PERICLES Extraction Tool can perform the capturing of the validation experiments themselves and appropriate metadata will be created. This would include validation scripts and dependencies between subsets of the data, and would constitute (part of) the environment information. The purpose associated to the SEI is the validation of the scientific data by the science community. The significance weights reflect the value of specific data objects in the validation of the SOLAR dataset. The SEI can assist scientists in assessing the quality and reliability of the data produced. A third example relating to the PERICLES science case study relates to the operational data for the SOLAR experiment, which is primarily created, managed and used by the mission operators, who operate the experiments on ISS remotely from the ground station. The operations data includes the planning, telemetry and operations logs. Given the huge complexity and volume of the space mission information, a major issue for the operators is information overload. An important task for the operators is to resolve anomalies, which occur when the normal operational parameters of the instrument are exceeded, such as overheating. Identifying and resolving anomalies often requires extensive research in the archived operations data and documentation. The problem that the operators often have is not that the information is not available, just that there is so much documentation that it is difficult and time consuming to locate information that relates to a specific problem (i.e. further contextual information). Only specific parts of the documentation are relevant, but this has to be learnt from experience, as this type of information is not formally documented. In this case, the digital object to be preserved is the catalogue of known anomalies and the environment information is the aggregation of all operations data. The purpose for the SEI is the identification of a specific anomaly. In this case, the significance weights indicate the relevance of a specific DO, such as a piece of documentation for the instrument or an excerpt from the archived telemetry to the particular anomaly. Thus the SEI provides a way to indicate all the environment information relevant to identifying and debugging a specific anomaly. 4.6 Modelling SEI with an LRM ontology extension Defining what factors make up the weights is a task for the creators, users and managers of the data who will be able to use past experience and other criteria to quantify what to collect. The SEI is defined by the following triple: ● ● The dependency relationship between the source and target DOs, The weight of the dependency, and, © PERICLES Consortium Page 33 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS ● The purpose that defines the dependency. We consider that weights could be the result of a number of factors, and depend on a formula that will compute a final significance weight; as for example when computing the value of a dependency, one may want to take into account the cost of maintaining it, and its benefit. Weights would also be connected to express the significance of single DOs or collections of them; so the concept of weight could be applied at the dependency or also at single DOs or entities. It is furthermore conceivable to assign SEI to classes of DOs, instead of single instances, where a class could represent a particular type of DO or follow any type of classification. In this case, SEI could express, for example, the fact that solving anomalies requires the use of mission records, and technical manuals (this could be based on statistical data from single instances). In addition, the SEI may change over time due to external factors (such as a requirement from a funding agency to include a copyright licence or a need to include the software to access the DO). This introduces the need to include a temporal element that can be represented by versioning the SEI as well as including a timestamp when the SEI was defined. Since changes in weights over time are likely to be a useful indicator of change in the user community interest, keeping track of its changes is important. This would allow also addressing changes in policies that may reflect in changes on how the composite significance weight is calculated. By keeping versioning of the SEI and its weights, we would be able to address these use cases. As early work on this direction, in this section we will give some simple examples of SEI - with attention to weights - that should be modelled, and a preliminary approach to its modelling based on the LRM. 4.6.1 Examples in support of weight and SEI modelling The dependency weights are not similar for all use purposes of a digital object. To substantiate the hypothesis that it makes sense to recalculate for each purpose, we will present three short examples. Example 1 In many cases building an application can require different libraries for different purposes. For example, a compiled application can be built in ‘optimised mode’ or in ‘debug mode’ in both cases the compilation may require different libraries to serve the different uses and applications built to work on different platforms may require different libraries to in order to function on different platforms (e.g. when using the sqlite database library the Windows compiled version is necessary for the Windows operating systems, whereas the Linux library is required for Linux operating systems). There may also be libraries that add specific functionality to the application that is only required for a subset of uses (for example the Root Data Analysis Framework can be built with or without many third-party libraries such as python, GNU Scientific Library etc.). These different configurations are usually encoded in the build programs that allow the user to select the configuration depending on the use. Example 2 At this example the regarded digital object might be any image that is viewed and manipulated at the environment. We consider two purposes: 1. Viewing the image. 2. Manipulating the image. At the environment are several image viewing tools installed, and one image manipulation program called GIMP. By regarding the dependency of the image to GIMP, a change of the weighting based on the purpose can be observed. The dependency has a really low weight for the purpose of viewing the © PERICLES Consortium Page 34 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS image, as there are alternative tools installed, which probably fit better to fulfil the purpose. For the purpose of manipulating the image, for example with a GIMP specific filter, the weight will become really high, as there is no alternative in using GIMP. If we come back to the image example the dependency to GIMP for the purpose of viewing the image could be really high, if no other tool is installed on the system that can be used. By the event of an installation of another image viewing tool, the weight will become very low, as the dependency is now replaceable. Example 3 With reference to the scenario of calibrating space data (described briefly in section 4.5.2), and more in general when facing sensor data, raw data coming from instruments has to be calibrated, according to some specific calibration formula and using some calibration parameters in order to generate calibrated data that can then be used and processed directly. In this simplified example, in the case where a user has the purpose of re-calibrating the calibrated data, the dependency with raw data, the calibration parameters and formula will be a strong one, with a significance weight set to 1 (assuming a scale of 0-1 for significance weights, and with the weight representing data significance for the purpose) indicating the absolute need for those data items, for that specific purpose. If the user wants to access calibrated data with the purpose of reading it, the user has no need to access the raw data, the calibration formula and parameters. This will be represented by dependencies with a significance weight of 0. 4.6.2 Modelling SEI with LRM - A preliminary approach The Linked Resources Model (LRM)7 is an operational middle-layer OWL ontology for modelling the dependencies between digital resources handled by the PERICLES preservation ecosystems. This subsection presents an initial approach to modelling SEI with the LRM. However, since representing SEI requires appropriate constructs for representing weights and purposes, which are currently not included in the core LRM, the latter was extended with the needed classes and properties. This is a good example of cross-WP interplay, since this work depends on WP3 outcomes (the LRM), while, on the other hand, the derived results will then be “fed into” WP3, possibly extending the LRM with additional functionality. Note that, as the title implies, the work presented here is still at a preliminary stage with plenty of room for improvements and extensions, most of which are already well underway. LRM Extensions Since the LRM currently does not support representing weight-related concepts, the Weighting Ontology [44] (WO) was chosen as an appropriate ontology to import to the LRM. WO constitutes a model for formalizing multiple purpose weight-related notions. The steps for extending the LRM for representing SEI were as follows: ● ● Class wo:Weight8 was “attached” to pk:Digital-resource and pk:Dependency through property wo:weight. Class Purpose is introduced as a new class attached to wo:Weight through the newly introduced property has-purpose. The reason behind associating the purpose with the weight was because each weight corresponds to a different purpose of use for the DO or 7 The LRM is thoroughly presented in D3.2 (due M18). The prefix in front of the class name denotes the namespace, i.e. the vocabulary this class is initially defined in. Namespace “wo” corresponds to WO, “pk” corresponds to the LRM, “prov” corresponds to the PROV-O ontology, while the absence of a namespace implies newly introduced constructs. 8 © PERICLES Consortium Page 35 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS dependency. Purposes may consist of sub-purposes forming hierarchies. This is achieved via transitive properties has-sub-purpose and has-super-purpose. ● Each weight has a numerical value [0..1] and a scale. The following diagram (Figure 10) demonstrates the interplay between the key classes mentioned thus far (the rest of the classes and properties are omitted for brevity reasons): ● Figure 10. Extension to the LRM for representing weights and purposes. Example For demonstration purposes, we illustrate here the LRM-based representation of example #3 presented previously. According to the example, calibrated data depends on raw data and calibration parameters. This can be represented as a dependency in the “LRM sense” (see object “datacalibration” in Figure 11). The weight of the dependency depends on the purpose of use: ● ● Calibration parameters and raw data are not needed for reading (using) the calibrated data. Thus, when the purpose of use is “read” (object “purpose-1”) then the associated weight (object “weight-1”) has a value of 0.0. On the other hand, recalibrating the calibrated data requires the existence of calibration parameters and raw data. Therefore, when the purpose of use is “recalibrate” (object “purpose-2”) then the associated weight (object “weight-2”) has a value of 1.0. Figure 11. Calibration data dependency modelling. © PERICLES Consortium Page 36 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS The figure represents the example described above, where the weights represent ‘importance’ and the scale is 0-1, while the yellow text boxes indicate the numerical values of weights in the specific case, as well as the literal values of purposes. © PERICLES Consortium Page 37 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 5 Digital Ecosystem (DE) Metaphor and SEI In this chapter we describe our exploration of the concept of Digital Ecosystem (DE) and its relationship with SEI from a fundamentally different perspective, cross-checking if DE as a metaphor for the conceptual framework of DP survives scrutiny. To this end we postulated that a Biological Ecosystem (BE) and a DE are similar enough to share functionally equivalent key concepts and that genetics is to BEs what DP is to DEs, since both safeguard their respective ecosystems by reconstructing them from inherited information. In the course of building our working glossary, this comparison between the two ecosystems yielded a list of concepts defined in similar ways so that they could be considered to function in both systems. We suggest that the parallel applies beyond a simple metaphor and can be exploited to derive a better understanding of digital preservation scenarios. Our respective findings are briefly explained in the next section, and were submitted to an iPRES-14 poster [45]. 5.1 State of the art about digital ecosystems Lately in a morphological attempt to compare new phenomena to known ones, several authors have proposed using the concept of digital ecosystems [46, 47] or preservation ecosystems [48, 49, 50, 51]. To contribute to DP theory development, we compared biological ecosystems with digital ecosystems. We suggest that DP amounts to reproducing a digital object as a reuse case, where the significant properties of the DO, in tandem with its SEI, reconstruct a copy of the functional original. Accordingly, SPs and their SEI are the digital counterparts of essential genetic information embedded in, and interacting with the environment in BEs. From this viewpoint we can look at DP in terms of functional genetic analysis. This is a method that has proved remarkably successful in molecular biology. Such a high level look may reveal further commonalities between genetic-biochemical vs. linguistic encoding of communicated content. By so doing, one can expect that better DE models, perhaps more naturalistic ones, might emerge. Underlying our theoretical stance is the argument that a set of key DP concepts, conceptualised within a DP embedding, have respective mappings to equally key concepts in BE rendered in terms of functional genetics. The comparison leads to a glossary of mutually mapped concepts not discussed here, that are inspirational for the testing of new ideas in DP. Examples for these concepts include environment, niche, sub-niche, fitness, ecosystem, organism, species, and sequential instructions on gene vs. chromosome level, DNA, evolution, mutation, behaviour, and several more. Related research shows that many active research programs exploit equivalences between biological objects and digital objects, up to and including, in the position taken by strong artificial life, the assumption of indistinguishability. The latter follows from the recognition that life is not dependent on any particular underlying medium, but is instead a property of information processing structures [52]. Irrespective of abstract unification, computer science assumes biological concepts exemplified as digital objects in genetic algorithms [53], neural networks [54], ant-trails [55], swarms [56], and artificial immune systems [57]. The application areas of this methodological perspective extend to literary science as a genomic project [58], or narrative genomics as a content sequencing approach [59, 60]; and from a more ecological perspective, digital ecosystems [46, 47], digital business ecosystems [61], digital preservation ecosystems [48, 49, 50] and digital forensics [62], apart from DP explicitly using DNA-based encoding of media content as one way to its very long-term preservation [63, 64]. © PERICLES Consortium Page 38 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS A framework incorporating biological concepts and terminology can result in confusion if the context is not well specified. DP is uniquely in a position of having to deal simultaneously with separate contexts arising out of the above subject areas, while at the same time conforming in its own right to the idea of the ecosystem, containing digital objects which replicate, behave, consume resources, mutate, get selected, and evolve, this leading to a meta-view of biology-based informational concepts with DP at its centre. In this sense, we felt encouraged to consider DOs as “digital organisms” filling in specific niches in a DE. 5.1.1 The DE metaphor in PERICLES research activities Analytical research approaches to DE content in PERICLES use two knowledge representation paradigms: graphs, prominently underlying the LRM and Topic Maps, and matrices feeding into machine learning methods. The two methods are expected to enable lossless, i.e. relationshippreserving ontology conversion to, and construction from, matrices. The key concept in both is relations between the objects, in biology expressing an interaction. 5.1.2 Interaction The study of biological ecosystems, such as in genetics, benefited from the concept of interaction maps. Interaction in causal systems is tentatively defined as follows: “Let P be a process that, in the absence of interactions with other processes would remain uniform with respect to a characteristic Q, which it would manifest consistently over an interval that includes both of the space-time points A and B (A − B). Then, a mark (consisting of a modification of Q into Q*), which has been introduced into process P by means of a single local interaction at a point A, is transmitted to point B if [and only if] P manifests the modification Q* at B and at all stages of the process between A and B without additional interactions.” ([65], with more definitions available). Examples for its variants in evolutionary genetics are given in [66]9. In PERICLES we use this concept in the sense of measurable contact between two entities such as by values in the cells of a feature-object matrix. Featurefeature, object-object, feature-environment and object-environment relationships can be all considered interactions; therefore the way a DO and its SEI interact can receive a formal treatment. Moreover, dependency expressed by weighted graphs is another expression for interaction, making our research strategy parsimonious. Interaction maps can be computed by many methods, as shown both in bioinformatics [67] and automatic categorization of DOs [68]. More importantly, as much as one can compare evolving weighted graphs over time to express changes in the structure of SEI vs. DO interactions, one can do the same by evolving interaction maps, in compliance with the evolving semantics topical constraint of PERICLES. Finally, a series of interaction maps can be also animated, leading to a generic change- 9 See also: “A long-standing dispute in evolutionary biology concerns the levels at which selection can occur (Brandon 1996, Keller 1999, Godfrey-Smith and Kerr 2001, Okasha 2004). As it turns out, there is no one process termed “selection.” Instead there are two processes—“replication” and “environment interaction”. Some authors see this dispute as concerning the levels at which replication can take place. Dawkins argues that replication in biological evolution occurs exclusively at the level of the genetic material. Hence, he is a gene replicationist. Other authors take the levels of selection dispute to concern environment interaction and insist that environment interaction can take place at a variety of levels from single genes, cells and organisms to colonies, demes and possibly entire species. Organisms certainly interact with their environments in ways that bias the transmission of their genes, but then so do entities that are both less inclusive than entire organisms (e.g., sperm cells) and more inclusive (e.g., beehives).” (http://plato.stanford.edu/entries/replication/). © PERICLES Consortium Page 39 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS tracking tool in DP, for example analysing provenance. This means that interactions pertain to all levels in an observed DE, and change tracking by the same conceptual toolkit can address different components of the system. 5.2 A parallel between SEI and environment in Biological Ecosystems In the PERICLES project, departing from the concept of an entity fitting its niche, we analysed the kinds of information extractable from a DO and its environment to improve its chances of being useful in the long-term. By this perspective we came to the conclusion that it is possible to extend the SP framework beyond the DO to its environment. We defined, in section 4.2, this additional source of knowledge SEI for a DO. SEI is a broad super-set of the existing SPs from where we adopt the concept of intended purpose, but extended to the whole DO environment and not just to the DO’s intrinsic properties. By adopting the practice of sheer curation, i.e. a situation in which curation activities are integrated into the workflow of the researchers creating or capturing data [49, 50], we considered how to examine the environment to discover and extract those dependencies that are crucial for different users for the use and reuse of DOs. Sheer curation is a perfect parallel to the situation in biology where organisms cannot be observed reliably outside of their niches, this results in an unavoidable loss of important information. In our working hypothesis, the DO was the organism, with the binary data representing it considered as its DNA, and the natural environment where this organism lived corresponded to the system environment. To map their connectedness, a software agent, as for example our PERICLES Extraction Tool, observed and collected information about interactions between the DO and its immediate surroundings. By observing such interactions one can obtain a series of observations for further analysis and recognise functional dependencies. Examples for what can be necessary to make use of the object in the future, depending on the purpose, include system properties (hardware and software configuration), resource usage, implicit dependencies on documentation, use of external data, processes and services (including provenance information), etc. Such information cannot be reliably reconstructed after the DO is archived. It has to be extracted from the “live” system when the user is present, and preserved together with the DO. As a concrete example, every software has a distinct way to represent and process DOs, and just changing the software version, not to speak of the software itself, can have an impact on the DO’s properties. By now it should be apparent that the SEI is a very close analogue to the definition of a niche in BEs, also with regard to its definition by functional analysis in order to determine what information is necessary for a particular function of an organism or a DO. Since SEI as an extension of SPs is purpose-dependent, we can extract such informational dependencies so that for every purpose we will have a specific sub-niche, with a general niche that will be the super-set of all the dependencies for each of the considered purposes. Getting down to details, we note in passing that although ablations would allow a more detailed determination of the SEI, one cannot run such processes automatically, because our environment needs to include end users with time constraints and unwilling to run the same experiments on very similar data several times. For this reason we need to observe what is happening and infer higher level dependencies and their significance on that basis. Also the niche we identify will be constituted only by the information that matters and not by all the available information. Similarly to interaction maps, dependency graphs for the representation of SEI concisely aim to determine the significant interactions between purpose (function) on the one hand, and DOenvironment properties as dependencies on the other. For instance, we may collect provenance information with the aim to infer functional and documentation-related dependencies between DOs, © PERICLES Consortium Page 40 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS but with the objective of not preserving complete provenance information per se (as not everything may be significant). A typical use case that requires the extraction of SEI is the creation of a new environment similar to the old one in such a way that it still fits the SP of the DO, just like the genome of a fossil could be revived under the right laboratory circumstances. A matching real-life scenario is offered in the next section. 5.2.1 The metaphor applied to the SBA example To substantiate the practical applicability of the introduced biological metaphor, below it is applied to the DP example of preserving a SBA, as described in 4.6.1. SBAs are defined as artworks with a software component, which makes their behaviour and appearance highly dependent on their DE. In this example, we consider the situation where a SBA has to be installed on a new DE for the purpose of an exhibition. To this end it is important to ensure that the SP of the SBA, which are described in more detail in [17] and [18], influenced by the conditions of the DE, are matched by the new system environment. This situation is comparable to that of an organism moved from its niche in the natural BE to a newly created BE. The niche can be emulated by replicating the original surroundings as precisely as possible with natural flora, fauna and resource settings, or an artificially modelled one with substituted components, for example when the organism is migrated to a zoo. To ensure its wellbeing and support its natural behaviour, the SEI of the niche that influences these conditions has to be preserved. Figure 12. The biological metaphor. Many components and processes that constitute the ecosystem, biological as well as digital, have a great impact on the SPs of its entities. E.g. the behaviour of a SBA item depends on the available system resources, just like an organism depends on food availability – upgrading computer system memory can affect the execution speed of software programming methods as much as increasing food availability in the niche will affect the foraging rate of an organism. Consequently for the relocation of a SBA object, as well as for that of an organism, one has to determine what the SEI is, © PERICLES Consortium Page 41 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS i.e. we have to extract it from the original ecosystem for an accurate emulation of the DE (Figure 12). Even if the SBA does not depend on other running software, it is affected by the interaction of that software with its DE, similarly to how organisms affect other living entities by manipulating certain ecosystem features. In the context of emulation the used peripheral, its configuration and the driver have to be preserved as well to conserve the user interaction experience, again similar to the situation where the interaction of the organism with other BE entities can be affected by a feature change in the BE. The graphic drivers, libraries and display configurations, like resolution and contrast, all belong to SEI because they influence the display of the SBA and therewith the perception of the viewer, as an observer’s perception by taking a picture of the organism can be affected by the settings of his camera. Here the observer is considered as part of the ecosystem. In a BE, behaviour patterns of organisms can be observed e.g. during the foraging, mating, and social interactions. Likewise, for the move to the new environment it is important to validate the behaviour of the moved object to test the reconstructed niche components. This can be done by the comparison of behaviour patterns with those measured in the original niche. Patterns for a SBA item can be extracted from measurements of the system resource usage; external API calls, peripheral calls and process timings plus the analysis of log messages. Furthermore, video recordings and documentation of the SBA installation can be used as well. Changes of other ecosystem entities that interact with the SBA, for example a runtime environment change, another version of the programming language installation or storage backend, can result in execution bugs that are comparable with abnormal behaviour of a living being caused by changes in its social interactions with other beings. Information about SBA development is useful for migration, since it can become necessary to recompile new binaries from the source code, dependent on the used programming language and the new system environment. This situation is comparable with an organism that is bred or cloned for better adaptation to changed niche conditions, and so that the outcome preserves the quality and resilience of the migrated object. We present a comparison table in Table 2 with the results of our investigations. Table 2. A comparison of the two types of ecosystems (DE and BE) and their components. SBA in DE - SEI Organism in BE – SEI Important for Available system resources Food (energy), territory (space) Resources for environment reconstruction Used system resources Current environmental quantities Resource status Process timings Behavioural events Behaviour pattern validation Software dependencies: driver, libraries, configuration, peripheral settings Essential interactions between organisms i.e., symbiotic partners, trophic associations Local interaction dependencies Running software without dependency Other organisms that live in the same ecosystem but a different niche System interaction dependencies © PERICLES Consortium Page 42 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS SBA in DE - SEI Organism in BE – SEI Important for Development and programming language installation Growth, maturation i.e., metamorphosis, and reproduction Influences on creation, running, adaptability Graphic settings Setting of camera trap to observe organism activity Appearance, perspective, overview PERICLES Extraction Tool – sheer curation Integrated environmental monitoring system resulting in documentary film about the habitat of the organism SEI monitoring and extraction techniques © PERICLES Consortium Page 43 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 6 Mathematical approaches for SEI analysis With reference to the importance of anomaly detection in space data (section 4.2), related work is now either on-going or being planned in PERICLES in three directions: 1. SEI extraction of anomaly solution dependency by PET (as described in section 8.1.1 of this deliverable); 2. Automatic anomaly detection by different methods; 3. Semantic analysis of anomaly reports. Here we briefly discuss the theoretical background of Automatic anomaly detection, with experimental results detailed in Chapter 8. 6.1 Problem description The data set to be analysed originates from measurements made on the International Space Station (ISS). Our task is to find outliers in the time series of the measurement, including early signs that would lead to these outliers. The outliers being anomalies, they occur in situations which are not normal for the operation of the space station. This problem is an application of a methodology known as anomaly detection to complex system supervision. 6.2 Solution: sensor networks Data streaming from a network of sensors has unique characteristics. Anomalies in such data collected can either mean that one or more sensors are faulty, or they are detecting unusual events that are worthy of further analysis. We must be able to identify either type of fault [69]. A sensor network might include sensors that collect different types of data, such as binary, categorical, integer, continuous, audio, or video. Hence measurements at given time stamps often have a heterogeneous structure. It may also happen that the environment of sensor deployment and the communication channel introduce noise, and missing values can also occur as a result. Given these characteristics, it is not a surprise that anomaly detection in sensor networks has its own unique set of challenges. In a best-case scenario, anomaly detection techniques are expected to operate online, as opposed to offline batch processing. The difficulty of online processing is made even harder by resource constraints in computational power – any deployed algorithm must have a low time and space complexity. 6.3 Contextual anomalies In general, anomalies fall into three major categories [70]: ● ● ● Point anomalies, which are single data records deviating from others; Contextual anomalies, which occur with respect to a context; Collective anomalies, where a subset of the data instances causes the anomaly. In sensor network data, the temporal nature of the measurement introduces the context; hence we are looking for a solution that addresses contextual anomalies in the above classification. In a contextual data set, each data instance is characterized by two types of attributes [69]: 1. Contextual attributes which define the context or neighbourhood for a particular instance. In sensor network data, we are facing a time series; hence time is a contextual attribute which © PERICLES Consortium Page 44 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS determines the position of an instance on the entire sequence. 2. Behavioural attributes describe the non-contextual characteristics of an instance. In a sensor network, these are the measurement data streaming from the individual sensors at a given point in time. Anomalies are identified using the values for the behavioural attributes within a specific context. A data instance might be normal in certain contexts, whereas a similar or identical instance could be a contextual anomaly in another context. This is key to understanding contextual and behavioural attributes for a contextual anomaly detection technique. Just as in the broader discipline of machine learning, we differentiate between labelled and unlabelled anomaly detection, which leads to different leading paradigms: 1. Supervised learning: an algorithm is given samples that are labelled. For example, the samples might be microarray data from cells, and the labels indicate whether the sample cells are cancerous. The algorithm takes these labelled samples and uses them to induce a classifier. This classifier is a function that assigns labels to samples including the samples that have never been previously seen by the algorithm; 2. Unsupervised learning: in this scenario, the task is to find structure in the samples without labels. For instance, finding clusters of similar instances in a growing collection of sensor measurements might identify outlying elements that are anomalous. Supervised anomaly detection suffers from class imbalance: typically only a tiny fraction of the examples is labelled as anomalous. Hence a trivial rejector, an algorithm that attaches a nonanomalous label to every instance might score near a hundred percept accuracy, but it is entirely inefficient in identifying the anomalies. Performance measurements must consider class imbalance, and so must do learning algorithms. An additional problem with labelling is that some anomalies in the data have loose labels. For instance, instead of identifying three anomalies in sequence in a time series of multivariate measurements, we might only get an approximate identification of the time interval when the anomaly occurred. This is essentially an inherently high label noise, which renders most convex supervised learning algorithms irrelevant, therefore nonconvex algorithms must be considered. Most anomaly detection algorithms detect point anomalies, and it is often easier to transform contextual and collective anomalies to point anomaly problems by additional pre-processing. As contextual anomalies are individual data instances just like point anomalies, but are anomalous only with respect to a specific context, the straightforward approach is to apply a known point anomaly detection technique within a context. Following this train of thought, the reduction to point anomaly detection has two steps. First, identify a context for the data instances using the contextual attributes – time stamps in the case of sensor networks. Secondly, calculate the anomaly score or label the test instance within this context using a known point anomaly detection technique. A transformation technique for time-series data uses phase spaces, a form of windowing on multivariate data [71]. This involves converting a time-series into a set of vectors by unfolding the time-series into a phase space using a time-delay embedding process. The temporal relations are embedded in the phase vector across all instances. This results in a traditional feature space that can be used to train support vector machines (SVM), one-class SVMs in particular. Mapping a contextual problem to point anomaly detection is not always necessary, easy, or straightforward – this is the typical scenario in time-series. In this case, direct methods must be considered which normally extend time-series modelling techniques. A simple example is multivariate regression, which tries to predict the next value in the time-series, and failing so it indicates an anomaly. It is not always clear however, which attributes should be predicted. © PERICLES Consortium Page 45 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS With these considerations, we designed an experiment, which has brought encouraging results (discussed in Chapter 8). © PERICLES Consortium Page 46 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 7 The PERICLES Extraction Tool 7.1 Introduction We have developed the PERICLES Extraction Tool (PET) entirely10 in the scope of the PERICLES project, for the extraction of information from the environment where DOs are created and modified. While different projects looked at sheer curation for very specific domains and use cases [22, 24, 25], we have built a generic, modular framework that can be adapted to support different use cases and domains through specific modules and configuration profiles. Our tool is focused on information extraction, while others target different aspects of information curation. We have also addressed the context of unstructured workflows, where the user is not adopting any workflow system, making it important to observe the flow of events in an agnostic framework. In a nutshell, PET works by analysing the use of the data from within the creator or consumer environment, extracting information useful for the later reuse of the data that is not possible to derive in later phases of the data lifecycle, as for example at ingest time. PET is open source (Apache licensed, final approval pending at the moment of writing) software written in Java and developed to be platform independent, soon to be available on GitHub at the address: https://github.com/pericles-project/pet. It implements various information extraction techniques as plug-in Extraction Modules, as complete implementations or where possible by reusing already existing external tools and libraries. Environment monitoring is supported by specialized monitoring daemons and continuous extraction of relevant information triggered by environment events related to the creation and alteration of DOs, like, e.g., the alteration of an observed file or directory, opening or closing a specific file, and other system calls. The tool can be used in a sheer curation scenario, running in the system background under the full control of - but without disrupting - the user. Furthermore a snapshot extraction mode exists for capturing the current state of the environment, which is mainly designed to extract information that doesn't change frequently, as e.g. system resource specifications. An advantage of the use of PET is that established information extraction tools can be integrated as modules. An user who had to use many different tools to extract the information and metadata for a scenario, could use our tool as framework instead, and will get all required information in one standard format saved in a selectable storage interface. Furthermore this approach enables the possibility to enrich the established set of information with additional information extracted by other PET modules. A tutorial for the use of the PET has also been created and is available11. 7.2 General scenario for SEI capture In order to further clarify the tool description, we briefly describe here a general scenario for the information capture that we are aiming at with our PERICLES Extraction Tool. In this scenario we observe and collect environment information from a user's computer as he interacts with DOs for different purposes. The tool is installed with the agreement and under the full control of the user. We want to look individually at the environment changes as the user e.g. calibrates some data, runs 10 Future versions of the tool could contain source code from developers external to PERICLES, given of the open source software license. 11 http://goo.gl/hvT5rm © PERICLES Consortium Page 47 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS unstructured analysis workflows, creates new DOs and in different ways makes use of the data by access, interaction, and transformation. We have the following overall objectives that we want to accomplish, each of which depends on the previous one: 1. Use the PET tool to collect environment information when the DOs are used, based on specific environment information profiles (based on the use case); 2. Analyse the information collected to infer new relationships between DO; 3. Assign values to the dependencies based on the purpose and significance (significance weights). In the rest of the chapter we describe PET in its current development status, which covers mostly the first objective and starts to address the second. 7.3 Provenance, related work Provenance information is a type of metadata that is used to represent the history of the transformations and events for a DO. As part of our scenario, some of the data collected will be in the form of provenance information. We are exploring how this processing of the DO’s history can help us infer dependency relationships, as described in more detail in paragraph 5.3. Our tool’s final aim is to collect SEI, represented by the relationships between the original DO and the information needed for making specific use of the object (such information can be extracted or just kept as a reference, if already existing as a DO). In contrast, provenance addresses the history of the DO, i.e. how it has been created, with what information, and how it has been transformed. Contrary to provenance information, such dependencies are not related to single events, and are not reports of what has happened, either. However, it will still be possible to use our tool for collecting useful provenance information, although it is not our main focus. During the development of PET, we have considered various provenance collection tools, to investigate if they could be helpful for our use cases. One such example, SPADE [72], is a cross platform tool for collecting provenance information. Its architecture [73] is similar in some ways to the one we independently designed for PET, with “reporters” that have a role similar to that of our modules. However, SPADE and its modules are focused on collecting provenance information, and do not cover the variety of information we are addressing with the PET tool. We are also trying to limit the amount of information to the degree useful for determining SEI, and we discovered there is not a good match with the existing modules (although some of the techniques used have similarities). In [74] the scenario is different: a scientist is running experiments using a formal workflow system (Taverna [75]), and the aim is that of preserving the process used by the experiment. While this is of course a worthy approach, it differs from our intent as it is based on a scenario where the user defines formally the workflow and other relevant information. In our case, we attempt to capture the process in an existing environment, where a formal workflow may not be defined. The TIMBUS project [4] encompasses the extraction of context information of business processes to support their long-term preservation. The collected environment information should “allow redeployment of the system at a point in the future.” Furthermore TIMBUS investigates the dependencies related to this use case. Long-term preservation of business processes could be an excellent use case purpose for SEI extraction. © PERICLES Consortium Page 48 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS The mating of the TIMBUS extractors, that have been recently released12, with the PERICLES Extraction Tool, would be an enriching task for further developments to improve the PET tools support of business process related scenarios. The other way round some of the PET extraction modules could also be useful to support the TIMBUS scenario. Currently, the extractors available in the TIMBUS context population tool13 are different and non-overlapping with the PET provided ones, and cover the software environment dependencies (debian packages, DLL dependencies in Windows), physical contexts (for sensor: location, weather), informal business processes by analysis of log files, and network and service contexts. The PET tool addresses a different perspective that is based on sheer curation and looking at the data usage by observing user interaction, in order to determine different use dependencies, as opposed contextual information for process reinstantiation. 7.4 PET general description and feature overview The tool (Figure 13) aims to be generic, as it is not created with a single user community or use case in mind, but can be specialized with domain specific modules and configuration. A customized selection of Extraction Modules can be made for each purpose, and most of the modules can be configured to adapt to different use cases. For highly specialized use cases it is possible to develop own Extraction Modules and include them into the tool. For this purpose, a template exists that sketches and describes the development of an Extraction Module for the developer. In the extraction process, two different types of environment information can be distinguished: filedependent information, and environment information that is valid for all files in the environment. Both of these information types can either be volatile, meaning that they change frequently or in case of specific events, or they remain constant for a long time or always. An example of volatile information is the current system resource usage; if it wasn’t extracted at a moment of interest, it can’t be extracted afterwards. The system resource specification is the constant information that only changes if new system hardware is installed. A system environment contains a plethora of potentially useful information, so discovering the most relevant and significant subset is important. With PET it is possible to create extraction profiles for different purposes to manage the information diversity. Profiles contain a set of investigated files belonging to digital objects, and a set of configured extraction modules to fit for the purpose. Future developments will include significance evaluation based on weighted graphs for the creation and selection of extraction profiles. A connection to the information encapsulation tool developed in Task 4.2 will also be provided, to allow the encapsulation of the extracted information together with related DOs. It is important to note that the major aim of the tool has been to enable the collection of the relevant information from the live environment, and in response to relevant events. The raw data collected will be further analysed by the tool in later tasks in the PERICLES project to conclude higherlevel SEI. The extracted SEI can be useful for various use and reuse scenarios and other specific purposes. We are also investigating techniques to encapsulate the extracted SEI together with the related DOs in Task 4.2, in order to avoid data loss. These techniques will be implemented in a further PERICLES tool, which will interact directly with the PET. 12 See: http://timbusproject.net/resources/blogs-news-items-etc/timbusnewsletter/timbusnewsletter32/260timbusnewsletter32. 13 https://opensourceprojects.eu/p/timbus/context-population/extractors/debian-software/ © PERICLES Consortium Page 49 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Figure 13. PET snapshot. PET feature overview The following list outlines the main features of the PET tool: ● Extracts information that is usually ignored by current metadata extractors. ● The extracted environment information from outside the digital object enables further reuse possibilities. ● Extracts information at the right time and place: within the production environment at the time of occurrence. ● Supports continuous extraction in a sheer curation scenario. ● Visualizes information change over time. ● Information snapshot extractions allow getting a quick overview of extractable environment information. ● Open source: Apache 2 license ● Platform independent (needs Java 7). ● Modular and extendable architecture that supports specialized needs. ● Use profiles allow the parallel usage for different scenarios. ● Supports configuration exchange with other users. ● Provides graphical user interface, but can also run without graphics in console mode. ● Provides exchangeable storage backend. ● Saves results in standardized format: JSON or XML. ● Provides different views for browsing the extraction results. © PERICLES Consortium Page 50 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 7.5 Requirements analysis and software specification This section works out the most relevant requirements for the 4.1.2 extraction techniques and framework. The aim is to develop a software specification to drive the design of the framework architecture that addresses the results of research in Task 4.1 (while taking into account the development planned in T4.2) and of our current use cases requirements from WP2. The fundamental requirements are outlined in the DOW. Additional requirements are derived from the rapid case studies of the use case partners, the analysis of user requirements of D2.3.1 and from the results of discussions with our partners. Further requirements emerged during the development process as feedback from experiments, demonstrations, a workshop, and as resulting ideas from the prototyping. A complete list of the requirements is reported in Appendix A. Discussions with our partners during the development process based on demonstrations of early software prototypes, use case experiments, and a workshop provided us with feedback that we used to refine the specification and to implement specialized extraction techniques. In these sessions emerged also the idea for the two use case scenarios that are specially adapted for the extraction tool: The Software Based Artwork-scenario that we developed in coordination with Tate, and the SOLAR Operator-scenario, that was built together with our PERICLES partners B.USOC and SpaceApps. Both are described previously in this deliverable in section 4.5. From these extracted requirements we created the specification for the software architecture. Note that the development process was in parallel to the related research, so we chose an agile approach and iteratively adjusted the specification a few times. We specified developing two extraction modes: (a) a continuous extraction mode to meet the requirements of a sheer curation scenario, and (b) a snapshot extraction mode to extract information post-hoc, as is needed for e.g. most digital forensic techniques. In order to support different scenarios with a high configurability, we specified developing the extraction techniques as plug-in modules and using the JSON standard for their configuration. For each scenario a selection of these techniques should be made and saved in a profile. This enables also the possibility for the future to suggest a configuration by weighted graphs, even if they are not implemented yet. Also the environment monitoring by daemons was specified to be modular to fit for the different use cases. An event management component handles the events triggered by the daemons and invokes related extractions. We decided to use the Java programming language to be able to develop the tool independently of the operating system, and addresses user requirement UR-AM-BDA-02. Furthermore we specified an exchangeable storage for the extracted information and provided an interface for it to cope with the different environments and needs. By default the extracted information should be saved in JSON format, which facilitates the use by other tools, as the T4.2 encapsulation tool. Finally, we specified providing a command line interface and - optionally - a graphical user interface, but the aim was to develop the tool in a way that every configuration could also be made manually in configuration files. The configuration should always be saved for future executions to be able to automate the extraction process. The final architecture design based on this specification is described in detail in the next section. Also, a list of the actually implemented extraction techniques can be found there. 7.6 Software Architecture In this section we firstly provide a general description of the software architecture of PET. The © PERICLES Consortium Page 51 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS software components, implementation techniques, and the components interaction flow are then described in more detail. Figure 14 illustrates a simplified sketch of the main components of PET. Figure 14. PET architecture sketch. Extraction Modules define how information can be extracted from the system environment, including customized algorithms and system calls for different operating systems and the usage of external tools and libraries. They can be configured to fit for customized needs. The Extractor of the tool executes the Extraction Modules during an information extraction run and they return the extracted information. PET manages lists of Digital Object-Parts, which are the representation of an important file in the tool structure. They are organized in Profiles by the user or by templates that support specific purposes. Profiles also keep a set of configured Extraction Modules which extract the SEI for the Profile related to the Digital Object-Parts. The last components displayed in Figure 14 are Environment Monitor Daemons, which observe the computer system environment for designated events and trigger the extraction of other Extraction Modules or other event handling processes in case of the occurrence of these events. Figure 15 shows a more detailed schema of the PET architecture, please refer to Appendix B for a detailed description of each part of the architecture. A typical flow for the user interaction and interaction between the tools components follows. The user executes the tool for the first time, without defining start commands. A screenshot of the PET tool with its modules can be seen in Figure 16. Consequently, the Extraction Controller Builder builds a default Extraction Controller, which initializes the other controllers and the user interfaces CLI and GUI. The user views a default Profile at the GUI, but creates an empty new Profile for own purposes. He adds files with the GUI to that Profile, which are parts of important Digital Objects. Internally, the following process is executed: The paths to the files that should be added are passed to the Profile Controller, where they are validated and added to a Part data structure. These Parts are added to the selected Profile and the interfaces are then updated. During the next step, the user adds Extraction Modules that fit the use case, using the corresponding GUI dialog. Internally, the following steps take place: for displaying the GUI dialog the GUI requests © PERICLES Consortium Page 52 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS the list of available Extraction Modules. This is provided by the Module Controller, which looked at tool start for all Extraction Module classes and created a set of Generic Modules. After the user selects which modules to create, Extraction Module instances are created from the Generic Modules and added to the selected Profile. Figure 15. PET detailed architecture. Most of the Extraction Modules need a configuration, before they can be executed. The user browses the configurations of the added modules and adjusts them to fit for the scenario. The configuration is saved as Module Configuration class. Now all configurations are ready for the first extraction. The user decides to run a single snapshot extraction, to get an overview of the environment information. Therefore the Extractor executes all Extraction Modules of the Profile. The file-independent modules return more general information, which is stored in the Environment class of the Profile. Each file-dependent module is executed once for each Part of the Profile, because it returns different pieces of information, depending on the file that is represented by the Part. The extraction result is saved into the related Part class, and, after the extraction run, the Storage Controller is used for serializing and saving them to the currently configured Storage. Daemon modules are ignored at the snapshot extraction. After the extraction run, the GUI is updated and the user can browse the results. The user decides to start the continuous extraction mode for capturing information during a working session, during which the files that are added to the tool will be altered. He closes the GUI and starts working. The tool components act as follows: first the Extractor starts the daemon Extraction Modules of the Profile, which begin monitoring the computer system environment. Furthermore, the File Monitor is started and observes the Profiles files for alterations or deletions. The monitoring components create Events in case of a detected event that they want to report and pass them to the © PERICLES Consortium Page 53 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Event Controller, where they are handled and trigger update extractions. After the working session, the user closes the continuous extraction and browses the results. The user regards his created Profile as useful and wants a colleague to use the same. Therefore, he exports the Profile as Profile Template and sends the generated template files to his colleague, who can then import them to create the same Profile on his PET installation. Internally, all Extraction Modules of the Profile and their Module Configurations are serialized by the Configuration Saver (and with the aid of the Jackson API) to JSON objects and saved to files into a template directory. The Profile Controller at the other PET installation uses the Configuration Saver to deserialize the Extraction Modules and recreate the Profile. The Profiles files are not saved to Profile Templates, as they are probably not present on other environments, so another user has to add his/her own files before using the Profile. Figure 16. PET snapshot showing Extraction Modules and their configuration. Finally the user shuts PET down, but expects to get the same configuration back at the next tool start. At shut down, the Extraction Controller initiates the shutdown of each tool component and saves all configurations. The Profile Controller starts saving each Profile; for this the Configuration Saver is used to save all Extraction Modules and all Parts into the Profiles output directory. Furthermore, the Extraction Controller uses the Configuration Saver to save more general tool options, as if the GUI is running, and if the continuous extraction mode is enabled. These will all be loaded at the next start. 7.7 Extracting SEI by observing the use environment The abstract problem to be addressed is that a user of an archive may on one hand have access to a large corpus of preserved documents and data, but without additional use environment, this data is © PERICLES Consortium Page 54 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS of limited use in practical situations. By tracking and analysis of the use environment, the informal knowledge can be captured and curated, and linked to the existing material in the archive. This enhances the extent to which the content can be reused in the future. In order to address this type of problem, we are developing a promising technique based on the observation of the environment (specifically at the software currently executing on the observed system) when some specific action is performed, to extract use environment information. Such information can represent implicit knowledge of the user and can be important information to support future use. By observing the software currently running and collecting and analysing the system calls and files used by an application, using a configurable set of parameters, this approach can allow us to infer dependencies between observed files. This will allow a reasonable amount of general information to be gathered all the time, while going in depth with the analysis of the activities when an interesting set of parameters will indicate the likelihood of a particular activity being executed. We first describe here a simple scenario that will allow us to illustrate how such collection of SEI should take place. Suppose a scientist is calibrating data, using some specific scripts, as described in section 4.5.2. The PET tool runs on the scientist's machine, monitoring the environment for events that can have importance for the information collection. ● ● ● ● ● The execution of a specific script triggers the event “data calibration”, indicating that the user is calibrating this set of data using this script with these parameters; Based on the event information and the state of the system, the tool will first start examining the system in more detail, for example by starting a more detailed examination of the parameters and input data for the script, or observing other target applications such as office software; A series of events and environment information is collected. This will be used to infer the activity being executed (user’s purpose), and the dependencies between DOs (by looking at patterns of usage and concurrent use of different DOs from specific software processes, dependencies based on the script, input and output parameters, or based on other heuristics). By using a larger series of this collected data, we may be able to assign a significance value to the dependencies (for example by looking at how often DO of type X is used in conjunction with DO of type Y). The collected data could also be stored and kept for improving the analysis, for example by using more complex and time-consuming algorithms. We are now able to partially address the first three steps of this process, as illustrated in the experimental results in 8.1.1. The dependencies can be mapped, automatically when possible, into a graph structure, where the edges are weighted to illustrate the importance of each dependency for the execution of an activity, as described in chapter 6. The most important dependencies can then be identified on the basis of the dependency graph, which helps to determine the SEI to be extracted for similar scenarios. This has not been practically implemented but is illustrated in Chapter 6, and will form the basis for later work on the appraisal of the collected information. 7.8 Modules for environment extraction This section describes the different techniques we are using to extract environment information. The types of extraction module implementations can generally be divided into three groups: (a) implementations by the use of the programming-language provided Java SE-libraries, (b) implementations by calling operating system APIs, and (c) implementations by integrating external programs or libraries. © PERICLES Consortium Page 55 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS The sigar library [78] is useful for capturing a lot of environment information independent from the operating system, for example system specifications and system resource usage. Based on this library, we developed extraction modules for capturing an information snapshot of the CPU specification, the available network interfaces, file storage information, file system information, network information and other system resources. Furthermore, modules for the continuous extraction of volatile information were developed. Examples of such information are, among others, the memory usage, CPU usage, process statistics, process parameters, swap usage, the FQDN14, TCP calls, and information that can be extracted by system commands emulated by sigar, as uptime and who. Another type of module (daemon modules) monitor the environment for specific events and report them by triggering new extractions or adding new files to the list of monitored files. In particular, two modules were created to support the technique described in the previous paragraph 7.7,by monitoring the resources used by running software. The ‘lsof’ module runs iteratively the lsof15 command, available and often included in most UNIX variants, listing its open files and sockets. As we are aware, there are other powerful commands that allow monitoring open files and sockets in Unix variants (such as dtruss16 and fs_usage17 for OSX, and strace18 for Linux); we opted for lsof because for a number of reasons: ● lsof is cross platform, so one command would work on Linux, OSX and other Unix; while the other commands are platform specific; ● lsof does not require administrative privileges, as the other command require; ● lsof is in practice quite reliable at reporting such events. In the module configuration, the user can define a set of commands to watch and when an event satisfying the monitoring conditions happens, will generate a new event reported in the event system (that can react by storing the event, or adding a file to the list of monitored files) . Such events for example are the opening or closing of a files or sockets by a specific application. This module has been used in the scenario described in detail in 8.1.1. A similar, but less effective solution was also done for Windows, where the lsof command is not available, with a module for the ‘handle’ command19. This module will report open files based on the repeated execution of the handle command similarly to the lsof module. The handle command, although free, has a restrictive license20 that does not allow redistribution; for this reason the user willing to use the module has to manually download and copy the handle command in the native command directory. A module for capturing a snapshot of the currently installed software was developed by calling the operating system-dependent software management component. It is an example of a module that uses customized operating system calls. Another module of this kind extracts the information about the currently used graphics card. Several extraction modules were developed only by using Java SE API, and work independently of external libraries and commands. The directory monitor module will monitor directories for the creation, modification or deletion of 14 15 16 17 Fully Qualified Domain Name, see http://en.wikipedia.org/wiki/Fully_qualified_domain_name LiSt Open Files, see http://en.wikipedia.org/wiki/Lsof https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/dtruss.1m.html https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/fs_usage.1.html 18 http://en.wikipedia.org/wiki/Strace http://technet.microsoft.com/en-us/sysinternals/bb896655.aspx 20 http://technet.microsoft.com/en-us/sysinternals/bb847944 19 © PERICLES Consortium Page 56 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS files uses the Java Watch Service. General file storage information is extracted by calling Files.getFileStore(file), while posix (Portable Operating System Interface for Unix) file information is extracted using Files.readAttributes(file,PosixFileAttributes.class);. Information about the Java installation, and more general operating system properties is extracted by calling System.getProperties();. Javax.xml was used for developing a module that extracts information from XML files with XPath expressions. To extract information from a normal text file with a regular expression java.util.regex was used, as well as a module that extracts information from log files via regular expressions. Two modules were developed by using java.awtone that captures screenshots to save graphical elements, and a second that extracts the systems graphic properties. A module to calculate the checksum of a file uses java.security. Additionally, the following modules also exist that include external tools for the information extraction or environment monitoring: an Apache Tika file identification module, a MediaInfo module, an OS X Spotlight module, a module for extracting MS Office document dependencies using the Office Dependency Discovery Tool [79], and the chrome-cli [80] for monitoring of opened tabs in the Chrome browser. For the extraction of font dependencies of pdf files the PDFBox library is used. In addition, a generic module was created to execute any OS command with a specific configuration, allowing users to extend create new modules by just defining the command configuration and with no programming necessary. 7.9 Development process and techniques This section describes the processes and techniques we used for the software project. An agile approach allowed us to start the development at an early stage and to evolve the development ideas during the parallel theoretical research. During the initial requirements analysis, we asked our use case partners for user scenarios and also created own user scenarios that we discussed and evaluated with our use case partners. The main intention was to produce and circulate ideas of what the extraction tool should do and which purposes it could be useful for. We wrote a requirements and features document based on these user scenarios, which served as the basis for creating the specifications for the first prototype. The first prototype served as demonstration test-bed to consolidate the ideas and adjust the specification. We created a screencast demonstration, which was distributed to the partners, and made a live demonstration of the prototype at a project meeting (October 2013). That brought us a lot of feedback, which was used in creating the specifications for the second more mature prototype. To evaluate the tool, we organized a practical two-hour workshop at one of our project meetings (January 2014), where we allowed the partners to test and use the tool on their own systems, as reported in section 8.4. A second evaluation, in the scope of the SBA scenario, has been recently held (July 2014) and will be reported once results are available. The following list provides an extract of the technologies and techniques used in this software project: ● ● Programming language: Java 721 Libraries: Swing22, JUnit23, reflections, JDOM24, JCommander25, Jackson26, sigar27 21 "Java Platform SE 7 - Oracle Documentation." <http://docs.oracle.com/javase/7/docs/api/> "javax.swing (Java Platform SE 7 ) - Oracle Documentation." <http://docs.oracle.com/javase/7/docs/api/javax/swing/package-summary.html> 23 "JUnit." <http://junit.org/> 22 © PERICLES Consortium Page 57 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS ● ● ● 7.10 External tools: exiftool28, Windows file utility29, Windows MediaInfo30, OfficeDDT31, Tika32, NanoHTTPD33, elasticsearch34 IDE: Eclipse35 (Plugins: EclEmma36 for code coverage, FindBugs37, Subversive38 for svn, m2e39 for maven) Development techniques: svn40, mvn41, JDoc42, design patterns43, unit tests44 Scalability The issue of scalability for the PET tool should have a marginal impact, given the assumptions on its use. The observations carried out by the tool involve the user as a key element and the data that we are trying to collect is relative to user interaction. We are not trying to observe automated workflows and processing, as such workflows are already formalised. So, by definition, we don’t forecast scalability issues for the PET tool, as the limiting factor to the amount of data that PET handles is the human factor, the user will manually handle a limited amount of data. Furthermore, the PET tool will help with the appraisal of the data so it can help with scalability issues by reducing the amount of data that is necessary when considering a particular context. 24 "JDOM." <http://www.jdom.org/> "JCommander." <http://jcommander.org/> 26 "Jackson JSON Processor - Home." <http://jackson.codehaus.org/> 27 "SIGAR API (System Information Gatherer and Reporter ..." <http://www.hyperic.com/products/sigar> 28 "ExifTool by Phil Harvey - SNO." <http://www.sno.phy.queensu.ca/~phil/exiftool/> 29 File utility for Windows: <http://gnuwin32.sourceforge.net/packages/file.htm> 30 "MediaInfo - Download." <http://mediaarea.net/en/MediaInfo/Download/Mac_OS> 31 "Dependency Discovery Tool - SourceForge." <http://sourceforge.net/projects/officeddt/> 32 "Apache Tika - Apache Tika." <http://tika.apache.org/> 33 "NanoHttpd/nanohttpd · GitHub." <https://github.com/NanoHttpd/nanohttpd> 34 "Elasticsearch.org Open Source Distributed Real Time ..." <http://www.elasticsearch.org/> 35 "Eclipse Luna." <http://www.eclipse.org/> 36 "EclEmma - Java Code Coverage for Eclipse<http://www.eclemma.org/> 37 "FindBugs™ - Find Bugs in Java Programs." <http://findbugs.sourceforge.net/> 38 "Eclipse Subversive - Subversion (SVN) Team Provider." <http://www.eclipse.org/subversive/> 39 "m2eclipse." 2011. 1 Jul. 2014 <https://www.eclipse.org/m2e/> 40 "Apache Subversion." <http://subversion.apache.org/> 41 "Maven Ant Tasks - Mvn." <http://maven.apache.org/ant-tasks/examples/mvn.html> 42 "Javadoc Tool Home Page - Oracle." <http://www.oracle.com/technetwork/java/javase/documentation/javadoc-137458.html> 43 Gamma, Erich et al. Design patterns: elements of reusable object-oriented software. Pearson, 1994. 44 "Unit Tests - Extreme Programming." <http://www.extremeprogramming.org/rules/unittests.html> 25 © PERICLES Consortium Page 58 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 8 Experimental results and evaluation We describe here the experiments we have set up to validate the functionality and important aspects of the frameworks. These cover the PET tool, as described in sections 8.1.1, 8.1.2 and 8.2, and the SOM anomaly detection described in Chapter 6, in section 8.1.3 and 8.3. In all the PET experiments, some common background steps are assumed: 1. The PET tool is installed, configured and started on the machine where the DOs are used. 2. The user interacts with the system while PET collects EI in the background. 3. The environment information, DO events and changes are stored for future use and analysis. 8.1 Space science 8.1.1 PET on operations: anomaly dependency information As described in the third example of paragraph 4.5.2, operators dealing with anomalies usually find their solution searching through a multitude of documents. This can include, for example, solutions from previous anomalies, telemetry, console logs, meeting notes, emails, etc. Such data, although present in the storage, requires experience and its selection is a task that requires specific knowledge that is usually passed from operator to operator. For this reason we are addressing the collection of such dependencies between anomalies and mission documentation, in order to preserve useful information that is otherwise not captured, by using some of the techniques we envisaged in section 7.7. In more detail, when an anomaly occurs, the issue is recorded on the 'handover sheet'. Different procedures are executed to solve the issue, and very frequently the operators need to access the relevant documentation. We have set up a simplified experiment to demonstrate what types of significant environment information can be collected in this scenario. In order to support this scenario, we set up a PET profile that tracks the use of relevant software on specific files, using the PET software monitor; this enables us to have a trace of the documents that have been used at a given moment in time, as illustrated in Figure 18. At the same time, it is possible to observe the ‘handover sheet’ and track the reporting of an anomaly start and end times (as shown in Figure 17, where a new issue is written in the document). Figure 17. Screenshot showing changes in the 'handover sheet' tracked by the PET tool, used to determine anomaly time. The connection between the documentation track and the ‘handover sheet’ tracking can allow us to infer the ‘anomaly solving time span’ (indicated in red in Figure 18) and assume there is a dependency between the solution to the anomaly and the documentation that was used between the start and end of the anomaly. © PERICLES Consortium Page 59 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Figure 18. Trace of document use (based on open and close times) collected from process monitoring (blue) with anomaly solving time (red) collected using file change monitoring. 8.1.2 PET on extracting results of scientific calculations The following experiment illustrates how SEI extraction can be useful for examining scientific calculations. This experiment uses an extraction module that extracts whole lines from files. It is configured to monitor an output directory of the open source tool GNU Octave [84] and to extract calculation results with the aid of a regular expression. The extraction module is originally intended for the extraction of particular log messages from a log directory. The scientist uses PET to track the resulting development of an Octave-script execution over time and in relation to the script lines that are relevant for the result. This enables the possibility to understand the resulting changes in relation to script formula changes. First, the user configures the module by specifying the output directory and the regular expression to search the result line, which is, similar to the name of the result variable, just “B” at this example. Then the sheer curation mode of PET is started, to monitor the directory, which triggers an initial extraction. At the time of this first extraction the script wasn’t executed. We used the script in Figure 19 for this experiment. Figure 19. Octave script used for the example. Then the scientist starts his normal octave workflow and executes the script. The PET detects the file changes in the configured output directory and triggers a new extraction of the selected module. The following screenshot shown in Figure 20 displays the results of the first and second extraction. Figure 20. Screenshot of the PET showing a calculation result extraction. © PERICLES Consortium Page 60 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS The result of the first extraction shows the locations of the script’s result variable B in the not yet executed script, which also lies in the observed output directory. At the result of the second extraction the line of the output file with the result variable B and its line number can be seen. This is the extracted result of the scientific calculation. Since also the location of the result variable at the original script was extracted, an easier understanding of the dependencies between results and locations at the script is enabled. A continuous extraction over long periods of time makes an observation of result changes in relation to changes of the script formulas possible. PET indeed needs highly customised configurations for the example, but these enable it to adapt to specialised scenarios. 8.1.3 SOM and anomaly detection on sensor data We received approximately 20 GB of incoming raw instrument data from the mission control of the International Space Station from SpaceApps and BUSOC45. A total of 383 features were measured. Each row is a time-stamped measurement of these features. The time resolution is approximately one second. There are about 16 million rows in total. The feature space is heterogeneous. Apart from integer and real-valued features, there are Boolean entries (on/off, used/not used, and other similar pairs), and also categorical variables. There are many missing entries, and the rows follow a predictable pattern of which features will be missing. The pattern alternates between two basic types of rows, let us denote them by A and B. Type A rows have values for the first 329 features, and type B rows have values for the remaining features. Type A rows occasionally miss all features except the first few. Extensive pre-processing of the feature space is inevitable. Labels identifying anomalies are not directly available. We were given one-minute resolution indications of the times where the anomalies occurred. However, measurements are taken every second. The labels are provided in Belgian User Support and Operations Centre. The following line is an example indicating an anomaly: 041/07:20 AIB failure without reboot anomaly AIB failure without reboot It is unclear at which second the above anomaly occurred, or whether it occurred exactly during this minute, or shortly before or after it. Furthermore, this is an approximate labelling that is not useful in automatic classification. Anomalies are much more efficiently detected by unsupervised methods, as outlined in the next section. 8.1.3.1 PRE-PROCESSING AND VECTORIZATION The majority of learning algorithms expect numerical data; hence we must map Boolean and categorical variables to numerical values. Boolean variables are mapped to 0 and 1, whereas categorical variables are mapped to consecutive positive integer numbers. We split the data set in two: rows belonging to the A pattern go into the first subset, and rows belonging to the B pattern go into the second one. This separation solves the problem of missing values. Separate learning pipelines will deal with the two subsets, leading to an ensemble of models. We believe this approach will be simpler and more efficient than interpolating missing elements in order to have uniformly dense data instances. 45 http://www.busoc.be/ © PERICLES Consortium Page 61 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Moving over to this pattern means that the difference between the A subset entries is 1 second, and the difference between entries in the B subset is 2 seconds. We processed the data of a single day, 2012-02-10. This day has two labelled anomalies: at 7.20AM and at 10.06AM, with a total of 127,098 entries. We restricted our attention to the Type A subset. Type A entries with missing values were discarded. The remaining subset had 86,321 entries. 2 Figure 21. An unsupervised pipeline for labelling outliers using χ feature filtering, X-means clustering and the local density cluster-based outlier factor algorithm. We trained an unsupervised ensemble consisting of an X-means clustering and a local density clusterbased outlier factor algorithm. We applied athreshold the values on the output of the latter to label outliers. This gave 100% accuracy on the training data. We used these labels for plotting a selforganizing map (see next section). To reduce the number of dimensions, we used a χ 2 feature filter against the automatically extracted labels. The eventual dimension of the feature space was thirty. The selected features were normalized to a range between 0 and 1, to match the range of the random features by which the self-organizing map is initialized. This part of the pre-processing pipeline is shown in Figure 21. 8.1.3.2 SELF-ORGANIZING MAPS (SOM) Self-organizing maps are a topology-preserving embedding of high-dimensional data to twodimensional surfaces, such as a plane or a torus. Artificial neurons are arranged in a grid - each neuron is assigned a weight vector with identical number of dimensions as the data to be embedded. An iterative procedure adjusts the weight vectors in each step: first a neuron closest to a data point is sought, then its weight vector and its neighbours' weight vectors are pulled closer to the data point's neuron. After a few iterations called epochs, the topology of the dataset emerges in the grid in an unsupervised fashion. If the number of nodes k in the grid is much smaller than the number of data points n, SOM is reduced to a clustering algorithm. Multiple data points will share a best matching neuron in the grid. Emergent self-organizing maps (ESOMs) are a variant of SOM in which k>n. In this arrangement, a data point will not only have a unique best matching neuron, but the neuron's neighbourhood will also `belong' to that data point. Clustering structure will still show: some areas of the map will be © PERICLES Consortium Page 62 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS denser than others. The extra information comes from the neighbourhoods of best matching neurons. These interim neurons provide a smooth chain of transitions between neurons that are assigned a data point. It is almost as if they interpolated the space between data points. If a data point sits in isolation far from other ones, this shows in the map because its neighbourhood will be sparse. These gaps gain meaning depending on the nature of the data. Self-organizing maps are notoriously slow to train, and ESOMs are even more so. Using the data without labels, we trained an emergent self-organizing map with Somoclu, an open source clusteroriented high-performance implementation of self-organizing maps that accelerates computations on multicore CPUs, GPUs, and even on multiple nodes [82]46. 8.1.3.3 VISUALIZING ANOMALIES We used an initial learning rate of 0.1, linearly decreasing to 0.01. The initial influence radius was 30, which reduced linearly to 1. The map had 100 x 60 neurons on a toroid grid. Figure 22. The U-matrix of a toroid emergent self-organizing map after ten epochs of training on the feature space of stream data. The individual numbers are neurons with a weight vector that match a data instance; multiple instances may map to the same neuron. The numbers indicate class membership: 0 refers to normal instances, and 1 to anomalous ones. The other neurons reflect the distances in the original high-dimensional space. We plotted the U-matrix after 10 training epochs by using Databionic’s ESOM Tools [83], a third-party software. We used the class information extracted by X-means clustering and the local density-based outlier factor algorithm to label best matching nodes in the grid (Figure 22). The lighter colours in the map indicate higher distances and separation. The anomalous instances form a clear cluster in the middle of the figure, further splitting into four subgroups. Interestingly, the regular instances also show clear groups of similar elements. Some even seem to circle the anomalous group. As the processing results from the test sample show, scalable automatic anomaly detection is feasible and a promising candidate for the real-time filtering of streamed data from the ISS. 46 Before PERICLES, early work on Somoclu was supported by an Amazon Web Services Education Machine Learning Grant. © PERICLES Consortium Page 63 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 8.2 Software Based Art: system information, dependencies This experiment is about collecting dependencies from a SBA, as described in paragraph 4.1. The PET tool is executed on the SBA and will extract a series of information useful for the understanding and future use of the SBA. For the SBA scenario, PET is especially useful in capturing a snapshot of the system specification and graphical configurations. Furthermore, the system resource usage can be monitored and extracted in the sheer curation mode to be used for a comparison of behaviour patterns between different SBA installations. However, PET has its limits for this scenario, as it won't provide analysing mechanisms for such patterns. Also, information that lies outside of the computer system environment of the SBA is not reachable for the PET, such as the actual SBA installation details that are manually captured (for example model of display, how it is installed, ...). 8.2.1 System information snapshot The PET with a snapshot extraction, as described in section 4.5.1 can extract various information pertinent to the scenario of emulating an SBA environment. This mainly includes information that doesn't change continuously, such as the system’s hardware specification or installed graphic drivers. Table 3. Extract of results for the SBA example showing SEI captured in a snapshot extraction. Extraction Module: CPU specification snapshot model Intel Core(TM)i5-3470CPU@ 3.20GHz totalCores 4 Extraction Module: Graphic System properties snapshot font_family_names “Bitstream”, “Charter", "Cantarell", ... displayInformation isDefaultDisplay=true, refreshRate=60 .. Extraction Module: Operating System properties user_language En os_name Linux Extraction Module: Java installation information java_home /opt/java/jre java_vendor Oracle Corporation java_version 1.7.0_15 © PERICLES Consortium Page 64 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Table 3 shows an extract of the result of a snapshot extraction executed by the PET. The significant information includes system hardware specifications, CPU, system graphic settings (e.g. installed fonts and display information), and information about the operating system and the development toolkit used to program the SBA's software. In addition to this, information that changes constantly has to be captured continuously in PET’s sheer curation mode. For the SBA scenario this is mainly the use of system resources. A measurement of resource usage values over a long period of time can be analysed to identify behaviour patterns, which can be used to validate the correct behaviour of a new software installation. Other examples of measurements are those of CPU usages, executed by PET's “CPU usage monitoring”-module, whereby the changes over time can be traced. Another example of such runtime information that can be collected and be useful for assessing the dependencies of a SBA is the file-system and network usage information (all the files and network connections used during the execution of the SBA) that can be collected by the PET tool with a specific extraction profile. The extraction results enable the configuration and emulation of a new environment for a SBA, as described in the example in section 4.6.1. 8.2.2 Extracting font dependencies The PDF format gives the ability to embed the font types used in a document, to guarantee faithful reproduction of the document even when the DO is moved to an environment that does not include them. However, it is still possible to create PDF documents without including the necessary fonts (for example, it can be a user choice or a font can be blacklisted in the PDF creating application). To recognize such external font dependencies, that are particularly relevant in the case of a PDF file used in a SBA, the module will analyse PDF files and extract a list of used but not embedded fonts. This list determines dependencies between the DO and the listed fonts, relevant for accurate rendering 8.2.3 Experiments on SBA: Brutalism The following experiment was run on a real software based artwork, Jose Carlos Martinat’s “Brutalismo, Ambiente de Stereo Realidad,” from 2007. In [18] the artwork is described as “a sculpture with a software element which searches the Internet for phrases incorporating the word “Brutalismo”. The sentences found are then printed out by industrial printers similar to cash machine printers. The printers then cut the till roll so that scraps of paper gather on the floor around the sculpture. “ We executed PET on a virtual machine running an image of the artwork with the intention of a deeper understanding of the artwork. The experiment was done in two separate sessions, by using remote desktop technology installed on the SBA virtual machine image. First the PET is downloaded and extracted in the Download folder; all the data it will create and modify is contained within the extracted folder. Since there was not an appropriate version of the Java VM installed on the system, a local version of Java was extracted in the download folder. This will have only a local, reversible effect, while keeping the original system installation of Java untouched. The situation brought up the consideration that it should be made clear to the users that the local installation of PET and Java would not interfere with the existing Java installed on the system and would be reversible by simply removing the respective folders. We have run a simple extraction of the environment information with the PET, by executing a “snapshot” similar to the experiment executed in 8.2.1. The extracted information includes a list of © PERICLES Consortium Page 65 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS running processes with details of each, and other important environment information. The running process list includes: ● Themysql daemon process ● Theinicio process: "/bin/bash", "/home/arturo/inicio.sh" run as root user We have also looked for the existence of software written in Java in the system (.class, .jar, .java files), as we understand that the original SBA was using that language, but were not able to find any. We also looked for running processes that could be implementing the SBA, and we think the only process important for this SBA is the MYSQL process. The “Brutalism” artwork contains printers. A couple of short scripts in the user folder seem to be implementing the actual printing (there are at least 2 versions of them); the running process is “inicio.sh” as in the process list. The script is simple: 1. Select a random row from a table of lines to print, that are stored in a running MySQL table. The lines are not removed from the table, so there is possibility for repetition, but that also means there will always be new lines to extract; 2. The line is sent to the printer; there seems to be a version of the script using line printers, and another using USB printers. It would be relatively simple and probably sensible to extract the contents of the MySQL DB to a plain text file, to make it simpler to update the system in the future. The software logic, as far as we could see in the VM, seems really quite simple and so not problematic to reproduce with a future technology, should the need arise. The data accessed for the printing seems to be all local, and no actual connection to external server seems to be made as this would be performed by the script. Of course, it would be important to validate this with the creators to make sure all details are covered. Initial discussions with the artist established that it is important to display current data from live searches. Currently the database acts as a valuable record of the search results and should be exported after each display. The JAVA software will have to be updated periodically to cope with any external service changes. We conclude that PET was helpful to investigate the artwork for a deeper understanding. Still, as SBA has a very customized nature, our experiment confirmed our suspicion that these type of use cases is likely to require more manual intervention compared to other more standardised cases such as the one we examined in section 8.1. The execution of monitoring daemons over a longer period of time would be an interesting follow on experiment offering the potential for a deeper investigation of the artworks behaviour. 8.3 Self-organising maps on video data Emergent self-organizing maps point towards contextual learning models, but they are still primarily content based. As an initial effort in content-based media mining, we worked with a video provided by Tate and pre-processed by CERTH. The data consisted of 4632 key-frames, with features extracted: colour structure and edge histogram features were included, resulting in a 112dimensional space. Most frames were labelled according to the content. The labels were from the following list: Artworks City Crowds – Audiences General gallery spaces Guitars – Instruments © PERICLES Consortium Page 66 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Landscape Night Tate Britain Tate Liverpool Tate Modern Text TV – Monitors We trained a toroid emergent self-organizing map of 320x200 nodes (Figure 23) using Somoclu [82]. Figure 23. SOM map for the video data labels. The initial learning rate was 1.0, which decreased linearly over ten epochs to 0.1. The initial radius for the neighbourhood was a hundred neurons, and it also decreased linearly to one. The neighbourhood function was a non-compact Gaussian. The global structure of the map shows various clusters of labels (Figure 24) – different ones tend to separate into local regions. These local regions are thus homogeneous. Ideally, all labels belonging to one category would occupy one part of the map, but given the training data and the low number of examples, the quality of the map is already high. As the next step, we intend to extend the analysis to a contextual variant, across numerous videos to see how usage patterns would influence the clustering structure. © PERICLES Consortium Page 67 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Figure 24. Close-up of the SOM map with labels. 8.4 First internal evaluation (London, January 2014) In January 2014, in the context of a project meeting, we run a workshop and hands-on evaluation of the PET tool prototype with our project partners, to collect feedback, bug reports and feature requests, and to validate the approach taken. This workshop allowed us to gather a lot of ideas and feedback, covering new modules and module improvements, design and implementation improvements, ideas about new features, and software project management suggestions. Furthermore, ideas for real usage scenarios came up at the meeting, which were tested live at the systems of our use case partners during the late development stage. © PERICLES Consortium Page 68 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 9 Conclusions and future work 9.1 Conclusions We presented our work on determining what information is significant to collect in order to support long-term use of Digital Objects, from the widest set of the Environment Information. Pursuing this objective we defined Significant Environment Information, a new concept taking into account the purpose of use of a DO, and its significance, while expressing a dependency relationship between the information entities involved. In doing so, we use the same conceptual framework shared by the PERICLES RTD work packages, by using dependencies and proposing an early, preliminary extension of the LRM to support our use of the model. We also consider that SEI will naturally form a graph structure that could help support an enhanced appraisal process; this will be explored in later tasks. The graph allows deducing other significant information to be extracted and to infer relationships between different objects. Our approach facilitates the gathering of information that is potentially not covered by established standards, and enables a better long-term preservation, in particular for complex digital objects. We also presented ways to determine significance weights and its relations to the DO lifecycle, and some examples of SEI from our stakeholders. We think that collecting such information at creation and use time is very relevant for preserving long-term use of the data. We also explore the similarities and potential application of concepts from the Biological domain to the Digital Ecosystems and Environments metaphors, and present potential approaches from the more mature Biological domain that could be applied to the Digital Ecosystem and Environments, and present mathematical approaches for the analysis of SEI. Finally, we presented the tool we are developing to collect such information, the PET tool, together with its methods of extraction, and showed experimental results to support the importance of such information. We believe the importance of the contribution also lies in the way that the information is collected, that is domain agnostic and aims at collection in the context of spontaneous workflows, with minimal input from the user and very limited assumption on the system structure. We present experimental results supporting the approach, based on the PERICLES use cases. The results from the PET tool will be further analysed in later tasks of the work package. 9.2 Future work We plan to continue our work on exploring new methods of automated information collection, and improving the filtering and inference of dependencies, in the remaining tasks of PERICLES. Here follows a list of future work, both related to PERICLES tasks, and more general ideas that could be explored but are not currently part of specific tasks. 9.2.1 Ideas for future PERICLES tasks 9.2.1.1 ENCAPSULATION OF SEI TOGETHER WITH THE RELATED DO IN T4.2 Task 4.2 investigates information encapsulation techniques that will allow to encapsulate DOs together with the belonging extracted SEI. This approach prevents the risk of data loss for SEI, in case the DO leaves its original environment for any reason, as illustrated in section 3.1. An information encapsulation tool will be developed during task 4.2, which will work together with the PERICLES Extraction Tool to encapsulate the extracted information directly in a sheer curation © PERICLES Consortium Page 69 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS scenario. Furthermore the idea of focusing on the use purpose of a DO will be extended by selecting a suitable information encapsulation technique based on the usage scenario. The techniques include packaging techniques, as used for preservation purposes, as well as digital watermarking, steganography and other embedding techniques. To avoid redundancy on the embedding of the data, it is likely that this task will also expand on the concepts of identifying recurring instances of embedded information, and providing ways to represent and identify it with limited redundancy across data collections. 9.2.1.2 ECOSYSTEM DEPENDENCY CAPTURE - WP5 In the context of the WP5 modelling of the ecosystem, we are aware that DO environment and an ecosystem share entities, as mentioned in section 5.1. As part of future work in WP5, it will be interesting to explore the possibility of extending the PET tool to help discover and extract and model the ecosystem entities and their dependencies. We can imagine, and we will evaluate the feasibility of the approach in later WP5 work, that the PET tool could be extended with modules that could help discover ecosystem entities and dependencies, by exploring the context of use of the system, and by that help with the modelling of the ecosystem. Another interesting possibility is that of extending PET to monitor the ecosystem entities and dependencies for change; this would allow reporting such changes so that the model can be updated and its consistency verified. 9.2.1.3 USING SEI FOR APPRAISAL - T5.5 We have already mentioned how the SEI – forming a weighted graph - can be used to enhance the appraisal process. We believe there are interesting possibilities to explore with this respect, such as looking at using values as weights both for dependencies and for DOs and spreading the DO weights through dependencies, and selecting the best set of objects based on both the dependency and object values. Another interesting aspect to explore is the use of dependency weights for non-binary appraisal, that is to use the weights to decide on a preservation strategy for DOs, so that for example the most valuable DOs and SEI can be preserved on more redundant, more expensive storage, while lower weight will serve to direct those to a less redundant storage. It is very likely that the work on this task will build on and extend the concepts and ideas we have started to explore in this deliverable. 9.2.1.4 USER COMMUNITY WATCH AND PROFILING - T5.4.3 Motivating Example: Consider one or more publications that are produced by a scientific community based on a set of experimental data (e.g. the SOLAR dataset). In practice the dataset produced by the SOLAR [43] experiments will be cross-referenced with other datasets with the aim of providing a definitive reference dataset for the solar spectrum. The dataset is likely to be downloaded and reused by many other scientists in different domains such as: ● ● ● Solar scientists running similar experiments to measure the solar spectrum. Other solar scientists. Scientists in other domains such as climate and environment science, as well as in domains that cannot yet be anticipated. One approach to user community monitoring is based on usage tracking and user profiling as described in section 6.5.1. An alternative approach is to track the scientific outputs that are based on a given dataset or other output. (This could be considered as part of the environment of a given paper). There is currently a trend towards treating datasets in a similar way to publications in terms of assigning DOs that can then be cited. Thus, it would be possible to track community and semantic © PERICLES Consortium Page 70 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS evolution both through citations of the datasets as well as the associated scientific publications. Academic publication metadata can be harvested from large OA repositories to provide a basis for capturing and analysing the communities of use, including their evolution and the associated semantics. There are two aspects to this: 1. Analysis of a corpus of historic publications to model the evolution of user communities and to test hypotheses about preservation and changing semantics with a real set of data. 2. Based on the experience in 1, to develop a preservation watch and planning tool that can monitor current content, detect changes and adapt preservation plans accordingly. 9.2.1.5 LRM EXTENSION TO REPRESENT SEI - T4.3 As illustrated in section 4.6, we have started preliminary work on extending the LRM for supporting SEI-related concepts. This is still early work, based on a pre-final version of the LRM and will need to be refined in later tasks. The aim is to create a simple and compact reusable model for representing SEI, possibly encapsulating the latter together with the DO (Task 4.2), as also outlined in Section 9.1.1. Potential directions to be investigated from then on include, but are not limited to, the following: ● ● As soon as weights have been assigned a value, a set of rules residing “on-top” of the ontological LRM model could then determine which the important dependencies are (and which could, consequently, be removed or “downgraded”). This can then lead to reasoning, through which it could be automatically pre-calculated (inferred) which newly-added dependencies are significant or not, without having to know their actual weight values - it will be enough to have prior knowledge of what type of dependencies are considered significant for each distinct purpose. 9.2.2 Other Ideas not assigned to specific PERICLES tasks Use of SEI for updating metadata standards Currently a community faced with archiving their data has to address the problem of knowing just what to collect. Usually, they look at previous examples or standards to help them determine what metadata information they need. Once they have a schema it is evaluated. Then it is implemented and people realise that there are some parts missing. So, they extend their schema to include those missing parts. It would be nice to reduce this by providing communities with an aid or guide that would help them to define what to collect. By exploring what is the environment and what is context we can start to either generate questions (like “have you thought about how the data was collected, are there any pieces of information from there that would be necessary for a potential user?”). Of course, in quite a few communities standards exist, but as the use of the data evolves the standards may not map completely to the uses. Use environment capture analysis This has been illustrated in sections 4.5.2, 7.7, 8.1.1 where the scenario is introduced and initial methods to address it are described. In this example we propose user tracking and analysis of events as they occur. This might involve use of the PET tool in some cases, and the linking of these events to enable analysis of specific workflows, such as the debugging of anomalies. Such a tool could involve tracking the symptoms of the problem, the effectiveness of mitigating actions, and the pieces of © PERICLES Consortium Page 71 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS relevant documentation. This use information could then be recorded and archived with the main parts of the documentation. More complex issues that were not addressed in the current tool, such as the ‘noise’ that can be reported by the event tracking, could be addressed in the future; this ‘noise’ can be for example due by the fact that users often multitask, so there can be unrelated documentation that was used but not relevant to the main task the user is doing, or documentation that was quickly opened and closed may also indicate in some cases that the document was not relevant. It would be interesting to also address more fine-grained tracking, as for example by including what pages have been consulted in a document. © PERICLES Consortium Page 72 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 10 Bibliography 1. PREMIS Editorial Committee. (2008). PREMIS data dictionary for preservation metadata, version 2.0. Retrieved June, 13, 2008. Chicago. 2. Dappert, A., & Farquhar, A. (2009). Significance is in the eye of the stakeholder. Research and Advanced Technology for Digital Libraries, 297-308. Springer Berlin Heidelberg. 3. List of Metadata Standards | Digital Curation Centre. Retrieved June 30, 2014, fromhttp://www.dcc.ac.uk/resources/metadatastandards/listhttp://www.dcc.ac.uk/resources/metadata-standards/list. 4. (2011). TIMBUS Project. Retrieved June 30, 2014, from http://timbusproject.net/. 5. (2012). TIMBUS Project Deliverable 4.5, “Business Process Contexts”, from http://timbusproject.net/. 6. Press, N. I. S. O. (2004). National Information Standards Organization. Understanding Metadata. 7. CCSDS, J. (2012). Reference Model for an Open Archival Information System (OAIS). CCSDS 650.0M-2, Magenta Book. 8. (2010). MPEG-21 | MPEG. Retrieved June 30, 2014, fromhttp://mpeg.chiariglione.org/standards/mpeg21http://mpeg.chiariglione.org/standards/mpeg-21. 9. Text Encoding Initiative: TEI. Retrieved June 30, 2014, fromhttp://www.teic.org/http://www.tei-c.org/. 10. Dublin Core Metadata Initiative. (2008). Dublin core metadata element set, version 1.1. 11. (2013). Disciplinary Metadata | Digital Curation Centre. Retrieved June 30, 2014, fromhttp://www.dcc.ac.uk/resources/metadatastandardshttp://www.dcc.ac.uk/resources/metadata-standards. 12. Lynch, C. (1999). Canonicalization: A fundamental tool to facilitate preservation and management of digital information. D-Lib Magazine, 5(9). 13. Hedstrom, M., & Lee, C. A. (2002). Significant properties of digital objects: definitions, applications, implications. Proceedings of the DLM-Forum (Vol. 200, pp. 218-27). 14. Knight, G. (2008). Deciding factors: Issues that influence decision-making on significant properties. JISC. Retrieved January, 7, 2010. 15. Knight, G. 2010. Significant Properties Data Dictionary. InSPECT project report. Arts and Humanities Data Service/The National Archives. At http://www.significantproperties.org.uk/sigprop-dictionary.pdf 16. Lurk, T. (2008, January). Virtualisation as conservation measure. In Archiving Conference (Vol. 2008, No. 1, pp. 221-225). Society for Imaging Science and Technology. 17. Laurenson, P. (2014). Old media, new media? Significant difference and the conservation of software-based art.In Graham, B. (Ed.) New Collecting: Exhibiting and Audiencesafter New Media Art. Chapter 3. University of Sunderland, UK 18. Falc o, P. (2010). Developing a Risk Assessment Tool for the conservation of software-based artworks. A thesis B . Bern ochschule der nste Bern. 19. (2013). Context: definition of context in Oxford dictionary (British ... Retrieved June 30, 2014, fromhttp://www.oxforddictionaries.com/definition/english/contexthttp://www.oxforddictionar ies.com/definition/english/context. © PERICLES Consortium Page 73 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 20. Chowdhury, G. (2010). From digital libraries to digital preservation research: the importance of users and context. Journal of documentation, 66(2), 207-223. 21. Kari, J., & Savolainen, R. (2007). Relationships between information seeking and context: A qualitative study of Internet searching and the goals of personal development. Library & Information Science Research, 29(1), 47-69. 22. Smeaton, A. F. (2006, January 1). Content vs. context for multimedia semantics: The case of sensecam image structuring. Semantic Multimedia (pp. 1-10). Springer Berlin Heidelberg. 23. Luo, J., Savakis, A. E., & Singhal, A. (2005). A Bayesian network-based framework for semantic image understanding. Pattern Recognition, 38(6), 919-934. 24. Blaschke, T. (2003). Object-based contextual image classification built on image segmentation. Advances in Techniques for Analysis of Remotely Sensed Data, 2003 IEEE Workshop on. IEEE. 25. Singhal, A., Luo, J., & Zhu, W. (2003). Probabilistic spatial context models for scene content understanding. Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on. IEEE. 26. Naphide, H., & Huang, T. S. (2001). A probabilistic framework for semantic video indexing, filtering, and retrieval. Multimedia, IEEE Transactions on, 3(1), 141-151. 27. Dappert, A., Peyrard, S., Chou, C. C., & Delve, J. (2013). Describing and Preserving Digital Object Environments. New Review of Information Networking, 18(2), 106-173. 28. Corubolo, F., Eggers, A.G, Hasan A., Hedges M., Waddington, S., Ludwig, J. (2014). A pragmatic approach to significant environment information collection to support object reuse. In IPRES 2014 proceedings. 29. (2011). SCIDIP-ES: SCIence Data Infrastructure for Preservartion ... Retrieved June 30, 2014, fromhttp://www.scidip-es.eu/http://www.scidip-es.eu/. 30. Perspectives, K. (2010). Data dimensions: disciplinary differences in research data sharing, reuse and long-term viability. SCARP Synthesis Study, Digital Curation Centre, UK, available at: http://www.dcc.ac.uk/sites/default/files/documents/publications/SCARP-Synthesis.Pdf. 31. (2008). Zoological Case Studies in Digital Curation – DCC SCARP ... Retrieved June 30, 2014, fromhttp://alimanfoo.wordpress.com/2007/06/27/zoological-case-studies-in-digital-curationdcc-scarp-imagestore/http://alimanfoo.wordpress.com/2007/06/27/zoological-case-studies-indigital-curation-dcc-scarp-imagestore/. 32. Curry, E., Freitas, A., & O’Riáin, S. (2010). The role of community-driven data curation for enterprises. Linking enterprise data, 25-47. 33. (2010). SCARP | Digital Curation Centre. Retrieved June 30, 2014, fromhttp://www.dcc.ac.uk/projects/scarphttp://www.dcc.ac.uk/projects/scarp. 34. Lyon, L., Rusbridge, C., Neilson C., Whyte, A. (2009) Disciplinary Approaches to Sharing, Curation, Reuse and Preservation, Digital Curation Centre, UK, available at: http://www.dcc.ac.uk/sites/default/files/documents/scarp/SCARP-FinalReport-Final-SENT.pdf 35. Whyte, A., Job, D., Giles, S., & Lawrie, S. (2008). Meeting curation challenges in a neuroimaging group. International Journal of Digital Curation, 3(1), 171-181. 36. Hedges, M., & Blanke, T. (2013). Digital Libraries for Experimental Data: Capturing Process through Sheer Curation. Research and Advanced Technology for Digital Libraries, 108-119. Springer Berlin Heidelberg. 37. Razum, M., Einwächter, S., Fridman, R., Herrmann, M., Krüger, M., Pohl, N., et al. (2003). Research Data Management in the Lab. Geochem. Geophys. Geosyst, 4(1), 1010. © PERICLES Consortium Page 74 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 38. (2012). eSciDoc - Research Data Management - BW-eLabs. Retrieved June 30, 2014, fromhttp://www.bw-elabs.org/datenmanagement/1_escidoc/index.en.htmlhttp://www.bwelabs.org/datenmanagement/1_escidoc/index.en.html. 39. Thibodeau, K. (2002). Overview of technological approaches to digital preservation and challenges in coming years. The state of digital preservation: an international perspective, 4-31. 40. Woods, K., Lee, C. A., & Garfinkel, S. (2011). Extending digital repository architectures to support disk image preservation and access. Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries. ACM. 41. Garfinkel, S. L., & Shelat, A. (2003). Remembrance of data passed: A study of disk sanitization practices. IEEE Security & Privacy, 1(1), 17-27. 42. Higgins, S. (2008). The DCC curation lifecycle model. International Journal of Digital Curation, 3(1), 134-140. 43. http://www.esa.int/Our_Activities/Human_Spaceflight/Columbus/SOLAR 44. (2011). The Weighting Ontology 0.1 - Namespace Document. Retrieved June 30, 2014, fromhttp://smiy.sourceforge.net/wo/weightingontologyhttp://smiy.sourceforge.net/wo/weight ingontology.http://smiy.sourceforge.net/wo/spec/weightingontology.html 45. Pocklington et al. (2014). A Biological Perspective on Digital Ecosystems and Digital Preservation. Submitted to iPRES-14. 46. Hadzic, M., Chang, E., & Dillon, T. (2007). Methodology framework for the design of digital ecosystems. Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on. IEEE. 47. McMeekin, D. A., Hadzic, M., & Chang, E. (2009). Sequence diagrams: an aid for digital ecosystem developers. Digital Ecosystems and Technologies, 2009. DEST'09. 3rd IEEE International Conference on. IEEE. 48. Kulovits, H., Kraxner, M., Plangg, M., Becker, C., & Bechhofer, S. Open Preservation Data: Controlled vocabularies and ontologies for preservation ecosystems. Proceedings of the 10th International Conference on Preservation of Digital Objects}. Biblioteca Nacional de Portugal}. 49. Doorn, P., and Roorda, D. (2010). The ecology of longevity – the relevance of evolutionary theory for digital preservation, In Proceedings of DH-10 50. Dobreva, M., and Ruusalepp, R. (2012). Digital preservation: interoperability ad modum. In: Chowdhury, G. G. (2002). Digital libraries and reference services: present and future. Journal of documentation, 58(3), 258-283. 51. (2011). SCAPE - Scalable Preservation Environments. Retrieved June 30, 2014, fromhttp://www.scape-project.eu/http://www.scape-project.eu/. 52. Fernando, C., Kampis, G., & Szathmáry, E. (2011). Evolvability of natural and artificial systems. Procedia Computer Science, 7, 73-76. 53. Goldberg, D. E., & Holland, J. H. (1988). Genetic algorithms and machine learning. Machine learning, 3(2), 95-99. 54. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8), 2554-2558. 55. Dorigo, M., Caro, G., & Gambardella, L. (1999). Ant algorithms for discrete optimization. Artificial life, 5(2), 137-172. 56. Kennedy, J. (1997). The particle swarm: social adaptation of knowledge. Evolutionary Computation, 1997., IEEE International Conference on. IEEE. © PERICLES Consortium Page 75 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 57. Smith, R. E., Timmis, J., Stepney, S., & Neal, M. Conceptual frameworks for artificial immune systems. International Journal of Unconventional Computing. 58. Jockers, M. L. (2013, March 15). Macroanalysis: Digital methods and literary history. University of Illinois Press. 59. Darányi, S., Wittek, P., & Forró, L. (2012). Toward sequencing “narrative DNA”: Tale types, motif strings and memetic pathways. Proceedings of CMN-12, Istanbul, May 26-27, 2012. 2-10. At :http://narrative.csail.mit.edu/ws12/proceedings.pdfhttp://narrative.csail.mit.edu/ws12/proce edings.pdf 60. Ofek, N., Darányi, S., & Rokach, L. (2013). Linking Motif Sequences with Tale Types by Machine Learning. OASIcs-OpenAccess Series in Informatics. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. 61. Nachira, F., Dini, P., & Nicolai, A. (2007). A network of digital business ecosystems for Europe: roots, processes and perspectives. European Commission, Bruxelles, Introductory Paper. 62. John, J. L. (2012). Digital Forensics and Preservation. Digital Preservation Coalition. 63. Church, G. M., Gao, Y., & Kosuri, S. (2012). Next-generation digital information storage in DNA. Science, 337(6102), 1628-1628. 64. Goldman, N., Bertone, P., Chen, S., Dessimoz, C., LeProust, E. M., Sipos, B., et al. (2013). Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature. 65. Causal Processes (Stanford Encyclopedia of Philosophy). Retrieved June 30, 2014, fromhttp://plato.stanford.edu/entries/causationprocess/http://plato.stanford.edu/entries/causation-process/. 66. Evolutionary Genetics (Stanford Encyclopedia of Philosophy). Retrieved June 30, 2014, fromhttp://plato.stanford.edu/entries/evolutionarygenetics/http://plato.stanford.edu/entries/evolutionary-genetics/. 67. Hartigan, J.A., 1975. Clustering Algorithms. Wiley, New York. 68. Darányi, S., & Wittek, P. (2013). Demonstrating conceptual dynamics in an evolving text collection. Journal of the American Society for Information Science and Technology, 64(12), 2564-2572. 69. Chandola, V., Banerjee, A. & Kumar, V. 2009. Anomaly Detection: A Survey. ACM Computing Surveys (CSUR). Volume 41 Issue 3. 70. Hofmann, M. and Klinkenberg, R. (2013). RapidMiner: Data Mining Use Cases and Business Analytics Applications. Chapman and Hall. 71. Ma, J. and Perkins, S. (2003). Time-series novelty detection using one-class support vector machines. In Proceedings of IJCNN-03, International Joint Conference on Neural Networks, volume 3, pp. 1741-1745. 72. (2010). Data-provenance Google Code. Retrieved June 30, 2014, fromhttp://code.google.com/p/data-provenance/http://code.google.com/p/data-provenance/. 73. Gehani, A., & Tariq, D. (2012). SPADE: Support for provenance auditing in distributed environments. Proceedings of the 13th International Middleware Conference. Springer-Verlag New York, Inc. 74. Strodl, S., Mayer, R., Rauber, A., & Draws, A. (2013). Digital Preservation of a Process and its Application to e-Science Experiments}. Proceedings of the 10th International Conference on Preservation of Digital Objects (IPRES 2013)}. Springer}. © PERICLES Consortium Page 76 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 75. (2008). Taverna - open source and domain independent Workflow ... Retrieved June 30, 2014, fromhttp://www.taverna.org.uk/http://www.taverna.org.uk/. 76. (2009). Elasticsearch.org Open Source Distributed Real Time ... Retrieved June 30, 2014, fromhttp://www.elasticsearch.org/http://www.elasticsearch.org/. 77. (2006). MapDB. Retrieved June 30, 2014, fromhttp://www.mapdb.org/http://www.mapdb.org/. 78. (2012). Home Sigar Hyperic Support. Retrieved June 30, 2014, fromhttps://support.hyperic.com/display/SIGAR/Homehttps://support.hyperic.com/display/SIG AR/Home. 79. (2012). Dependency Discovery Tool - SourceForge. Retrieved June 30, 2014, fromhttp://sourceforge.net/projects/officeddt/http://sourceforge.net/projects/officeddt/. 80. (2014). prasmussen/chrome-cli · GitHub. Retrieved June 30, 2014, fromhttps://github.com/prasmussen/chrome-clihttps://github.com/prasmussen/chrome-cli. 81. (2002). FreeMind - SourceForge. Retrieved June 30, 2014, fromhttp://freemind.sourceforge.net/http://freemind.sourceforge.net/. 82. Wittek, P. http://arxiv.org/abs/1305.1422 Somoclu: An Efficient Distributed Library for SelfOrganizing Maps. arXiv:1305.1422, 2013. 83. Ultsch, A., & Mörchen, F. (2005). ESOM-Maps: tools for clustering, visualization, and classification with Emergent SOM. 84. (2005). GNU Octave. Retrieved June 30, fromhttp://www.gnu.org/software/octave/http://www.gnu.org/software/octave/. 2014, 85. (1997) McKemmish, S. Yesterday, today and tomorrow: a continuum of responsibility. In Proceedings of the Records Management Association of Australia 14th National Convention, pp. 15-17. © PERICLES Consortium Page 77 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Appendix A: List of requirements Requirements analysis from D2.3.1 This is the list of requirements identifiers, described in more detail in D2.3.1, addressed by the PET tool. The bold typeface identifiers are the most relevant, the ones in ‘[]’ are partially addressed and those in ‘()’ can potentially be addressed by the PET, but currently are not. Born-digital Archives (BDA) UR-AM-BDA-02, UR-AM-BDA-03, [UR-AM-BDA-04], [UR-AM-BDA-05], UR-AM-BDA-10, UR-AM-BDA11, (UR-AM-BDA-25), UR-AM-BDA-26, UR-AM-BDA-27, UR-AM-BDA-28, UR-AM-BDA-29, UR-AMBDA-30, [UR-AM-BDA-31], (UR-AM-BDA-32), UR-AM-BDA-37. In this use scenario, Pet can help with the issues related to the extraction of metadata and environment information from Born Digital Archives, to help support the initial appraisal and documentation of the archive contents including information that could be lost if not captured in the native context (such as system and file-system specific information). Software Based Art (SBA) UR-AM-SBA-01, UR-AM-SBA-14, UR-AM-SBA-15, UR-AM-SBA-21 PET can help analyse the technical environment of an SBA, and also monitor and help analyse the runtime requirements of SBA, as described in the experiments in chapter 8 of this deliverable. Media Production (MPR) UR-AM-MPR-09, [UR-AM-MPR-12], [UR-AM-MPR-26], [UR-AM-MPR-41], UR-AM-MPR-37, UR-AMMPR-39, UR-AM-MPR-40 In the case of MPR, the PET tool can help again with the metadata extraction, and with the successive and linked task of metadata embedding will allow embedding the relevant metadata inside of the data. Digital Video Art (DVA) [UR-AM-DVA-04], UR-AM-DVA-05, [UR-AM-DVA-07], Science Requirements [UR-SC-POL-01], [UR-SC-POL-02], [UR-SC-POL-19], UR-SC-DAT-03, [UR-SC-DAT-06], [UR-SC-DAT-12], UR-SC-DAT-17, [UR-SC-DAT-20], (UR-SC-DAT-26), UR-SC-DAT-28, UR-SC-DAT-29, UR-SC-DAT-39, UR-SC-DAT-41, UR-SC-DAT-44 UR-SC-PRO-13, (UR-SC-PRO-14), UR-SC-PRO-16, [UR-SC-PRO-18], [UR-SC-PRO-21], UR-SC-PRO-24, (UR-SC-PRO-26), UR-SC-PRO-27, UR-SC-PRO-30, UR-SC-PRO-44, (UR-SC-TLS-03),[ UR-SC-TLS-06], Cross-Domain Requirements UR-CO-DAT-01, UR-CO-DAT-04, [UR-CO-DAT-05] (UR-CO-PRO-6), UR-CO-PRO-13, UR-CO-PRO-15 [UR-CO-TLS-03] © PERICLES Consortium Page 78 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Requirements from DoW, rapid case studies and other feedback General requirements from the DOW: 1. The aim of the tool is the extraction of significant environment information. 2. The software should implement techniques related to digital forensics, sheer curation, and weighted graphs to help prioritize the configuration of the extraction process. 3. The software should be based on, as well as support, the related research of Task 4.1. 4. The use cases of our partners should be considered. From the first requirement we derived the following specification points: ● ● [Obligatory] Extraction of environment information that sits outside of the object. [Optional] Support extraction of metadata from the digital object itself. For the second requirement we decided to support a sheer curation scenario and to design the architecture in a way that digital forensic techniques and weighted graphs can be included into the framework. The following specification points derived: ● ● ● ● ● ● ● Sheer curation: Allow continuous extraction. This requires the detection of when to extract and therewith environment monitoring. Sheer curation: Do not disturb the user. The tool should run in the background with minimal notification and configuration requirements. Sheer curation: Take into account resource usage in environment extraction in order to limit impact on system performance. Sheer curation: To avoid privacy issues, the user should have full control over the extraction process and the extracted information. Digital forensic: Allow the integration of post-hoc extraction techniques. Weighted graphs: Allow a high configurability of the extraction process. Weighted graphs: Provide an interface for the configuration. The following requirements derived from an analysis of the rapid case studies of our use case partners, and user scenarios provided by the partners: ● ● ● ● The need to capture the information from different environments. This includes that the extraction techniques have to be implemented in a way to be used on different operating systems. Remote extraction is not feasible for security reasons. Consider that the digital object can be altered in another environment and reenter the original environment (e-mail out/in). Automation of extraction process. Furthermore the scenarios provided us with ideas of which information is significant to extract from the environment based on the research about SEI. The following list is a subset of the potentially significant information: ● ● ● ● ● ● Software dependencies. Information from log files. Information from configuration files and XML files. Original file environment information (position in folder structure; author; modification etc. dates; permissions). Algorithms (applications, workflows) used for processing data. Detailed configuration parameters on SW used to process data. © PERICLES Consortium Page 79 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS ● ● ● ● ● ● ● ● Information stored on specialist SW or databases. Information from web services. User information from user directory services as LDAP. Usage information. Policy information. Relationships between objects. Execution environment features such as fonts, system properties and configuration, etc. Information about programming language installations. Some scenarios require highly specialized information. The following requirements for the software are a conclusion of this discovery: ● ● ● Possibility to choose and configure which techniques to use for a scenario. Possibility to create profiles of this configuration to be able to support different scenarios at the same time. Modularity for the extraction techniques and possibility for the user to easily create and add own customized modules to the tool. © PERICLES Consortium Page 80 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Appendix B: PET tool architecture in detail 1. The user can give some basic configuration commands as arguments at start. To use the same commands automatically for every tool start, it is possible to write these commands into a configuration file. The tool provides a command line interface (CLI) and a graphical user interface (GUI) for the interaction with the user; the use of the GUI is optional, as it is possible to execute the tool without the need of a graphical desktop. The user has full control over the extraction process and can decide which extraction modules to use, which files to consider and which (subsets of the) extracted information to keep. 2. The Extraction Controller Builder is responsible for building the Extraction Controller, the “heart” of the application, based on the given user commands. It is designed following the builder design pattern47 and is only executed once for building a single instance at tool start. 3. The Extraction Controller is the main controlling class of the application. It has access to all other controllers and is responsible for the application flow. At tool start it initializes all other controllers, and shuts them down at the end of the tools execution. All specialized controller components communicate exclusively over the Extraction Controller with other controllers, so this main controller is responsible for updating the states of other components and for serving as an intermediary. 47 http://en.wikipedia.org/wiki/Builder_pattern © PERICLES Consortium Page 81 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS 4. Profiles are created by the user to organize the extraction components, e.g. based on use purposes. They contain a list of configured Extraction Modules, and a list of Extraction Result Collections to collect the information extracted by the modules and to keep references to important files where this information is related to. Both components are described in the following two points. 5. Extraction Modules implement the techniques that designate how and when information is extracted. They provide different implementations of algorithms to be executed in the computer system environment of different operating systems. There are three different kinds of Extraction Modules: file-dependent, file-independent and daemons. File-dependent modules take as argument a path to a file to extract information that is valid only for this file, whereas file-independent modules extract environment information, which is valid for all files within the environment. Daemon modules, on the other hand, don't extract information, but instead monitor the environment for the occurrence of designated events. It is also possible to develop and easily plug into the application customized modules for extracting specialized information or for monitoring specific events. A class template for supporting the developer(s) is provided for this purpose. 6. Extraction Result Collections are the data structures that keep the extracted information. Each collection belongs to one of two sub-classes: Environment or Part. An Environment collects all extracted file-independent information that belongs to a Profile, whereby each Profile has only one Environment class. Parts keep the extracted information that is valid only for a specific file together with a path to this file. They can be seen as a file-part of a Digital Object, but, in order to increase flexibility, we intentionally didn't implement a Digital Object as a data structure. 7. The Profile Controller manages all Profiles and Profile Templates, which can be used for the fast creation of a preconfigured Profile. It is possible to export existing Profiles as Profile Templates, to be able to pass them to other PET users. 8. The Module Controller searches (with the help of Java reflections48) for all available module classes and creates a list of generic extraction modules provided for creating Extraction Module instances for Profiles. After their creation, most of the Extraction Modules have to be configured before they can be executed. 9. The Extractor is responsible for executing the Extraction Modules and for saving the Extraction Results into the right Extraction Result Collections. It supports two extraction modes: (a) a snapshot extraction that executes each Extraction Module of each Profile for capturing the current information state, (b) a continuous extraction mode that initiates a new extraction by the Event Controller when an event is detected by the environment monitoring daemons (the File Monitor and the Daemon Modules). 10. The Event Controller receives all Events detected by the monitoring daemons and controls the event handling. It uses a queue for handling the events in the order of emerging. 11. Monitoring daemons are the File Monitor (see 12) and the Daemon Modules (see 5). 12. The File Monitor is responsible for observing the files that are added to the Profiles for changes. If a modification to one of the files is detected, a new extraction for all modules related to this file will be initiated. In case of a file deletion, all Profiles that include this file as Part are informed, and will remove the file from their list. Contrary to the exchangeable daemon modules, this is an inherent component of the application. 13. The Configuration Saver saves the state of the application at the end of the tools execution 48 http://en.wikipedia.org/wiki/Reflection_(computer_programming) © PERICLES Consortium Page 82 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS to configuration files, and loads the state at the next start of the tool. The Profiles will be saved with all the added files and their modules with their configurations. Furthermore, the current extraction mode and general usage options are saved. 14. The Storage Controller allows generic access to the exchangeable Storage. It provides methods for saving and loading extracted information to and from the Storage. 15. Storage: save and load metadata using a modular storage support. Currently implemented three storage interfaces: defaults to a simple flat filesystem storage with Json mapping, one using elasticsearch [75] and a third using mapdb [76]. 16. PET works together with an information encapsulation tool, also developed during the PERICLES project, to be able to encapsulate the extracted information together with its related files in a sheer curation scenario. 17. The weighted graphs described in chapter 6.3 could be implemented for suggesting information to be extracted based on the use cases. © PERICLES Consortium Page 83 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS Appendix C: Ideas for further PET developments In this section we collect a list of issues and feature ideas for further PET developments, guided by close discussion with stakeholders. Although it will likely be impossible to implement all of these in the course of PERICLES, we hope to come back to the list in later project and to inspire the open source community for contributions. Therefore we plan to maintain an updated list of ideas in PET repository website49 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 49 Address the creation context of Digital Objects by adding newly created Digital Objects based on environment events detected by the Environment Monitoring Daemons. We already tested the concept with the Directory Monitoring Daemon by adding the newly created files of an observed directory after getting a file creation event. An option to trigger the execution of defined Extraction Modules repetitive in a configurable time interval, for example every 5 minutes, instead of only in case of an environment event. Explore the possibility to support extraction using the PREMIS 3 vocabulary and LRM dependency definitions. Implementation of the investigated weighted graphs and integration into PET. The FreeMind [81] open source tool (MIT license) could be used to visualize the weighted graph. Automated inference of dependencies from the monitored environment events. Extraction Module configuration Templates could be developed similar to the Profile Templates to export and ship single configured modules. GUI refactor: Help- and event-tab are Profile independent and should be shown outside the Profile area. Further development of the “General Native Command Module” that allows the execution of customized terminal commands as Extraction Module. A support for the extraction of parameters from the command output would be useful. The Information Tree is the main GUI display for extraction results. There are two other display methods available, which both allow the filtering of information by the Extraction Module that was used to extract it. It would be useful to enable such filtering also for the Information Tree. The same “Combo Box” for selecting the Extraction Module can be used for all three displays. Some intelligent redundancy management for the extraction results could be implemented. Some operating system files could be naturally excluded from extraction and monitoring, for example using DF databases, such as the National Software Reference Library (NSRL) Hashsets and Diskprints50. This would be helpful for handling of large amounts of files. Currently a “Part” is the data structure that represents an important file to be investigated during the extraction process. This concept could be extended to also support directories as “Parts”. It would allow including all future files created in a directory to the investigation. The configuration of Extraction Modules could be supported by the CLI. At the moment the user has to modify the JSON configuration file directly to avoid the use of the GUI. At the moment there could be conflicts if two Extraction Modules get the same name. An UUID for Extraction Modules would be useful to avoid this. Think about if there is any good method to show the extracted information at the CLI. The https://github.com/pericles-project/pet 50 http://www.nsrl.nist.gov/ © PERICLES Consortium Page 84 / 86 DELIVERABLE 4.1 INITIAL VERSION OF ENVIRONMENT INFORMATION EXTRACTION TOOLS ● ● ● ● ● problem is hard to solve, because of the great amount of extracted information. The Extraction Modules could be extended by a variable that indicates their current state. A state could indicate problems as “Further configuration needed”. Extensively test on windows. (The current version was developed and mainly tested on Linux and OS X, with limited testing on Windows) The configuration of Extraction Modules is currently possible by using a GUI editor for manipulating a JSON file. An improvement would be the generation of a configuration GUI. Currently all existing Extraction Modules are loaded to the default profile at the first tool start. This is good for presentations, but less good for real usages. Better would be to load no module and to display a text to the user that Extraction Modules should be added to the profile. Develop further Extraction Modules: ○ Exif information extraction ○ Maven / Ant dependency extraction ○ Extraction of installed and used drivers ○ Extraction of information about installed programming languages (already existing for Java) ○ Extraction of provenance and comments from version control systems ○ IDE (software development environment) information extraction ○ IDE project information extraction ○ Dependency extraction form IDE ○ Include the TIMBUS Extractors as modules to enable the extraction of business activity context information © PERICLES Consortium Page 85 / 86