DELIVERABLE 4.1 Initial version of environment information

advertisement
PERICLES - Promoting and Enhancing Reuse of Information
throughout the Content Lifecycle taking account of Evolving
Semantics
[Digital Preservation]
DELIVERABLE 4.1
Initial version of environment information extraction tools
GRANT AGREEMENT: 601138
SCHEME FP7 ICT 2011.4.3
Start date of project: 1 February 2013
Duration: 48 months
1
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Project co-funded by the European Commission within the Seventh Framework Programme
(2007-2013)
Dissemination level
PU
PUBLIC
PP
Restricted to other PROGRAMME PARTICIPANTS
(including the Commission Services)
RE
RESTRICTED
to a group specified by the consortium (including the Commission Services)
CO
CONFIDENTIAL
only for members of the consortium (including the Commission Services)
X
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Revision History
V#
Date
Description / Reason of change
Author
V0.1
2013.12.03
First template and outline
FC, AH
V0.2
2014.1.16
Updated outline, created executive
summary
V0.3
2014.1.21
Updated outline, new sections
V0.4
2014.04.1
Updated outline, additions and fixes
FC, SW
V0.5
2014.05.02
Updated outline, initial integration of
substantial new material
FC, AE, AH, MCH, SW
Draft 1
2014.05.19
First draft with new section and additions
FC, SD, AE, AH
Draft 2
2014.06.17
Second draft improved SEI definition, new
SOA, simplified outline, Context update
FC, AH, AE, SK, TM
Draft 3
2014.06.30
Complete draft for internal review.
Finished chapters 1-2, work on 4.7, work
on chapter 9
Draft 4
2014.07.04
Completed all chapters, updated definition
of SEI
FC, AE, JL
Draft 5
2014.07.14
First external review draft. Addresses
internal reviewer’s comments, small
addition to lifecycle
FC, MCH
Draft 6
2014.07.15
Moved to DOCX template
FC
Draft 7
2014.07.17
Formatting and other fixes
SK
Draft 8
2014.07.29
Added explicit references to use case
requirements (D2.3.1) and improved
requirement analysis section
FC
V1.0
2014.07.29
Final version
FC
© PERICLES Consortium
FC
FC, MCH
FC, AE, AH, SK
Page 2 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Authors and Contributors
Authors
Partner
Name
University of Liverpool
Fabio Corubolo (FC)
University of Göttingen
Anna Eggers (AE)
University of Liverpool
Adil Hasan (AD)
King’s College London
Mark Hedges (MH)
University of Boras
Sándor Darányi (SD)
King’s College London
Simon Waddington (SW)
Centre for Research & Technology, Hellas
Stratos Kontopoulos (SK)
Contributors
Partner
Name
Centre for Research & Technology, Hellas
Tasos Maronidis (TM)
University of Göttingen
Jens Ludwig (JL)
Tate
Patricia Falcao (PF)
Tate
Pip Laurenson (PL)
© PERICLES Consortium
Page 3 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Contents
Glossary……………………………………………………………………………………………………………………………………….6
1
Executive Summary ............................................................................................................... 8
2
Introduction & Rationale ....................................................................................................... 9
2.1
2.1.1
2.1.2
2.2
2.3
2.4
3
DO information for preservation & reuse ............................................................................. 14
3.1
3.2
3.3
3.4
4
State of the art about digital ecosystems .................................................................... 38
The DE metaphor in PERICLES research activities ................................................... 39
Interaction .............................................................................................................. 39
A parallel between SEI and environment in Biological Ecosystems ............................ 40
The metaphor applied to the SBA example ............................................................ 41
Mathematical approaches for SEI analysis............................................................................ 44
6.1
6.2
6.3
7
Definition of dependency and other entities in PERICLES ........................................... 24
Definition of Significant Environment Information ..................................................... 26
Measuring significance ................................................................................................ 27
SEI in the digital object lifecycle .................................................................................. 28
Post hoc curation and digital forensics ................................................................... 30
SEI in the PERICLES case studies .................................................................................. 31
Software Based Artworks ....................................................................................... 31
Space science scenario ............................................................................................ 32
Modelling SEI with an LRM ontology extension .......................................................... 33
Examples in support of weight and SEI modelling .................................................. 34
Modelling SEI with LRM - A preliminary approach ................................................. 35
Digital Ecosystem (DE) Metaphor and SEI ............................................................................. 38
5.1
5.1.1
5.1.2
5.2
5.2.1
6
Metadata ..................................................................................................................... 14
Significant Properties................................................................................................... 17
Context ........................................................................................................................ 18
Environment information ............................................................................................ 21
Significant Environment Information ................................................................................... 24
4.1
4.2
4.3
4.4
4.4.1
4.5
4.5.1
4.5.2
4.6
4.6.1
4.6.2
5
Context of this Deliverable Production ....................................................................... 10
Relation to other work packages ............................................................................ 10
Relation to the other work package tasks .............................................................. 11
What to expect from this Document ........................................................................... 12
Document Structure .................................................................................................... 12
Task 4.1 outline from the DOW ................................................................................... 13
Problem description .................................................................................................... 44
Solution: sensor networks ........................................................................................... 44
Contextual anomalies .................................................................................................. 44
The PERICLES Extraction Tool ............................................................................................... 47
7.1
7.2
7.3
Introduction ................................................................................................................. 47
General scenario for SEI capture ................................................................................. 47
Provenance, related work ........................................................................................... 48
© PERICLES Consortium
Page 4 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
7.4
7.5
7.6
7.7
7.8
7.9
7.10
8
Experimental results and evaluation .................................................................................... 59
8.1
8.1.1
8.1.2
8.1.3
8.2
8.2.1
8.2.2
8.2.3
8.3
8.4
9
Space science ............................................................................................................... 59
PET on operations: anomaly dependency information........................................... 59
PET on extracting results of scientific calculations ................................................. 60
SOM and anomaly detection on sensor data ......................................................... 61
Software Based Art: system information, dependencies ............................................ 64
System information snapshot ................................................................................. 64
Extracting font dependencies ................................................................................. 65
Experiments on SBA: Brutalism............................................................................... 65
Self-organising maps on video data............................................................................. 66
First internal evaluation (London, January 2014) ........................................................ 68
Conclusions and future work ............................................................................................... 69
9.1
9.2
9.2.1
9.2.2
10
PET general description and feature overview............................................................ 49
Requirements analysis and software specification ..................................................... 51
Software Architecture ................................................................................................. 51
Extracting SEI by observing the use environment ....................................................... 54
Modules for environment extraction .......................................................................... 55
Development process and techniques ........................................................................ 57
Scalability ..................................................................................................................... 58
Conclusions .................................................................................................................. 69
Future work ................................................................................................................. 69
Ideas for future PERICLES tasks .............................................................................. 69
Other Ideas not assigned to specific PERICLES tasks .............................................. 71
Bibliography ....................................................................................................... 73
Appendix A: List of requirements ................................................................................................ 78
Appendix B: PET tool architecture in detail.................................................................................. 81
Appendix C: Ideas for further PET developments ......................................................................... 84
© PERICLES Consortium
Page 5 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Glossary
Abbreviation / Acronym
Meaning
BE
Biological Ecosystem
CLI
Command Line Interface
Dx.y
Deliverable x.y from PERICLES
DE
Digital Ecosystem
DF
Digital Forensics
DO
Digital Object
DOW
Description of Work
DP
Digital Preservation
EI
Environment Information
ESOM
Emergent Self-Organizing Maps
GUI
Graphical User Interface
LRM
Linked Resource Model
PET
PERICLES Extraction Tool
RTD
Research and Technology Development
SBA
Software Based Art/Artwork
SEI
Significant Environment Information
SOM
Self-Organizing Maps
SP
Significant Properties
SW
Software
Tx.y
Task x.y from PERICLES
UAG
User Advisory Group
UC
Use Cases
WPy
Work Package number y from PERICLES
© PERICLES Consortium
Page 6 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Figures
Figure 1. Relationships between different tasks in WP4. ..................................................................... 11
Figure 2. The problem of information loss for external data. ............................................................... 15
Figure 3. Information extraction and encapsulation to avoid external information loss. .................... 15
Figure 4. A view on Digital Object and related information (…)............................................................ 17
Figure 5. The OAIS information model. ................................................................................................. 19
Figure 6. From [1], the PREMIS 2 data model. ...................................................................................... 21
Figure 7. From [27], proposed changes for the PREMIS 3 standard (…) ............................................... 22
Figure 8. A possible example of a dependency graph. .......................................................................... 27
Figure 9. SEI influences on SP of DO...................................................................................................... 31
Figure 10. Extension to the LRM for representing weights and purposes. ........................................... 36
Figure 11. Calibration data dependency modelling. ............................................................................. 36
Figure 12. The biological metaphor....................................................................................................... 41
Figure 13. PET snapshot. ....................................................................................................................... 50
Figure 14. PET architecture sketch. ....................................................................................................... 52
Figure 15. PET detailed architecture. .................................................................................................... 53
Figure 16. PET snapshot showing Extraction Modules and their configuration. .................................. 54
Figure 17. Screenshot showing changes in the 'handover sheet' (…) ................................................... 59
Figure 18. Trace of document use (…)................................................................................................... 60
Figure 19. Octave script used for the example. .................................................................................... 60
Figure 20. Screenshot of the PET showing a calculation result extraction. .......................................... 60
Figure 21. An unsupervised pipeline for labelling outliers (…).............................................................. 62
Figure 22. The U-matrix of a toroid emergent self-organizing map (…) ............................................... 63
Figure 23. SOM map for the video data labels. ..................................................................................... 67
Figure 24. Close-up of the SOM map with labels. ................................................................................. 68
© PERICLES Consortium
Page 7 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
1 Executive Summary
This deliverable will define what constitutes significant environment information, supporting novel
approaches for its collection and taking into account a wider scope of collectable information, to
improve the information use and reuse, both in the present and in the future.
Digital objects (DOs) live in an environment containing a variety of information that can be relevant
to their use. In order to take into account all the information needed, we take a very broad view on
the DO environment, considering information from anywhere, at any time, and of any type. We
recognise that some of the environment information external to the organisation in charge of the
data could change or disappear at any given time. Collecting such information at creation and use
time is thus very relevant for the preserving long-term use of the data.
Looking anywhere, we start by analysing how such information has been described in related work,
considering common definitions of metadata, context, significant properties and environment,
describing information from a DO environment, and we come to the conclusion that we need to
consider the broadest set of information, which we term environment information. We have the
intuition that the focus should be on what matters for a DO use and reuse, to support it in the longterm, when the environment is likely to change.
How can we represent such prerequisites to the use of DO? We adopt the concept of dependency,
common in PERICLES RTD work packages, and extend it to express the prerequisites of use.
Building on the existing definitions, we introduce the concept of Significant Environment
Information (SEI) that takes into account the dependencies of the digital object on external
information for specific purposes and has significance weights that express the importance of such
dependencies for the specific purpose. Such information will naturally form a graph structure (the
concept of dependency graph is common between the RTD work packages) that in our case can
support an enhanced appraisal process. The graph allows deducing other significant information to
be extracted and to infer relationships between different objects. Our approach facilitates the
gathering of information that is potentially not covered by established standards, and enables a
better long-term preservation, in particular for complex digital objects.
From there we expand the definition in time considering the importance of collecting SEI during any
phase of the digital object lifecycle, from creation (sheer curation), during the primary use, to posthoc (e.g. using digital forensics). This will allow us to observe and derive the use of DOs and help infer
their SEI dependencies. We present examples of SEI from the PERICLES stakeholders; and present an
extension of the LRM model (the abstract model being developed in WP3) to represent SEI. We
investigate the analogy between Digital Ecosystems and Environments with the Biological domain in
to find approaches from the mature Biological domain that can be applied to the Digital Ecosystem
and Environments, and present mathematical approaches for SEI analysis.
Our PERICLES Extraction Tool (PET) is built on these concepts. It is a modular and generic open source
(final approval pending at the moment of writing) framework for the extraction of SEI from system
environments where digital objects are created and used. PET is built to be domain agnostic, and
supports extension by external modules. The modules can easily make use of the variety of existing
tools (such as Apache Tika, mediainfo, mdls, and others in order to address domain specific needs.
The tool automates novel techniques to collect SEI, and supports sheer curation, a continuous
transparent monitoring and collection process that otherwise the user (e.g. scientist, artist in our use
cases) would have to find time to perform manually. We present experimental results supporting the
approach, based on the PERICLES use cases. The results from the PET tool will be further analysed in
later tasks of the work package.
© PERICLES Consortium
Page 8 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
2 Introduction & Rationale
A digital object's (DO) environment contains a variety of information that needs to be captured to
support its use and reuse in the long-term. In this deliverable we identify the forms that this
information may take, and propose a software framework to support its collection.
The environment where Digital Objects are created, processed and used is an active research topic in
the area of long-term preservation [1][2]. Current approaches for collecting information from a DO’s
environment are mostly standard driven, based on the large number of metadata standards available
[3]. Standards are usually managed by committees deciding on their contents, based on community
feedback. While this is of course a very tested and effective method, our focus is on what could be
improved for metadata collection. By analysing the current situation and considering what could be
improved, we discovered that, given notable exceptions, most metadata standards don’t include
information about the environment, but rather descriptions of the objects themselves, and are
stretched to their limits in supporting specialised use cases. We consider environment information to
have a wide scope and include all the information that can be important to the DOs use and reuse.
In recent approaches more attention is paid to the importance of information residing in the
environment for long-term preservation purposes. For example the TIMBUS [4] project investigated
the extraction of context information of business activities from the environment. The aim in [4] is to
assure that business processes will be re-playable, by capturing their context; so this is parallel, but
different from the objectives we have in this deliverable.
We decided to focus our research on DOs, their uses by user communities and the requirement that
these uses create in the environment of the DOs, in order to support the collection of a wider set of
environment information. We consider that this use driven approach will allow us to address
information that would likely be ignored as it does not fit existing standards, or because it’s
dependant on the specifics of the use case.
We start by analysing how such information has been described in related work, considering
common definitions of metadata, context, significant properties and environment, describing
information from a DO environment, and we come to the conclusion that we need to consider the
broadest set of information, which we term environment information. We have the intuition that the
focus should be on what matters for a DO use and reuse, to support it in the long-term, when the
environment is likely to change.
To represent environment information we introduce the term Significant Environment Information
(SEI), formally described in chapter 4, roughly covering the information needed to make use of a DO,
when considering a specific purpose and user community. We propose to use dependencies to
express the link between the DO and its SEI, weighted by the significance, and qualified by the
purpose of use. Such information will naturally form a graph structure (the concept of dependency
graph is common between the RTD work packages) that in our case can support an enhanced
appraisal process. The graph allows deducing other significant information to be extracted and to
infer relationships between different objects. Our approach facilitates the gathering of information
that is potentially not covered by established standards, and enables a better long-term
preservation, in particular for complex digital objects. The direct environment of an investigated DO
draws the boundaries of this perspective, whereas WP5 investigates a top-down approach in
considering the dependencies in the whole digital ecosystem.
We also investigate the analogy between Digital Ecosystems and Environments with the Biological
domain in order to understand if there are approaches from the more mature Biological domain that
© PERICLES Consortium
Page 9 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
can be applied to the Digital Ecosystem and Environments, and present mathematical approaches for
SEI analysis.
As the digital environment continually changes, the significant information could be lost if it were not
extracted at the moment of occurrence. To address this risk we investigate in Section 4.4 a sheer
curation approach, in which the environment is observed for events and significant information is
extracted continuously based on these environment events. Sheer curation is an approach where
curation activities take place in the environment of use and creation of DOs. This technique offers a
great potential for collecting additional environment information, as it provides direct access to the
digital object’s creation and use environment, where we can observe directly dependencies of use of
DOs.
Our PERICLES Extraction Tool (PET) is building on these concepts. It is a modular and generic open
source (final approval pending at the moment of writing) framework for the extraction of SEI from
system environments where digital objects are created and used. The PET operates in a sheer
curation scenario, where it observes the environment and continuously collects environment
information. Furthermore it provides a post hoc extraction mode to capture a single snapshot of
current environment information. The tool is designed to work in combination with an information
encapsulation tool, which will be developed in the succeeding task 4.2 of this work package, to
enable the possibility to encapsulate the extracted information directly with the related digital
object. Experiments applying the tool as well as substantiating the theoretical part based on the
PERICLES use case areas of digital art and space science, will also be presented in this deliverable,
and confirm the applicability of the tool to the different use cases. Also a mathematical approach for
the detection of anomalies by analysing significant environment is investigated.
Significant environment information will enable many new usages for digital objects, which are not
supported by the currently established techniques. Evaluating and extracting potential significant
environment information can facilitate unpredictable future re-uses. This concept has potential to
become a fundamental part of long-term preservation activities.
2.1 Context of this Deliverable Production
2.1.1 Relation to other work packages
This deliverable draws from, and contributes to work in different work packages, as here briefly
illustrated.
WP2: The rapid case studies and use case scenarios were used in this deliverable to draw initial
requirements for the PET tool (section 7.5), and also to validate the concept of SEI in very diverse
cases, where the case studies were analysed and SEI was determined on them. In particular, these
can be found in sections 4.5 (examples of SEI from the case studies), 5.2.1 (Digital Ecosystem
metaphor applied to Software Based Art) and 8 (Experiments of PET extraction).
WP3: The focus on the dependency model shared by the RTD WPs has brought a lot of very useful
cross-fertilisation. The importance of the purpose when considering dependencies was one of our
contributions to the concepts in the LRM. The modelling of SEI and weights has been built as an
extension to the LRM from WP3 (see section 4.6), and will help to define a common language and
allow a better reuse of our discoveries.
WP5: While our focus is in the scope of the DO environment, that builds from DO up to the entities
related to them, WP5 is exploring the concept of digital ecosystem, starting from the perspective
from the system down to the DOs and other entities. This can be seen as, for WP4, taking into
© PERICLES Consortium
Page 10 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
account the properties of DOs with a strong focus on their properties and environment while in WP5
the focus is on the system properties specific of the organisations in custody of the data. Given this
significant difference in the point of view, there is overlap within the two concepts, and we are for
this reason using a same terminology and same type of entities across the work packages, as
illustrated in section 4.1 where the D5.1 definitions are used to avoid duplication.
WP6-7: The PET is designed to be a client side tool, as it needs to observe the environment of use;
but its modules and tools could potentially be integrated on the server side in WP6-7. The extracted
SEI is now stored on the user local storage (in the programme folder) and never sent over to a
remote server for privacy issues; it is still conceivable that specific SEI could be sent over to a remote
server, given the exchangeable storage engine we implemented in PET, this could be a place where
WP6-7 could implement the integration.
2.1.2 Relation to the other work package tasks
Task 4.1 will directly or indirectly feed into all the successive work package tasks, as illustrated by the
following list and in Figure 1.
●
●
●
T4.2 “Environment information encapsulation”, where the information collected in T.4.1
(and later task analysis results) will be encapsulated within the Digital Object to create richer,
more self-describing DOs (see also sections 3.1 and 9.2.1).
T4.3 “Semantic content and use-context analysis” will also use information from T4.1,
where semantic analysis will be performed also on the extracted information to extract
meaningful features
T4.4 “Contextualised content interpretation” will use the 4.1 and 4.3 information to
perform an analysis and interpretation of the extracted features in time.
T4.1 Identification and
extraction of the
environment
T4.2 Environment
information
encapsulation
timestamped env. information
T4.3 Semantic content
and use-context
analysis
(feature extraction)
timestamped features
T4.4 Contextualised
content interpretation
(interpretation)
Figure 1. Relationships between different tasks in WP4.
© PERICLES Consortium
Page 11 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
2.2 What to expect from this Document
This deliverable will define significant environment information by broadening the existing
definitions to cover information relevant to support better use and reuse of DOs that has up to now
been largely ignored. The definition of SEI is based on an analysis of current approaches and
standards. By analysing the Digital Ecosystem metaphor and its relation to SEI, we anticipate good
potential pointers for future research. We will also describe the methods and tools we have
developed (PET tool) to identify and extract such information by observing the DOs in their creation
and use environment, while presenting mathematical methods for the analysis of SEI.
We will present examples of SEI in PERICLES two use case domains, and later provide an analysis of
experimental results on those use cases.
This deliverable will focus on the following aspects, from the point of view of a DO environment:
●
What can be defined as environment information in the perspective of the (re)use of the
data in digital preservation? An analysis of current standards and approaches will provide the
basis of our definition (Ch. 3). We argue that metadata standards driven approaches cannot
cover, in some cases, all the information needed for DO reuse.
● Significant environment information is defined here, to cover a broader set of information,
not constrained by strict definitions, and is based on the use purpose and weighed on its
importance (Ch. 4).
● Volatile significant environment information exists only at the moment of interest and is
difficult or impossible to reconstruct afterwards, and for this reason it’s important to collect
it in different phases (creation and use time) of the DO lifecycle.
● SEI comes from a variety of different sources possibly under different domains. A generic
approach is important in order to provide tools that can cover multiple domains
● We will describe various techniques that can be applied for the extraction of significant
environmental information
We are also investigating, in collaboration with the other RTD work packages, the issues of modelling
dependencies between such information.
2.3 Document Structure
The document follows the following structure:
1. Introduction
Chapter 1: Executive summary
Chapter 2: Introduction, project context, structure
2. Theory
Chapter 3: State of the art – Metadata, significant properties, context, environment
Chapter 4: Significant environment information – Dependencies, sheer curation,
weighted graphs, examples to substantiate the theory
3. Models
Chapter 5: Biological ecosystem – Digital ecosystem metaphor, pointers for future
work
Chapter 6: Mathematical SEI analysis, anomaly detection
4. Software
Chapter 7: The PERICLES Extraction Tool – Description, requirements and features,
architecture design, development process;
© PERICLES Consortium
Page 12 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
5. Experiments
Chapter 8: Practical experimentation in the space science and art domains
6. Conclusion
Chapter 9: Conclusion, future work
2.4 Task 4.1 outline from the DOW
A digital object's environment contains a variety of information that needs to be captured if the
object is to be preserved and used in the future. This task is concerned with identifying the forms
that this information may take, and with developing methods to extract the information from the
environment in a scalable and reliable manner.
T4.1.1 Environment information specifications
●
●
●
Identify the nature and source of the information necessary to preserve the critical features
of the environment in which digital objects are created, curated and used (such as
representation information, significant properties, provenance, policies, context, etc.), using
information from the use cases (media and science) in WP2 and the existing expertise of the
partners.
Analyse the types and roles of resources involved in the extraction of environment
information, to inform the modelling work in WP3 (resource modelling, context modelling,
content modelling, change management)
Identify models of semantic and mathematical approaches that fit the identified mediadependent environment variables
T4.1.2 Methods for environment information extraction
●
●
●
Investigate the application of techniques (e.g. weighted graphs, digital forensics, sheer
curation) for extracting environment information such as that identified in T4.1.1.
Develop toolkit for extracting environment information that implements some of these
techniques.
Apply the semantic and mathematical models identified in T4.1.1 to the extracted
information.
© PERICLES Consortium
Page 13 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
3 DO information for preservation & reuse
In this chapter, we examine previous work on identifying and representing the information for a DO
that is relevant to support the reuse of that object, in the long-term, across different user
communities and for different purposes. We structure this examination by beginning with
information that comes from the DO itself, then move beyond the DO with the aim of identifying a
broader set of information that needs to be taken into account to better support DO reuse. We
recognise that this classification is one among many; the aim is however to show one thread that
leads us to the topic of Significant Environment Information, introduced in chapter 4.
3.1 Metadata
Metadata can be defined as the information necessary to find and use a DO during its lifetime [6].
This definition covers a wide variety of information, and was further refined by the Consultative
Committee for Space Data Systems in their reference model for an Open Archival Information System
(OAIS) [7]. This refinement covered the information necessary for the long-term storage of DO, and
identified a number of high-level metadata categories, as follows.
The Descriptive Information (DI) consists of information necessary to understand the DO, for example
its name, a description of its intended use, when and where it was created, etc. The Preservation
Description Information (PDI) consists of all the information necessary to ensure that the DO can be
preserved, including fixity (e.g. a checksum), access rights, unique identifier, context information
(described in more detail in the following subsection) and provenance, which describes how the
object was created. The final category arises from the fact that the OAIS manages not the DO itself,
but information packages which consist of the DO as well as the DI, PDI and information required to
interpret the contents of the DO (which is described by the Representation Information (RI)). The
Packaging Information (PI) category describes how the information package is arranged such that
individual elements can be accessed.
Standard file formats have standard structural metadata (e.g. MPEG21 [8]), and de facto standards
(e.g. the Text Encoding Initiative [9]) exist for popular formats. The situation on standardisation for
the descriptive part of the RI is more complex due to the different needs of different communities,
although many approaches contain the Dublin Core metadata element set [10] as a core. A catalogue
of metadata standards for different communities can be found on the Digital Curation Centre
website [11].
Metadata may be treated as a separate entity, as it can be accessed without accessing the DO, but
the lack of metadata adversely affects the access to or reuse of the DO. While such information is
essential for the reuse of the DO it is not in general sufficient; information concerning the external
relationships of a DO, whether to other DOs, stakeholder communities, or other aspects of the
environment within which a DO is created or curated, also need to be taken into account to ensure
that the DO can be used fully and appropriately.
Metadata may be held internally in a DO, e.g. in the header of a structured file, or externally, e.g. in a
database, or separate data structures such as in file system extended attributes. The location of
metadata is an important factor for Long-term Digital Preservation (LTDP), as it will have
consequences on the availability of the information when for example the data is moved to an
external location, as in such event external metadata is in danger of being lost. Just to give a short
motivating example, of potentially useful metadata that is stored in an external location, we take the
example of the Apple OS X spotlight metadata. When downloading data from a web browser, OS X
© PERICLES Consortium
Page 14 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
stores provenance information about the URL where the data was downloaded. Such provenance
information (that can be useful to understand the origin of the data) is lost, if proper action is not
taken, as soon as the data is archived to an external location (as for example when archiving it to a
ZIP file) as it is part of the OS indexing engine database.
Figure 2. The problem of information loss for external data.
The issue is illustrated in Figure 2 where the relationship between information that is needed to use
a DO and the DO is implicit and is lost when the DO is transported to a new location. This can be
solved in two steps:
1. We first need to identify and extract the useful information from the DO environment
(external information). We are specifically addressing this issue with the definition of
Significant Environment Information in chapter 4, and with the PET software, described in
chapter 7.
2. We will address the encapsulation of such information together with the DO in the next task
in this work package, Task 4.2, so that the information is not lost when DOs are moved or
migrated (as shown in Figure 3).
This is one of the reasons why we consider the extraction, and successive encapsulation of such
external information as an important step to guarantee long-term use of DOs.
Figure 3. Information extraction and encapsulation to avoid external information loss.
© PERICLES Consortium
Page 15 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Table 1. An extract of our early analysis of metadata standards and their information content across standard.
As part of the work to analyse existing standards and determine useful information to collect as well
as existing gaps, we have built a comparative table illustrating the different standards and features,
© PERICLES Consortium
Page 16 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
as illustrated in Table 1. The table presented here is an extract of the original table and therewith not
complete; the complete table is available online1 and on the PERICLES WIKI.
This table illustrates the scope of a number of current metadata standards, and drove us to aim at
the collection of a broader set of information, defined on the uses as opposed to taking a standards
driven approach. Our first attempt at determining what to capture was to determine this based on
the different existing standards and their classifications. After some work, it was clear that a lot of
time would be required for the classification and mapping of existing standards, and that would
result in any case in subjective output. We do not try to imply that such classifications aren’t useful in
many cases, but as a number of standards and classifications already exist, we determined that a
different approach would be beneficial and would fill a gap in the current approaches.
The standards were based on particular uses and so offer some clues in those domains as to what
constitutes the useful environment information to collect. Still, for domain specific standards, it’s not
so easy to map across domains, while in the case of the generic standards they of course are generic
and cover a subset of what's needed and not everything so the users are left with the task of defining
a metadata schema for their own domain. For this reason we opted for a bootstrap, practical
approach.
We considered that what really matters is the capability of making use of the information, as we will
describe here and in the following chapters, and decided that this is what can help and guide our
selection of the environment information.
Starting from the data use and looking for the required information will allow, we assume, to gather
a broader scope of information that could partially be ignored when using a standard driven
approach, covering what is necessary for the use and reuse of the information by the use
communities, and to observe its change in time. One possible interpretation of the scope of the
different type of information is given in Figure 4.
Figure 4. A view on Digital Object and related information, from the narrowest to the broadest.
3.2 Significant Properties
The concept of significant properties (SP) has been much discussed in Digital Preservation (DP) over
the past decade (see for example [12, 13, 1]), in particular in the context of maintaining authenticity
under format migrations, given that some characteristics are bound to change as formats are
1
http://goo.gl/sJ1Z1m
© PERICLES Consortium
Page 17 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
migrated. The issue here was to identify which properties of an object are significant for maintaining
its authenticity.
Early work in this direction may be found in [12], where SP are introduced as a “canonical form for a
class of digital objects that, to some extent, captures the essential characteristics of that type of
object in a highly determined fashion”. Later work [13] investigated ways of classifying the
properties: “Significant Properties, also referred to as “significant characteristics” or “essence”, are
essential attributes of a digital object which affect its appearance, behaviour, quality and usability.
They can be grouped into categories such as content, context, appearance (e.g. layout, colour),
behaviour (e.g. interaction, functionality) and structure (e.g. pagination, sections).” The concept has
been adopted by standards such as [1], describing SP as “Characteristics of a particular object
subjectively determined to be important to maintain through preservation actions.” Such
characteristics may be specific to an individual DO, but can also be associated with categories of DO.
An important aspect of SP is that significance is not absolute; a property is significant only relative to
(e.g.) an intended purpose [14], or a stakeholder [2], or some other way of identifying a viewpoint.
This intuition is highly relevant to the work we will describe in chapter 4, where significance is
considered a key property for collecting relevant information from the environment.
While the concept of SP is useful for digital preservation, in its application it has usually been
restricted to internal properties of a DO, for example the size and colour space of an image, or the
formatting of text documents, rather than the potentially valuable information that is external to the
object itself. There have been some indications of a broader conception: [13] identifies context as a
category of SP, [15] refers to the need to preserve properties of the environment in which a DO is
rendered, and [2] introduces the notion of characteristics of the environment. The latter associates
environments with functions or purposes; this differs from what we are aiming at, which is to
describe the significance of information from a DO’s environment in relation to the purpose the user
is following (such as editing the object, processing the object, etc.). We thus see the purpose as
qualifying the significance, not the environment – a piece of information is significant for a specific
purpose, but not for some other purpose (within the same environment).
In this deliverable we often draw on Software Based Artworks (SBA) as example, because they qualify
to demonstrate that just preserving their intrinsic metadata cannot always preserve SPs. [16]
described that the environment has a special role for born digital media art, as “The digital
environment contains both: the core elements of the artistic software as well as the operating
system and modules required by the artistic software.”
In [17] the important question “What are the Spatial or Environmental Parameters of a Work?” is
asked to identify one of the areas of focus for SPs of SBAs. Also, [18] outlines the “significance of the
hardware/software systems for video and software based artworks”. One of her case studies
concerning the artwork Becoming by Michael Craig-Martin is a good example to illustrate the
dependency of SBAs SPs on the environment: “The randomness was created using Lingo, the
programming language for Director. This randomness makes use of the computer’s system clock, and
the speed at which images become visible or invisible is also dependent on the speed of the
computer.”
3.3 Context
Context is a term with many definitions, a basic dictionary definition being “the circumstances that
form the setting for an event, statement, or idea, and in terms of which it can be fully understood
[19]”. This clearly relates context to the purpose of understanding information, and this is a key
feature of context in relation to digital objects. Context encompasses a broader range of information
than metadata; it describes the setting that enables an understanding of a DO [20], including for
© PERICLES Consortium
Page 18 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
example other DOs, metadata, significant properties, relationships, and policies governing the
curation or use of the DO. In [21] context is defined even more broadly as ‘all those things which are
not an inherent part of information phenomena, but which nevertheless bear some relation to
these’, where the nature of the ‘relation’ is left unspecified.
Figure 5. The OAIS information model.
The OAIS model (Figure 5) views context as the relationship between a DO (equivalent to the Content
Information in OAIS terms) and its environment2. In this view, the environment is considered to be
necessary for using the DO, although it does not take into account two factors that we consider
essential for our purposes: firstly, the possible variety of different uses to which a DO may be put,
which will in general differ in the demands of ‘necessity’ they make on the environment; secondly,
the variable strengths of the relationship with different aspects of the environment.
In TIMBUS [5] context is explored from the point of view of supporting business processes in the long
term, describing a meta-model based on enterprise modelling frameworks. The context parameters
cover an wide set of parameters, from the legal, business to the system, and technological ones, with
the aim of supporting the execution of processes in the long term.
In PERICLES, DP and digital forensics are approached from the angle of preserving content in context,
where context can be spatial, temporal, conceptual and/or external, and all of these aspects apply to
space and digital art, the domains in study. We note in passing that the dependence of content on
2
From OAIS: “Context Information: The information that documents the relationships of the Content
Information to its environment. This includes why the Content Information was created and how it relates to
other Content Information objects.”
© PERICLES Consortium
Page 19 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
context as an insight has been around for quite some time3, and this insight returns in the way
dependencies play a key role in PERICLES models of content and context.
Context Definition from the Multimedia Analysis Perspective
In semantic multimedia analysis the ultimate goal is to obtain high-level semantic interpretations of a
digital object. A plethora of techniques have been devised for semantic content description and
categorization in many application domains. Although some of these techniques display satisfactory
performance, they make use of no contextual information that arises from the specific application. In
recent years though, the notion of context has been gaining increasing interest in image/video
understanding. Modelling the context and integrating the contextual information with the raw
information of the source digital media has proven to improve the performance of the state-of-theart techniques in image/video categorization [22]. In an attempt to model the context information,
many approaches have been proposed and published in the literature for semantic image
understanding [23], [24], [25] or video indexing [26]. Furthermore, the use of context in conjunction
with the domain knowledge, that is information about the application domain in general, has
boosted the efficiency of content analysis.
From the multimedia analysis perspective, context is usually considered as the auxiliary information
that can help us cope with the erroneous interpretations produced by the analysis modules that rely
solely on the audio-visual content. This information may be related to the environment where the
multimedia item is used, the purpose of its use, the reason for its creation, or even the equipment
used for its creation. For instance, in the media case study, a painting might be exhibited in a
museum sector referring to a specific time era (e.g., Renaissance, Rococo, etc.), where different,
more or less abstract concepts might be depicted. This environment information, along with artistspecific information i.e., who is the creator of the painting as well as contextual information acquired
from his/her other works, may also affect the extraction of semantic information about the artwork.
Moreover, information regarding a potential reason why a piece of work has been created, e.g., an
exhibition devoted to a certain concept, for instance “representation of spring”, may also contribute
to semantic extraction (object categorization, scene recognition, concept recognition, etc.). As
another example, in a music artwork, information about the equipment used for its creation (e.g.
whether it has been performed by orchestral instruments or it has been electronically generated
using computer software) may be integrated with the pure content for annotation purposes.
Using a rough categorization, context could be classified into four main types: spatial, temporal,
conceptual and external, according to its origin, nature and what it manifests. These types are briefly
described here:
●
●
●
Spatial context is related to the spatial organisation and structure of a digital object, e.g.
image, and it usually originates from the pixels surrounding an object in an image or the
relative location, orientation and scale of multiple objects within an image.
Temporal context is encoded by the temporal relation among a collection of digital objects,
for example in the form of the optical flow or frame proximity among consecutive frames in a
video.
In a higher level of abstraction, conceptual context might represent the co-occurrence of
multiple concepts in a digital object (e.g., objects in an image) and in some sense it expresses
the correlation among several concepts.
3
E.g. in linguistics in the works of Zellig Harris (1909-1992). His distributional hypothesis has led to today’s
successful research on distributional or statistical semantics, underlying progress in information retrieval and
machine learning.
© PERICLES Consortium
Page 20 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
●
External context includes all those types of information that come from sources other than
the digital object itself and actually is intimately related to descriptive information as this has
been previously identified in terms of metadata.
It is worth mentioning though that this is only a rough classification enabling an easier perception of
the inherently abstract notion of context. As a matter of fact, there is usually an overlap between the
several classes; thereby a strict classification of contextual cues is rather far-fetched. Moreover, it is
clear that not all of these types are always suited for every kind of audio-visual content. For instance,
an image cannot be interpreted in terms of temporal context, while an audio file cannot be
interpreted within spatial context.
3.4 Environment information
The widest set of information in our view is the environment. We consider environment information
to include all the entities (DOs, metadata, policies, rights, services, etc.) useful to correctly access,
render and use the DO. The definition supports the use of unrelated DOs and conforms to the
definition for environment used by PREMIS [1].
Figure 6. From [1], the PREMIS 2 data model.
In the context of OAIS, the term ‘Representation information’4, and its specification ‘Other
representation information’5, together with the information in the PDI, seem to include some of the
information we would classify as environment information. The point of view in OAIS still is that of
supporting the understanding of the object, and does not qualify the different uses and purposes for
4
“Representation information: The information that maps a Data Object into more meaningful concepts. An
example of Representation Information for a bit sequence which is a FITS file might consist of the FITS standard
which defines the format plus a dictionary which defines the meaning in the file of keywords which are not part
of the standard. Another example is JPEG software which is used to render a JPEG file; rendering the JPEG file
as bits is not very meaningful to humans but the software, which embodies an understanding of the JPEG
standard, maps the bits into pixels which can then be rendered as an image for human viewing.”
5
“Other representation information: Representation Information which cannot easily be classified as Semantic
or Structural. For example software, algorithms, encryption, written instructions and many other things may be
needed to understand the Content Data Object, all of which therefore would be, by definition, Representation
Information, yet would not obviously be either Structure or Semantics. Information defining how the Structure
and the Semantic Information relate to each other, or software needed to process a database file would also be
regarded as Other Representation Information.”
© PERICLES Consortium
Page 21 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
the information. In a way, we also propose a simpler definition, which does not aim to a classification
of the different types of information (structural, semantic, other, and PDI plus sub categories) while
we focus on the user and use of the information from the creation context and eventually from
different communities.
PREMIS builds on the OAIS reference model and defines a core set of metadata semantic units that
are necessary for preserving DOs. The set is a restricted subset of all the potential metadata and only
consists of metadata common to all types of DOs. The current, published set (PREMIS 2) defines a
data model consisting of four entities (see Figure 6): the Object entity allows information about the
DO environment to be recorded amongst other information. The Rights entity covers the information
on rights and permissions for the DO. The Events entity covers actions that alter the object whilst in
the repository. The Agents covers the people, organizations or software services relevant that may
have roles in the series of events that alter the DO or in the rights statements. The Intellectual Entity
allows a collection of digital objects to be treated as a single unit.
Dependency relationships are defined in PREMIS 2 as “when one object requires another to support
its function, delivery, or coherence of content. An object may require a font, style sheet, DTD,
schema, or other file that is not formally part of the object itself but is necessary to render it.”
The PREMIS working group undertook an investigation of the environment information metadata
based on feedback from their user-groups that found the existing support to be difficult to use. The
group reported in [27] their findings which entailed promoting the environment information to a
first-class entity and not a subordinate element of the DO for the next version of PREMIS (PREMIS 3).
They advocate the use of the Object entity to describe the environment, which allows relationships
between different environment entities, as illustrated in Figure 7. This approach neatly supports the
PERICLES view of the environment although PERICLES makes a distinction between the general
environment and the environment significant for a particular set of purposes (termed the Significant
Environment Information for a DO), which is described in the following chapter.
Figure 7. From [27], proposed changes for the PREMIS 3 standard to make environment a first class object (light
grey).
We consider the environment information for a DO to be the widest set of entities that is related to
it. This would include by definition all other DOs, information, services and other information that
can relate to the DO, but also other information from the environment that is useful for any of its
possible uses.
We consider this a wider set, although related, to the one described in the OAIS as Representation
Information, and we take a different focus than that defined by PREMIS 3, as we are not focused on
Software Environments. Another important distinction is that in general, we look at the environment
as something defined from a DO upwards, and thus we see environments as defined on the DO. This
is a different point of view from the one that is taken in WP5, which takes an ecosystem perspective,
© PERICLES Consortium
Page 22 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
thus looking from the system and institution point of view downwards to the different entities,
services, processes, users and digital objects.
Furthermore, we consider that a part of the environment information will only be observable in the
live environment of creation and use, reason that drives our choice of a sheer curation approach (as
described later in the next chapter). While looking at the DO environment, we think that the user is
an important part of it, and for that we observe the interaction between users and their
communities, the DOs, and the rest of the environment. We think that this perspective will allow us
to capture the information based on the pragmatic, sometimes neglected aspects of the real
requirements for making use of DOs. This will also help us in the task of inferring dependencies that
are not explicit and determine relevant information based on real use of the DOs.
As environment information will be a very wide set of information, it will be important to qualify
what information is significant for and what not, as we are introducing in the next chapter.
© PERICLES Consortium
Page 23 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
4 Significant Environment Information
Based on the definition of environment information from the previous section, we now propose our
definition of Significant Environment Information (SEI). For this we introduce the term “dependency”
as needed to express usage constraints for the DOs that exist in the regarded environment.
Furthermore we will propose some methods to measure SEI in the context of sheer curation, and
investigate SEI for usage scenarios related to our use case partners. Our respective findings will be
presented as a full paper at iPRES 2014 [28].
4.1 Definition of dependency and other entities in
PERICLES
In PERICLES, dependencies are a concept shared by the three RTD Work Packages (WP3-5), together
with other entities that can appear in a DO environment and in the general ecosystem in WP4 and 5.
Although many of these entities are common and shared, it is useful to make a distinction in their
use, which is based on the point of view taken by the work packages.
In this deliverable, the point of view taken is that of observing a DO and its environment, aiming to
support their use and reuse, and supporting DOs appraisal, making very few assumptions on the
whole ecosystem (the institutional or system aspects of the environment). This allows us to focus on
the properties that are independent of the system aspects and do not rely on the existence of the
system or institution. For this reason this deliverable looks at the less structured aspects of the
information and its processes. In WP5, the view taken is closer to the institution, and more attention
is paid to the structure of the ecosystem and its processes and of its other entities.
Looking in particular at dependency we propose a definition that has a different scope from the one
introduced in D5.1. Here we take the perspective of the use made of DOs, to express what
information is needed when a user wants to make a specific use of DOs. By looking at what uses are
possible for a DO, and what the prerequisites for them are, we aim to discover the dependencies of a
particular object, when considering a particular use or purpose.
As this deliverable and D3.2 and D5.1 are concurrent developments, we are referring here to the
most recent available drafts of the deliverable at the time of writing.
The focus on the definition currently in D5.1 is that of assessing the impact of change; the definition
proposed there is the following:
“There are the objects A and B. An object A has a dependency on B when some changes in B have a
significant impact on the state of A. And there is also a dependency if changes in B can impact the
ability to perform function X on A.”
In the current incarnation of D3.2, dependencies are defined as links between entities (based on the
PROV-O ontology definition) with an intention, which is described by a plan. A plan can have
“preconditions (when is it required to trigger the propagation of a change?) and impact (how
depending resources will be impacted)”.
The definition of dependency in 3.2 is focused on the change aspects of dependencies, as it is the
case in the WP5 definition. We are planning to investigate on the possibility of a single, common
definition covering the aspects of dependency used in the project in later work, but we considered
important for now to make progress with the properties that express the need of other digital
objects that fit better this deliverable objectives. It is important to note that the definition per se is
more conceptual than practical, and does not preclude us to make use of the LRM model from WP3.
© PERICLES Consortium
Page 24 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
In section 4.6 we are presenting our early work on extending the LRM model to support the
description of SEI.
Dependency definition
We define dependency as a directed relationship from entity A (source) to entity B (target), that
expresses necessity/prerequisite/precondition of B in order to make use of A.
In a different form, we view a dependency as: in order to use the source entity A, you need the target
entity B.
In this definition, a dependency is always qualified by a use: one can assume there can be more than
one use of an entity. This assumes the users will naturally have an objective, a purpose of use when
accessing a resource. As we will see later in this deliverable, although we cannot know in advance the
future uses for an entity, we consider that dependencies covering different user communities will
likely cover a broad set of uses, and monitoring the change in use by user communities will be the
best options to address future uses. Even if some dependent entities can be replaced by other
entities in an environment, as for example by using another word processor to edit a text fragment,
our focus is on the currently established dependencies in the environment. A disjunctive dependency
expresses the need for one entity from a set of entities (entity A OR entity B etc.); in this case the
dependency would be between the entity and another entity representing the set of alternative
entities.
Dependencies at the type and instance level
We defined dependencies to connect different entities, but there are distinctions to be made based
on the type of entities:
Dependencies from or to:
●
Instance level:
○ Software or services: these will link an entity to a service, a process or a software
component that is necessary (or useful) in order to make use of some data. For example,
in the case of an application that makes use of a web service. This can be considered as a
form of dynamic data.
○ Static data (single digital objects and their dependency): to make sense of file
‘anomaly.doc’ you need to also have access to ‘errorcodes.txt’ and ‘whereisthat.cvs’.
●
Type level:
○ Software or services: to use data in format X you also usually need application Y or
service Z.
○ Static data: to interpret this type of data you usually need this other type of data.
We think that the distinction between type and instance level dependencies will be particularly
relevant for later work on dependencies, and it has to be made explicit, as SEI will likely involve both
types of entities. It is likely that future work may be addressing the creation of type dependencies by
the observation over time of instance level dependencies.
It is of course possible to further classify dependencies by a number of facets, such as re-playable or
not re-playable dependencies; time-dependant or not time-dependant; dependencies for the
different aspects, both from the technical environment and from the human or not strictly system
bound environment; semantic dependencies etc.
© PERICLES Consortium
Page 25 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
For a detailed classification of dependencies, we refer to the work carried out in the scope of D5.1
that includes the definition of a number of entities in the ecosystem. For a more detailed state of the
art on dependency and definitions, we refer again to Deliverable 3.2.
DOs and other entities
We again refer to WP5 for a more detailed definition of the ecosystem entities, and for the definition
of Digital Object.
The focus in this deliverable is on the DO and service entities and their types, and we consider that a
DO can represent both proper data and a description and reference to existing services or software.
We also consider the possibility of referring to subparts of a DO, and to the type of DO, such as some
specific information contained in a DO metadata, or the type of that information, while talking about
dependencies between DOs. This allows us to describe dependencies concisely without getting into
the details of DO identifiers and subparts or type identifiers.
4.2 Definition of Significant Environment Information
Based on the broad definition of environment in the previous chapter, and that of dependency, we
define:
Environment Information for a source DO is the set of dependencies from the source DO's, together
with their target DOs for any type of use. Depending on the situation, the target of the dependency
can be an existing DO, a new DO representing some extracted information, a reference to such
objects, or finally a type of object, and should be considered part of the environment information.
We further define purpose (or intended use) as one specific use or activity applied to the source DO,
by a given user community. It is possible to imagine a hierarchy of purposes, where a higher-level
purpose (as for example, ‘render with faithful appearance’) will lead to a set of detailed purposes
(such as, ‘accurate colour reproduction’, ‘accurate font reproduction’ etc.).
We further define significance weight, with respect to a purpose, as a value expressing the
importance of each environment information dependency for that particular purpose. The
significance weight will be a property of each dependency between the DO and the DO expressing
the environment information for a specific purpose. We will explore methods and models for the
significance weights in sections 4.3 and 4.6
Finally, we define ‘Significant Environment Information’ (SEI) for a source DO, with respect to a
given purpose(s) as the set of environment information, qualified with significance weights. This will
include both the dependency relationship (with purpose and weights) and the information that is
target of the dependency.
In a less formal way, what we are aiming at is to determine “more or less all you need to have” when
interacting with a DO for a specific purpose, and the relative significance of each of these information
units (dependency).
Once SEI is determined for a collection of DOs, the different dependencies can form a graph
structure, as illustrated in Figure 8, where DOs in the collection could have relationships between
each other (when a DO in the collection will depend on another DO in the collection for a specific
purpose). This graph of significant information from the environment can serve as the basis for
appraising the set of DOs that should be maintained together with the relationships in the SEI (for
example, by applying a simple threshold to the significance weight, in order to support the use of the
DO in the future). In this graph we can imagine that weights will be assigned both to the data and the
SEI. The combination of the DO weights and their propagation through the dependency weights
© PERICLES Consortium
Page 26 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
should allow determining an optimal set of DOs to appraise. This will be explored in future
deliverables of this project, in particular within T5.5.
Comparing this definition to existing definitions of SP of the environment, as for example the one
described in [1], we note that the former is aimed at the collection of SEI for a DO to support the
different purposes a user can have with respect to it, while the latter defines the significant
properties of an environment in itself. The information we aim to collect is constituted by qualified
relationships to other DOs, as opposed to properties of the environment.
Figure 8. A possible example of a dependency graph.
We can observe that SEI is a specialisation of the dependency definition, taking into account the
purpose and the weight of the dependency. This means that SEI is a specialised subset of the
dependencies for a DO.
As mentioned before, the perspective we take is that of observing the current use of a DO in its use
environment, which is often before it will enter a Digital Preservation system. This will allow to better
determine what dependencies and information, targets of the dependency relationship (be it
information or services) are significant for the uses of the DO.
We consider that knowing the significant information necessary to support current uses will allow us
to cover, or at least know more precisely, the needs also for the long-term, as long as we try to
support different user communities. This is because different user communities will have different
purposes and requirements, so this is a good approximation of knowing the needs of a future
community (that we cannot know in advance, anyway).
4.3 Measuring significance
Currently, we are focusing on collecting a wide array of environment information, based on the
dependencies this information can have to the DO and its estimated relevance. We are also trying to
infer object dependencies that have an implied significance, by looking at use data, as described
later. Still, a very relevant part of what we need is measuring the significance of the collected data.
Although we don’t have experimental results for this part yet, we have clear ideas on how to define
and collect it. It’s in answering the question ‘what for – for what purpose?’ that should help us define
what is significant.
© PERICLES Consortium
Page 27 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Collections of data often have more than one use. Determining what information is significant
depends on the use of the data. For example, the calibration of the solar measurement instrument
will require calibration data, which may be a subset of the complete collection of data, as well as
applications necessary to read and analyse the calibration data. For a given collection, not all of its
environment information may be necessary for every potential use. To represent this, we propose
assigning weights to each relation between the collection DOs and their environment information.
Weights have a value between 0 and 1 included; where a weight of 1 indicates the information is
essential for all intended uses of the data. Monitoring the access of information as well as regular
reviewing of the information required for each use would provide the opportunity to update the
weights and could also accommodate new uses of the data.
Weights could be determined by direct collection, which is by asking the users to provide such
weights together with their current purpose of use, once the dependencies have been determined.
Another possibility would be that of observing the frequency of data use to determine the
significance weight, as for example by observing how often a particular object is used in conjunction
with another, in cases where such a scenario could be applied (when the usage data can be collected
across multiple users). For an individual DO these weights would include factors based on the cost of
collecting the information (e.g. a subscription fee may be required to access the information
contained in the resource). A threshold defined by the user community, archive and content
providers would determine which pieces of information make up the SEI. For example, in the case of
a subscription fee for accessing information contained in a third-party repository one could define
the weight as: (1-cost/budget), where the budget could be the total funds allocated to the archival of
this DO. The user community, content providers and archive will need to determine how the weight
is defined. Current FP7 projects such as the 4C project6 and the SCIDIP-ES project [29].
Particular patterns in data usage could also help determine the current user activity - to infer the
dependency purpose - and from there also automate the inference of the purpose of use.
Other factors can also be taken into consideration in calculating the weight, to express value, such as
cost in time and money to collect the information as well as whether the information is proprietary
(which may limit the accessibility to the information); there may also be licensing and privacy
constraints that restrict where the data can be accessed from. Any factor that influences use to the
information may contribute to its weight.
Significance is useful in the long-term preservation perspective, for example to support critical
analysis of the science data, as it will be a useful representation of the point of view and importance
of the information for the stakeholders. It can also provide a key to the understanding of the
information and the relationships between its different parts.
4.4 SEI in the digital object lifecycle
In recent years, there have been various efforts within the digital curation community to establish
new methods of carrying out curation activities from the outset of the digital lifecycle. A major
constraint that mitigates against this is that data creators (such as researchers) typically have time
only to meet their own short-term goals, and – even when willing – may have insufficient resources,
whether in terms of time, expertise or infrastructure, to spend making their datasets preservable, or
reusable by others (e.g. [30]). Moreover, the very volume of information that may be useful can
preclude this as a practical approach, and in any case the researcher may be unaware of the utility,
or even the existence, of much of this information
6
http://www.4cproject.eu/
© PERICLES Consortium
Page 28 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
One approach to this challenge has been termed sheer curation (by Alistair Miles of the Science and
Technology Facilities Council, UK), and describes a situation in which curation activities are integrated
into the workflow of the researchers creating or capturing data. The word ‘sheer’ here is used to
describe the ‘lightweight and virtually transparent' way in which these curation activities are
integrated, with minimal disruption (see [31]).
Sheer curation is based on the principle that effective data management at the point of creation and
initial use lays a firm foundation for subsequent data publication, sharing, reuse, curation and
preservation activities; the sheer curation model has not been extensively discussed in the scientific
literature. The term has sometimes been interpreted as motivating the performance of curatorial
tasks by data creators and initial users of data by promoting the use of tools and good practice that
add immediate value to the data. This is, in particular, the take of [32], which discusses the role of
such an approach to the distributed, community-based curation of business data.
However, this interpretation does not really address the challenges outlined above, and a more
common understanding of sheer curation depends on data capture being embedded within the data
creators’ working practices in such a way that it is automatic and invisible to them. For example, the
SCARP project [33], during which the term ‘sheer curation’ was coined, carried out a number of case
studies in which digital curators engaged with researchers in a range of disciplines, with the aim of
improving data curation through a close understanding of the researchers’ practice [34][35].
In [36] the concept of sheer curation is extended further to take account of process and provenance
as well as the data itself. The work examined a number of use cases in which scientists processed
data through various stages using different tools in turn; however, as this processing was not carried
out in any formally controlled way (e.g. by a workflow management system), it would have been
impossible for a generic preservation environment to understand the significance of the various
digital objects produced from the information available, as the story of the experiment was
represented implicitly in a variety of opaque sources of information, such as the location of files in
the directory hierarchy, metadata embedded in binary files, filenames, and log files. This was
addressed by capturing information about changes on the file system as these changes occurred,
when a variety of contextual information was still available, and the provenance graph was
constructed from this dynamically using software that embedded the knowledge and expertise of the
scientists.
The BW-e Labs Project [37][38] comprises an example for sheer curation and the collection of
context information in laboratory environments. The project stores context metadata during
experiments at the laboratory equipment together with the experiments measurements to improve
reuse and the collaboration between scientists.
The most effective way to capture SEI is through observation in the environment of creation and use
of the object. We look at the interaction between the DO, the environment and the user, with time
dimension. This allows us to infer dependencies that are not explicit and determine relevant
information useful for use and reuse of the DO. In terms of the DCC life-cycle, we are addressing the
‘create’ phase of the DCC lifecycle [42], with a strong focus on the ‘use and reuse’ and create phases,
examining the creation and use-reuse context and try to extract SEI from these contexts.
There is a close analogy between what we term sheer curation and modern models used in the
records management community. In the traditional approach towards record-keeping – the so-called
‘life cycle model’ – archivists are only involved subsequent to the period of active use of a record
within an organisation, when an object is transferred to a formal archive or otherwise disposed of.
Partly in response to the move towards digital rather than paper records, record-keeping practice has
increasingly adopted the ‘Records Continuum’ model, in which a record is regarded as existing in a
continuum rather than passing through a series of fixed life-cycle stages, and archival practices are
involved throughout, from the time the record is first created [85]. In this way, contextual
© PERICLES Consortium
Page 29 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
information available at the time the record is created and during its period of active use may be
captured and subsequently exploited to support archiving, in much the same way as the metadata
captured during data creation and reuse in sheer curation models. This metadata may be thought of
as constituting records that document (e.g.) a scientific experiment, thus forming a ‘metadata
continuum’.
4.4.1 Post hoc curation and digital forensics
Sheer curation may be contrasted with post hoc curation, which takes place only after the period
during which the digital objects are created and primarily used. When a DO is ingested into an
archive or repository, it is certainly possible to extract metadata from it – indeed, many tools already
exist to analyse content on ingest – however, much of the contextual information that was available
in the environments in which the DO was created and used may no longer be available at ingest, or
may be represented only implicitly. This may result in a reduced ability to understand the semantics
of the DO, or to reuse it effectively.
Digital forensics is one particular approach to such post hoc curation, and involves the recovery and
analysis of data from digital devices. It began in law enforcement as a way of investigating computer
crime, but since then has been applied to a variety of related areas, such as investigating internal
corporate issues, and also to digital preservation.
At the heart of digital forensics is the recognition that archives exist as physical (the storage media),
logical (digital artifact), and conceptual (as recognized by a person) objects [39].
Physical storage media can degrade over time, and so there is the need to preserve the logical object
(data) stored on the media. However physical media can also provide additional information relating
to the context of the data’s creation. For example the data might be the final draft of a document,
but the media may also contain deleted files relating to previous drafts of the document; it can also
give information relating to the environment in which the document was created such as the
operating system.
Although digital forensic techniques were originally designed for application to computer hard drives,
they can similarly be applied to any form of physical storage media, such as optical disks (e.g.
compact discs or DVDs), USB storage devices (e.g., flash drives), digital cameras or iPods mobile
phones.
Key to digital forensics is the creation of a disk image – this is a bit for bit copy of the whole storage
media, and which is identical to the original. As such, a disk image is different to a logical copy of the
storage media, which may contain copies of files and folders of the storage media, but is not identical
at the bit level [40].
Creating a disk image means that this can then be viewed, analysed and tested in exactly the same
way as the original storage media, but without interfering with, or modifying, the original storage
media. For example, in some cases even accessing the storage media would result in modification
(such as the indexing process performed by Mac OS X), and for this reason digital forensics
practitioners recommend the use of a ‘write blocker’ during the access of original storage media
(e.g., when creating a disk image).
The disk image can then be mounted in order to view the constituent files and directories (as could
be done with the original storage device), or it can be analysed using specialist digital forensics tools
(to identify deleted files and environmental information relating to the creation of the data).
Both the creation of the disk images and use of digital forensic techniques lead to policy issues for
preservation. Firstly there is the issue of storing the potentially large disk images which may be many
gigabytes or even terabytes in size. Second, there is the issue raised by the potentially personal,
© PERICLES Consortium
Page 30 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
identifying or confidential information recovered from the storage media, some of which may be
from previously deleted files. In this case, preserving institutions need to have policies relating to
both the collection of such information from donors, and also to restrict access to, or redact,
confidential information [41]. Digital forensics approaches will be addressed in more detail at a later
stage of the project.
4.5 SEI in the PERICLES case studies
In the following sections we illustrate the concept of SEI with examples from the Digital Art and
Space Science domains (the main areas of interest for the PERICLES stakeholders). Practical
applications of these examples can be found in the experimental results section.
4.5.1 Software Based Artworks
The following use case example illustrates the SEI investigation inspired by the Software Based Art
scenario from the Tate gallery. In this example a Software Based Artwork (SBA) should be migrated to
a new computer system for the purpose of an exhibition. The software component of the SBA causes
a strong dependency on the computer system environment. A description of SBAs and an extensive
study on their SPs can be found in [17] and [18].
We assume there is a computer system with a validated SBA installation, which should be preserved
to be able to configure and emulate the computer system environment as closely as possible for
future exhibitions. Preserving only the SBA as a DO cannot solve the problem, as the original
appearance and behaviour of the software cannot be reconstructed based only on the metadata that
belongs to the DO. In the context of executing the SBA's software for the exhibition installation are,
for example, other dependencies, such as external libraries and applications, and data dependencies
(data used at run-time by the SBA). However, we have to look further into the whole environment to
conceive all information that could be important for this scenario, as, for example, context-external
running processes that can affect the availability of resources, or external network dependencies.
The determination, extraction and preservation of SEI are essential to solve the problem of enabling
a future faithful emulation of the original system. An investigation of the environment information
with influence on the SP of the DO helps to identify the SEI for this use case. This is illustrated in
Figure 9.
Figure 9. SEI influences on SP of DO.
An example of SEI influencing the SP is when software changes the execution speed, based on the
system resources, since program procedures can adapt their execution speed to the available
resources depending on the programming style. This will turn information about system resources
© PERICLES Consortium
Page 31 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
into SEI for the purpose of “maintaining execution speed”. Information about display settings, as
colour profile and resolution, used fonts, the graphics card and its driver is SEI that can affect the SBA
appearance (“render DO with faithful appearance” purpose). Changes of programming languagerelated software can result in execution bugs or different speed of execution. The peripheral driver
or setting or response times that are dependent on the execution speed can affect the user
interaction experience with the SBA.
In order to determine the SEI, each SBA has to be individually analysed, regarding the use purpose
and based on the properties of the artwork and the artist’s beliefs regarding the SP of his artwork.
Typical SEI to emulate the environment for a SBA is: information about computer system
specifications, available resources, required resources, installed software and software
dependencies. Other relevant dependencies to capture can be, for example, all the files that are used
during the SBA execution, and peripheral dependencies, which can be identified by analysing
peripheral calls of the SBA. System resource requirements can be estimated on the basis of resource
usage, but hardware changes come with the risk of affecting the software behaviour and should be
avoided.
Another SEI purpose is given, if the SBA has to be recompiled because of a migration to another
platform or to fix malfunctions with a different set of significance weights. Here the SBA behaviour
has to be validated by the comparison of behaviour patterns measured at the original system
continuously in a sheer curation setting. Examples for such measurements are processing timings, log
outputs, operating system calls, calls of libraries and dependent external software, peripheral calls
and commands, resource usage, user interaction, video and audio recordings. The two latter can also
be used for validating the appearance of the artwork. If the SBA has a component of randomness, it
is more difficult to evaluate its behaviour based on the measured patterns. Furthermore, information
about the original development environment can be useful for a recompilation, and to identify the
source of a malfunction.
Consequently, there is a great amount of accessible and potentially useful environment information.
The decision of what to extract is often complicated by the fact that it is hard to foresee all further
uses. Therefore, the whole environment (instead of only a current context) has to be considered, in
order to identify the significant information to be extracted and preserved.
With the PERICLES Extraction Tool, which will be described in chapter 7, we address ways to extract
SEI from the environment in an automated fashion. In some cases the SEI, though significant, lies
outside of the scope of the information that can be captured automatically by our tool, and will
require different approaches. One such example can be found in [17] (our emphasis): “Many
software-based artworks depend on external links or dependencies. This might include links to
websites, data sets, live events, sculptural components, or interfaces.”
4.5.2 Space science scenario
As one of the two main use cases, the PERICLES project is considering capturing and preserving
information relating to measurements of the solar spectrum being carried out by the SOLAR payload
[43] of the International Space Station. The information includes operational data concerning the
planning and execution of experiments, engineering documentation relating to the payload and
ground systems, calibration data, and scientific measurements of experiments performed by solar
scientists. The ultimate aim of SOLAR is to produce a fully calibrated set of solar observations,
together with appropriate metadata.
We now consider three examples to illustrate the capture and use of SEI. In order to validate the
experimental observations of the SOLAR instrument, it is necessary to understand the impact of
many complex extraneous factors on the instrument. For example, vehicles visiting the ISS can affect
© PERICLES Consortium
Page 32 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
the trajectory of the ISS itself and cause pollution and temperature changes. Such effects are often
only uncovered by a long-term analysis of the data by the scientists. Hence, there is a need to
capture as much of the environment as possible at the time the observations are made to enable
such analysis. This includes capturing a wide range of complex environment information relating to
the instrument, the operational data, the payload sensors and events on the ISS itself. In this case,
the purpose of SEI is to enable critical analysis of the solar observations by the scientists. The
significance weights reflect the influence the DOs captured have on the critical analysis task. These
weights can change over time as additional environmental factors may be uncovered that have an
impact on the scientific data. The SEI (at a given time) will therefore reflect the DOs that are relevant
to critical analysis with an appropriate weighting.
In order to validate the solar measurements made by the SOLAR instrument, frequent comparisons
are made with data collected independently by other scientific teams. Often the techniques and
instruments are different, which provides a good way to ensure the results are not subject to
unwanted effects caused by the experimental methods used. The data from other teams and the
comparisons that have been made are a valuable part of the environment metadata for the SOLAR
data. The PERICLES Extraction Tool can perform the capturing of the validation experiments
themselves and appropriate metadata will be created. This would include validation scripts and
dependencies between subsets of the data, and would constitute (part of) the environment
information. The purpose associated to the SEI is the validation of the scientific data by the science
community. The significance weights reflect the value of specific data objects in the validation of the
SOLAR dataset. The SEI can assist scientists in assessing the quality and reliability of the data
produced.
A third example relating to the PERICLES science case study relates to the operational data for the
SOLAR experiment, which is primarily created, managed and used by the mission operators, who
operate the experiments on ISS remotely from the ground station. The operations data includes the
planning, telemetry and operations logs. Given the huge complexity and volume of the space mission
information, a major issue for the operators is information overload. An important task for the
operators is to resolve anomalies, which occur when the normal operational parameters of the
instrument are exceeded, such as overheating. Identifying and resolving anomalies often requires
extensive research in the archived operations data and documentation. The problem that the
operators often have is not that the information is not available, just that there is so much
documentation that it is difficult and time consuming to locate information that relates to a specific
problem (i.e. further contextual information). Only specific parts of the documentation are relevant,
but this has to be learnt from experience, as this type of information is not formally documented.
In this case, the digital object to be preserved is the catalogue of known anomalies and the
environment information is the aggregation of all operations data. The purpose for the SEI is the
identification of a specific anomaly. In this case, the significance weights indicate the relevance of a
specific DO, such as a piece of documentation for the instrument or an excerpt from the archived
telemetry to the particular anomaly. Thus the SEI provides a way to indicate all the environment
information relevant to identifying and debugging a specific anomaly.
4.6 Modelling SEI with an LRM ontology extension
Defining what factors make up the weights is a task for the creators, users and managers of the data
who will be able to use past experience and other criteria to quantify what to collect. The SEI is
defined by the following triple:
●
●
The dependency relationship between the source and target DOs,
The weight of the dependency, and,
© PERICLES Consortium
Page 33 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
●
The purpose that defines the dependency.
We consider that weights could be the result of a number of factors, and depend on a formula that
will compute a final significance weight; as for example when computing the value of a dependency,
one may want to take into account the cost of maintaining it, and its benefit. Weights would also be
connected to express the significance of single DOs or collections of them; so the concept of weight
could be applied at the dependency or also at single DOs or entities. It is furthermore conceivable to
assign SEI to classes of DOs, instead of single instances, where a class could represent a particular
type of DO or follow any type of classification. In this case, SEI could express, for example, the fact
that solving anomalies requires the use of mission records, and technical manuals (this could be
based on statistical data from single instances).
In addition, the SEI may change over time due to external factors (such as a requirement from a
funding agency to include a copyright licence or a need to include the software to access the DO).
This introduces the need to include a temporal element that can be represented by versioning the
SEI as well as including a timestamp when the SEI was defined.
Since changes in weights over time are likely to be a useful indicator of change in the user
community interest, keeping track of its changes is important. This would allow also addressing
changes in policies that may reflect in changes on how the composite significance weight is
calculated. By keeping versioning of the SEI and its weights, we would be able to address these use
cases. As early work on this direction, in this section we will give some simple examples of SEI - with
attention to weights - that should be modelled, and a preliminary approach to its modelling based on
the LRM.
4.6.1 Examples in support of weight and SEI modelling
The dependency weights are not similar for all use purposes of a digital object. To substantiate the
hypothesis that it makes sense to recalculate for each purpose, we will present three short examples.
Example 1
In many cases building an application can require different libraries for different purposes. For
example, a compiled application can be built in ‘optimised mode’ or in ‘debug mode’ in both cases
the compilation may require different libraries to serve the different uses and applications built to
work on different platforms may require different libraries to in order to function on different
platforms (e.g. when using the sqlite database library the Windows compiled version is necessary for
the Windows operating systems, whereas the Linux library is required for Linux operating systems).
There may also be libraries that add specific functionality to the application that is only required for a
subset of uses (for example the Root Data Analysis Framework can be built with or without many
third-party libraries such as python, GNU Scientific Library etc.). These different configurations are
usually encoded in the build programs that allow the user to select the configuration depending on
the use.
Example 2
At this example the regarded digital object might be any image that is viewed and manipulated at the
environment.
We consider two purposes:
1. Viewing the image.
2. Manipulating the image.
At the environment are several image viewing tools installed, and one image manipulation program
called GIMP. By regarding the dependency of the image to GIMP, a change of the weighting based on
the purpose can be observed. The dependency has a really low weight for the purpose of viewing the
© PERICLES Consortium
Page 34 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
image, as there are alternative tools installed, which probably fit better to fulfil the purpose. For the
purpose of manipulating the image, for example with a GIMP specific filter, the weight will become
really high, as there is no alternative in using GIMP.
If we come back to the image example the dependency to GIMP for the purpose of viewing the
image could be really high, if no other tool is installed on the system that can be used. By the event
of an installation of another image viewing tool, the weight will become very low, as the dependency
is now replaceable.
Example 3
With reference to the scenario of calibrating space data (described briefly in section 4.5.2), and more
in general when facing sensor data, raw data coming from instruments has to be calibrated,
according to some specific calibration formula and using some calibration parameters in order to
generate calibrated data that can then be used and processed directly. In this simplified example, in
the case where a user has the purpose of re-calibrating the calibrated data, the dependency with raw
data, the calibration parameters and formula will be a strong one, with a significance weight set to 1
(assuming a scale of 0-1 for significance weights, and with the weight representing data significance
for the purpose) indicating the absolute need for those data items, for that specific purpose.
If the user wants to access calibrated data with the purpose of reading it, the user has no need to
access the raw data, the calibration formula and parameters. This will be represented by
dependencies with a significance weight of 0.
4.6.2 Modelling SEI with LRM - A preliminary approach
The Linked Resources Model (LRM)7 is an operational middle-layer OWL ontology for modelling the
dependencies between digital resources handled by the PERICLES preservation ecosystems. This
subsection presents an initial approach to modelling SEI with the LRM. However, since representing
SEI requires appropriate constructs for representing weights and purposes, which are currently not
included in the core LRM, the latter was extended with the needed classes and properties. This is a
good example of cross-WP interplay, since this work depends on WP3 outcomes (the LRM), while, on
the other hand, the derived results will then be “fed into” WP3, possibly extending the LRM with
additional functionality. Note that, as the title implies, the work presented here is still at a
preliminary stage with plenty of room for improvements and extensions, most of which are already
well underway.
LRM Extensions
Since the LRM currently does not support representing weight-related concepts, the Weighting
Ontology [44] (WO) was chosen as an appropriate ontology to import to the LRM. WO constitutes a
model for formalizing multiple purpose weight-related notions. The steps for extending the LRM for
representing SEI were as follows:
●
●
Class wo:Weight8 was “attached” to pk:Digital-resource and pk:Dependency through
property wo:weight.
Class Purpose is introduced as a new class attached to wo:Weight through the newly
introduced property has-purpose. The reason behind associating the purpose with the
weight was because each weight corresponds to a different purpose of use for the DO or
7
The LRM is thoroughly presented in D3.2 (due M18).
The prefix in front of the class name denotes the namespace, i.e. the vocabulary this class is initially defined
in. Namespace “wo” corresponds to WO, “pk” corresponds to the LRM, “prov” corresponds to the PROV-O
ontology, while the absence of a namespace implies newly introduced constructs.
8
© PERICLES Consortium
Page 35 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
dependency.
Purposes may consist of sub-purposes forming hierarchies. This is achieved via transitive
properties has-sub-purpose and has-super-purpose.
● Each weight has a numerical value [0..1] and a scale.
The following diagram (Figure 10) demonstrates the interplay between the key classes mentioned
thus far (the rest of the classes and properties are omitted for brevity reasons):
●
Figure 10. Extension to the LRM for representing weights and purposes.
Example
For demonstration purposes, we illustrate here the LRM-based representation of example #3
presented previously. According to the example, calibrated data depends on raw data and calibration
parameters. This can be represented as a dependency in the “LRM sense” (see object “datacalibration” in Figure 11). The weight of the dependency depends on the purpose of use:
●
●
Calibration parameters and raw data are not needed for reading (using) the calibrated data.
Thus, when the purpose of use is “read” (object “purpose-1”) then the associated weight
(object “weight-1”) has a value of 0.0.
On the other hand, recalibrating the calibrated data requires the existence of calibration
parameters and raw data. Therefore, when the purpose of use is “recalibrate” (object
“purpose-2”) then the associated weight (object “weight-2”) has a value of 1.0.
Figure 11. Calibration data dependency modelling.
© PERICLES Consortium
Page 36 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
The figure represents the example described above, where the weights represent
‘importance’ and the scale is 0-1, while the yellow text boxes indicate the numerical values
of weights in the specific case, as well as the literal values of purposes.
© PERICLES Consortium
Page 37 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
5 Digital Ecosystem (DE) Metaphor and SEI
In this chapter we describe our exploration of the concept of Digital Ecosystem (DE) and its
relationship with SEI from a fundamentally different perspective, cross-checking if DE as a metaphor
for the conceptual framework of DP survives scrutiny. To this end we postulated that a Biological
Ecosystem (BE) and a DE are similar enough to share functionally equivalent key concepts and that
genetics is to BEs what DP is to DEs, since both safeguard their respective ecosystems by
reconstructing them from inherited information. In the course of building our working glossary, this
comparison between the two ecosystems yielded a list of concepts defined in similar ways so that
they could be considered to function in both systems.
We suggest that the parallel applies beyond a simple metaphor and can be exploited to derive a
better understanding of digital preservation scenarios.
Our respective findings are briefly explained in the next section, and were submitted to an iPRES-14
poster [45].
5.1 State of the art about digital ecosystems
Lately in a morphological attempt to compare new phenomena to known ones, several authors have
proposed using the concept of digital ecosystems [46, 47] or preservation ecosystems [48, 49, 50,
51]. To contribute to DP theory development, we compared biological ecosystems with digital
ecosystems. We suggest that DP amounts to reproducing a digital object as a reuse case, where the
significant properties of the DO, in tandem with its SEI, reconstruct a copy of the functional original.
Accordingly, SPs and their SEI are the digital counterparts of essential genetic information embedded
in, and interacting with the environment in BEs.
From this viewpoint we can look at DP in terms of functional genetic analysis. This is a method that
has proved remarkably successful in molecular biology. Such a high level look may reveal further
commonalities between genetic-biochemical vs. linguistic encoding of communicated content. By so
doing, one can expect that better DE models, perhaps more naturalistic ones, might emerge.
Underlying our theoretical stance is the argument that a set of key DP concepts, conceptualised
within a DP embedding, have respective mappings to equally key concepts in BE rendered in terms of
functional genetics. The comparison leads to a glossary of mutually mapped concepts not discussed
here, that are inspirational for the testing of new ideas in DP. Examples for these concepts include
environment, niche, sub-niche, fitness, ecosystem, organism, species, and sequential instructions on
gene vs. chromosome level, DNA, evolution, mutation, behaviour, and several more.
Related research shows that many active research programs exploit equivalences between biological
objects and digital objects, up to and including, in the position taken by strong artificial life, the
assumption of indistinguishability. The latter follows from the recognition that life is not dependent
on any particular underlying medium, but is instead a property of information processing structures
[52]. Irrespective of abstract unification, computer science assumes biological concepts exemplified
as digital objects in genetic algorithms [53], neural networks [54], ant-trails [55], swarms [56], and
artificial immune systems [57]. The application areas of this methodological perspective extend to
literary science as a genomic project [58], or narrative genomics as a content sequencing approach
[59, 60]; and from a more ecological perspective, digital ecosystems [46, 47], digital business
ecosystems [61], digital preservation ecosystems [48, 49, 50] and digital forensics [62], apart from DP
explicitly using DNA-based encoding of media content as one way to its very long-term preservation
[63, 64].
© PERICLES Consortium
Page 38 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
A framework incorporating biological concepts and terminology can result in confusion if the context
is not well specified. DP is uniquely in a position of having to deal simultaneously with separate
contexts arising out of the above subject areas, while at the same time conforming in its own right to
the idea of the ecosystem, containing digital objects which replicate, behave, consume resources,
mutate, get selected, and evolve, this leading to a meta-view of biology-based informational
concepts with DP at its centre. In this sense, we felt encouraged to consider DOs as “digital
organisms” filling in specific niches in a DE.
5.1.1 The DE metaphor in PERICLES research activities
Analytical research approaches to DE content in PERICLES use two knowledge representation
paradigms: graphs, prominently underlying the LRM and Topic Maps, and matrices feeding into
machine learning methods. The two methods are expected to enable lossless, i.e. relationshippreserving ontology conversion to, and construction from, matrices. The key concept in both is
relations between the objects, in biology expressing an interaction.
5.1.2 Interaction
The study of biological ecosystems, such as in genetics, benefited from the concept of interaction
maps. Interaction in causal systems is tentatively defined as follows: “Let P be a process that, in the
absence of interactions with other processes would remain uniform with respect to a characteristic
Q, which it would manifest consistently over an interval that includes both of the space-time points A
and B (A − B). Then, a mark (consisting of a modification of Q into Q*), which has been introduced
into process P by means of a single local interaction at a point A, is transmitted to point B if [and only
if] P manifests the modification Q* at B and at all stages of the process between A and B without
additional interactions.” ([65], with more definitions available). Examples for its variants in
evolutionary genetics are given in [66]9. In PERICLES we use this concept in the sense of measurable
contact between two entities such as by values in the cells of a feature-object matrix. Featurefeature, object-object, feature-environment and object-environment relationships can be all
considered interactions; therefore the way a DO and its SEI interact can receive a formal treatment.
Moreover, dependency expressed by weighted graphs is another expression for interaction, making
our research strategy parsimonious.
Interaction maps can be computed by many methods, as shown both in bioinformatics [67] and
automatic categorization of DOs [68]. More importantly, as much as one can compare evolving
weighted graphs over time to express changes in the structure of SEI vs. DO interactions, one can do
the same by evolving interaction maps, in compliance with the evolving semantics topical constraint
of PERICLES. Finally, a series of interaction maps can be also animated, leading to a generic change-
9
See also: “A long-standing dispute in evolutionary biology concerns the levels at which selection can occur
(Brandon 1996, Keller 1999, Godfrey-Smith and Kerr 2001, Okasha 2004). As it turns out, there is no one
process termed “selection.” Instead there are two processes—“replication” and “environment interaction”.
Some authors see this dispute as concerning the levels at which replication can take place. Dawkins argues that
replication in biological evolution occurs exclusively at the level of the genetic material. Hence, he is a gene
replicationist. Other authors take the levels of selection dispute to concern environment interaction and insist
that environment interaction can take place at a variety of levels from single genes, cells and organisms to
colonies, demes and possibly entire species. Organisms certainly interact with their environments in ways that
bias the transmission of their genes, but then so do entities that are both less inclusive than entire organisms
(e.g., sperm cells) and more inclusive (e.g., beehives).” (http://plato.stanford.edu/entries/replication/).
© PERICLES Consortium
Page 39 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
tracking tool in DP, for example analysing provenance. This means that interactions pertain to all
levels in an observed DE, and change tracking by the same conceptual toolkit can address different
components of the system.
5.2 A parallel between SEI and environment in
Biological Ecosystems
In the PERICLES project, departing from the concept of an entity fitting its niche, we analysed the
kinds of information extractable from a DO and its environment to improve its chances of being
useful in the long-term. By this perspective we came to the conclusion that it is possible to extend
the SP framework beyond the DO to its environment. We defined, in section 4.2, this additional
source of knowledge SEI for a DO. SEI is a broad super-set of the existing SPs from where we adopt
the concept of intended purpose, but extended to the whole DO environment and not just to the
DO’s intrinsic properties. By adopting the practice of sheer curation, i.e. a situation in which curation
activities are integrated into the workflow of the researchers creating or capturing data [49, 50], we
considered how to examine the environment to discover and extract those dependencies that are
crucial for different users for the use and reuse of DOs. Sheer curation is a perfect parallel to the
situation in biology where organisms cannot be observed reliably outside of their niches, this results
in an unavoidable loss of important information. In our working hypothesis, the DO was the
organism, with the binary data representing it considered as its DNA, and the natural environment
where this organism lived corresponded to the system environment. To map their connectedness, a
software agent, as for example our PERICLES Extraction Tool, observed and collected information
about interactions between the DO and its immediate surroundings. By observing such interactions
one can obtain a series of observations for further analysis and recognise functional dependencies.
Examples for what can be necessary to make use of the object in the future, depending on the
purpose, include system properties (hardware and software configuration), resource usage, implicit
dependencies on documentation, use of external data, processes and services (including provenance
information), etc. Such information cannot be reliably reconstructed after the DO is archived. It has
to be extracted from the “live” system when the user is present, and preserved together with the
DO. As a concrete example, every software has a distinct way to represent and process DOs, and just
changing the software version, not to speak of the software itself, can have an impact on the DO’s
properties.
By now it should be apparent that the SEI is a very close analogue to the definition of a niche in BEs,
also with regard to its definition by functional analysis in order to determine what information is
necessary for a particular function of an organism or a DO. Since SEI as an extension of SPs is
purpose-dependent, we can extract such informational dependencies so that for every purpose we
will have a specific sub-niche, with a general niche that will be the super-set of all the dependencies
for each of the considered purposes.
Getting down to details, we note in passing that although ablations would allow a more detailed
determination of the SEI, one cannot run such processes automatically, because our environment
needs to include end users with time constraints and unwilling to run the same experiments on very
similar data several times. For this reason we need to observe what is happening and infer higher
level dependencies and their significance on that basis. Also the niche we identify will be constituted
only by the information that matters and not by all the available information.
Similarly to interaction maps, dependency graphs for the representation of SEI concisely aim to
determine the significant interactions between purpose (function) on the one hand, and DOenvironment properties as dependencies on the other. For instance, we may collect provenance
information with the aim to infer functional and documentation-related dependencies between DOs,
© PERICLES Consortium
Page 40 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
but with the objective of not preserving complete provenance information per se (as not everything
may be significant).
A typical use case that requires the extraction of SEI is the creation of a new environment similar to
the old one in such a way that it still fits the SP of the DO, just like the genome of a fossil could be
revived under the right laboratory circumstances. A matching real-life scenario is offered in the next
section.
5.2.1 The metaphor applied to the SBA example
To substantiate the practical applicability of the introduced biological metaphor, below it is applied
to the DP example of preserving a SBA, as described in 4.6.1.
SBAs are defined as artworks with a software component, which makes their behaviour and
appearance highly dependent on their DE. In this example, we consider the situation where a SBA
has to be installed on a new DE for the purpose of an exhibition. To this end it is important to ensure
that the SP of the SBA, which are described in more detail in [17] and [18], influenced by the
conditions of the DE, are matched by the new system environment.
This situation is comparable to that of an organism moved from its niche in the natural BE to a newly
created BE. The niche can be emulated by replicating the original surroundings as precisely as
possible with natural flora, fauna and resource settings, or an artificially modelled one with
substituted components, for example when the organism is migrated to a zoo. To ensure its wellbeing and support its natural behaviour, the SEI of the niche that influences these conditions has to
be preserved.
Figure 12. The biological metaphor.
Many components and processes that constitute the ecosystem, biological as well as digital, have a
great impact on the SPs of its entities. E.g. the behaviour of a SBA item depends on the available
system resources, just like an organism depends on food availability – upgrading computer system
memory can affect the execution speed of software programming methods as much as increasing
food availability in the niche will affect the foraging rate of an organism. Consequently for the
relocation of a SBA object, as well as for that of an organism, one has to determine what the SEI is,
© PERICLES Consortium
Page 41 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
i.e. we have to extract it from the original ecosystem for an accurate emulation of the DE (Figure 12).
Even if the SBA does not depend on other running software, it is affected by the interaction of that
software with its DE, similarly to how organisms affect other living entities by manipulating certain
ecosystem features. In the context of emulation the used peripheral, its configuration and the driver
have to be preserved as well to conserve the user interaction experience, again similar to the
situation where the interaction of the organism with other BE entities can be affected by a feature
change in the BE. The graphic drivers, libraries and display configurations, like resolution and
contrast, all belong to SEI because they influence the display of the SBA and therewith the perception
of the viewer, as an observer’s perception by taking a picture of the organism can be affected by the
settings of his camera. Here the observer is considered as part of the ecosystem.
In a BE, behaviour patterns of organisms can be observed e.g. during the foraging, mating, and social
interactions. Likewise, for the move to the new environment it is important to validate the behaviour
of the moved object to test the reconstructed niche components. This can be done by the
comparison of behaviour patterns with those measured in the original niche. Patterns for a SBA item
can be extracted from measurements of the system resource usage; external API calls, peripheral
calls and process timings plus the analysis of log messages. Furthermore, video recordings and
documentation of the SBA installation can be used as well.
Changes of other ecosystem entities that interact with the SBA, for example a runtime environment
change, another version of the programming language installation or storage backend, can result in
execution bugs that are comparable with abnormal behaviour of a living being caused by changes in
its social interactions with other beings. Information about SBA development is useful for migration,
since it can become necessary to recompile new binaries from the source code, dependent on the
used programming language and the new system environment. This situation is comparable with an
organism that is bred or cloned for better adaptation to changed niche conditions, and so that the
outcome preserves the quality and resilience of the migrated object.
We present a comparison table in Table 2 with the results of our investigations.
Table 2. A comparison of the two types of ecosystems (DE and BE) and their components.
SBA in DE - SEI
Organism in BE – SEI
Important for
Available system resources
Food (energy), territory (space)
Resources for
environment
reconstruction
Used system resources
Current environmental quantities
Resource status
Process timings
Behavioural events
Behaviour pattern
validation
Software dependencies:
driver, libraries, configuration,
peripheral settings
Essential interactions between organisms
i.e., symbiotic partners, trophic associations
Local interaction
dependencies
Running software without
dependency
Other organisms that live in the same
ecosystem but a different niche
System
interaction
dependencies
© PERICLES Consortium
Page 42 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
SBA in DE - SEI
Organism in BE – SEI
Important for
Development and
programming language
installation
Growth, maturation i.e., metamorphosis,
and reproduction
Influences on
creation, running,
adaptability
Graphic settings
Setting of camera trap to observe organism
activity
Appearance,
perspective,
overview
PERICLES Extraction Tool –
sheer curation
Integrated environmental monitoring
system resulting in documentary film about
the habitat of the organism
SEI monitoring
and extraction
techniques
© PERICLES Consortium
Page 43 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
6 Mathematical approaches for SEI analysis
With reference to the importance of anomaly detection in space data (section 4.2), related work is
now either on-going or being planned in PERICLES in three directions:
1. SEI extraction of anomaly solution dependency by PET (as described in section 8.1.1 of this
deliverable);
2. Automatic anomaly detection by different methods;
3. Semantic analysis of anomaly reports.
Here we briefly discuss the theoretical background of Automatic anomaly detection, with
experimental results detailed in Chapter 8.
6.1 Problem description
The data set to be analysed originates from measurements made on the International Space Station
(ISS). Our task is to find outliers in the time series of the measurement, including early signs that
would lead to these outliers. The outliers being anomalies, they occur in situations which are not
normal for the operation of the space station. This problem is an application of a methodology
known as anomaly detection to complex system supervision.
6.2 Solution: sensor networks
Data streaming from a network of sensors has unique characteristics. Anomalies in such data
collected can either mean that one or more sensors are faulty, or they are detecting unusual events
that are worthy of further analysis. We must be able to identify either type of fault [69].
A sensor network might include sensors that collect different types of data, such as binary,
categorical, integer, continuous, audio, or video. Hence measurements at given time stamps often
have a heterogeneous structure. It may also happen that the environment of sensor deployment and
the communication channel introduce noise, and missing values can also occur as a result.
Given these characteristics, it is not a surprise that anomaly detection in sensor networks has its own
unique set of challenges. In a best-case scenario, anomaly detection techniques are expected to
operate online, as opposed to offline batch processing. The difficulty of online processing is made
even harder by resource constraints in computational power – any deployed algorithm must have a
low time and space complexity.
6.3 Contextual anomalies
In general, anomalies fall into three major categories [70]:
●
●
●
Point anomalies, which are single data records deviating from others;
Contextual anomalies, which occur with respect to a context;
Collective anomalies, where a subset of the data instances causes the anomaly.
In sensor network data, the temporal nature of the measurement introduces the context; hence we
are looking for a solution that addresses contextual anomalies in the above classification. In a
contextual data set, each data instance is characterized by two types of attributes [69]:
1. Contextual attributes which define the context or neighbourhood for a particular instance. In
sensor network data, we are facing a time series; hence time is a contextual attribute which
© PERICLES Consortium
Page 44 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
determines the position of an instance on the entire sequence.
2. Behavioural attributes describe the non-contextual characteristics of an instance. In a sensor
network, these are the measurement data streaming from the individual sensors at a given
point in time.
Anomalies are identified using the values for the behavioural attributes within a specific context. A
data instance might be normal in certain contexts, whereas a similar or identical instance could be a
contextual anomaly in another context. This is key to understanding contextual and behavioural
attributes for a contextual anomaly detection technique.
Just as in the broader discipline of machine learning, we differentiate between labelled and
unlabelled anomaly detection, which leads to different leading paradigms:
1. Supervised learning: an algorithm is given samples that are labelled. For example, the
samples might be microarray data from cells, and the labels indicate whether the sample
cells are cancerous. The algorithm takes these labelled samples and uses them to induce a
classifier. This classifier is a function that assigns labels to samples including the samples that
have never been previously seen by the algorithm;
2. Unsupervised learning: in this scenario, the task is to find structure in the samples without
labels. For instance, finding clusters of similar instances in a growing collection of sensor
measurements might identify outlying elements that are anomalous.
Supervised anomaly detection suffers from class imbalance: typically only a tiny fraction of the
examples is labelled as anomalous. Hence a trivial rejector, an algorithm that attaches a nonanomalous label to every instance might score near a hundred percept accuracy, but it is entirely
inefficient in identifying the anomalies. Performance measurements must consider class imbalance,
and so must do learning algorithms.
An additional problem with labelling is that some anomalies in the data have loose labels. For
instance, instead of identifying three anomalies in sequence in a time series of multivariate
measurements, we might only get an approximate identification of the time interval when the
anomaly occurred. This is essentially an inherently high label noise, which renders most convex
supervised learning algorithms irrelevant, therefore nonconvex algorithms must be considered.
Most anomaly detection algorithms detect point anomalies, and it is often easier to transform
contextual and collective anomalies to point anomaly problems by additional pre-processing. As
contextual anomalies are individual data instances just like point anomalies, but are anomalous only
with respect to a specific context, the straightforward approach is to apply a known point anomaly
detection technique within a context.
Following this train of thought, the reduction to point anomaly detection has two steps. First, identify
a context for the data instances using the contextual attributes – time stamps in the case of sensor
networks. Secondly, calculate the anomaly score or label the test instance within this context using a
known point anomaly detection technique.
A transformation technique for time-series data uses phase spaces, a form of windowing on
multivariate data [71]. This involves converting a time-series into a set of vectors by unfolding the
time-series into a phase space using a time-delay embedding process. The temporal relations are
embedded in the phase vector across all instances. This results in a traditional feature space that can
be used to train support vector machines (SVM), one-class SVMs in particular.
Mapping a contextual problem to point anomaly detection is not always necessary, easy, or
straightforward – this is the typical scenario in time-series. In this case, direct methods must be
considered which normally extend time-series modelling techniques. A simple example is
multivariate regression, which tries to predict the next value in the time-series, and failing so it
indicates an anomaly. It is not always clear however, which attributes should be predicted.
© PERICLES Consortium
Page 45 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
With these considerations, we designed an experiment, which has brought encouraging results
(discussed in Chapter 8).
© PERICLES Consortium
Page 46 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
7 The PERICLES Extraction Tool
7.1 Introduction
We have developed the PERICLES Extraction Tool (PET) entirely10 in the scope of the PERICLES
project, for the extraction of information from the environment where DOs are created and
modified. While different projects looked at sheer curation for very specific domains and use cases
[22, 24, 25], we have built a generic, modular framework that can be adapted to support different
use cases and domains through specific modules and configuration profiles. Our tool is focused on
information extraction, while others target different aspects of information curation. We have also
addressed the context of unstructured workflows, where the user is not adopting any workflow
system, making it important to observe the flow of events in an agnostic framework.
In a nutshell, PET works by analysing the use of the data from within the creator or consumer
environment, extracting information useful for the later reuse of the data that is not possible to
derive in later phases of the data lifecycle, as for example at ingest time.
PET is open source (Apache licensed, final approval pending at the moment of writing) software
written in Java and developed to be platform independent, soon to be available on GitHub at the
address: https://github.com/pericles-project/pet. It implements various information extraction
techniques as plug-in Extraction Modules, as complete implementations or where possible by reusing already existing external tools and libraries. Environment monitoring is supported by
specialized monitoring daemons and continuous extraction of relevant information triggered by
environment events related to the creation and alteration of DOs, like, e.g., the alteration of an
observed file or directory, opening or closing a specific file, and other system calls. The tool can be
used in a sheer curation scenario, running in the system background under the full control of - but
without disrupting - the user. Furthermore a snapshot extraction mode exists for capturing the
current state of the environment, which is mainly designed to extract information that doesn't
change frequently, as e.g. system resource specifications.
An advantage of the use of PET is that established information extraction tools can be integrated as
modules. An user who had to use many different tools to extract the information and metadata for a
scenario, could use our tool as framework instead, and will get all required information in one
standard format saved in a selectable storage interface. Furthermore this approach enables the
possibility to enrich the established set of information with additional information extracted by other
PET modules. A tutorial for the use of the PET has also been created and is available11.
7.2 General scenario for SEI capture
In order to further clarify the tool description, we briefly describe here a general scenario for the
information capture that we are aiming at with our PERICLES Extraction Tool. In this scenario we
observe and collect environment information from a user's computer as he interacts with DOs for
different purposes. The tool is installed with the agreement and under the full control of the user.
We want to look individually at the environment changes as the user e.g. calibrates some data, runs
10
Future versions of the tool could contain source code from developers external to PERICLES, given of the
open source software license.
11
http://goo.gl/hvT5rm
© PERICLES Consortium
Page 47 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
unstructured analysis workflows, creates new DOs and in different ways makes use of the data by
access, interaction, and transformation.
We have the following overall objectives that we want to accomplish, each of which depends on the
previous one:
1. Use the PET tool to collect environment information when the DOs are used, based on
specific environment information profiles (based on the use case);
2. Analyse the information collected to infer new relationships between DO;
3. Assign values to the dependencies based on the purpose and significance (significance
weights).
In the rest of the chapter we describe PET in its current development status, which covers mostly the
first objective and starts to address the second.
7.3 Provenance, related work
Provenance information is a type of metadata that is used to represent the history of the
transformations and events for a DO. As part of our scenario, some of the data collected will be in
the form of provenance information. We are exploring how this processing of the DO’s history can
help us infer dependency relationships, as described in more detail in paragraph 5.3.
Our tool’s final aim is to collect SEI, represented by the relationships between the original DO and the
information needed for making specific use of the object (such information can be extracted or just
kept as a reference, if already existing as a DO). In contrast, provenance addresses the history of the
DO, i.e. how it has been created, with what information, and how it has been transformed. Contrary
to provenance information, such dependencies are not related to single events, and are not reports
of what has happened, either. However, it will still be possible to use our tool for collecting useful
provenance information, although it is not our main focus. During the development of PET, we have
considered various provenance collection tools, to investigate if they could be helpful for our use
cases.
One such example, SPADE [72], is a cross platform tool for collecting provenance information. Its
architecture [73] is similar in some ways to the one we independently designed for PET, with
“reporters” that have a role similar to that of our modules. However, SPADE and its modules are
focused on collecting provenance information, and do not cover the variety of information we are
addressing with the PET tool. We are also trying to limit the amount of information to the degree
useful for determining SEI, and we discovered there is not a good match with the existing modules
(although some of the techniques used have similarities).
In [74] the scenario is different: a scientist is running experiments using a formal workflow system
(Taverna [75]), and the aim is that of preserving the process used by the experiment. While this is of
course a worthy approach, it differs from our intent as it is based on a scenario where the user
defines formally the workflow and other relevant information. In our case, we attempt to capture the
process in an existing environment, where a formal workflow may not be defined.
The TIMBUS project [4] encompasses the extraction of context information of business processes to
support their long-term preservation. The collected environment information should “allow
redeployment of the system at a point in the future.” Furthermore TIMBUS investigates the
dependencies related to this use case.
Long-term preservation of business processes could be an excellent use case purpose for SEI
extraction.
© PERICLES Consortium
Page 48 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
The mating of the TIMBUS extractors, that have been recently released12, with the PERICLES
Extraction Tool, would be an enriching task for further developments to improve the PET tools
support of business process related scenarios. The other way round some of the PET extraction
modules could also be useful to support the TIMBUS scenario. Currently, the extractors available in
the TIMBUS context population tool13 are different and non-overlapping with the PET provided ones,
and cover the software environment dependencies (debian packages, DLL dependencies in
Windows), physical contexts (for sensor: location, weather), informal business processes by analysis
of log files, and network and service contexts. The PET tool addresses a different perspective that is
based on sheer curation and looking at the data usage by observing user interaction, in order to
determine different use dependencies, as opposed contextual information for process reinstantiation.
7.4 PET general description and feature overview
The tool (Figure 13) aims to be generic, as it is not created with a single user community or use case
in mind, but can be specialized with domain specific modules and configuration. A customized
selection of Extraction Modules can be made for each purpose, and most of the modules can be
configured to adapt to different use cases. For highly specialized use cases it is possible to develop
own Extraction Modules and include them into the tool. For this purpose, a template exists that
sketches and describes the development of an Extraction Module for the developer.
In the extraction process, two different types of environment information can be distinguished: filedependent information, and environment information that is valid for all files in the environment.
Both of these information types can either be volatile, meaning that they change frequently or in
case of specific events, or they remain constant for a long time or always. An example of volatile
information is the current system resource usage; if it wasn’t extracted at a moment of interest, it
can’t be extracted afterwards. The system resource specification is the constant information that
only changes if new system hardware is installed.
A system environment contains a plethora of potentially useful information, so discovering the most
relevant and significant subset is important. With PET it is possible to create extraction profiles for
different purposes to manage the information diversity. Profiles contain a set of investigated files
belonging to digital objects, and a set of configured extraction modules to fit for the purpose. Future
developments will include significance evaluation based on weighted graphs for the creation and
selection of extraction profiles. A connection to the information encapsulation tool developed in Task
4.2 will also be provided, to allow the encapsulation of the extracted information together with
related DOs.
It is important to note that the major aim of the tool has been to enable the collection of the
relevant information from the live environment, and in response to relevant events. The raw data
collected will be further analysed by the tool in later tasks in the PERICLES project to conclude higherlevel SEI. The extracted SEI can be useful for various use and reuse scenarios and other specific
purposes. We are also investigating techniques to encapsulate the extracted SEI together with the
related DOs in Task 4.2, in order to avoid data loss. These techniques will be implemented in a
further PERICLES tool, which will interact directly with the PET.
12
See: http://timbusproject.net/resources/blogs-news-items-etc/timbusnewsletter/timbusnewsletter32/260timbusnewsletter32.
13
https://opensourceprojects.eu/p/timbus/context-population/extractors/debian-software/
© PERICLES Consortium
Page 49 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Figure 13. PET snapshot.
PET feature overview
The following list outlines the main features of the PET tool:
●
Extracts information that is usually ignored by current metadata extractors.
●
The extracted environment information from outside the digital object enables further reuse possibilities.
●
Extracts information at the right time and place: within the production environment at the
time of occurrence.
●
Supports continuous extraction in a sheer curation scenario.
●
Visualizes information change over time.
●
Information snapshot extractions allow getting a quick overview of extractable environment
information.
●
Open source: Apache 2 license
●
Platform independent (needs Java 7).
●
Modular and extendable architecture that supports specialized needs.
●
Use profiles allow the parallel usage for different scenarios.
●
Supports configuration exchange with other users.
●
Provides graphical user interface, but can also run without graphics in console mode.
●
Provides exchangeable storage backend.
●
Saves results in standardized format: JSON or XML.
●
Provides different views for browsing the extraction results.
© PERICLES Consortium
Page 50 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
7.5 Requirements analysis and software specification
This section works out the most relevant requirements for the 4.1.2 extraction techniques and
framework. The aim is to develop a software specification to drive the design of the framework
architecture that addresses the results of research in Task 4.1 (while taking into account the
development planned in T4.2) and of our current use cases requirements from WP2. The
fundamental requirements are outlined in the DOW. Additional requirements are derived from the
rapid case studies of the use case partners, the analysis of user requirements of D2.3.1 and from the
results of discussions with our partners. Further requirements emerged during the development
process as feedback from experiments, demonstrations, a workshop, and as resulting ideas from the
prototyping.
A complete list of the requirements is reported in Appendix A.
Discussions with our partners during the development process based on demonstrations of early
software prototypes, use case experiments, and a workshop provided us with feedback that we used
to refine the specification and to implement specialized extraction techniques. In these sessions
emerged also the idea for the two use case scenarios that are specially adapted for the extraction
tool: The Software Based Artwork-scenario that we developed in coordination with Tate, and the
SOLAR Operator-scenario, that was built together with our PERICLES partners B.USOC and
SpaceApps. Both are described previously in this deliverable in section 4.5.
From these extracted requirements we created the specification for the software architecture. Note
that the development process was in parallel to the related research, so we chose an agile approach
and iteratively adjusted the specification a few times.
We specified developing two extraction modes: (a) a continuous extraction mode to meet the
requirements of a sheer curation scenario, and (b) a snapshot extraction mode to extract information
post-hoc, as is needed for e.g. most digital forensic techniques. In order to support different
scenarios with a high configurability, we specified developing the extraction techniques as plug-in
modules and using the JSON standard for their configuration. For each scenario a selection of these
techniques should be made and saved in a profile. This enables also the possibility for the future to
suggest a configuration by weighted graphs, even if they are not implemented yet. Also the
environment monitoring by daemons was specified to be modular to fit for the different use cases.
An event management component handles the events triggered by the daemons and invokes related
extractions.
We decided to use the Java programming language to be able to develop the tool independently of
the operating system, and addresses user requirement UR-AM-BDA-02. Furthermore we specified an
exchangeable storage for the extracted information and provided an interface for it to cope with the
different environments and needs. By default the extracted information should be saved in JSON
format, which facilitates the use by other tools, as the T4.2 encapsulation tool.
Finally, we specified providing a command line interface and - optionally - a graphical user interface,
but the aim was to develop the tool in a way that every configuration could also be made manually in
configuration files. The configuration should always be saved for future executions to be able to
automate the extraction process.
The final architecture design based on this specification is described in detail in the next section.
Also, a list of the actually implemented extraction techniques can be found there.
7.6 Software Architecture
In this section we firstly provide a general description of the software architecture of PET. The
© PERICLES Consortium
Page 51 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
software components, implementation techniques, and the components interaction flow are then
described in more detail. Figure 14 illustrates a simplified sketch of the main components of PET.
Figure 14. PET architecture sketch.
Extraction Modules define how information can be extracted from the system environment, including
customized algorithms and system calls for different operating systems and the usage of external
tools and libraries. They can be configured to fit for customized needs. The Extractor of the tool
executes the Extraction Modules during an information extraction run and they return the extracted
information.
PET manages lists of Digital Object-Parts, which are the representation of an important file in the tool
structure. They are organized in Profiles by the user or by templates that support specific purposes.
Profiles also keep a set of configured Extraction Modules which extract the SEI for the Profile related
to the Digital Object-Parts.
The last components displayed in Figure 14 are Environment Monitor Daemons, which observe the
computer system environment for designated events and trigger the extraction of other Extraction
Modules or other event handling processes in case of the occurrence of these events.
Figure 15 shows a more detailed schema of the PET architecture, please refer to Appendix B for a
detailed description of each part of the architecture.
A typical flow for the user interaction and interaction between the tools components follows. The
user executes the tool for the first time, without defining start commands. A screenshot of the PET
tool with its modules can be seen in Figure 16. Consequently, the Extraction Controller Builder builds
a default Extraction Controller, which initializes the other controllers and the user interfaces CLI and
GUI. The user views a default Profile at the GUI, but creates an empty new Profile for own purposes.
He adds files with the GUI to that Profile, which are parts of important Digital Objects. Internally, the
following process is executed: The paths to the files that should be added are passed to the Profile
Controller, where they are validated and added to a Part data structure. These Parts are added to the
selected Profile and the interfaces are then updated.
During the next step, the user adds Extraction Modules that fit the use case, using the corresponding
GUI dialog. Internally, the following steps take place: for displaying the GUI dialog the GUI requests
© PERICLES Consortium
Page 52 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
the list of available Extraction Modules. This is provided by the Module Controller, which looked at
tool start for all Extraction Module classes and created a set of Generic Modules. After the user
selects which modules to create, Extraction Module instances are created from the Generic Modules
and added to the selected Profile.
Figure 15. PET detailed architecture.
Most of the Extraction Modules need a configuration, before they can be executed. The user browses
the configurations of the added modules and adjusts them to fit for the scenario. The configuration is
saved as Module Configuration class.
Now all configurations are ready for the first extraction. The user decides to run a single snapshot
extraction, to get an overview of the environment information. Therefore the Extractor executes all
Extraction Modules of the Profile. The file-independent modules return more general information,
which is stored in the Environment class of the Profile. Each file-dependent module is executed once
for each Part of the Profile, because it returns different pieces of information, depending on the file
that is represented by the Part. The extraction result is saved into the related Part class, and, after
the extraction run, the Storage Controller is used for serializing and saving them to the currently
configured Storage. Daemon modules are ignored at the snapshot extraction. After the extraction
run, the GUI is updated and the user can browse the results.
The user decides to start the continuous extraction mode for capturing information during a working
session, during which the files that are added to the tool will be altered. He closes the GUI and starts
working. The tool components act as follows: first the Extractor starts the daemon Extraction
Modules of the Profile, which begin monitoring the computer system environment. Furthermore, the
File Monitor is started and observes the Profiles files for alterations or deletions. The monitoring
components create Events in case of a detected event that they want to report and pass them to the
© PERICLES Consortium
Page 53 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Event Controller, where they are handled and trigger update extractions. After the working session,
the user closes the continuous extraction and browses the results.
The user regards his created Profile as useful and wants a colleague to use the same. Therefore, he
exports the Profile as Profile Template and sends the generated template files to his colleague, who
can then import them to create the same Profile on his PET installation. Internally, all Extraction
Modules of the Profile and their Module Configurations are serialized by the Configuration Saver (and
with the aid of the Jackson API) to JSON objects and saved to files into a template directory. The
Profile Controller at the other PET installation uses the Configuration Saver to deserialize the
Extraction Modules and recreate the Profile. The Profiles files are not saved to Profile Templates, as
they are probably not present on other environments, so another user has to add his/her own files
before using the Profile.
Figure 16. PET snapshot showing Extraction Modules and their configuration.
Finally the user shuts PET down, but expects to get the same configuration back at the next tool start.
At shut down, the Extraction Controller initiates the shutdown of each tool component and saves all
configurations. The Profile Controller starts saving each Profile; for this the Configuration Saver is
used to save all Extraction Modules and all Parts into the Profiles output directory. Furthermore, the
Extraction Controller uses the Configuration Saver to save more general tool options, as if the GUI is
running, and if the continuous extraction mode is enabled. These will all be loaded at the next start.
7.7 Extracting SEI by observing the use environment
The abstract problem to be addressed is that a user of an archive may on one hand have access to a
large corpus of preserved documents and data, but without additional use environment, this data is
© PERICLES Consortium
Page 54 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
of limited use in practical situations. By tracking and analysis of the use environment, the informal
knowledge can be captured and curated, and linked to the existing material in the archive. This
enhances the extent to which the content can be reused in the future.
In order to address this type of problem, we are developing a promising technique based on the
observation of the environment (specifically at the software currently executing on the observed
system) when some specific action is performed, to extract use environment information. Such
information can represent implicit knowledge of the user and can be important information to
support future use.
By observing the software currently running and collecting and analysing the system calls and files
used by an application, using a configurable set of parameters, this approach can allow us to infer
dependencies between observed files. This will allow a reasonable amount of general information to
be gathered all the time, while going in depth with the analysis of the activities when an interesting
set of parameters will indicate the likelihood of a particular activity being executed.
We first describe here a simple scenario that will allow us to illustrate how such collection of SEI
should take place. Suppose a scientist is calibrating data, using some specific scripts, as described in
section 4.5.2. The PET tool runs on the scientist's machine, monitoring the environment for events
that can have importance for the information collection.
●
●
●
●
●
The execution of a specific script triggers the event “data calibration”, indicating that the
user is calibrating this set of data using this script with these parameters;
Based on the event information and the state of the system, the tool will first start examining
the system in more detail, for example by starting a more detailed examination of the
parameters and input data for the script, or observing other target applications such as office
software;
A series of events and environment information is collected. This will be used to infer the
activity being executed (user’s purpose), and the dependencies between DOs (by looking at
patterns of usage and concurrent use of different DOs from specific software processes,
dependencies based on the script, input and output parameters, or based on other
heuristics).
By using a larger series of this collected data, we may be able to assign a significance value to
the dependencies (for example by looking at how often DO of type X is used in conjunction
with DO of type Y).
The collected data could also be stored and kept for improving the analysis, for example by
using more complex and time-consuming algorithms.
We are now able to partially address the first three steps of this process, as illustrated in the
experimental results in 8.1.1. The dependencies can be mapped, automatically when possible, into a
graph structure, where the edges are weighted to illustrate the importance of each dependency for
the execution of an activity, as described in chapter 6. The most important dependencies can then be
identified on the basis of the dependency graph, which helps to determine the SEI to be extracted for
similar scenarios. This has not been practically implemented but is illustrated in Chapter 6, and will
form the basis for later work on the appraisal of the collected information.
7.8 Modules for environment extraction
This section describes the different techniques we are using to extract environment information. The
types of extraction module implementations can generally be divided into three groups: (a)
implementations by the use of the programming-language provided Java SE-libraries, (b)
implementations by calling operating system APIs, and (c) implementations by integrating external
programs or libraries.
© PERICLES Consortium
Page 55 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
The sigar library [78] is useful for capturing a lot of environment information independent from the
operating system, for example system specifications and system resource usage. Based on this
library, we developed extraction modules for capturing an information snapshot of the CPU
specification, the available network interfaces, file storage information, file system information,
network information and other system resources. Furthermore, modules for the continuous
extraction of volatile information were developed. Examples of such information are, among others,
the memory usage, CPU usage, process statistics, process parameters, swap usage, the FQDN14, TCP
calls, and information that can be extracted by system commands emulated by sigar, as uptime and
who.
Another type of module (daemon modules) monitor the environment for specific events and report
them by triggering new extractions or adding new files to the list of monitored files. In particular, two
modules were created to support the technique described in the previous paragraph 7.7,by
monitoring the resources used by running software.
The ‘lsof’ module runs iteratively the lsof15 command, available and often included in most UNIX
variants, listing its open files and sockets. As we are aware, there are other powerful commands that
allow monitoring open files and sockets in Unix variants (such as dtruss16 and fs_usage17 for OSX, and
strace18 for Linux); we opted for lsof because for a number of reasons:
●
lsof is cross platform, so one command would work on Linux, OSX and other Unix; while the
other commands are platform specific;
● lsof does not require administrative privileges, as the other command require;
● lsof is in practice quite reliable at reporting such events.
In the module configuration, the user can define a set of commands to watch and when an event
satisfying the monitoring conditions happens, will generate a new event reported in the event
system (that can react by storing the event, or adding a file to the list of monitored files) . Such
events for example are the opening or closing of a files or sockets by a specific application. This
module has been used in the scenario described in detail in 8.1.1.
A similar, but less effective solution was also done for Windows, where the lsof command is not
available, with a module for the ‘handle’ command19. This module will report open files based on the
repeated execution of the handle command similarly to the lsof module. The handle command,
although free, has a restrictive license20 that does not allow redistribution; for this reason the user
willing to use the module has to manually download and copy the handle command in the native
command directory.
A module for capturing a snapshot of the currently installed software was developed by calling the
operating system-dependent software management component. It is an example of a module that
uses customized operating system calls. Another module of this kind extracts the information about
the currently used graphics card.
Several extraction modules were developed only by using Java SE API, and work independently of
external libraries and commands.
The directory monitor module will monitor directories for the creation, modification or deletion of
14
15
16
17
Fully Qualified Domain Name, see http://en.wikipedia.org/wiki/Fully_qualified_domain_name
LiSt Open Files, see http://en.wikipedia.org/wiki/Lsof
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/dtruss.1m.html
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/fs_usage.1.html
18
http://en.wikipedia.org/wiki/Strace
http://technet.microsoft.com/en-us/sysinternals/bb896655.aspx
20
http://technet.microsoft.com/en-us/sysinternals/bb847944
19
© PERICLES Consortium
Page 56 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
files uses the Java Watch Service. General file storage information is extracted by calling
Files.getFileStore(file), while posix (Portable Operating System Interface for Unix) file information is
extracted using Files.readAttributes(file,PosixFileAttributes.class);.
Information about the Java installation, and more general operating system properties is extracted
by calling System.getProperties();. Javax.xml was used for developing a module that extracts
information from XML files with XPath expressions. To extract information from a normal text file
with a regular expression java.util.regex was used, as well as a module that extracts information from
log files via regular expressions. Two modules were developed by using java.awtone that captures
screenshots to save graphical elements, and a second that extracts the systems graphic properties. A
module to calculate the checksum of a file uses java.security.
Additionally, the following modules also exist that include external tools for the information
extraction or environment monitoring: an Apache Tika file identification module, a MediaInfo
module, an OS X Spotlight module, a module for extracting MS Office document dependencies using
the Office Dependency Discovery Tool [79], and the chrome-cli [80] for monitoring of opened tabs in
the Chrome browser. For the extraction of font dependencies of pdf files the PDFBox library is used.
In addition, a generic module was created to execute any OS command with a specific configuration,
allowing users to extend create new modules by just defining the command configuration and with
no programming necessary.
7.9 Development process and techniques
This section describes the processes and techniques we used for the software project. An agile
approach allowed us to start the development at an early stage and to evolve the development ideas
during the parallel theoretical research.
During the initial requirements analysis, we asked our use case partners for user scenarios and also
created own user scenarios that we discussed and evaluated with our use case partners. The main
intention was to produce and circulate ideas of what the extraction tool should do and which
purposes it could be useful for. We wrote a requirements and features document based on these
user scenarios, which served as the basis for creating the specifications for the first prototype.
The first prototype served as demonstration test-bed to consolidate the ideas and adjust the
specification. We created a screencast demonstration, which was distributed to the partners, and
made a live demonstration of the prototype at a project meeting (October 2013). That brought us a
lot of feedback, which was used in creating the specifications for the second more mature prototype.
To evaluate the tool, we organized a practical two-hour workshop at one of our project meetings
(January 2014), where we allowed the partners to test and use the tool on their own systems, as
reported in section 8.4. A second evaluation, in the scope of the SBA scenario, has been recently held
(July 2014) and will be reported once results are available.
The following list provides an extract of the technologies and techniques used in this software
project:
●
●
Programming language: Java 721
Libraries: Swing22, JUnit23, reflections, JDOM24, JCommander25, Jackson26, sigar27
21
"Java Platform SE 7 - Oracle Documentation." <http://docs.oracle.com/javase/7/docs/api/>
"javax.swing (Java Platform SE 7 ) - Oracle Documentation."
<http://docs.oracle.com/javase/7/docs/api/javax/swing/package-summary.html>
23
"JUnit." <http://junit.org/>
22
© PERICLES Consortium
Page 57 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
●
●
●
7.10
External tools: exiftool28, Windows file utility29, Windows MediaInfo30, OfficeDDT31, Tika32,
NanoHTTPD33, elasticsearch34
IDE: Eclipse35 (Plugins: EclEmma36 for code coverage, FindBugs37, Subversive38 for svn, m2e39
for maven)
Development techniques: svn40, mvn41, JDoc42, design patterns43, unit tests44
Scalability
The issue of scalability for the PET tool should have a marginal impact, given the assumptions on its
use. The observations carried out by the tool involve the user as a key element and the data that we
are trying to collect is relative to user interaction. We are not trying to observe automated workflows
and processing, as such workflows are already formalised.
So, by definition, we don’t forecast scalability issues for the PET tool, as the limiting factor to the
amount of data that PET handles is the human factor, the user will manually handle a limited amount
of data.
Furthermore, the PET tool will help with the appraisal of the data so it can help with scalability issues
by reducing the amount of data that is necessary when considering a particular context.
24
"JDOM." <http://www.jdom.org/>
"JCommander." <http://jcommander.org/>
26
"Jackson JSON Processor - Home." <http://jackson.codehaus.org/>
27
"SIGAR API (System Information Gatherer and Reporter ..." <http://www.hyperic.com/products/sigar>
28
"ExifTool by Phil Harvey - SNO." <http://www.sno.phy.queensu.ca/~phil/exiftool/>
29
File utility for Windows: <http://gnuwin32.sourceforge.net/packages/file.htm>
30
"MediaInfo - Download." <http://mediaarea.net/en/MediaInfo/Download/Mac_OS>
31
"Dependency Discovery Tool - SourceForge." <http://sourceforge.net/projects/officeddt/>
32
"Apache Tika - Apache Tika." <http://tika.apache.org/>
33
"NanoHttpd/nanohttpd · GitHub." <https://github.com/NanoHttpd/nanohttpd>
34
"Elasticsearch.org Open Source Distributed Real Time ..." <http://www.elasticsearch.org/>
35
"Eclipse Luna." <http://www.eclipse.org/>
36
"EclEmma - Java Code Coverage for Eclipse<http://www.eclemma.org/>
37
"FindBugs™ - Find Bugs in Java Programs." <http://findbugs.sourceforge.net/>
38
"Eclipse Subversive - Subversion (SVN) Team Provider." <http://www.eclipse.org/subversive/>
39
"m2eclipse." 2011. 1 Jul. 2014 <https://www.eclipse.org/m2e/>
40
"Apache Subversion." <http://subversion.apache.org/>
41
"Maven Ant Tasks - Mvn." <http://maven.apache.org/ant-tasks/examples/mvn.html>
42
"Javadoc Tool Home Page - Oracle."
<http://www.oracle.com/technetwork/java/javase/documentation/javadoc-137458.html>
43
Gamma, Erich et al. Design patterns: elements of reusable object-oriented software. Pearson, 1994.
44
"Unit Tests - Extreme Programming." <http://www.extremeprogramming.org/rules/unittests.html>
25
© PERICLES Consortium
Page 58 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
8 Experimental results and evaluation
We describe here the experiments we have set up to validate the functionality and important aspects
of the frameworks.
These cover the PET tool, as described in sections 8.1.1, 8.1.2 and 8.2, and the SOM anomaly
detection described in Chapter 6, in section 8.1.3 and 8.3.
In all the PET experiments, some common background steps are assumed:
1. The PET tool is installed, configured and started on the machine where the DOs are used.
2. The user interacts with the system while PET collects EI in the background.
3. The environment information, DO events and changes are stored for future use and analysis.
8.1 Space science
8.1.1 PET on operations: anomaly dependency information
As described in the third example of paragraph 4.5.2, operators dealing with anomalies usually find
their solution searching through a multitude of documents. This can include, for example, solutions
from previous anomalies, telemetry, console logs, meeting notes, emails, etc. Such data, although
present in the storage, requires experience and its selection is a task that requires specific knowledge
that is usually passed from operator to operator. For this reason we are addressing the collection of
such dependencies between anomalies and mission documentation, in order to preserve useful
information that is otherwise not captured, by using some of the techniques we envisaged in section
7.7.
In more detail, when an anomaly occurs, the issue is recorded on the 'handover sheet'. Different
procedures are executed to solve the issue, and very frequently the operators need to access the
relevant documentation.
We have set up a simplified experiment to demonstrate what types of significant environment
information can be collected in this scenario.
In order to support this scenario, we set up a PET profile that tracks the use of relevant software on
specific files, using the PET software monitor; this enables us to have a trace of the documents that
have been used at a given moment in time, as illustrated in Figure 18.
At the same time, it is possible to observe the ‘handover sheet’ and track the reporting of an
anomaly start and end times (as shown in Figure 17, where a new issue is written in the document).
Figure 17. Screenshot showing changes in the 'handover sheet' tracked by the PET tool, used to determine
anomaly time.
The connection between the documentation track and the ‘handover sheet’ tracking can allow us to
infer the ‘anomaly solving time span’ (indicated in red in Figure 18) and assume there is a
dependency between the solution to the anomaly and the documentation that was used between
the start and end of the anomaly.
© PERICLES Consortium
Page 59 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Figure 18. Trace of document use (based on open and close times) collected from process monitoring (blue) with
anomaly solving time (red) collected using file change monitoring.
8.1.2 PET on extracting results of scientific calculations
The following experiment illustrates how SEI extraction can be useful for examining scientific
calculations. This experiment uses an extraction module that extracts whole lines from files. It is
configured to monitor an output directory of the open source tool GNU Octave [84] and to extract
calculation results with the aid of a regular expression. The extraction module is originally intended
for the extraction of particular log messages from a log directory.
The scientist uses PET to track the resulting development of an Octave-script execution over time and
in relation to the script lines that are relevant for the result. This enables the possibility to
understand the resulting changes in relation to script formula changes. First, the user configures the
module by specifying the output directory and the regular expression to search the result line, which
is, similar to the name of the result variable, just “B” at this example. Then the sheer curation mode
of PET is started, to monitor the directory, which triggers an initial extraction. At the time of this first
extraction the script wasn’t executed. We used the script in Figure 19 for this experiment.
Figure 19. Octave script used for the example.
Then the scientist starts his normal octave workflow and executes the script. The PET detects the file
changes in the configured output directory and triggers a new extraction of the selected module. The
following screenshot shown in Figure 20 displays the results of the first and second extraction.
Figure 20. Screenshot of the PET showing a calculation result extraction.
© PERICLES Consortium
Page 60 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
The result of the first extraction shows the locations of the script’s result variable B in the not yet
executed script, which also lies in the observed output directory. At the result of the second
extraction the line of the output file with the result variable B and its line number can be seen. This is
the extracted result of the scientific calculation.
Since also the location of the result variable at the original script was extracted, an easier
understanding of the dependencies between results and locations at the script is enabled. A
continuous extraction over long periods of time makes an observation of result changes in relation to
changes of the script formulas possible. PET indeed needs highly customised configurations for the
example, but these enable it to adapt to specialised scenarios.
8.1.3 SOM and anomaly detection on sensor data
We received approximately 20 GB of incoming raw instrument data from the mission control of the
International Space Station from SpaceApps and BUSOC45. A total of 383 features were measured.
Each row is a time-stamped measurement of these features. The time resolution is approximately
one second. There are about 16 million rows in total.
The feature space is heterogeneous. Apart from integer and real-valued features, there are Boolean
entries (on/off, used/not used, and other similar pairs), and also categorical variables. There are
many missing entries, and the rows follow a predictable pattern of which features will be missing.
The pattern alternates between two basic types of rows, let us denote them by A and B. Type A rows
have values for the first 329 features, and type B rows have values for the remaining features. Type A
rows occasionally miss all features except the first few. Extensive pre-processing of the feature space
is inevitable.
Labels identifying anomalies are not directly available. We were given one-minute resolution
indications of the times where the anomalies occurred. However, measurements are taken every
second. The labels are provided in Belgian User Support and Operations Centre. The following line is
an example indicating an anomaly:
041/07:20 AIB failure without reboot anomaly AIB failure without reboot
It is unclear at which second the above anomaly occurred, or whether it occurred exactly during this
minute, or shortly before or after it. Furthermore, this is an approximate labelling that is not useful in
automatic classification. Anomalies are much more efficiently detected by unsupervised methods, as
outlined in the next section.
8.1.3.1
PRE-PROCESSING AND VECTORIZATION
The majority of learning algorithms expect numerical data; hence we must map Boolean and
categorical variables to numerical values. Boolean variables are mapped to 0 and 1, whereas
categorical variables are mapped to consecutive positive integer numbers.
We split the data set in two: rows belonging to the A pattern go into the first subset, and rows
belonging to the B pattern go into the second one. This separation solves the problem of missing
values. Separate learning pipelines will deal with the two subsets, leading to an ensemble of models.
We believe this approach will be simpler and more efficient than interpolating missing elements in
order to have uniformly dense data instances.
45
http://www.busoc.be/
© PERICLES Consortium
Page 61 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Moving over to this pattern means that the difference between the A subset entries is 1 second, and
the difference between entries in the B subset is 2 seconds.
We processed the data of a single day, 2012-02-10. This day has two labelled anomalies: at 7.20AM
and at 10.06AM, with a total of 127,098 entries. We restricted our attention to the Type A subset.
Type A entries with missing values were discarded. The remaining subset had 86,321 entries.
2
Figure 21. An unsupervised pipeline for labelling outliers using χ feature filtering, X-means clustering and the
local density cluster-based outlier factor algorithm.
We trained an unsupervised ensemble consisting of an X-means clustering and a local density clusterbased outlier factor algorithm. We applied athreshold the values on the output of the latter to label
outliers. This gave 100% accuracy on the training data. We used these labels for plotting a selforganizing map (see next section). To reduce the number of dimensions, we used a χ 2 feature filter
against the automatically extracted labels. The eventual dimension of the feature space was thirty.
The selected features were normalized to a range between 0 and 1, to match the range of the
random features by which the self-organizing map is initialized. This part of the pre-processing
pipeline is shown in Figure 21.
8.1.3.2
SELF-ORGANIZING MAPS (SOM)
Self-organizing maps are a topology-preserving embedding of high-dimensional data to twodimensional surfaces, such as a plane or a torus. Artificial neurons are arranged in a grid - each
neuron is assigned a weight vector with identical number of dimensions as the data to be embedded.
An iterative procedure adjusts the weight vectors in each step: first a neuron closest to a data point
is sought, then its weight vector and its neighbours' weight vectors are pulled closer to the data
point's neuron. After a few iterations called epochs, the topology of the dataset emerges in the grid
in an unsupervised fashion.
If the number of nodes k in the grid is much smaller than the number of data points n, SOM is
reduced to a clustering algorithm. Multiple data points will share a best matching neuron in the grid.
Emergent self-organizing maps (ESOMs) are a variant of SOM in which k>n. In this arrangement, a
data point will not only have a unique best matching neuron, but the neuron's neighbourhood will
also `belong' to that data point. Clustering structure will still show: some areas of the map will be
© PERICLES Consortium
Page 62 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
denser than others.
The extra information comes from the neighbourhoods of best matching neurons. These interim
neurons provide a smooth chain of transitions between neurons that are assigned a data point. It is
almost as if they interpolated the space between data points. If a data point sits in isolation far from
other ones, this shows in the map because its neighbourhood will be sparse. These gaps gain
meaning depending on the nature of the data.
Self-organizing maps are notoriously slow to train, and ESOMs are even more so. Using the data
without labels, we trained an emergent self-organizing map with Somoclu, an open source clusteroriented high-performance implementation of self-organizing maps that accelerates computations
on multicore CPUs, GPUs, and even on multiple nodes [82]46.
8.1.3.3
VISUALIZING ANOMALIES
We used an initial learning rate of 0.1, linearly decreasing to 0.01. The initial influence radius was 30,
which reduced linearly to 1. The map had 100 x 60 neurons on a toroid grid.
Figure 22. The U-matrix of a toroid emergent self-organizing map after ten epochs of training on the feature
space of stream data. The individual numbers are neurons with a weight vector that match a data instance;
multiple instances may map to the same neuron. The numbers indicate class membership: 0 refers to normal
instances, and 1 to anomalous ones. The other neurons reflect the distances in the original high-dimensional
space.
We plotted the U-matrix after 10 training epochs by using Databionic’s ESOM Tools [83], a third-party
software. We used the class information extracted by X-means clustering and the local density-based
outlier factor algorithm to label best matching nodes in the grid (Figure 22). The lighter colours in the
map indicate higher distances and separation. The anomalous instances form a clear cluster in the
middle of the figure, further splitting into four subgroups. Interestingly, the regular instances also
show clear groups of similar elements. Some even seem to circle the anomalous group.
As the processing results from the test sample show, scalable automatic anomaly detection is
feasible and a promising candidate for the real-time filtering of streamed data from the ISS.
46
Before PERICLES, early work on Somoclu was supported by an Amazon Web Services Education Machine
Learning Grant.
© PERICLES Consortium
Page 63 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
8.2 Software Based Art: system information,
dependencies
This experiment is about collecting dependencies from a SBA, as described in paragraph 4.1. The PET
tool is executed on the SBA and will extract a series of information useful for the understanding and
future use of the SBA. For the SBA scenario, PET is especially useful in capturing a snapshot of the
system specification and graphical configurations. Furthermore, the system resource usage can be
monitored and extracted in the sheer curation mode to be used for a comparison of behaviour
patterns between different SBA installations. However, PET has its limits for this scenario, as it won't
provide analysing mechanisms for such patterns. Also, information that lies outside of the computer
system environment of the SBA is not reachable for the PET, such as the actual SBA installation
details that are manually captured (for example model of display, how it is installed, ...).
8.2.1 System information snapshot
The PET with a snapshot extraction, as described in section 4.5.1 can extract various information
pertinent to the scenario of emulating an SBA environment. This mainly includes information that
doesn't change continuously, such as the system’s hardware specification or installed graphic drivers.
Table 3. Extract of results for the SBA example showing SEI captured in a snapshot extraction.
Extraction Module: CPU specification snapshot
model
Intel Core(TM)i5-3470CPU@ 3.20GHz
totalCores
4
Extraction Module: Graphic System properties snapshot
font_family_names
“Bitstream”, “Charter", "Cantarell", ...
displayInformation
isDefaultDisplay=true, refreshRate=60 ..
Extraction Module: Operating System properties
user_language
En
os_name
Linux
Extraction Module: Java installation information
java_home
/opt/java/jre
java_vendor
Oracle Corporation
java_version
1.7.0_15
© PERICLES Consortium
Page 64 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Table 3 shows an extract of the result of a snapshot extraction executed by the PET. The significant
information includes system hardware specifications, CPU, system graphic settings (e.g. installed
fonts and display information), and information about the operating system and the development
toolkit used to program the SBA's software.
In addition to this, information that changes constantly has to be captured continuously in PET’s
sheer curation mode. For the SBA scenario this is mainly the use of system resources. A
measurement of resource usage values over a long period of time can be analysed to identify
behaviour patterns, which can be used to validate the correct behaviour of a new software
installation. Other examples of measurements are those of CPU usages, executed by PET's “CPU
usage monitoring”-module, whereby the changes over time can be traced.
Another example of such runtime information that can be collected and be useful for assessing the
dependencies of a SBA is the file-system and network usage information (all the files and network
connections used during the execution of the SBA) that can be collected by the PET tool with a
specific extraction profile.
The extraction results enable the configuration and emulation of a new environment for a SBA, as
described in the example in section 4.6.1.
8.2.2 Extracting font dependencies
The PDF format gives the ability to embed the font types used in a document, to guarantee faithful
reproduction of the document even when the DO is moved to an environment that does not include
them. However, it is still possible to create PDF documents without including the necessary fonts (for
example, it can be a user choice or a font can be blacklisted in the PDF creating application). To
recognize such external font dependencies, that are particularly relevant in the case of a PDF file
used in a SBA, the module will analyse PDF files and extract a list of used but not embedded fonts.
This list determines dependencies between the DO and the listed fonts, relevant for accurate
rendering
8.2.3 Experiments on SBA: Brutalism
The following experiment was run on a real software based artwork, Jose Carlos Martinat’s
“Brutalismo, Ambiente de Stereo Realidad,” from 2007.
In [18] the artwork is described as “a sculpture with a software element which searches the Internet
for phrases incorporating the word “Brutalismo”. The sentences found are then printed out by
industrial printers similar to cash machine printers. The printers then cut the till roll so that scraps of
paper gather on the floor around the sculpture. “
We executed PET on a virtual machine running an image of the artwork with the intention of a
deeper understanding of the artwork. The experiment was done in two separate sessions, by using
remote desktop technology installed on the SBA virtual machine image.
First the PET is downloaded and extracted in the Download folder; all the data it will create and
modify is contained within the extracted folder. Since there was not an appropriate version of the
Java VM installed on the system, a local version of Java was extracted in the download folder. This
will have only a local, reversible effect, while keeping the original system installation of Java
untouched. The situation brought up the consideration that it should be made clear to the users that
the local installation of PET and Java would not interfere with the existing Java installed on the
system and would be reversible by simply removing the respective folders.
We have run a simple extraction of the environment information with the PET, by executing a
“snapshot” similar to the experiment executed in 8.2.1. The extracted information includes a list of
© PERICLES Consortium
Page 65 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
running processes with details of each, and other important environment information.
The running process list includes:
● Themysql daemon process
● Theinicio process: "/bin/bash", "/home/arturo/inicio.sh" run as root user
We have also looked for the existence of software written in Java in the system (.class, .jar, .java
files), as we understand that the original SBA was using that language, but were not able to find any.
We also looked for running processes that could be implementing the SBA, and we think the only
process important for this SBA is the MYSQL process.
The “Brutalism” artwork contains printers. A couple of short scripts in the user folder seem to be
implementing the actual printing (there are at least 2 versions of them); the running process is
“inicio.sh” as in the process list. The script is simple:
1. Select a random row from a table of lines to print, that are stored in a running MySQL table.
The lines are not removed from the table, so there is possibility for repetition, but that also
means there will always be new lines to extract;
2. The line is sent to the printer; there seems to be a version of the script using line printers,
and another using USB printers.
It would be relatively simple and probably sensible to extract the contents of the MySQL DB to a plain
text file, to make it simpler to update the system in the future. The software logic, as far as we could
see in the VM, seems really quite simple and so not problematic to reproduce with a future
technology, should the need arise. The data accessed for the printing seems to be all local, and no
actual connection to external server seems to be made as this would be performed by the script. Of
course, it would be important to validate this with the creators to make sure all details are covered.
Initial discussions with the artist established that it is important to display current data from live
searches. Currently the database acts as a valuable record of the search results and should be
exported after each display. The JAVA software will have to be updated periodically to cope with any
external service changes.
We conclude that PET was helpful to investigate the artwork for a deeper understanding. Still, as SBA
has a very customized nature, our experiment confirmed our suspicion that these type of use cases is
likely to require more manual intervention compared to other more standardised cases such as the
one we examined in section 8.1. The execution of monitoring daemons over a longer period of time
would be an interesting follow on experiment offering the potential for a deeper investigation of the
artworks behaviour.
8.3 Self-organising maps on video data
Emergent self-organizing maps point towards contextual learning models, but they are still primarily
content based. As an initial effort in content-based media mining, we worked with a video provided
by Tate and pre-processed by CERTH. The data consisted of 4632 key-frames, with features
extracted: colour structure and edge histogram features were included, resulting in a 112dimensional space. Most frames were labelled according to the content. The labels were from the
following list:

Artworks

City

Crowds – Audiences

General gallery spaces

Guitars – Instruments
© PERICLES Consortium
Page 66 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS

Landscape

Night

Tate Britain

Tate Liverpool

Tate Modern

Text

TV – Monitors
We trained a toroid emergent self-organizing map of 320x200 nodes (Figure 23) using Somoclu [82].
Figure 23. SOM map for the video data labels.
The initial learning rate was 1.0, which decreased linearly over ten epochs to 0.1. The initial radius for
the neighbourhood was a hundred neurons, and it also decreased linearly to one. The
neighbourhood function was a non-compact Gaussian.
The global structure of the map shows various clusters of labels (Figure 24) – different ones tend to
separate into local regions. These local regions are thus homogeneous. Ideally, all labels belonging to
one category would occupy one part of the map, but given the training data and the low number of
examples, the quality of the map is already high.
As the next step, we intend to extend the analysis to a contextual variant, across numerous videos to
see how usage patterns would influence the clustering structure.
© PERICLES Consortium
Page 67 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Figure 24. Close-up of the SOM map with labels.
8.4 First internal evaluation (London, January 2014)
In January 2014, in the context of a project meeting, we run a workshop and hands-on evaluation of
the PET tool prototype with our project partners, to collect feedback, bug reports and feature
requests, and to validate the approach taken. This workshop allowed us to gather a lot of ideas and
feedback, covering new modules and module improvements, design and implementation
improvements, ideas about new features, and software project management suggestions.
Furthermore, ideas for real usage scenarios came up at the meeting, which were tested live at the
systems of our use case partners during the late development stage.
© PERICLES Consortium
Page 68 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
9 Conclusions and future work
9.1 Conclusions
We presented our work on determining what information is significant to collect in order to support
long-term use of Digital Objects, from the widest set of the Environment Information. Pursuing this
objective we defined Significant Environment Information, a new concept taking into account the
purpose of use of a DO, and its significance, while expressing a dependency relationship between the
information entities involved. In doing so, we use the same conceptual framework shared by the
PERICLES RTD work packages, by using dependencies and proposing an early, preliminary extension
of the LRM to support our use of the model. We also consider that SEI will naturally form a graph
structure that could help support an enhanced appraisal process; this will be explored in later tasks.
The graph allows deducing other significant information to be extracted and to infer relationships
between different objects. Our approach facilitates the gathering of information that is potentially
not covered by established standards, and enables a better long-term preservation, in particular for
complex digital objects. We also presented ways to determine significance weights and its relations
to the DO lifecycle, and some examples of SEI from our stakeholders. We think that collecting such
information at creation and use time is very relevant for preserving long-term use of the data.
We also explore the similarities and potential application of concepts from the Biological domain to
the Digital Ecosystems and Environments metaphors, and present potential approaches from the
more mature Biological domain that could be applied to the Digital Ecosystem and Environments,
and present mathematical approaches for the analysis of SEI.
Finally, we presented the tool we are developing to collect such information, the PET tool, together
with its methods of extraction, and showed experimental results to support the importance of such
information. We believe the importance of the contribution also lies in the way that the information
is collected, that is domain agnostic and aims at collection in the context of spontaneous workflows,
with minimal input from the user and very limited assumption on the system structure. We present
experimental results supporting the approach, based on the PERICLES use cases. The results from the
PET tool will be further analysed in later tasks of the work package.
9.2 Future work
We plan to continue our work on exploring new methods of automated information collection, and
improving the filtering and inference of dependencies, in the remaining tasks of PERICLES. Here
follows a list of future work, both related to PERICLES tasks, and more general ideas that could be
explored but are not currently part of specific tasks.
9.2.1 Ideas for future PERICLES tasks
9.2.1.1
ENCAPSULATION OF SEI TOGETHER WITH THE RELATED DO IN T4.2
Task 4.2 investigates information encapsulation techniques that will allow to encapsulate DOs
together with the belonging extracted SEI. This approach prevents the risk of data loss for SEI, in case
the DO leaves its original environment for any reason, as illustrated in section 3.1.
An information encapsulation tool will be developed during task 4.2, which will work together with
the PERICLES Extraction Tool to encapsulate the extracted information directly in a sheer curation
© PERICLES Consortium
Page 69 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
scenario. Furthermore the idea of focusing on the use purpose of a DO will be extended by selecting
a suitable information encapsulation technique based on the usage scenario. The techniques include
packaging techniques, as used for preservation purposes, as well as digital watermarking,
steganography and other embedding techniques. To avoid redundancy on the embedding of the
data, it is likely that this task will also expand on the concepts of identifying recurring instances of
embedded information, and providing ways to represent and identify it with limited redundancy
across data collections.
9.2.1.2
ECOSYSTEM DEPENDENCY CAPTURE - WP5
In the context of the WP5 modelling of the ecosystem, we are aware that DO environment and an
ecosystem share entities, as mentioned in section 5.1. As part of future work in WP5, it will be
interesting to explore the possibility of extending the PET tool to help discover and extract and model
the ecosystem entities and their dependencies.
We can imagine, and we will evaluate the feasibility of the approach in later WP5 work, that the PET
tool could be extended with modules that could help discover ecosystem entities and dependencies,
by exploring the context of use of the system, and by that help with the modelling of the ecosystem.
Another interesting possibility is that of extending PET to monitor the ecosystem entities and
dependencies for change; this would allow reporting such changes so that the model can be updated
and its consistency verified.
9.2.1.3
USING SEI FOR APPRAISAL - T5.5
We have already mentioned how the SEI – forming a weighted graph - can be used to enhance the
appraisal process. We believe there are interesting possibilities to explore with this respect, such as
looking at using values as weights both for dependencies and for DOs and spreading the DO weights
through dependencies, and selecting the best set of objects based on both the dependency and
object values. Another interesting aspect to explore is the use of dependency weights for non-binary
appraisal, that is to use the weights to decide on a preservation strategy for DOs, so that for example
the most valuable DOs and SEI can be preserved on more redundant, more expensive storage, while
lower weight will serve to direct those to a less redundant storage. It is very likely that the work on
this task will build on and extend the concepts and ideas we have started to explore in this
deliverable.
9.2.1.4
USER COMMUNITY WATCH AND PROFILING - T5.4.3
Motivating Example: Consider one or more publications that are produced by a scientific community
based on a set of experimental data (e.g. the SOLAR dataset). In practice the dataset produced by the
SOLAR [43] experiments will be cross-referenced with other datasets with the aim of providing a
definitive reference dataset for the solar spectrum. The dataset is likely to be downloaded and
reused by many other scientists in different domains such as:
●
●
●
Solar scientists running similar experiments to measure the solar spectrum.
Other solar scientists.
Scientists in other domains such as climate and environment science, as well as in domains
that cannot yet be anticipated.
One approach to user community monitoring is based on usage tracking and user profiling as
described in section 6.5.1. An alternative approach is to track the scientific outputs that are based on
a given dataset or other output. (This could be considered as part of the environment of a given
paper). There is currently a trend towards treating datasets in a similar way to publications in terms
of assigning DOs that can then be cited. Thus, it would be possible to track community and semantic
© PERICLES Consortium
Page 70 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
evolution both through citations of the datasets as well as the associated scientific publications.
Academic publication metadata can be harvested from large OA repositories to provide a basis for
capturing and analysing the communities of use, including their evolution and the associated
semantics.
There are two aspects to this:
1. Analysis of a corpus of historic publications to model the evolution of user communities and to
test hypotheses about preservation and changing semantics with a real set of data.
2. Based on the experience in 1, to develop a preservation watch and planning tool that can
monitor current content, detect changes and adapt preservation plans accordingly.
9.2.1.5
LRM EXTENSION TO REPRESENT SEI - T4.3
As illustrated in section 4.6, we have started preliminary work on extending the LRM for supporting
SEI-related concepts. This is still early work, based on a pre-final version of the LRM and will need to
be refined in later tasks. The aim is to create a simple and compact reusable model for representing
SEI, possibly encapsulating the latter together with the DO (Task 4.2), as also outlined in Section
9.1.1. Potential directions to be investigated from then on include, but are not limited to, the
following:
●
●
As soon as weights have been assigned a value, a set of rules residing “on-top” of the
ontological LRM model could then determine which the important dependencies are (and
which could, consequently, be removed or “downgraded”).
This can then lead to reasoning, through which it could be automatically pre-calculated
(inferred) which newly-added dependencies are significant or not, without having to know
their actual weight values - it will be enough to have prior knowledge of what type of
dependencies are considered significant for each distinct purpose.
9.2.2 Other Ideas not assigned to specific PERICLES tasks
Use of SEI for updating metadata standards
Currently a community faced with archiving their data has to address the problem of knowing just
what to collect. Usually, they look at previous examples or standards to help them determine what
metadata information they need. Once they have a schema it is evaluated. Then it is implemented
and people realise that there are some parts missing. So, they extend their schema to include those
missing parts. It would be nice to reduce this by providing communities with an aid or guide that
would help them to define what to collect.
By exploring what is the environment and what is context we can start to either generate questions
(like “have you thought about how the data was collected, are there any pieces of information from
there that would be necessary for a potential user?”). Of course, in quite a few communities
standards exist, but as the use of the data evolves the standards may not map completely to the
uses.
Use environment capture analysis
This has been illustrated in sections 4.5.2, 7.7, 8.1.1 where the scenario is introduced and initial
methods to address it are described. In this example we propose user tracking and analysis of events
as they occur. This might involve use of the PET tool in some cases, and the linking of these events to
enable analysis of specific workflows, such as the debugging of anomalies. Such a tool could involve
tracking the symptoms of the problem, the effectiveness of mitigating actions, and the pieces of
© PERICLES Consortium
Page 71 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
relevant documentation. This use information could then be recorded and archived with the main
parts of the documentation.
More complex issues that were not addressed in the current tool, such as the ‘noise’ that can be
reported by the event tracking, could be addressed in the future; this ‘noise’ can be for example due
by the fact that users often multitask, so there can be unrelated documentation that was used but
not relevant to the main task the user is doing, or documentation that was quickly opened and closed
may also indicate in some cases that the document was not relevant. It would be interesting to also
address more fine-grained tracking, as for example by including what pages have been consulted in a
document.
© PERICLES Consortium
Page 72 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
10 Bibliography
1. PREMIS Editorial Committee. (2008). PREMIS data dictionary for preservation metadata, version
2.0. Retrieved June, 13, 2008. Chicago.
2. Dappert, A., & Farquhar, A. (2009). Significance is in the eye of the stakeholder. Research and
Advanced Technology for Digital Libraries, 297-308. Springer Berlin Heidelberg.
3. List of Metadata Standards | Digital Curation Centre. Retrieved June 30, 2014,
fromhttp://www.dcc.ac.uk/resources/metadatastandards/listhttp://www.dcc.ac.uk/resources/metadata-standards/list.
4. (2011). TIMBUS Project. Retrieved June 30, 2014, from http://timbusproject.net/.
5. (2012). TIMBUS Project Deliverable 4.5, “Business Process Contexts”, from
http://timbusproject.net/.
6. Press, N. I. S. O. (2004). National Information Standards Organization. Understanding Metadata.
7. CCSDS, J. (2012). Reference Model for an Open Archival Information System (OAIS). CCSDS 650.0M-2, Magenta Book.
8. (2010). MPEG-21 | MPEG. Retrieved June 30, 2014,
fromhttp://mpeg.chiariglione.org/standards/mpeg21http://mpeg.chiariglione.org/standards/mpeg-21.
9. Text Encoding Initiative: TEI. Retrieved June 30, 2014, fromhttp://www.teic.org/http://www.tei-c.org/.
10. Dublin Core Metadata Initiative. (2008). Dublin core metadata element set, version 1.1.
11. (2013). Disciplinary Metadata | Digital Curation Centre. Retrieved June 30, 2014,
fromhttp://www.dcc.ac.uk/resources/metadatastandardshttp://www.dcc.ac.uk/resources/metadata-standards.
12. Lynch, C. (1999). Canonicalization: A fundamental tool to facilitate preservation and
management of digital information. D-Lib Magazine, 5(9).
13. Hedstrom, M., & Lee, C. A. (2002). Significant properties of digital objects: definitions,
applications, implications. Proceedings of the DLM-Forum (Vol. 200, pp. 218-27).
14. Knight, G. (2008). Deciding factors: Issues that influence decision-making on significant
properties. JISC. Retrieved January, 7, 2010.
15. Knight, G. 2010. Significant Properties Data Dictionary. InSPECT project report. Arts and
Humanities Data Service/The National Archives. At
http://www.significantproperties.org.uk/sigprop-dictionary.pdf
16. Lurk, T. (2008, January). Virtualisation as conservation measure. In Archiving Conference (Vol.
2008, No. 1, pp. 221-225). Society for Imaging Science and Technology.
17. Laurenson, P. (2014). Old media, new media? Significant difference and the conservation of
software-based art.In Graham, B. (Ed.) New Collecting: Exhibiting and Audiencesafter New
Media Art. Chapter 3. University of Sunderland, UK
18. Falc o, P. (2010). Developing a Risk Assessment Tool for the conservation of software-based
artworks. A thesis B . Bern ochschule der nste Bern.
19. (2013). Context: definition of context in Oxford dictionary (British ... Retrieved June 30, 2014,
fromhttp://www.oxforddictionaries.com/definition/english/contexthttp://www.oxforddictionar
ies.com/definition/english/context.
© PERICLES Consortium
Page 73 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
20. Chowdhury, G. (2010). From digital libraries to digital preservation research: the importance of
users and context. Journal of documentation, 66(2), 207-223.
21. Kari, J., & Savolainen, R. (2007). Relationships between information seeking and context: A
qualitative study of Internet searching and the goals of personal development. Library &
Information Science Research, 29(1), 47-69.
22. Smeaton, A. F. (2006, January 1). Content vs. context for multimedia semantics: The case of
sensecam image structuring. Semantic Multimedia (pp. 1-10). Springer Berlin Heidelberg.
23. Luo, J., Savakis, A. E., & Singhal, A. (2005). A Bayesian network-based framework for semantic
image understanding. Pattern Recognition, 38(6), 919-934.
24. Blaschke, T. (2003). Object-based contextual image classification built on image segmentation.
Advances in Techniques for Analysis of Remotely Sensed Data, 2003 IEEE Workshop on. IEEE.
25. Singhal, A., Luo, J., & Zhu, W. (2003). Probabilistic spatial context models for scene content
understanding. Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE
Computer Society Conference on. IEEE.
26. Naphide, H., & Huang, T. S. (2001). A probabilistic framework for semantic video indexing,
filtering, and retrieval. Multimedia, IEEE Transactions on, 3(1), 141-151.
27. Dappert, A., Peyrard, S., Chou, C. C., & Delve, J. (2013). Describing and Preserving Digital Object
Environments. New Review of Information Networking, 18(2), 106-173.
28. Corubolo, F., Eggers, A.G, Hasan A., Hedges M., Waddington, S., Ludwig, J. (2014). A pragmatic
approach to significant environment information collection to support object reuse. In IPRES
2014 proceedings.
29. (2011). SCIDIP-ES: SCIence Data Infrastructure for Preservartion ... Retrieved June 30, 2014,
fromhttp://www.scidip-es.eu/http://www.scidip-es.eu/.
30. Perspectives, K. (2010). Data dimensions: disciplinary differences in research data sharing, reuse
and long-term viability. SCARP Synthesis Study, Digital Curation Centre, UK, available at:
http://www.dcc.ac.uk/sites/default/files/documents/publications/SCARP-Synthesis.Pdf.
31. (2008). Zoological Case Studies in Digital Curation – DCC SCARP ... Retrieved June 30, 2014,
fromhttp://alimanfoo.wordpress.com/2007/06/27/zoological-case-studies-in-digital-curationdcc-scarp-imagestore/http://alimanfoo.wordpress.com/2007/06/27/zoological-case-studies-indigital-curation-dcc-scarp-imagestore/.
32. Curry, E., Freitas, A., & O’Riáin, S. (2010). The role of community-driven data curation for
enterprises. Linking enterprise data, 25-47.
33. (2010). SCARP | Digital Curation Centre. Retrieved June 30, 2014,
fromhttp://www.dcc.ac.uk/projects/scarphttp://www.dcc.ac.uk/projects/scarp.
34. Lyon, L., Rusbridge, C., Neilson C., Whyte, A. (2009) Disciplinary Approaches to Sharing,
Curation, Reuse and Preservation, Digital Curation Centre, UK, available at:
http://www.dcc.ac.uk/sites/default/files/documents/scarp/SCARP-FinalReport-Final-SENT.pdf
35. Whyte, A., Job, D., Giles, S., & Lawrie, S. (2008). Meeting curation challenges in a neuroimaging
group. International Journal of Digital Curation, 3(1), 171-181.
36. Hedges, M., & Blanke, T. (2013). Digital Libraries for Experimental Data: Capturing Process
through Sheer Curation. Research and Advanced Technology for Digital Libraries, 108-119.
Springer Berlin Heidelberg.
37. Razum, M., Einwächter, S., Fridman, R., Herrmann, M., Krüger, M., Pohl, N., et al. (2003).
Research Data Management in the Lab. Geochem. Geophys. Geosyst, 4(1), 1010.
© PERICLES Consortium
Page 74 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
38. (2012). eSciDoc - Research Data Management - BW-eLabs. Retrieved June 30, 2014,
fromhttp://www.bw-elabs.org/datenmanagement/1_escidoc/index.en.htmlhttp://www.bwelabs.org/datenmanagement/1_escidoc/index.en.html.
39. Thibodeau, K. (2002). Overview of technological approaches to digital preservation and
challenges in coming years. The state of digital preservation: an international perspective, 4-31.
40. Woods, K., Lee, C. A., & Garfinkel, S. (2011). Extending digital repository architectures to support
disk image preservation and access. Proceedings of the 11th annual international ACM/IEEE
joint conference on Digital libraries. ACM.
41. Garfinkel, S. L., & Shelat, A. (2003). Remembrance of data passed: A study of disk sanitization
practices. IEEE Security & Privacy, 1(1), 17-27.
42. Higgins, S. (2008). The DCC curation lifecycle model. International Journal of Digital Curation,
3(1), 134-140.
43. http://www.esa.int/Our_Activities/Human_Spaceflight/Columbus/SOLAR
44. (2011). The Weighting Ontology 0.1 - Namespace Document. Retrieved June 30, 2014,
fromhttp://smiy.sourceforge.net/wo/weightingontologyhttp://smiy.sourceforge.net/wo/weight
ingontology.http://smiy.sourceforge.net/wo/spec/weightingontology.html
45. Pocklington et al. (2014). A Biological Perspective on Digital Ecosystems and Digital
Preservation. Submitted to iPRES-14.
46. Hadzic, M., Chang, E., & Dillon, T. (2007). Methodology framework for the design of digital
ecosystems. Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on. IEEE.
47. McMeekin, D. A., Hadzic, M., & Chang, E. (2009). Sequence diagrams: an aid for digital
ecosystem developers. Digital Ecosystems and Technologies, 2009. DEST'09. 3rd IEEE
International Conference on. IEEE.
48. Kulovits, H., Kraxner, M., Plangg, M., Becker, C., & Bechhofer, S. Open Preservation Data:
Controlled vocabularies and ontologies for preservation ecosystems. Proceedings of the 10th
International Conference on Preservation of Digital Objects}. Biblioteca Nacional de Portugal}.
49. Doorn, P., and Roorda, D. (2010). The ecology of longevity – the relevance of evolutionary
theory for digital preservation, In Proceedings of DH-10
50. Dobreva, M., and Ruusalepp, R. (2012). Digital preservation: interoperability ad modum. In:
Chowdhury, G. G. (2002). Digital libraries and reference services: present and future. Journal of
documentation, 58(3), 258-283.
51. (2011). SCAPE - Scalable Preservation Environments. Retrieved June 30, 2014,
fromhttp://www.scape-project.eu/http://www.scape-project.eu/.
52. Fernando, C., Kampis, G., & Szathmáry, E. (2011). Evolvability of natural and artificial systems.
Procedia Computer Science, 7, 73-76.
53. Goldberg, D. E., & Holland, J. H. (1988). Genetic algorithms and machine learning. Machine
learning, 3(2), 95-99.
54. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective
computational abilities. Proceedings of the national academy of sciences, 79(8), 2554-2558.
55. Dorigo, M., Caro, G., & Gambardella, L. (1999). Ant algorithms for discrete optimization.
Artificial life, 5(2), 137-172.
56. Kennedy, J. (1997). The particle swarm: social adaptation of knowledge. Evolutionary
Computation, 1997., IEEE International Conference on. IEEE.
© PERICLES Consortium
Page 75 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
57. Smith, R. E., Timmis, J., Stepney, S., & Neal, M. Conceptual frameworks for artificial immune
systems. International Journal of Unconventional Computing.
58. Jockers, M. L. (2013, March 15). Macroanalysis: Digital methods and literary history. University
of Illinois Press.
59. Darányi, S., Wittek, P., & Forró, L. (2012). Toward sequencing “narrative DNA”: Tale types, motif
strings and memetic pathways. Proceedings of CMN-12, Istanbul, May 26-27, 2012. 2-10. At
:http://narrative.csail.mit.edu/ws12/proceedings.pdfhttp://narrative.csail.mit.edu/ws12/proce
edings.pdf
60. Ofek, N., Darányi, S., & Rokach, L. (2013). Linking Motif Sequences with Tale Types by Machine
Learning. OASIcs-OpenAccess Series in Informatics. Schloss Dagstuhl-Leibniz-Zentrum fuer
Informatik.
61. Nachira, F., Dini, P., & Nicolai, A. (2007). A network of digital business ecosystems for Europe:
roots, processes and perspectives. European Commission, Bruxelles, Introductory Paper.
62. John, J. L. (2012). Digital Forensics and Preservation. Digital Preservation Coalition.
63. Church, G. M., Gao, Y., & Kosuri, S. (2012). Next-generation digital information storage in DNA.
Science, 337(6102), 1628-1628.
64. Goldman, N., Bertone, P., Chen, S., Dessimoz, C., LeProust, E. M., Sipos, B., et al. (2013). Towards
practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature.
65. Causal Processes (Stanford Encyclopedia of Philosophy). Retrieved June 30, 2014,
fromhttp://plato.stanford.edu/entries/causationprocess/http://plato.stanford.edu/entries/causation-process/.
66. Evolutionary Genetics (Stanford Encyclopedia of Philosophy). Retrieved June 30, 2014,
fromhttp://plato.stanford.edu/entries/evolutionarygenetics/http://plato.stanford.edu/entries/evolutionary-genetics/.
67. Hartigan, J.A., 1975. Clustering Algorithms. Wiley, New York.
68. Darányi, S., & Wittek, P. (2013). Demonstrating conceptual dynamics in an evolving text
collection. Journal of the American Society for Information Science and Technology, 64(12),
2564-2572.
69. Chandola, V., Banerjee, A. & Kumar, V. 2009. Anomaly Detection: A Survey. ACM Computing
Surveys (CSUR). Volume 41 Issue 3.
70. Hofmann, M. and Klinkenberg, R. (2013). RapidMiner: Data Mining Use Cases and Business
Analytics Applications. Chapman and Hall.
71. Ma, J. and Perkins, S. (2003). Time-series novelty detection using one-class support vector
machines. In Proceedings of IJCNN-03, International Joint Conference on Neural Networks,
volume 3, pp. 1741-1745.
72. (2010).
Data-provenance
Google
Code.
Retrieved
June
30,
2014,
fromhttp://code.google.com/p/data-provenance/http://code.google.com/p/data-provenance/.
73. Gehani, A., & Tariq, D. (2012). SPADE: Support for provenance auditing in distributed
environments. Proceedings of the 13th International Middleware Conference. Springer-Verlag
New York, Inc.
74. Strodl, S., Mayer, R., Rauber, A., & Draws, A. (2013). Digital Preservation of a Process and its
Application to e-Science Experiments}. Proceedings of the 10th International Conference on
Preservation of Digital Objects (IPRES 2013)}. Springer}.
© PERICLES Consortium
Page 76 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
75. (2008). Taverna - open source and domain independent Workflow ... Retrieved June 30, 2014,
fromhttp://www.taverna.org.uk/http://www.taverna.org.uk/.
76. (2009). Elasticsearch.org Open Source Distributed Real Time ... Retrieved June 30, 2014,
fromhttp://www.elasticsearch.org/http://www.elasticsearch.org/.
77. (2006). MapDB. Retrieved June 30, 2014, fromhttp://www.mapdb.org/http://www.mapdb.org/.
78. (2012).
Home
Sigar
Hyperic
Support.
Retrieved
June
30,
2014,
fromhttps://support.hyperic.com/display/SIGAR/Homehttps://support.hyperic.com/display/SIG
AR/Home.
79. (2012). Dependency Discovery Tool - SourceForge. Retrieved June 30, 2014,
fromhttp://sourceforge.net/projects/officeddt/http://sourceforge.net/projects/officeddt/.
80. (2014). prasmussen/chrome-cli · GitHub. Retrieved June 30, 2014,
fromhttps://github.com/prasmussen/chrome-clihttps://github.com/prasmussen/chrome-cli.
81. (2002). FreeMind - SourceForge. Retrieved June 30, 2014,
fromhttp://freemind.sourceforge.net/http://freemind.sourceforge.net/.
82. Wittek, P. http://arxiv.org/abs/1305.1422 Somoclu: An Efficient Distributed Library for SelfOrganizing Maps. arXiv:1305.1422, 2013.
83. Ultsch, A., & Mörchen, F. (2005). ESOM-Maps: tools for clustering, visualization, and
classification with Emergent SOM.
84. (2005).
GNU
Octave.
Retrieved
June
30,
fromhttp://www.gnu.org/software/octave/http://www.gnu.org/software/octave/.
2014,
85. (1997) McKemmish, S. Yesterday, today and tomorrow: a continuum of responsibility. In
Proceedings of the Records Management Association of Australia 14th National Convention, pp.
15-17.
© PERICLES Consortium
Page 77 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Appendix A: List of requirements
Requirements analysis from D2.3.1
This is the list of requirements identifiers, described in more detail in D2.3.1, addressed by the PET
tool. The bold typeface identifiers are the most relevant, the ones in ‘[]’ are partially addressed and
those in ‘()’ can potentially be addressed by the PET, but currently are not.
Born-digital Archives (BDA)
UR-AM-BDA-02, UR-AM-BDA-03, [UR-AM-BDA-04], [UR-AM-BDA-05], UR-AM-BDA-10, UR-AM-BDA11, (UR-AM-BDA-25), UR-AM-BDA-26, UR-AM-BDA-27, UR-AM-BDA-28, UR-AM-BDA-29, UR-AMBDA-30, [UR-AM-BDA-31], (UR-AM-BDA-32), UR-AM-BDA-37.
In this use scenario, Pet can help with the issues related to the extraction of metadata and
environment information from Born Digital Archives, to help support the initial appraisal and
documentation of the archive contents including information that could be lost if not captured in the
native context (such as system and file-system specific information).
Software Based Art (SBA)
UR-AM-SBA-01, UR-AM-SBA-14, UR-AM-SBA-15, UR-AM-SBA-21
PET can help analyse the technical environment of an SBA, and also monitor and help analyse the
runtime requirements of SBA, as described in the experiments in chapter 8 of this deliverable.
Media Production (MPR)
UR-AM-MPR-09, [UR-AM-MPR-12], [UR-AM-MPR-26], [UR-AM-MPR-41], UR-AM-MPR-37, UR-AMMPR-39, UR-AM-MPR-40
In the case of MPR, the PET tool can help again with the metadata extraction, and with the
successive and linked task of metadata embedding will allow embedding the relevant metadata
inside of the data.
Digital Video Art (DVA)
[UR-AM-DVA-04], UR-AM-DVA-05, [UR-AM-DVA-07],
Science Requirements
[UR-SC-POL-01], [UR-SC-POL-02], [UR-SC-POL-19],
UR-SC-DAT-03, [UR-SC-DAT-06], [UR-SC-DAT-12], UR-SC-DAT-17, [UR-SC-DAT-20], (UR-SC-DAT-26),
UR-SC-DAT-28, UR-SC-DAT-29, UR-SC-DAT-39, UR-SC-DAT-41, UR-SC-DAT-44
UR-SC-PRO-13, (UR-SC-PRO-14), UR-SC-PRO-16, [UR-SC-PRO-18], [UR-SC-PRO-21], UR-SC-PRO-24,
(UR-SC-PRO-26), UR-SC-PRO-27, UR-SC-PRO-30, UR-SC-PRO-44,
(UR-SC-TLS-03),[ UR-SC-TLS-06],
Cross-Domain Requirements
UR-CO-DAT-01, UR-CO-DAT-04, [UR-CO-DAT-05]
(UR-CO-PRO-6), UR-CO-PRO-13, UR-CO-PRO-15
[UR-CO-TLS-03]
© PERICLES Consortium
Page 78 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Requirements from DoW, rapid case studies and other feedback
General requirements from the DOW:
1. The aim of the tool is the extraction of significant environment information.
2. The software should implement techniques related to digital forensics, sheer curation, and
weighted graphs to help prioritize the configuration of the extraction process.
3. The software should be based on, as well as support, the related research of Task 4.1.
4. The use cases of our partners should be considered.
From the first requirement we derived the following specification points:
●
●
[Obligatory] Extraction of environment information that sits outside of the object.
[Optional] Support extraction of metadata from the digital object itself.
For the second requirement we decided to support a sheer curation scenario and to design the
architecture in a way that digital forensic techniques and weighted graphs can be included into the
framework. The following specification points derived:
●
●
●
●
●
●
●
Sheer curation: Allow continuous extraction. This requires the detection of when to extract
and therewith environment monitoring.
Sheer curation: Do not disturb the user. The tool should run in the background with minimal
notification and configuration requirements.
Sheer curation: Take into account resource usage in environment extraction in order to limit
impact on system performance.
Sheer curation: To avoid privacy issues, the user should have full control over the extraction
process and the extracted information.
Digital forensic: Allow the integration of post-hoc extraction techniques.
Weighted graphs: Allow a high configurability of the extraction process.
Weighted graphs: Provide an interface for the configuration.
The following requirements derived from an analysis of the rapid case studies of our use case
partners, and user scenarios provided by the partners:
●
●
●
●
The need to capture the information from different environments. This includes that the
extraction techniques have to be implemented in a way to be used on different operating
systems.
Remote extraction is not feasible for security reasons.
Consider that the digital object can be altered in another environment and reenter the
original environment (e-mail out/in).
Automation of extraction process.
Furthermore the scenarios provided us with ideas of which information is significant to extract from
the environment based on the research about SEI. The following list is a subset of the potentially
significant information:
●
●
●
●
●
●
Software dependencies.
Information from log files.
Information from configuration files and XML files.
Original file environment information (position in folder structure; author; modification etc.
dates; permissions).
Algorithms (applications, workflows) used for processing data.
Detailed configuration parameters on SW used to process data.
© PERICLES Consortium
Page 79 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
●
●
●
●
●
●
●
●
Information stored on specialist SW or databases.
Information from web services.
User information from user directory services as LDAP.
Usage information.
Policy information.
Relationships between objects.
Execution environment features such as fonts, system properties and configuration, etc.
Information about programming language installations.
Some scenarios require highly specialized information. The following requirements for the software
are a conclusion of this discovery:
●
●
●
Possibility to choose and configure which techniques to use for a scenario.
Possibility to create profiles of this configuration to be able to support different scenarios at
the same time.
Modularity for the extraction techniques and possibility for the user to easily create and add
own customized modules to the tool.
© PERICLES Consortium
Page 80 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Appendix B: PET tool architecture in detail
1. The user can give some basic configuration commands as arguments at start. To use the
same commands automatically for every tool start, it is possible to write these commands
into a configuration file. The tool provides a command line interface (CLI) and a graphical
user interface (GUI) for the interaction with the user; the use of the GUI is optional, as it is
possible to execute the tool without the need of a graphical desktop. The user has full
control over the extraction process and can decide which extraction modules to use, which
files to consider and which (subsets of the) extracted information to keep.
2. The Extraction Controller Builder is responsible for building the Extraction Controller, the
“heart” of the application, based on the given user commands. It is designed following the
builder design pattern47 and is only executed once for building a single instance at tool start.
3. The Extraction Controller is the main controlling class of the application. It has access to all
other controllers and is responsible for the application flow. At tool start it initializes all other
controllers, and shuts them down at the end of the tools execution. All specialized controller
components communicate exclusively over the Extraction Controller with other controllers,
so this main controller is responsible for updating the states of other components and for
serving as an intermediary.
47 http://en.wikipedia.org/wiki/Builder_pattern
© PERICLES Consortium
Page 81 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
4. Profiles are created by the user to organize the extraction components, e.g. based on use
purposes. They contain a list of configured Extraction Modules, and a list of Extraction Result
Collections to collect the information extracted by the modules and to keep references to
important files where this information is related to. Both components are described in the
following two points.
5. Extraction Modules implement the techniques that designate how and when information is
extracted. They provide different implementations of algorithms to be executed in the
computer system environment of different operating systems. There are three different
kinds of Extraction Modules: file-dependent, file-independent and daemons. File-dependent
modules take as argument a path to a file to extract information that is valid only for this file,
whereas file-independent modules extract environment information, which is valid for all
files within the environment. Daemon modules, on the other hand, don't extract
information, but instead monitor the environment for the occurrence of designated events.
It is also possible to develop and easily plug into the application customized modules for
extracting specialized information or for monitoring specific events. A class template for
supporting the developer(s) is provided for this purpose.
6. Extraction Result Collections are the data structures that keep the extracted information.
Each collection belongs to one of two sub-classes: Environment or Part. An Environment
collects all extracted file-independent information that belongs to a Profile, whereby each
Profile has only one Environment class. Parts keep the extracted information that is valid only
for a specific file together with a path to this file. They can be seen as a file-part of a Digital
Object, but, in order to increase flexibility, we intentionally didn't implement a Digital Object
as a data structure.
7. The Profile Controller manages all Profiles and Profile Templates, which can be used for the
fast creation of a preconfigured Profile. It is possible to export existing Profiles as Profile
Templates, to be able to pass them to other PET users.
8. The Module Controller searches (with the help of Java reflections48) for all available module
classes and creates a list of generic extraction modules provided for creating Extraction
Module instances for Profiles. After their creation, most of the Extraction Modules have to be
configured before they can be executed.
9. The Extractor is responsible for executing the Extraction Modules and for saving the
Extraction Results into the right Extraction Result Collections. It supports two extraction
modes: (a) a snapshot extraction that executes each Extraction Module of each Profile for
capturing the current information state, (b) a continuous extraction mode that initiates a
new extraction by the Event Controller when an event is detected by the environment
monitoring daemons (the File Monitor and the Daemon Modules).
10. The Event Controller receives all Events detected by the monitoring daemons and controls
the event handling. It uses a queue for handling the events in the order of emerging.
11. Monitoring daemons are the File Monitor (see 12) and the Daemon Modules (see 5).
12. The File Monitor is responsible for observing the files that are added to the Profiles for
changes. If a modification to one of the files is detected, a new extraction for all modules
related to this file will be initiated. In case of a file deletion, all Profiles that include this file as
Part are informed, and will remove the file from their list. Contrary to the exchangeable
daemon modules, this is an inherent component of the application.
13. The Configuration Saver saves the state of the application at the end of the tools execution
48 http://en.wikipedia.org/wiki/Reflection_(computer_programming)
© PERICLES Consortium
Page 82 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
to configuration files, and loads the state at the next start of the tool. The Profiles will be
saved with all the added files and their modules with their configurations. Furthermore, the
current extraction mode and general usage options are saved.
14. The Storage Controller allows generic access to the exchangeable Storage. It provides
methods for saving and loading extracted information to and from the Storage.
15. Storage: save and load metadata using a modular storage support. Currently implemented
three storage interfaces: defaults to a simple flat filesystem storage with Json mapping, one
using elasticsearch [75] and a third using mapdb [76].
16. PET works together with an information encapsulation tool, also developed during the
PERICLES project, to be able to encapsulate the extracted information together with its
related files in a sheer curation scenario.
17. The weighted graphs described in chapter 6.3 could be implemented for suggesting
information to be extracted based on the use cases.
© PERICLES Consortium
Page 83 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
Appendix C: Ideas for further PET
developments
In this section we collect a list of issues and feature ideas for further PET developments, guided by
close discussion with stakeholders. Although it will likely be impossible to implement all of these in
the course of PERICLES, we hope to come back to the list in later project and to inspire the open
source community for contributions. Therefore we plan to maintain an updated list of ideas in PET
repository website49
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
49
Address the creation context of Digital Objects by adding newly created Digital Objects based
on environment events detected by the Environment Monitoring Daemons. We already
tested the concept with the Directory Monitoring Daemon by adding the newly created files
of an observed directory after getting a file creation event.
An option to trigger the execution of defined Extraction Modules repetitive in a configurable
time interval, for example every 5 minutes, instead of only in case of an environment event.
Explore the possibility to support extraction using the PREMIS 3 vocabulary and LRM
dependency definitions.
Implementation of the investigated weighted graphs and integration into PET. The FreeMind
[81] open source tool (MIT license) could be used to visualize the weighted graph.
Automated inference of dependencies from the monitored environment events.
Extraction Module configuration Templates could be developed similar to the Profile
Templates to export and ship single configured modules.
GUI refactor: Help- and event-tab are Profile independent and should be shown outside the
Profile area.
Further development of the “General Native Command Module” that allows the execution of
customized terminal commands as Extraction Module. A support for the extraction of
parameters from the command output would be useful.
The Information Tree is the main GUI display for extraction results. There are two other
display methods available, which both allow the filtering of information by the Extraction
Module that was used to extract it. It would be useful to enable such filtering also for the
Information Tree. The same “Combo Box” for selecting the Extraction Module can be used
for all three displays.
Some intelligent redundancy management for the extraction results could be implemented.
Some operating system files could be naturally excluded from extraction and monitoring, for
example using DF databases, such as the National Software Reference Library (NSRL)
Hashsets and Diskprints50. This would be helpful for handling of large amounts of files.
Currently a “Part” is the data structure that represents an important file to be investigated
during the extraction process. This concept could be extended to also support directories as
“Parts”. It would allow including all future files created in a directory to the investigation.
The configuration of Extraction Modules could be supported by the CLI. At the moment the
user has to modify the JSON configuration file directly to avoid the use of the GUI.
At the moment there could be conflicts if two Extraction Modules get the same name. An
UUID for Extraction Modules would be useful to avoid this.
Think about if there is any good method to show the extracted information at the CLI. The
https://github.com/pericles-project/pet
50
http://www.nsrl.nist.gov/
© PERICLES Consortium
Page 84 / 86
DELIVERABLE 4.1
INITIAL VERSION OF ENVIRONMENT INFORMATION
EXTRACTION TOOLS
●
●
●
●
●
problem is hard to solve, because of the great amount of extracted information.
The Extraction Modules could be extended by a variable that indicates their current state. A
state could indicate problems as “Further configuration needed”.
Extensively test on windows. (The current version was developed and mainly tested on Linux
and OS X, with limited testing on Windows)
The configuration of Extraction Modules is currently possible by using a GUI editor for
manipulating a JSON file. An improvement would be the generation of a configuration GUI.
Currently all existing Extraction Modules are loaded to the default profile at the first tool
start. This is good for presentations, but less good for real usages. Better would be to load no
module and to display a text to the user that Extraction Modules should be added to the
profile.
Develop further Extraction Modules:
○ Exif information extraction
○ Maven / Ant dependency extraction
○ Extraction of installed and used drivers
○ Extraction of information about installed programming languages (already existing
for Java)
○ Extraction of provenance and comments from version control systems
○ IDE (software development environment) information extraction
○ IDE project information extraction
○ Dependency extraction form IDE
○ Include the TIMBUS Extractors as modules to enable the extraction of business
activity context information
© PERICLES Consortium
Page 85 / 86
Download