Descargar documento

advertisement
Preserving Information for all – new
challenges and new skills needed
Dr David Giaretta MBE, Giaretta Associates Ltd, www.giaretta.org
Abstract
A huge amount of information is being created and digitally encoded. There is a
demand that, at least some of, this information is re-used to create value, again and
again into the future from repositories which can be trusted. Libraries may be able to
support these demands. However fundamental skills must be developed.
This paper discusses the fundamental concepts and challenges of OAIS and the
associated ISO 16363 for Trustworthy Digital Repositories.
1 Introduction
In this paper I would like to describe the skills which the library and other communities
must develop in order to meet the demands for looking after the tsunami of data that is
being created. As will be described in sections 4 and 5, the fundamentals of digital
preservation are well understood, but, in a real sense, this is just part of the picture.
2 Challenges
Rather than simply look at the obvious challenges of digital preservation I would like to
begin at a different point.
The Riding the Wave report1, for which I was rapporteur, provided a vision for 2030
and addressed the question, as part of the EU Digital Agenda, “How Europe can gain
from the rising tide of scientific data”. A similar question is surely of interest in all
countries, including Mexico.
As we worked on this, it became clear that the question should be extended to all
kinds of data. Moreover digital preservation is intimately bound up in this question as
well as to the question “who pays and why?” for digital preservation. While data is
newly created and of obvious use there will be resources available, but as has been
1
Available at http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf
1
pointed out2, the value of much data is potential – it may be useful in the future, but
this is not certain. We will return to this and its implications later.
A new profession of “data scientist” or “data librarian” is being discussed in this regard,
and it provides the library and other memory institutions with an opportunity. However
it seems clear that, despite their initial ownership of the digital preservation domain, in
order able to meet the challenges a number of mantras needs to be unlearned and a
number of new skills developed.
For example “emulate or migrate”, “characterisation”, “significant properties”,
“metadata” and even “format” flag a number of concepts in digital preservation which
are useful but only for a limited number of types of digital objects – specifically those
which are normally rendered i.e. displayed visually or audibly for human consumption;
the test of preservation for these is essentially that the digital object can be rendered
again in the future.
While very important these types of objects do not include the vast bulk of the
scientific, financial, engineering, social and business data with which we are deluged.
There are many challenges associated with this deluge. A fundamental challenge is
one which, when looked at from the basis of OAIS, addressed the challenge identified
in Rising the Wave and the broad challenge of digital preservation, namely how can
the value of digital objects be increased?
One of the ways to group the challenges, and one which links the discussion to the
topic of “big data”, is to look at the challenges of the “V”s.
3 The “V” challenges
Resources are needed to address the many V’s 3 which are normally discussed in
terms of big data – but are also relevant to small data, since as noted4 the real
revolution, which is the mass democratisation of the means of access, storage and
processing of data – small as well as big.
It is useful to divide these Vs into two groups. The first consists of Volume, Velocity,
Variety and Volatility which are ones more related to data management – i.e. issues
2
See
for
example
Sustainable
Economics
for
a
Digital
Planet,
http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdf
3
4
http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/
http://www.theguardian.com/news/datablog/2013/apr/25/forget-big-data-small-data-revolution
2
available
from
which arise even if the data is not necessarily being preserved but is being used by
the researchers who created it and over just a few years.
The other group consists of Veracity, Validity and Value, which this paper will focus
on for the following reasons.
Veracity, including Understandability and Authenticity, is vital for a researcher
using unfamiliar data from unfamiliar sources – otherwise how can that researcher use
the data and trust that it is what it is claimed to be? The challenge will be exacerbated
by the data management “Vs” noted previously, in particular scaling with Variety.
Validity (including correctness, data quality and legality) is normally of vital interest to
researchers if they wish to undertake scientifically useful work.
Value (or potential value) must be identified in order to justify keeping the data in the
long term – and even in the short term (related to Volatility) – because keeping data
requires resources. The minimum, relatively easily identified, costs are those related
to storage which tends to scale with Volume and in large scale repositories are very
front-loaded5. Other costs, which less obvious and more uncertain, are those
associated with maintaining Veracity and Validity.
It is worth mentioning another area which has caused and still causes difficulty,
namely terminology, even if one restricts the language to English. There are many
collections of terms (glossaries), created by, for example, libraries, organisations and
communities. The problem is that none of these show their relationships to any of the
others – even in the cases where they use the same word with a different meaning.
Thus when the groups talk together they talk at cross-purposes. There has been an
attempt6 to draw a number of these glossaries together using the Simple Knowledge
Organisation System (SKOS) system which allows one to indicate whether a term
from one glossary is wider, narrower or related to a term in another glossary. It
remains to be seen whether this gains widespread use.
3.1 Variety: Types of digital objects
There are many ways to think about the variety of digital objects which researchers
and libraries may need to deal with. One can list things like PDFs, emails,
photographs, videos, audios, unstructured data such as text, structured data and of
5
6
Information gathered by CERN data management group
http://www.alliancepermanentaccess.org/index.php/consultancy/dpglossary/
3
course the many types of scientific data. How should these be dealt with? In particular
how should they be preserved?
To draw up a map of the landscape of digital objects we suggested earlier that
whether or not the object is normally “rendered” is a useful way to think about dividing
digital objects because things which are normally not rendered present different
challenges from those which are normally rendered.
Similarly it seems fairly obvious that software, for example the Word application,
presents different preservation challenges than does a Word document. One way to
make the distinction is between those digital objects, such as the Word application,
which are “active” i.e. they do things to other objects – and the “passive” ones like the
Word document.
Another distinction that seems reasonable to make is between objects that are
regarded to be “static” i.e. they are not normally expected to change, as opposed to
that which may be described as “dynamic” such as a genome with associated
annotations.
Although many more divisions are possible we suggest just one more namely between
“simple” objects i.e. ones which are normally regarded as a single thing such as an
image or a piece of music. The other side of the division may be referred to as
“complex” or perhaps “composite”, for example a ZIP file or a scientific dataset
containing raw data plus data quality flags.
These individual dimensions can be combining to construct a multi-dimensional
coordinate system, for example a simple JPEG is static, simple, passive and
rendered, whereas a database with built-in procedures is dynamic, complex, active
and non-rendered. One reason that this may be (and is) useful is that, based on the
discussion on preservation techniques below, we can use it as a way to guide us
towards the preservation tool/technique to try first for a particular digital object. There
are many collections of tools but little guidance on which to use in which circumstance.
4 Fundamentals of digital preservation – OAIS
OAIS7 (ISO 14721:2012) provides key concepts, models and terminology for digital
preservation. These have been designed to be applicable to all types of repositories
7
Available free from http://public.ccsds.org/publications/archive/650x0m2.pdf
4
and all types of digitally encoded information, and has been applied and tested across
a very wide variety of repositories.
The Functional Model provides a way to
explain some of the terminology and may
repositories, and indeed system vendors,
have mapped their functionality to it.
However it should be realised that simply
being able to do this is no indication of the
quality of such repositories or systems
since it is possible to map a trivial setup
Figure 1 OAIS Functional Model
with very little preservation capabilities to
the Functional Model.
The
concepts
preservation
and model key to
are
supplied
by
the
Information Model.
Indeed conformance to OAIS is defined
within OAIS itself as use of the
Information Model and fulfilment of the
OAIS Mandatory Responsibilities.
It is worth mentioning, albeit briefly, the
Figure 2 OAIS Information Model
ideas
behind
Trustworthy
Digital
Repositories (TDR) for which ISO 16363:20128 provides metrics. The fundamental
concepts of OAIS are integrated into the metrics of ISO 16363. Supplementing these
are ideas about the adequacy of the financial, legal and staffing capabilities, and basic
security metrics. An important point to understand is that the repository does not have
to exist forever but, if it ceases to function, it will have had time to hand over the
digitally encoded information that it is preserving to the next in the chain of
preservation.
Within the international ISO process, audit and certification of repositories can be
performed by auditors accredited using ISO 16919:20149.
8
9
Available free from http://public.ccsds.org/publications/archive/652x0m1.pdf For more information
about ISO 16363 see http://www.iso16363.org
Available free from http://public.ccsds.org/publications/archive/652x1m2.pdf.
5
4.1 OAIS concepts
It is worth re-capping the fundamental definitions and concepts because these form
the bedrock for an understanding of broadly applicable digital preservation, and what
is required for trustworthiness in preservation terms.
The sequence is as follows:
OAIS takes a very general definition of its prime concern which, as the “I” in OAIS
suggests, is information:
Information: Any type of knowledge that can be exchanged. In an exchange, it is
represented by data. An example is a string of bits (the data) accompanied by a
description of how to interpret the string of bits as numbers representing temperature
observations measured in degrees Celsius.
Note that Knowledge is not defined in OAIS.
The accompanying definition of data is equally broad:
Data: A reinterpretable representation of information in a formalized manner suitable
for communication, interpretation, or processing. Examples of data include a
sequence of bits, a table of numbers, the characters on a page, the recording of
sounds made by a person speaking, or a moon rock specimen.
And in the case of things digital:
Digital Object: An object composed of a set of bit sequences.
Note that this does not mean we are restricted to a single file. The definition includes
multiple, perhaps distributed, files, or indeed a set of network messages.
The restriction to “bits” i.e. consisting of “1” and “0”, means that if we move to trinary
(i.e. “0”, “1” and “2”) instead of binary then we would have to change this definition, but
it would not affect the concept – however it would change the tools we could use.
One might wonder why data includes physical objects such as a "moon rock
specimen". The answer should become clear later but in essence the answer is that to
provide a logically complete solution to digital preservation one needs, eventually, to
jump outside the digital, if only, for example, to read the label on the disk.
As to the question of length of time we need to be concerned about, OAIS provides
the following pair of definitions:
6
Long Term: A period of time long enough for there to be concern about the impacts of
changing technologies, including support for new media and data formats, and of a
changing Designated Community, on the information being held in an OAIS. This
period extends into the indefinite future.
Long Term Preservation: The act of maintaining information, Independently
Understandable by a Designated Community, and with evidence supporting its
Authenticity, over the Long Term.
In other words we are not only talking about decades into the future but, as is a
common experience, we need to be concerned with the rapid change of hardware and
software, the cycle time of which may be just a few years. Of course even if an archive
is not itself looking after the digital objects over the long term, even by that definition,
the intention may be for another archive to take over later. In this case the first archive
needs to capture all the “metadata” needed so that it can hand these on also.
Two of key concepts are embedded in the above definition namely:
Independently Understandable: A characteristic of information that is sufficiently
complete to allow it to be interpreted, understood and used by the Designated
Community without having to resort to special resources not widely available,
including named individuals.
By being able to “understand” a piece of information is meant that one can do
something useful with it; it would be impractical to mean that one understands all of its
ramifications.
Now we approach one element of what that the "preservation" part of "digital
preservation" means. To require that things are able to be "interpreted, understood
and used" is to make some very powerful demands. It not only includes playing a
digital recording so it can be heard, or rendering an image or a document so that it can
be seen; it also includes being able to understand what the columns in the
spreadsheet we mention earlier mean, or what the numbers in a piece of scientific
data mean; this is needed in order to actually understand and, in particular, use the
data. For example using it in some analysis programme, combining it with other data
in order to derive new scientific insights. The "Independently" part is to exclude the
easy but unreliable option of being able to simply ask the person who created the
digital object; unreliable, not because the creator may be a liar, but rather because the
creator may be, and in the very long term certainly will be, deceased!
7
Finally, we have the other key concept of “Designated Community”.
Designated Community: An identified group of potential Consumers who should be
able to understand a particular set of information. The Designated Community may be
composed of multiple user communities. A Designated Community is defined by the
archive and this definition may change over time.
Why is this a key concept? To answer that question we need to ask another
fundamental question, namely "How can we tell whether a digital object has been
successfully preserved?" – a question which can be asked repeatedly as time passes.
Clearly we can do the simple things like checking whether the bit sequences are
unchanged over time, using one or more standard techniques such as digital digests
[XX]. However just having the bits is not enough. The demand for the ability for the
object to be "interpreted, understood and used" is broader than that - and of course it
can be tested.
But surely there is another qualification, for is it sensible to demand that anyone can
"interpret, understand and use" the digital object - say a four year old child?
Clearly we need to be more specific. But how can such a group be specified, and
indeed who should choose? This seems a daunting task - who could possibly be in a
position to do that?
The answer that OAIS provides is a subtle one. The group of people who should be
able to "interpret, understand and use" the digital object and who we can use to test
the success or otherwise of the "preservation", is defined by the people who are doing
the preservation.
The advantage of this definition is that it leads to something that can be tested. So if
an archive claims "we are preserving this digital object for astronomers" we can then
call in an astronomer to test that claim.
The disadvantage is that the preserver could choose a definition which makes life
easy for him/her – what is to stop that? The answer is that there is nothing to prevent
that BUT who would rely on such an archive?
As long as the archive’s definition is made clear, then the person depositing the digital
objects can decide whether this is acceptable. The success or failure of the archive, in
terms of digital objects being deposited, will be determined by the market. Thus in
order to succeed the archive will have to define its Designated Community(ies)
8
appropriately. Different archives, holding the same digital object may define their
Designated Communities as being different. This will have implications for the amount
and type of “metadata” which is needed by each archive.
Making the link back to the bits, OAIS defines Representation Information : The
information that maps a Data Object into more meaningful concepts. It is important to
realise that Representation can be whatever is needed to understand that Data Object
– documents, dictionaries, data, software, pieces of paper with handwritten notes etc.
The other important point is that the Representation Information will be represented by
some data object – which itself may need its own Representation Information; this
means that we have a network of pieces of information. The breadth/depth of this
network is determined by the choice of Designated Community.
4.2 What “metadata”, how much “metadata”?
One fundamental question to ask is ‘What “metadata” do we need?’ The problem with
“metadata” is that it is so broad that people tend to have their own limited view. OAIS
provides a more detailed breakdown. The first three broad categories are to do with
(1) understandability, (2) origins, context and restrictions and (3) the way in which the
data and “metadata” are grouped together.
The reason for this separation is that given some digitally encoded information one
can reasonably ask whether it is usable, which is dealt with by (1). This is a separate
question to the one about where this digital object came from, dealt with by (2). Since
there are many ways of associating these things it seems reasonable to want to
separate consider (3) separately.
It could be argued that to understand a piece of data one needs to know its context.
However the discussion about “Independently Understandable” in the previous section
points out that OAIS does not require understanding of all the ramifications so this
separation of context from understandability is reasonable, although it does not mean
that all context is excluded from understandability since a piece of “metadata” may
have several roles.
Authenticity is a key concept in digital preservation, and some would argue that is it
the pre-eminent concept, in that unless one can show that the data object is, in some
provable sense, what was originally deposited, then one cannot prove that digital
preservation has been successful.
9
On the other hand OAIS defines preservation in terms of understandability and
usability as well as authenticity; it therefore provides a view in which Representation
Information and Authenticity are equal partners.
OAIS defines Authenticity as: “the degree to which a person (or system) may regard
an object as what it is purported to be. The degree of Authenticity is judged on the
basis of evidence”.
Provenance Information is the information that documents the history of the Content
Information. This information tells the origin or source of the Content Information, any
changes that may have taken place since it was originated, and who has had custody
of it since it was originated. The archive is responsible for creating and preserving
Provenance Information from the point of Ingest, however earlier Provenance
Information should be provided by the Producer. Provenance Information adds to the
evidence to support Authenticity.
4.3 Archival Information Package
OAIS defines the Archival
Information
(AIP),
Package
which
is
conceptually vital for the
preservation of a digital
object. According to OAIS
the
AIP
is
defined
to
provide a concise way of
referring
to
a
set
of
information that has, in
principle, all the qualities
needed for permanent, or
indefinite, Long Term Preservation of a designated Information Object.
It is important to realise that the AIP is a logical construct i.e. it does not have to be a
single file.
The AIP is shown above. Note that this means that, unlike the general Information
Package, the AIP must have exactly one piece of Content Information and one piece
of PDI. Remember that a single Information Object (i.e. Content Information or PDI)
10
could consist of many separate digital objects. There are very many ways of
packaging information, both physically as well as logically.
5 Fundamental preservation techniques
OAIS requires that the information (represented as data) must be maintained as
Independently Understandable by a Designated Community, and with evidence
supporting its authenticity.
To be understandable requires that there is adequate Representation Information –
we might have adequate Representation Information for the Designated Community at
one time, but over time things such as hardware, software, environment or the tacit
knowledge of the Designated Community, change.
A Digital Object is made up of bit sequences; we can either keep these unchanged, in
which case we can check digital digests or hashes, which should be standard data
management practice (see the first group of V’s), or else we can decide to transform
the original object to another bit sequence perhaps for reasons of convenience or
costs for example if the software used as Representation Information is no longer
available.
Therefore we can see two fundamental digital preservation techniques:
1) Add Representation Information
2) Transform – OAIS uses this term for a more specific type of Migration.
We can add another, one which repositories tend for obvious reasons tend not to think
about, namely to
3) hand over to another repository in the case that the original repository can no
longer undertake the preservation activities, for example, because of lack of
resources. This will, in principle, be straightforward as long as Archival
Information Packages have been created, remembering of course that these
may be logical structures.
Comparing these to the various terms we noted at the start:
“emulate
or emulation is essentially adding a type of Representation
migrate”
Information but there are far more types that could be added.
Moreover emulation allows one to do what had been done
previously whereas with data one will want to do new things –
11
combining with newly created data.
“characterisation”
Characterisation
refers
to
“technical
characteristics”
with
associated “technical metadata” – this ignores, for example, the
semantics associated with the object. Moreover there is nothing
useful in terms of understandability.
“significant
A detailed analysis led the definition of “Transformational
properties”
Information Properties” in the updated version of OAIS. The
definition is broader than “significant properties” and in particular
is applicable to all types of data. Moreover the analysis shows
that these are important in terms of Authenticity where the digital
object is Transformed.
“metadata”
Metadata is too ill-defined, different people focus on different
types. The key question, which needs the finer granularity of
terms which OAIS provides, is which sort and how much?
“format”
While useful the term “format” is often used to the exclusion of
other types of Representation Information. In OAIS terms format
is a type of Structure Information; the others important types are
Semantic Information and “Other Information” which includes, for
example, software.
These are a few examples of terms and concepts which many in the memory
institutions need to re-learn in order to be able to deal effectively with the many types
of digital objects which must be preserved, and have value added.
6 Who pays and why?
As noted at the start of this paper, besides focussing on the techniques of
preservation we must also address how digital preservation is paid for; this is
connected with how to obtain value from what is preserved in order to justify its
preservation, and indeed to help decide whether to continue to preserve.
There are many types of value – monetary, risk reduction, avoidance of litigation,
prestige, safety, for future generations, for oneself. How can value be increased?
OAIS provides an answer using the techniques required for digital preservation.
12
The line of thinking is as follows: the
repository must define the Designated
Community and then provide adequate
Representation Information to ensure that
the
digital
object
is
independently
understandable. Being able to understand
and use the digital object is likely to make
it valuable to the Designated Community.
The same method used to define and then
make
available
that
Representation
Information can be extended to a wider
community – wider than the Designated Community. The repository need not commit
to maintaining the Representation Information for this broader community into the
future; instead it may be viewed as trial on adding value – if this does not work then
other trials may be attempted. The basic idea is that Preservation is tested on
Usability, and this can be enhanced as described above; Usability gives Value; Value
forms the basis of Business Cases; Business Cases are implemented by Business
Models, which produce resources which can fund preservation and provide wider
benefits to society. More details are available10, together with an integrated view of a
vast amount of digital preservation research results.
7 New Skills Needed
It seems reasonable to expect that those responsible for preservation and access to
the intellectual capital or a person, organisation, nation or humanity as a whole will be
presented with many continuing challenges and, perhaps more importantly, many new
challenges. These will range from new, more complex, types of digital objects to
demands to justify the resources needed for preservation and more broadly to add
value to what is being preserved.
My personal view is that currently those who come out of the education systems from
around the world are not provided with the intellectual tools to enable them to rise to
these challenges.
10
A
more
detailed
exploration
of
this
approach
http://www.alliancepermanentaccess.org/index.php/community/common-vision/
13
is
provided
at
7.1 Limitation
There seems to be a narrowness of view of the types of digital objects which are
considered. The main focus is on digital objects which are rendered i.e. displayed
visually or audibly for human consumption; the test of preservation is essentially that
the digital object can be rendered again in the future.
In earlier sections I presented a 4-dimensional view providing 16 broad categories
such as static, simple, passive, rendered. A person charged with preserving digitally
encoded information needs to be able to understand where the difficulties lie, what key
questions to ask and which techniques to try first, no matter what the digital object.
7.2 Misunderstandings
In section 5 a number of the terms and concepts that need to be unlearned/corrected
were presented. We can add the following examples common misunderstandings:

The OAIS Functional Model is the most important part of OAIS

We can preserve documents therefore preserving data is just a small extension
of this

Everyone understands the terms I use in preservation

…
7.3 Mental tools
As an example of the mental tools or rules of thumb which should be at the command
of those responsible for preservation one can consider the challenges for each of the
dimensions discussed earlier.
Type
Preservation challenge
Rendered
Be able to render sufficiently similarly in future - knowing format is often
enough. Meaning is assumed to be known to the human viewer/listener
Nonrendered
Be able to understand/use the information encoded in the digital object in
the future - need the semantics as well as the format
Static
The information to be preserved does not change over time
Dynamic
The information changes over time - need to be able to preserve state at
some past time
Simple
The object is normally thought of as a single entity with its specific
preservation challenge
Composite
The object is normally thought of as being made up of many simple
entities - each of which may present a different preservation challenge
14
Passive
Active
The digital object is normally used as input to applications - these
applications or their equivalent may need to be preserved
This is something that takes in other objects and produces something.
It probably relies on a support infrastructure e.g. operating systems
Looking at one of the dimensions – rendered vs non-rendered – and the three basic
techniques of preservation, one can develop the following table of issues which may
arise.
Rendered
Add
Rep.Info
Non-Rendered
Usually
-Other R.I. (emulator)
-Structure (format)
-often no semantics
-No specific Designated Community
-Transformational Information
Properties (T.I.P.)
Transform – only simple format related
“significant properties” e.g. colour,
font similar enough. No semantics.
-Need defined Designated
Community
-Many kinds of R.I. : semantic,
structure, other (software etc)
-Transformational Information
Properties (T.I.P.)
– complex structural and semantic
considerations which may need
subtle human judgement.
-Consideration of Designated
Community probably needed.
-Relatively straightforward. Evidence
Handover
-Need to hand over Representation
of authenticity most critical.
Information as well as evidence of
Authenticity.
Combining the other dimensions we have sixteen combinations, each with a
combination of the various key issues.
8 Conclusion
These key concepts, questions and rules of thumb provide a solid, practical, basis for
training those who will be responsible for preserving, and adding value to, our
important digital intellectual capital, whatever the type and source of that information,
so that society as w whole can reap the benefits.
They can also form the basis for training those who create the digitally encoded
information – they are usually the ones most knowledgeable about the digital objects.
These ideas are being developed in the Active Data Management Plan 11 group of
11
See https://www.rd-alliance.org/groups/active-data-management-plans.html for details
15
RDA, and associates standards are being produced by the group which created
OAIS12.
12
The CCSDS-DAI working group - see http://cwe.ccsds.org/moims/default.aspx#_MOIMS-DAI
16
Download