Preserving Information for all – new challenges and new skills needed Dr David Giaretta MBE, Giaretta Associates Ltd, www.giaretta.org Abstract A huge amount of information is being created and digitally encoded. There is a demand that, at least some of, this information is re-used to create value, again and again into the future from repositories which can be trusted. Libraries may be able to support these demands. However fundamental skills must be developed. This paper discusses the fundamental concepts and challenges of OAIS and the associated ISO 16363 for Trustworthy Digital Repositories. 1 Introduction In this paper I would like to describe the skills which the library and other communities must develop in order to meet the demands for looking after the tsunami of data that is being created. As will be described in sections 4 and 5, the fundamentals of digital preservation are well understood, but, in a real sense, this is just part of the picture. 2 Challenges Rather than simply look at the obvious challenges of digital preservation I would like to begin at a different point. The Riding the Wave report1, for which I was rapporteur, provided a vision for 2030 and addressed the question, as part of the EU Digital Agenda, “How Europe can gain from the rising tide of scientific data”. A similar question is surely of interest in all countries, including Mexico. As we worked on this, it became clear that the question should be extended to all kinds of data. Moreover digital preservation is intimately bound up in this question as well as to the question “who pays and why?” for digital preservation. While data is newly created and of obvious use there will be resources available, but as has been 1 Available at http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf 1 pointed out2, the value of much data is potential – it may be useful in the future, but this is not certain. We will return to this and its implications later. A new profession of “data scientist” or “data librarian” is being discussed in this regard, and it provides the library and other memory institutions with an opportunity. However it seems clear that, despite their initial ownership of the digital preservation domain, in order able to meet the challenges a number of mantras needs to be unlearned and a number of new skills developed. For example “emulate or migrate”, “characterisation”, “significant properties”, “metadata” and even “format” flag a number of concepts in digital preservation which are useful but only for a limited number of types of digital objects – specifically those which are normally rendered i.e. displayed visually or audibly for human consumption; the test of preservation for these is essentially that the digital object can be rendered again in the future. While very important these types of objects do not include the vast bulk of the scientific, financial, engineering, social and business data with which we are deluged. There are many challenges associated with this deluge. A fundamental challenge is one which, when looked at from the basis of OAIS, addressed the challenge identified in Rising the Wave and the broad challenge of digital preservation, namely how can the value of digital objects be increased? One of the ways to group the challenges, and one which links the discussion to the topic of “big data”, is to look at the challenges of the “V”s. 3 The “V” challenges Resources are needed to address the many V’s 3 which are normally discussed in terms of big data – but are also relevant to small data, since as noted4 the real revolution, which is the mass democratisation of the means of access, storage and processing of data – small as well as big. It is useful to divide these Vs into two groups. The first consists of Volume, Velocity, Variety and Volatility which are ones more related to data management – i.e. issues 2 See for example Sustainable Economics for a Digital Planet, http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdf 3 4 http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/ http://www.theguardian.com/news/datablog/2013/apr/25/forget-big-data-small-data-revolution 2 available from which arise even if the data is not necessarily being preserved but is being used by the researchers who created it and over just a few years. The other group consists of Veracity, Validity and Value, which this paper will focus on for the following reasons. Veracity, including Understandability and Authenticity, is vital for a researcher using unfamiliar data from unfamiliar sources – otherwise how can that researcher use the data and trust that it is what it is claimed to be? The challenge will be exacerbated by the data management “Vs” noted previously, in particular scaling with Variety. Validity (including correctness, data quality and legality) is normally of vital interest to researchers if they wish to undertake scientifically useful work. Value (or potential value) must be identified in order to justify keeping the data in the long term – and even in the short term (related to Volatility) – because keeping data requires resources. The minimum, relatively easily identified, costs are those related to storage which tends to scale with Volume and in large scale repositories are very front-loaded5. Other costs, which less obvious and more uncertain, are those associated with maintaining Veracity and Validity. It is worth mentioning another area which has caused and still causes difficulty, namely terminology, even if one restricts the language to English. There are many collections of terms (glossaries), created by, for example, libraries, organisations and communities. The problem is that none of these show their relationships to any of the others – even in the cases where they use the same word with a different meaning. Thus when the groups talk together they talk at cross-purposes. There has been an attempt6 to draw a number of these glossaries together using the Simple Knowledge Organisation System (SKOS) system which allows one to indicate whether a term from one glossary is wider, narrower or related to a term in another glossary. It remains to be seen whether this gains widespread use. 3.1 Variety: Types of digital objects There are many ways to think about the variety of digital objects which researchers and libraries may need to deal with. One can list things like PDFs, emails, photographs, videos, audios, unstructured data such as text, structured data and of 5 6 Information gathered by CERN data management group http://www.alliancepermanentaccess.org/index.php/consultancy/dpglossary/ 3 course the many types of scientific data. How should these be dealt with? In particular how should they be preserved? To draw up a map of the landscape of digital objects we suggested earlier that whether or not the object is normally “rendered” is a useful way to think about dividing digital objects because things which are normally not rendered present different challenges from those which are normally rendered. Similarly it seems fairly obvious that software, for example the Word application, presents different preservation challenges than does a Word document. One way to make the distinction is between those digital objects, such as the Word application, which are “active” i.e. they do things to other objects – and the “passive” ones like the Word document. Another distinction that seems reasonable to make is between objects that are regarded to be “static” i.e. they are not normally expected to change, as opposed to that which may be described as “dynamic” such as a genome with associated annotations. Although many more divisions are possible we suggest just one more namely between “simple” objects i.e. ones which are normally regarded as a single thing such as an image or a piece of music. The other side of the division may be referred to as “complex” or perhaps “composite”, for example a ZIP file or a scientific dataset containing raw data plus data quality flags. These individual dimensions can be combining to construct a multi-dimensional coordinate system, for example a simple JPEG is static, simple, passive and rendered, whereas a database with built-in procedures is dynamic, complex, active and non-rendered. One reason that this may be (and is) useful is that, based on the discussion on preservation techniques below, we can use it as a way to guide us towards the preservation tool/technique to try first for a particular digital object. There are many collections of tools but little guidance on which to use in which circumstance. 4 Fundamentals of digital preservation – OAIS OAIS7 (ISO 14721:2012) provides key concepts, models and terminology for digital preservation. These have been designed to be applicable to all types of repositories 7 Available free from http://public.ccsds.org/publications/archive/650x0m2.pdf 4 and all types of digitally encoded information, and has been applied and tested across a very wide variety of repositories. The Functional Model provides a way to explain some of the terminology and may repositories, and indeed system vendors, have mapped their functionality to it. However it should be realised that simply being able to do this is no indication of the quality of such repositories or systems since it is possible to map a trivial setup Figure 1 OAIS Functional Model with very little preservation capabilities to the Functional Model. The concepts preservation and model key to are supplied by the Information Model. Indeed conformance to OAIS is defined within OAIS itself as use of the Information Model and fulfilment of the OAIS Mandatory Responsibilities. It is worth mentioning, albeit briefly, the Figure 2 OAIS Information Model ideas behind Trustworthy Digital Repositories (TDR) for which ISO 16363:20128 provides metrics. The fundamental concepts of OAIS are integrated into the metrics of ISO 16363. Supplementing these are ideas about the adequacy of the financial, legal and staffing capabilities, and basic security metrics. An important point to understand is that the repository does not have to exist forever but, if it ceases to function, it will have had time to hand over the digitally encoded information that it is preserving to the next in the chain of preservation. Within the international ISO process, audit and certification of repositories can be performed by auditors accredited using ISO 16919:20149. 8 9 Available free from http://public.ccsds.org/publications/archive/652x0m1.pdf For more information about ISO 16363 see http://www.iso16363.org Available free from http://public.ccsds.org/publications/archive/652x1m2.pdf. 5 4.1 OAIS concepts It is worth re-capping the fundamental definitions and concepts because these form the bedrock for an understanding of broadly applicable digital preservation, and what is required for trustworthiness in preservation terms. The sequence is as follows: OAIS takes a very general definition of its prime concern which, as the “I” in OAIS suggests, is information: Information: Any type of knowledge that can be exchanged. In an exchange, it is represented by data. An example is a string of bits (the data) accompanied by a description of how to interpret the string of bits as numbers representing temperature observations measured in degrees Celsius. Note that Knowledge is not defined in OAIS. The accompanying definition of data is equally broad: Data: A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Examples of data include a sequence of bits, a table of numbers, the characters on a page, the recording of sounds made by a person speaking, or a moon rock specimen. And in the case of things digital: Digital Object: An object composed of a set of bit sequences. Note that this does not mean we are restricted to a single file. The definition includes multiple, perhaps distributed, files, or indeed a set of network messages. The restriction to “bits” i.e. consisting of “1” and “0”, means that if we move to trinary (i.e. “0”, “1” and “2”) instead of binary then we would have to change this definition, but it would not affect the concept – however it would change the tools we could use. One might wonder why data includes physical objects such as a "moon rock specimen". The answer should become clear later but in essence the answer is that to provide a logically complete solution to digital preservation one needs, eventually, to jump outside the digital, if only, for example, to read the label on the disk. As to the question of length of time we need to be concerned about, OAIS provides the following pair of definitions: 6 Long Term: A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing Designated Community, on the information being held in an OAIS. This period extends into the indefinite future. Long Term Preservation: The act of maintaining information, Independently Understandable by a Designated Community, and with evidence supporting its Authenticity, over the Long Term. In other words we are not only talking about decades into the future but, as is a common experience, we need to be concerned with the rapid change of hardware and software, the cycle time of which may be just a few years. Of course even if an archive is not itself looking after the digital objects over the long term, even by that definition, the intention may be for another archive to take over later. In this case the first archive needs to capture all the “metadata” needed so that it can hand these on also. Two of key concepts are embedded in the above definition namely: Independently Understandable: A characteristic of information that is sufficiently complete to allow it to be interpreted, understood and used by the Designated Community without having to resort to special resources not widely available, including named individuals. By being able to “understand” a piece of information is meant that one can do something useful with it; it would be impractical to mean that one understands all of its ramifications. Now we approach one element of what that the "preservation" part of "digital preservation" means. To require that things are able to be "interpreted, understood and used" is to make some very powerful demands. It not only includes playing a digital recording so it can be heard, or rendering an image or a document so that it can be seen; it also includes being able to understand what the columns in the spreadsheet we mention earlier mean, or what the numbers in a piece of scientific data mean; this is needed in order to actually understand and, in particular, use the data. For example using it in some analysis programme, combining it with other data in order to derive new scientific insights. The "Independently" part is to exclude the easy but unreliable option of being able to simply ask the person who created the digital object; unreliable, not because the creator may be a liar, but rather because the creator may be, and in the very long term certainly will be, deceased! 7 Finally, we have the other key concept of “Designated Community”. Designated Community: An identified group of potential Consumers who should be able to understand a particular set of information. The Designated Community may be composed of multiple user communities. A Designated Community is defined by the archive and this definition may change over time. Why is this a key concept? To answer that question we need to ask another fundamental question, namely "How can we tell whether a digital object has been successfully preserved?" – a question which can be asked repeatedly as time passes. Clearly we can do the simple things like checking whether the bit sequences are unchanged over time, using one or more standard techniques such as digital digests [XX]. However just having the bits is not enough. The demand for the ability for the object to be "interpreted, understood and used" is broader than that - and of course it can be tested. But surely there is another qualification, for is it sensible to demand that anyone can "interpret, understand and use" the digital object - say a four year old child? Clearly we need to be more specific. But how can such a group be specified, and indeed who should choose? This seems a daunting task - who could possibly be in a position to do that? The answer that OAIS provides is a subtle one. The group of people who should be able to "interpret, understand and use" the digital object and who we can use to test the success or otherwise of the "preservation", is defined by the people who are doing the preservation. The advantage of this definition is that it leads to something that can be tested. So if an archive claims "we are preserving this digital object for astronomers" we can then call in an astronomer to test that claim. The disadvantage is that the preserver could choose a definition which makes life easy for him/her – what is to stop that? The answer is that there is nothing to prevent that BUT who would rely on such an archive? As long as the archive’s definition is made clear, then the person depositing the digital objects can decide whether this is acceptable. The success or failure of the archive, in terms of digital objects being deposited, will be determined by the market. Thus in order to succeed the archive will have to define its Designated Community(ies) 8 appropriately. Different archives, holding the same digital object may define their Designated Communities as being different. This will have implications for the amount and type of “metadata” which is needed by each archive. Making the link back to the bits, OAIS defines Representation Information : The information that maps a Data Object into more meaningful concepts. It is important to realise that Representation can be whatever is needed to understand that Data Object – documents, dictionaries, data, software, pieces of paper with handwritten notes etc. The other important point is that the Representation Information will be represented by some data object – which itself may need its own Representation Information; this means that we have a network of pieces of information. The breadth/depth of this network is determined by the choice of Designated Community. 4.2 What “metadata”, how much “metadata”? One fundamental question to ask is ‘What “metadata” do we need?’ The problem with “metadata” is that it is so broad that people tend to have their own limited view. OAIS provides a more detailed breakdown. The first three broad categories are to do with (1) understandability, (2) origins, context and restrictions and (3) the way in which the data and “metadata” are grouped together. The reason for this separation is that given some digitally encoded information one can reasonably ask whether it is usable, which is dealt with by (1). This is a separate question to the one about where this digital object came from, dealt with by (2). Since there are many ways of associating these things it seems reasonable to want to separate consider (3) separately. It could be argued that to understand a piece of data one needs to know its context. However the discussion about “Independently Understandable” in the previous section points out that OAIS does not require understanding of all the ramifications so this separation of context from understandability is reasonable, although it does not mean that all context is excluded from understandability since a piece of “metadata” may have several roles. Authenticity is a key concept in digital preservation, and some would argue that is it the pre-eminent concept, in that unless one can show that the data object is, in some provable sense, what was originally deposited, then one cannot prove that digital preservation has been successful. 9 On the other hand OAIS defines preservation in terms of understandability and usability as well as authenticity; it therefore provides a view in which Representation Information and Authenticity are equal partners. OAIS defines Authenticity as: “the degree to which a person (or system) may regard an object as what it is purported to be. The degree of Authenticity is judged on the basis of evidence”. Provenance Information is the information that documents the history of the Content Information. This information tells the origin or source of the Content Information, any changes that may have taken place since it was originated, and who has had custody of it since it was originated. The archive is responsible for creating and preserving Provenance Information from the point of Ingest, however earlier Provenance Information should be provided by the Producer. Provenance Information adds to the evidence to support Authenticity. 4.3 Archival Information Package OAIS defines the Archival Information (AIP), Package which is conceptually vital for the preservation of a digital object. According to OAIS the AIP is defined to provide a concise way of referring to a set of information that has, in principle, all the qualities needed for permanent, or indefinite, Long Term Preservation of a designated Information Object. It is important to realise that the AIP is a logical construct i.e. it does not have to be a single file. The AIP is shown above. Note that this means that, unlike the general Information Package, the AIP must have exactly one piece of Content Information and one piece of PDI. Remember that a single Information Object (i.e. Content Information or PDI) 10 could consist of many separate digital objects. There are very many ways of packaging information, both physically as well as logically. 5 Fundamental preservation techniques OAIS requires that the information (represented as data) must be maintained as Independently Understandable by a Designated Community, and with evidence supporting its authenticity. To be understandable requires that there is adequate Representation Information – we might have adequate Representation Information for the Designated Community at one time, but over time things such as hardware, software, environment or the tacit knowledge of the Designated Community, change. A Digital Object is made up of bit sequences; we can either keep these unchanged, in which case we can check digital digests or hashes, which should be standard data management practice (see the first group of V’s), or else we can decide to transform the original object to another bit sequence perhaps for reasons of convenience or costs for example if the software used as Representation Information is no longer available. Therefore we can see two fundamental digital preservation techniques: 1) Add Representation Information 2) Transform – OAIS uses this term for a more specific type of Migration. We can add another, one which repositories tend for obvious reasons tend not to think about, namely to 3) hand over to another repository in the case that the original repository can no longer undertake the preservation activities, for example, because of lack of resources. This will, in principle, be straightforward as long as Archival Information Packages have been created, remembering of course that these may be logical structures. Comparing these to the various terms we noted at the start: “emulate or emulation is essentially adding a type of Representation migrate” Information but there are far more types that could be added. Moreover emulation allows one to do what had been done previously whereas with data one will want to do new things – 11 combining with newly created data. “characterisation” Characterisation refers to “technical characteristics” with associated “technical metadata” – this ignores, for example, the semantics associated with the object. Moreover there is nothing useful in terms of understandability. “significant A detailed analysis led the definition of “Transformational properties” Information Properties” in the updated version of OAIS. The definition is broader than “significant properties” and in particular is applicable to all types of data. Moreover the analysis shows that these are important in terms of Authenticity where the digital object is Transformed. “metadata” Metadata is too ill-defined, different people focus on different types. The key question, which needs the finer granularity of terms which OAIS provides, is which sort and how much? “format” While useful the term “format” is often used to the exclusion of other types of Representation Information. In OAIS terms format is a type of Structure Information; the others important types are Semantic Information and “Other Information” which includes, for example, software. These are a few examples of terms and concepts which many in the memory institutions need to re-learn in order to be able to deal effectively with the many types of digital objects which must be preserved, and have value added. 6 Who pays and why? As noted at the start of this paper, besides focussing on the techniques of preservation we must also address how digital preservation is paid for; this is connected with how to obtain value from what is preserved in order to justify its preservation, and indeed to help decide whether to continue to preserve. There are many types of value – monetary, risk reduction, avoidance of litigation, prestige, safety, for future generations, for oneself. How can value be increased? OAIS provides an answer using the techniques required for digital preservation. 12 The line of thinking is as follows: the repository must define the Designated Community and then provide adequate Representation Information to ensure that the digital object is independently understandable. Being able to understand and use the digital object is likely to make it valuable to the Designated Community. The same method used to define and then make available that Representation Information can be extended to a wider community – wider than the Designated Community. The repository need not commit to maintaining the Representation Information for this broader community into the future; instead it may be viewed as trial on adding value – if this does not work then other trials may be attempted. The basic idea is that Preservation is tested on Usability, and this can be enhanced as described above; Usability gives Value; Value forms the basis of Business Cases; Business Cases are implemented by Business Models, which produce resources which can fund preservation and provide wider benefits to society. More details are available10, together with an integrated view of a vast amount of digital preservation research results. 7 New Skills Needed It seems reasonable to expect that those responsible for preservation and access to the intellectual capital or a person, organisation, nation or humanity as a whole will be presented with many continuing challenges and, perhaps more importantly, many new challenges. These will range from new, more complex, types of digital objects to demands to justify the resources needed for preservation and more broadly to add value to what is being preserved. My personal view is that currently those who come out of the education systems from around the world are not provided with the intellectual tools to enable them to rise to these challenges. 10 A more detailed exploration of this approach http://www.alliancepermanentaccess.org/index.php/community/common-vision/ 13 is provided at 7.1 Limitation There seems to be a narrowness of view of the types of digital objects which are considered. The main focus is on digital objects which are rendered i.e. displayed visually or audibly for human consumption; the test of preservation is essentially that the digital object can be rendered again in the future. In earlier sections I presented a 4-dimensional view providing 16 broad categories such as static, simple, passive, rendered. A person charged with preserving digitally encoded information needs to be able to understand where the difficulties lie, what key questions to ask and which techniques to try first, no matter what the digital object. 7.2 Misunderstandings In section 5 a number of the terms and concepts that need to be unlearned/corrected were presented. We can add the following examples common misunderstandings: The OAIS Functional Model is the most important part of OAIS We can preserve documents therefore preserving data is just a small extension of this Everyone understands the terms I use in preservation … 7.3 Mental tools As an example of the mental tools or rules of thumb which should be at the command of those responsible for preservation one can consider the challenges for each of the dimensions discussed earlier. Type Preservation challenge Rendered Be able to render sufficiently similarly in future - knowing format is often enough. Meaning is assumed to be known to the human viewer/listener Nonrendered Be able to understand/use the information encoded in the digital object in the future - need the semantics as well as the format Static The information to be preserved does not change over time Dynamic The information changes over time - need to be able to preserve state at some past time Simple The object is normally thought of as a single entity with its specific preservation challenge Composite The object is normally thought of as being made up of many simple entities - each of which may present a different preservation challenge 14 Passive Active The digital object is normally used as input to applications - these applications or their equivalent may need to be preserved This is something that takes in other objects and produces something. It probably relies on a support infrastructure e.g. operating systems Looking at one of the dimensions – rendered vs non-rendered – and the three basic techniques of preservation, one can develop the following table of issues which may arise. Rendered Add Rep.Info Non-Rendered Usually -Other R.I. (emulator) -Structure (format) -often no semantics -No specific Designated Community -Transformational Information Properties (T.I.P.) Transform – only simple format related “significant properties” e.g. colour, font similar enough. No semantics. -Need defined Designated Community -Many kinds of R.I. : semantic, structure, other (software etc) -Transformational Information Properties (T.I.P.) – complex structural and semantic considerations which may need subtle human judgement. -Consideration of Designated Community probably needed. -Relatively straightforward. Evidence Handover -Need to hand over Representation of authenticity most critical. Information as well as evidence of Authenticity. Combining the other dimensions we have sixteen combinations, each with a combination of the various key issues. 8 Conclusion These key concepts, questions and rules of thumb provide a solid, practical, basis for training those who will be responsible for preserving, and adding value to, our important digital intellectual capital, whatever the type and source of that information, so that society as w whole can reap the benefits. They can also form the basis for training those who create the digitally encoded information – they are usually the ones most knowledgeable about the digital objects. These ideas are being developed in the Active Data Management Plan 11 group of 11 See https://www.rd-alliance.org/groups/active-data-management-plans.html for details 15 RDA, and associates standards are being produced by the group which created OAIS12. 12 The CCSDS-DAI working group - see http://cwe.ccsds.org/moims/default.aspx#_MOIMS-DAI 16