This section looks at the technical aspects of long-term preservation. Two facets of the problem: - archiving the bit stream (which bits? And how?); - preserving the interpretation mechanism. Archiving the bit stream Which bits? - the original file unchanged. - The “essential bits”. Data may have built-in redundancy (as in a CD). The redundant bits can be suppressed without loss of information. - The result of a reversible transformation: i.e. result of a lossless compression. - The result of a non-reversible transformation. It should preserve the essential information. This means 1) that the essential information must be identified and agreed upon by archivists, and 2) that the new transformed bits make all that information accessible in the future. - A derived representation that may preserve some of the information. This will always be in addition to one of the options above. It may help indexing, searcing, querying etc. How to archive the bit stream? The lowest level of a system architecture should be a Storage Component that sees the bits stream as objects. It supports a simple object interface used to create, delete(?) and retrieve an object. If a perfectly autonomic system could be built, that would be it. In practice, the interface must support a mechanism for controlling what happens below the cover. What happens below the cover? List of the needed functions - Support for very large repositories supports for storage hierarchies space management clustering distribution, at one site or among sites caching multiple copies rejuvenation (by copying) redundancy for disaster recovery others? Initially, the inteface may support interactions to control some of these operations. When the storage component becomes more autonomic, the interface will evolve from specific instructions to policy specifications. Two remarks on implementation 1. Many operational installations exist that implement large repositories. NARA can benefit from their experience. Such installations rely mostly on off-the-shelf components. Some of the functions may not be currently supported; then look at industrial research projects; they are probably trying to solve the remaining problem. This is a most general problem – not NARA specific at all. 2. What is unique to NARA (or any installation) is the way the system is expected to behave when submitted to a particular load. Here, a suggestion may be to use simulation as a way to predict the behavior of a particular system implementation under unknown but reasonably predictable circumstances. The parameters involved in the simulation would cover design options, various load distributions, various estimates on technological parameters and their evolution in time (performance of devices, costs, mean-time-to-failure for medium or devices, reduction in real estate, etc.). A subtantial advantage of a simulation is to force NARA to thing in terms of quantitative requirements. Should we give examples of: - very large operational installations? - off-the-shelf components? - Relevant research projects? Preserving the interpretation mechanism Once bits are preserved, how do we know what to do with them? This of course depends on the format. If it is a PDF document, the bit stream needs to be interpreted according to the specification of PDF, the same for .doc, .xsl, jpeg, etc. These are examples of well known formats. But there may be ad-hoc formats designed to support well a particular application. NARA says it must deal with hundreds of formats (do we have to multiply that by a certain factor to take into account multiple versions?) Studying that many formats to decide how to preserve them would be a tremendous task. Take examples (like the most frequently used formats) would be one approach. A better approach may be to study classes of formats where all formats in a class exhibit essentially the same preservation problems. Classes: Text – without presentation Save the bits; but the definition of every one-byte character must be specified, either in English or by showing the bitmap of each character in a common font. Of course, the description of how a bitmap is stored needs to be specified as well, but this can be as simple as “stored as 24 lines of 16 bits, stored line by line, and left to right in a line”. Image There is always the possibility of agreeing that JPEG is explained in so many places that decoding algorithms will always exist. This is however dangerous (there are already several variants). The alternative is to explain the decoding algorithm in the metadata as it is explained in a mathematical book. If this is too complicate, then it may be better to convert the original bit stream into another one that may be less compressed but easier to explain. An alternative is to store an executable program. Sound (need to check this with a “sound” expert) This is a good example where saving the original bit map is not a good idea. The useful data is just a list of values, obtained by sampling the signal at a certain frequency (or two such lists for stereo). The bit stream, however, is much more complex, involving redundancy for error correction that comes into effect if the CD gets slightly damaged. This complexity is not needed for the archive. Tools exist today to extract the useful data so conversion is easy. The metadata must only contain an explanation of sampling, the shape of the list (number of bits per value) and the specific frequency. Data structures in XML – no presentation Can be stored as text (as above). That is enough to identify all individual data elements and to know their tags. The tags have generally a semantic meaning. That semantic meaning needs to be provided in the metadata. Data from relational database – the system is not important. It becomes simply a data structure; its schema must be explained in metadata. Semantics about relationships (if hierarchical) may be easier to explain in an XML world. Possibility of saving relational data in an XML-like manner. General data structures Not everything is relational or XML. Many files may have their own format to represent geographical data, engineering data, statistical data. Not everything can be converted to XML! Files would become huge. So there is a need to specify how to decode these formats as well. Documents with presentation It is always possible to store an image of the document (stored in an appropriate format, same discussion as for image above). In any case, the original bit stream can be archived, together with information on how to decode it. But these formats can be so complicate that only a program can do the job. A possibility is to convert the document once, at ingest, into a (possibly XML) structure that contains both the tagged data elements of the document and explicit presentation attributes down to the individual characters images. Conversion could also be to PDF (see JPEG arguments above). Spreadsheet It is a document with presentation; so the last paragraph applies. But some of the values appearing in cells are specified as formulas that compute them as a function of the values in other cells. These formulas could be stored as metadata as a way to convey the mathematical relationships between cells. Here also, we have to assume that the description of the formula language is known or documented. If the archivist decides that a future user should be able to execute these formulas, then an “executor” program must be archived. Dynamic applications The last example introduces a case of dynamic application. Other examples are video games, or interactive applications. They require the archiving of programs to re-enact the original behavior. Again, you generally do not need an operational version of a DB2 or Lotus Notes to get access to some data or an e-mail! Summary - Some digital objects can be archived simply; but even then, they require metadata for correct interpretation in the future. - Some formats may be so standard that it can be assume that a decoder will always exist; then the metadata becomes quite simple. - Some formats can be initially converted so that the metadata to be archived is much reduced and simplified. - In some cases, describing how to interpret the data is just to complex; then archiving a program is the only alternative. NARA should look at techniques that have been proposed, see how they would be used to solve or not solve the problems for each of these classes, choose a subset of classes, etc… (see conclusions). List of proposed techniques (a paragraph for each) - Using standards - Rely on metadata - Emulation - Conversion on demand - UVC - Others? What each technique does for each format class (a paragraph per technique). Conclusion The challenge is serious but can be addressed. It can be partitioned in three main components, serving the following functions: 1. Preserving the bits: This is essentially a storage system. It must be able to store files (identified by unique ids) safely. It may involve disks, tapes, automatic staging, automatic copying on the same or new devices. Product offerings exist that would do most of this; a thin upper layer may be needed. 2. Archiving metadata: The objects handled in 1 above are basic files. A logical object at the application level may consist of several related such objects. The metadata that keeps these things together must be stored in a metadata database. That database should also contain descriptive data that allows quick access to documents on well known attributes such as author, date, provenance, changes, authorization, etc. This is more or less classical content manager/ digital library stuff. 3. Preserving the interpretation of the data: Providing the above functionalities need engineering. Preserving the interpretation of the data for the long-term needs more: we have the core technology but we still need to build an operational system. It also requires additional investigation on what the client really needs to preserve, from which set of formats, and how new viewing/processing applications may be implemented in the future. An important assessment is on the requirement or non-requirement for preserving an old system behavior. In summary, a first phase implementation may comprise functionalities 1 and 2. In parallel, a serious study – with prototyping - should be established to identify precise requirements for 3. The results will fuel a full system implementation. A certain delay between phases (1, 2) and (3) is not a big problem since if it remains short, since we can make sure programs/systems/expertise do not get obsolete in such a short period. We discussed elsewhere RFP's for particular pieces of the whole NARA problem. The nature of the preservation problem requires a sequence of steps such as an initial RFP on "what can you do to solve the problem?", with a serious on-paper evaluation of how the proposed solution would solve the problem (this means that NARA should be able to write down what functionality (or what range of functionality) is expected for each class of format and/or individual formats. The next level RFP may ask for an operational prototype, and the next one for an actual implementation. At each level the specifications become more concrete; at the beginning the whole thing is open to suggestions. I'd think someone who had experience moving large relational databases forward through time/platforms might be more instructive. To a non-DB person like me, that problem looks like a mess, and seems more similar to the harder NARA problems. To a database person, it looks like a quite manageable problem. Ideally we'd want some organization that has to maintain access over a long period of time. Really, is this needed? The archive is meant to provide a way to retrieve the information later on. It does not necessarily mean that the user interface needs to be exactly the same. But there is a requirement question here. A relational database contains a certain amount of information. If we can retrieve the same info later using a very different interface, is that a reasonable archive? I think so. (I essentially agree with Jerry’comment below). But someone could argue (mainly if he is called J. Rothenberg) that we need to preserve what a user could extract from the data – and that may be dependent on the way the extraction is specified. So, if the functionality changes, it may give you a wrong idea of what a user was able to discover 50 years ago. The Social Security Administration, the Veteran's Administration, and the IRS are government examples. Even if their solutions are lame, they must be doing something. In the corporate world, I'm not sure who that would be. Car companies (complex parts information, responsible for recalls over a decade-long life of product)? Chemical companies (must maintain records for liability purposes over several decades, think Dow Chemical, Corning, asbestos)? There seems to be a discrepancy between the interest in preservation on the part of many sectors, and the absence of money/resources that the companies want to invest at this time. Databases, digital libraries exist but rely on the presumed continuity of tools (databases, PDF, XML) or business-as-usual kinds of conversions). This is a point where careful distinctions are likely to pay important dividends. NARA certainly has to *ingest* many different types of information. But in most cases it is probably *not* necessary to archive it in the same form it was ingested. One reason is that at the point when NARA takes it in, it becomes read-only (if it wasn't already), and any functional capability built in to the data structure related to updating is not only not needed, I would imagine that functional capability actually needs to be disabled. So running old database systems with their full original capability is not, on the face of it, a requirement. All that is really needed is to extract the data from the old system so that it can be archived in a way that allows it to be useful in the future. Rothenberg archives with the data bit stream the original executable program that was used to create/manipulate the information in the first place. That program (with operating system) works only on the old machine (the M2000). The only way to get the data is to run that old program, and this always requires an emulator. In 1995, Rothenberg suggests that a description of the architecture of M2000, in all its details, be archived with the data. But building an emulator from the description of the M2000 architecture is not a simple endeavor. It can be done only if the description is perfect and complete (a notoriously difficult task in itself). And even then, how do we know that the emulator works correctly since no machine M2000 exists for comparison? Later on, he suggests that what should be saved in an emulator specification. This has never, to my knowledge, been carried out. Note that the emulation method hinges on the fact that what is saved in D is the original executable bit stream of the application program that created and/or rendered the document (including the operating system). Saving the original program is justifiable for behavior archiving but is overkill for data archiving. In order to archive a collection of pictures, is it necessary to save the full system that enabled the original user to create, modify, and enhance pictures when only the final result is of interest for posterity. If Lotus Notes is used to send an email message in the year 2000, is it necessary to save the whole Lotus Notes environment and reactivate it in 2100 in order to restore the note contents? But there is an even worse drawback: in many cases the application program may display the data in a certain way (for example, a graphical representation) without giving explicit access to the data themselves. In such a case, it is impossible to export the basic data from the old system to the new one. This is a serious drawback. In other words, repurposing is impossible. Our UVC approach We need to be able to restore, say in year 2100, on a machine M2100, data generated in 2000, on a machine M2000. In 2000, an application program generates a data file, which must be archived for the future. In order for the file to be understood in the future, a program P is also archived, which can decode the data. and present it to the client in an understandable form. That program P is written for a UVC machine. In 2100, a restore application program reads the bit stream and passes it to a UVC emulator, which executes the UVC program. During that execution the data is decoded and returned to the client according to a logical view (or schema). The schema itself must also be archived and easily readable – we use a similar technique. When read (actually, decoded by the UVC program), all data items are returned to the user, tagged with a semantic label. This allows for re-purposing. The UVC approach provides a nice way of checking today the correctness of programs that will be used in the future. If a UVC program is written in M2000, it can be tested on a UVC interpreter written in 2000 for an M2000 machine. If ten years later, in 2000+10, a new machine architecture comes up, a new UVC interpreter can easily be written. It can be checked by running the same UVC program through both the 2000 and 2000+10 UVC interpreter. In other words any UVC interpreter can be checked by comparison with the interpreter of the previous generation. For archiving programs: When the object of archival is a program, the M2000 code must be archived and later executed under emulation. The UVC approach can be naturally extended to support the archiving of programs, providing for a way to essentially write the emulator in the year 2000, even if the target machine is not known. Instead of archiving the data bit stream and a program to decode it, we now store the original program (in the M2000 machine language) together with an emulator of the M2000 machine, written in the UVC machine language, and any data file required to run the original application program. In 2100, the UVC emulator interprets the UVC instructions that emulate the M2000 instructions; that emulation essentially produces an M2000 equivalent machine, which then executes the original application code. That execution yields the same results as the original program on an M2000. The metadata must simply contain a user’s guide on how to run the program. We believe this is the only reasonable way of developing (today) an emulator for today’s machine when that emulator will be used in a possibly distant future.. Figure 2: UVC-based Em