h-lorie-NARA2d

advertisement
This section looks at the technical aspects of long-term preservation.
Two facets of the problem:
- archiving the bit stream (which bits? And how?);
- preserving the interpretation mechanism.
Archiving the bit stream
Which bits?
- the original file unchanged.
- The “essential bits”. Data may have built-in redundancy (as in
a CD). The redundant bits can be suppressed without loss of
information.
- The result of a reversible transformation: i.e. result of a
lossless compression.
- The result of a non-reversible transformation. It should
preserve the essential information. This means 1) that the
essential information must be identified and agreed upon by
archivists, and 2) that the new transformed bits make all that
information accessible in the future.
- A derived representation that may preserve some of the
information. This will always be in addition to one of the
options above. It may help indexing, searcing, querying etc.
How to archive the bit stream?
The lowest level of a system architecture should be a Storage Component
that sees the bits stream as objects. It supports a simple object
interface used to create, delete(?) and retrieve an object.
If a perfectly autonomic system could be built, that would be it. In
practice, the interface must support a mechanism for controlling what
happens below the cover.
What happens below the cover?
List of the needed functions
-
Support for very large repositories
supports for storage hierarchies
space management
clustering
distribution, at one site or among sites
caching
multiple copies
rejuvenation (by copying)
redundancy for disaster recovery
others?
Initially, the inteface may support interactions to control some of
these operations. When the storage component becomes more autonomic,
the interface will evolve from specific instructions to policy
specifications.
Two remarks on implementation
1. Many operational installations exist that implement large
repositories. NARA can benefit from their experience. Such
installations rely mostly on off-the-shelf components. Some of the
functions may not be currently supported; then look at industrial
research projects; they are probably trying to solve the remaining
problem. This is a most general problem – not NARA specific at all.
2. What is unique to NARA (or any installation) is the way the system
is expected to behave when submitted to a particular load. Here, a
suggestion may be to use simulation as a way to predict the behavior of
a particular system implementation under unknown but reasonably
predictable circumstances. The parameters involved in the simulation
would cover design options, various load distributions, various
estimates on technological parameters and their evolution in time
(performance of devices, costs, mean-time-to-failure for medium or
devices, reduction in real estate, etc.). A subtantial advantage of a
simulation is to force NARA to thing in terms of quantitative
requirements.
Should we give examples of:
- very large operational installations?
- off-the-shelf components?
- Relevant research projects?
Preserving the interpretation mechanism
Once bits are preserved, how do we know what to do with them? This of
course depends on the format. If it is a PDF document, the bit stream
needs to be interpreted according to the specification of PDF, the same
for .doc, .xsl, jpeg, etc. These are examples of well known formats.
But there may be ad-hoc formats designed to support well a particular
application. NARA says it must deal with hundreds of formats (do we
have to multiply that by a certain factor to take into account multiple
versions?)
Studying that many formats to decide how to preserve them would be a
tremendous task. Take examples (like the most frequently used formats)
would be one approach. A better approach may be to study classes of
formats where all formats in a class exhibit essentially the same
preservation problems.
Classes:
Text – without presentation
Save the bits; but the definition of every one-byte character must be
specified, either in English or by showing the bitmap of each character
in a common font. Of course, the description of how a bitmap is stored
needs to be specified as well, but this can be as simple as “stored as
24 lines of 16 bits, stored line by line, and left to right in a line”.
Image
There is always the possibility of agreeing that JPEG is explained in
so many places that decoding algorithms will always exist. This is
however dangerous (there are already several variants). The alternative
is to explain the decoding algorithm in the metadata as it is explained
in a mathematical book. If this is too complicate, then it may be
better to convert the original bit stream into another one that may be
less compressed but easier to explain. An alternative is to store an
executable program.
Sound (need to check this with a “sound” expert)
This is a good example where saving the original bit map is not a good
idea. The useful data is just a list of values, obtained by sampling
the signal at a certain frequency (or two such lists for stereo). The
bit stream, however, is much more complex, involving redundancy for
error correction that comes into effect if the CD gets slightly damaged.
This complexity is not needed for the archive. Tools exist today to
extract the useful data so conversion is easy. The metadata must only
contain an explanation of sampling, the shape of the list (number of
bits per value) and the specific frequency.
Data structures in XML – no presentation
Can be stored as text (as above). That is enough to identify all
individual data elements and to know their tags. The tags have
generally a semantic meaning. That semantic meaning needs to be
provided in the metadata.
Data from relational database – the system is not important.
It becomes simply a data structure; its schema must be explained in
metadata. Semantics about relationships (if hierarchical) may be easier
to explain in an XML world. Possibility of saving relational data in an
XML-like manner.
General data structures
Not everything is relational or XML. Many files may have their own
format to represent geographical data, engineering data, statistical
data. Not everything can be converted to XML! Files would become huge.
So there is a need to specify how to decode these formats as well.
Documents with presentation
It is always possible to store an image of the document (stored in an
appropriate format, same discussion as for image above). In any case,
the original bit stream can be archived, together with information on
how to decode it. But these formats can be so complicate that only a
program can do the job. A possibility is to convert the document once,
at ingest, into a (possibly XML) structure that contains both the
tagged data elements of the document and explicit presentation
attributes down to the individual characters images. Conversion could
also be to PDF (see JPEG arguments above).
Spreadsheet
It is a document with presentation; so the last paragraph applies. But
some of the values appearing in cells are specified as formulas that
compute them as a function of the values in other cells. These formulas
could be stored as metadata as a way to convey the mathematical
relationships between cells. Here also, we have to assume that the
description of the formula language is known or documented. If the
archivist decides that a future user should be able to execute these
formulas, then an “executor” program must be archived.
Dynamic applications
The last example introduces a case of dynamic application. Other
examples are video games, or interactive applications. They require the
archiving of programs to re-enact the original behavior. Again, you
generally do not need an operational version of a DB2 or Lotus Notes to
get access to some data or an e-mail!
Summary
- Some digital objects can be archived simply; but even then, they
require metadata for correct interpretation in the future.
- Some formats may be so standard that it can be assume that a
decoder will always exist; then the metadata becomes quite simple.
- Some formats can be initially converted so that the metadata to
be archived is much reduced and simplified.
- In some cases, describing how to interpret the data is just to
complex; then archiving a program is the only alternative.
NARA should look at techniques that have been proposed, see how they
would be used to solve or not solve the problems for each of these
classes, choose a subset of classes, etc… (see conclusions).
List of proposed techniques (a paragraph for each)
- Using standards
- Rely on metadata
- Emulation
- Conversion on demand
- UVC
- Others?
What each technique does for each format class (a paragraph per
technique).
Conclusion
The challenge is serious but can be addressed. It can be partitioned in
three main components, serving the following functions:
1. Preserving the bits:
This is essentially a storage system. It must be able to store files
(identified by unique ids) safely. It may involve disks, tapes,
automatic staging, automatic copying on the same or new devices.
Product offerings exist that would do most of this; a thin upper
layer may be needed.
2. Archiving metadata:
The objects handled in 1 above are basic files. A logical object at
the application level may consist of several related such objects.
The metadata that keeps these things together must be stored in a
metadata database. That database should also contain descriptive
data that allows quick access to documents on well known attributes
such as author, date, provenance, changes, authorization, etc. This
is more or less classical content manager/ digital library stuff.
3. Preserving the interpretation of the data:
Providing the above functionalities need engineering. Preserving the
interpretation of the data for the long-term needs more: we have the
core technology but we still need to build an operational system. It
also requires additional investigation on what the client really
needs to preserve, from which set of formats, and how new
viewing/processing applications may be implemented in the future. An
important assessment is on the requirement or non-requirement for
preserving an old system behavior.
In summary, a first phase implementation may comprise
functionalities 1 and 2. In parallel, a serious study – with
prototyping - should be established to identify precise requirements
for 3. The results will fuel a full system implementation. A certain
delay between phases (1, 2) and (3) is not a big problem since if
it remains short, since we can make sure programs/systems/expertise
do not get obsolete in such a short period.
We discussed elsewhere RFP's for particular pieces of the whole NARA
problem. The nature of the preservation problem requires a sequence of
steps such as an initial RFP on "what can you do to solve the problem?",
with a serious on-paper evaluation of how the proposed solution would
solve the problem (this means that NARA should be able to write down
what functionality (or what range of functionality) is expected for
each class of format and/or individual formats. The next level RFP may
ask for an operational prototype, and the next one for an actual
implementation. At each level the specifications become more concrete;
at the beginning the whole thing is open to suggestions.
I'd think someone who had experience moving large relational databases
forward through time/platforms might be more instructive. To a non-DB
person like me, that problem looks like a mess, and seems more similar to
the harder NARA problems. To a database person, it looks like a quite manageable
problem. Ideally we'd want some organization that has to maintain access over a long
period of time. Really, is this needed? The archive is meant to provide a way to retrieve
the information later on. It does not necessarily mean that the user interface needs to be
exactly the same. But there is a requirement question here. A relational database contains
a certain amount of information. If we can retrieve the same info later using a very
different interface, is that a reasonable archive? I think so. (I essentially agree with
Jerry’comment below). But someone could argue (mainly if he is called J. Rothenberg)
that we need to preserve what a user could extract from the data – and that may be
dependent on the way the extraction is specified. So, if the functionality changes, it may
give you a wrong idea of what a user was able to discover 50 years ago.
The Social Security Administration, the Veteran's Administration, and the
IRS are government examples. Even if their solutions are lame, they must
be doing something.
In the corporate world, I'm not sure who that would be. Car companies
(complex parts information, responsible for recalls over a decade-long life
of product)? Chemical companies (must maintain records for liability
purposes over several decades, think Dow Chemical, Corning, asbestos)?
There seems to be a discrepancy between the interest in preservation on the part of many
sectors, and the absence of money/resources that the companies want to invest at this
time. Databases, digital libraries exist but rely on the presumed continuity of tools
(databases, PDF, XML) or business-as-usual kinds of conversions).
This is a point where careful distinctions are likely to pay important
dividends.
NARA certainly has to *ingest* many different types of information. But in
most cases it is probably *not* necessary to archive it in the same form it
was ingested. One reason is that at the point when NARA takes it in, it
becomes read-only (if it wasn't already), and any functional capability
built in to the data structure related to updating is not only not needed,
I would imagine that functional capability actually needs to be disabled.
So running old database systems with their full original capability is not,
on the face of it, a requirement. All that is really needed is to extract
the data from the old system so that it can be archived in a way that
allows it to be useful in the future.
Rothenberg archives with the data bit stream the original executable program that was used to
create/manipulate the information in the first place. That program (with operating system) works
only on the old machine (the M2000). The only way to get the data is to run that old program, and
this always requires an emulator. In 1995, Rothenberg suggests that a description of the
architecture of M2000, in all its details, be archived with the data. But building an emulator from
the description of the M2000 architecture is not a simple endeavor. It can be done only if the
description is perfect and complete (a notoriously difficult task in itself). And even then, how do
we know that the emulator works correctly since no machine M2000 exists for comparison?
Later on, he suggests that what should be saved in an emulator specification. This has never, to
my knowledge, been carried out.
Note that the emulation method hinges on the fact that what is saved in D is the original
executable bit stream of the application program that created and/or rendered the document
(including the operating system). Saving the original program is justifiable for behavior
archiving but is overkill for data archiving. In order to archive a collection of pictures, is it
necessary to save the full system that enabled the original user to create, modify, and enhance
pictures when only the final result is of interest for posterity. If Lotus Notes is used to send an email message in the year 2000, is it necessary to save the whole Lotus Notes environment and
reactivate it in 2100 in order to restore the note contents? But there is an even worse drawback: in
many cases the application program may display the data in a certain way (for example, a
graphical representation) without giving explicit access to the data themselves. In such a case,
it is impossible to export the basic data from the old system to the new one. This is a serious
drawback. In other words, repurposing is impossible.
Our UVC approach
We need to be able to restore, say in year 2100, on a machine M2100, data generated in 2000, on
a machine M2000. In 2000, an application program generates a data file, which must be archived
for the future. In order for the file to be understood in the future, a program P is also archived,
which can decode the data. and present it to the client in an understandable form. That program P
is written for a UVC machine.
In 2100, a restore application program reads the bit stream and passes it to a UVC emulator,
which executes the UVC program. During that execution the data is decoded and returned to the
client according to a logical view (or schema). The schema itself must also be archived and easily
readable – we use a similar technique. When read (actually, decoded by the UVC program), all
data items are returned to the user, tagged with a semantic label. This allows for re-purposing.
The UVC approach provides a nice way of checking today the correctness of programs that
will be used in the future. If a UVC program is written in M2000, it can be tested on a UVC
interpreter written in 2000 for an M2000 machine. If ten years later, in 2000+10, a new machine
architecture comes up, a new UVC interpreter can easily be written. It can be checked by running
the same UVC program through both the 2000 and 2000+10 UVC interpreter. In other words any
UVC interpreter can be checked by comparison with the interpreter of the previous generation.
For archiving programs: When the object of archival is a program, the M2000 code must be
archived and later executed under emulation. The UVC approach can be naturally extended to
support the archiving of programs, providing for a way to essentially write the emulator in
the year 2000, even if the target machine is not known. Instead of archiving the data bit stream
and a program to decode it, we now store the original program (in the M2000 machine language)
together with an emulator of the M2000 machine, written in the UVC machine language, and any
data file required to run the original application program. In 2100, the UVC emulator interprets
the UVC instructions that emulate the M2000 instructions; that emulation essentially produces an
M2000 equivalent machine, which then executes the original application code. That execution
yields the same results as the original program on an M2000. The metadata must simply contain a
user’s guide on how to run the program. We believe this is the only reasonable way of developing
(today) an emulator for today’s machine when that emulator will be used in a possibly distant
future..
Figure 2: UVC-based Em
Download