Transcript (doc. 64kb) - National Archives of Australia

advertisement
Digital Preservation – the state of the game on the library lawns
Colin Webb, National Library of Australia
For NAA Digital Futures International Forum 18-19 September 2007
Thank you …..Looking over my notes I realise that I have failed to mention very
important work being done by major players such as the British Library, Library of
Congress, National Library of New Zealand and others. Still, I hope what I will cover
gives you a good sense of where things stand in libraries.
I would like to begin with a story that I hope will explain the title of my presentation:
State of the game on the library lawns.
In the mid 1980s, I was working in what was then called The Australian Archives. As
a conservator and preservation manager, I and my staff had contact with the collection
that few other AA staff had. It was an exciting place to work including the frequent
discovery of small treasures.
At that time, the Australian Archives did not hold exhibitions or have a publications
program, and apart from a small number of researchers and a procession of
undervalued agency registry managers, I suspect most Australians hardly knew we
existed – or if they did, they simply didn’t care. (At least that’s how it felt from the
inside.)
Then in 1988 (I think it was), a Queensland academic “discovered” the vast amount of
copyright registration exhibits held by the Archives, from a buxom Mae West doll of
the 1920s, to photographs of the iconic Australian television puppet Mr Squiggle in
the1950s. The researcher described it as a “treasure trove”.
Of course we had known about them all along, and had been quietly preserving them
for just such earnest use and delight. However, just as the researcher had “discovered”
these treasures, his story was discovered by a journalist. A cover story in The Bulletin,
double page spreads in the major capital city newspapers: it was fabulous publicity.
We were all so excited. “The eagle has landed!” the headlines said. “We have struck
the mother lode!” And we opened our papers to read how, in the dark and dusty vaults
of the national archives, a researcher had uncovered these hitherto lost treasures. It
has been turned into an Indiana Jones story, with the clever researcher stumbling upon
treasures under the very noses of the archivists. “Dusty vaults” indeed!
I tell this story by way of claiming the right to use what may seem to be a similarly
mischievous metaphor, but one which I hope will illuminate what I see as the state of
play with digital preservation in libraries.
Compared with the conduct of digital preservation in the archives sector in
scrupulously undusty vaults, digital preservation in libraries is a game underway on
the library lawns, in the sunshine and the open air, amongst the dust of activity, the
heat and the flies and the barracking of passers by. Like many open air games, it may
have the appearance of a leisurely pursuit of little consequence, but the participants
and at least some of the spectators are playing for high stakes.
1
I use this metaphor not only for the sake of a bit of gentle banter with my former
colleagues, but as a way of introducing what I see as some key contextual features
that characterise the state of play for digital preservation in libraries. I am going to
spend a significant part of my presentation talking about that and the way it influences
how we see digital preservation, the kind of preservation actions we are most
interested in, and where we currently are –at least as I can present it in a short talk like
this.
The first bit of context – and I suppose the most obvious bit – is that “libraries” is as
generic a term as “archives”. They are useful labels suggesting important
commonalities, but there are of course as many different kinds of libraries as there are
different kinds of archival institutions. Consider the different levels and kinds of
interest in digital preservation that might be expected from the National Library of
Australia compared with the Queanbeyan City Library, or the branch library in a
small country town.
However, I guess one thing that libraries have in common is that much of the
information they hold has been made specifically for a public readership. That’s not to
say that libraries only hold published materials – some of our most interesting stuff is
unique and unpublished, but in almost all cases material held by libraries was either
produced, or knowingly acquired, for the purpose of public access. Because of this,
libraries are meant to be places where you can find information, or at least find out
how to locate and get access to information even if the library itself doesn’t hold it.
In the digital world, these roles remain, albeit with some increasingly blurry
boundaries between kinds of information, and increasing opportunities for finding
information without needing to even know where it is physically held.
To follow through on this idea, I am going to focus my remarks about digital
preservation on what we consider to be published digital information, ignoring for the
moment the significant challenges of unpublished digital materials.
At the risk of making this seem incredibly complicated – which one should never do
when talking about digital preservation – I am going to play a game of threes. In fact,
I am going to ask you to keep three sets of threes in mind!
I want to divide the body of published digital materials into three categories which
tend to get treated in different ways, and which are generally at three different levels
of preservation progress:

Online materials freely available on the World Wide Web.

Electronic journals published by mainstream publishers, usually as commercial
products.

Physical format digital publications, issued on media such as CDs, DVDs, once
upon a time on floppy disks.
I am also going to talk about three categories of issues that have a huge impact on
how preservation is approached in libraries. The first two, collecting and access, I will
be referring to as context issues, while the third is what would be recognised as
preservation issues themselves.
2
Finally, in talking about those core preservation issues, I will ask you to keep in mind
three focuses that we know we have to address:

Data management, including maintenance of data identity and integrity, security
and managing risks of media failure.

Management of intelligibility, to do with technological change and format
obsolescence.

Organisational issues to do with responsibility, sustainability and resources.
I will try to keep track of this complex array, because we need to understand it if we
are going to make meaningful statements about the progress of digital preservation, in
the context in which libraries operate and do their core business.
Collecting
Let me begin with collecting, and speak in particular about the collecting of online
materials. This is an area of strength but also of great challenge for libraries.
When referring to online information, I’m assuming I don’t need to convince anyone
that there is a lot of it. Many national libraries believe their traditional roles of
collecting, preserving and providing access to the national published output must
include something approaching a representative sample of information published
online.
The simple reality understood by most libraries with an interest in web archiving is
that preservation has to start with collecting. Much more than in the print world, the
window of opportunity for collecting online information is usually quite small and
unpredictable. There is no obligation on most online creators to look after what they
produce or even to create it in ways that make it easy to preserve. In the vast world of
web publishing, many creators have interests that either ignore or even run counter to
what a library would consider to be reliable ongoing access.
This means that the chances of information surviving unchanged for very long in the
wild world of the web are very hit and miss, and not necessarily correlated with the
long term value of the information. Most libraries are convinced of the need to move
information from the high risk environment of the live web to a relatively safe place
where it can be protected. Most of us have decided it is better to take this step without
waiting until all preservation problems have been solved and perfect preservation and
content management systems are in place, simply because most of the current material
that needs to be kept will already be lost if we wait.
So, from the earliest days of web archiving in libraries, collecting has been spoken of
as part of preservation. Many national libraries have set up web archives believing
they are contributing to digital preservation efforts. In 2003 a number of us formed
the International Internet Preservation Consortium, a somewhat loosely named group
that was and is committed to taking action that will help its members build and
preserve web archives. Currently the IIPC consists of 26 member institutions, mostly
national libraries but also including a few archives and private web archiving
operations, including the well known Internet Archive.
3
In Australia, since 1996 the library sector has been building the PANDORA national
collection of Australian online information, initiated by the NLA but now contributed
to by partners including all the mainland state libraries and a number of other
information and collecting agencies. There have been huge achievements in this first
preservation step.
On the other side of the ledger, however, it is easy to feel that context has become the
substance of preservation. The importance of collecting has preoccupied the library
sector, both in terms of allocating funding resources and just as importantly, in the
allocation of brain power and energy. The IIPC provides a good reflection of this
situation. The consortium is an action-oriented one – you can’t belong to it without
committing to involvement in R&D work. To date, the great majority of the R&D
work undertaken within IIPC has been focused on better tools for collecting.
This is a pressure that can only continue, even while we try to shift some weight to the
preservation of what has been collected. Our substantial collecting efforts are
constantly challenged by the changing nature of web publishing. Already I consider
our collecting efforts to be quite inadequate, given that we are still trying to find a
way to automatically gather information from database structured websites – which
surely must account for the majority of what we want to collect! The IIPC funded
work at the NLA on a database gathering tool called Xinq some years ago, but it has
made almost no contribution to actual gathering efforts, and we have barely even
started thinking about how to collect and represent current changes in web use.
Even in the NLA, which has a long standing digital preservation program, the
collecting context bites hard on our preservation effort. We are beginning a thoroughgoing review of our web archiving activities, aimed at getting a better balance
between tightly selected and curated research quality collections such as PANDORA,
and the kind of bulk automated harvesting that will allow future users to see a broad
cross-section of the information on the Australian web. We are doing this after more
than a decade of very active collecting, but in the context of looking for ways to
maintain this momentum in the absence of supportive legislation, and the changing
face of web publishing. At the same time looking for ways we can free up resources
to have an impact on our growing core preservation challenges.
So, collecting is both influential context for, and part of, digital preservation for us.
I will say similar things about the other big context issue for libraries, which is access.
As with collecting, this is an intrinsic driver for libraries – in some ways it is what we
exist for! As an overall driver it is so important that it tends to drain resources away
from core preservation – whilst also contributing to our preservation work.
Libraries have an impressive track record in recognising the opportunities for
federating access to their collections. This kind of thinking goes back a long way in
library land, so it has been a natural step to try to extend this to seamless searching
and getting of resources across and beyond the library sector. This is such an intrinsic
part of the vision of modern libraries that it will always win attention in debates about
priorities for systems development work. As a preservation specialist, I have had to
come to accept this, and to see the positives in it for preservation. It is surely easier to
4
argue that our preservation efforts are critical when it is recognised that we can’t just
put data away with the hope that no-one will want to access it for decades.
I’m going to choose another type of published digital information as a case study here
– not because access is at all unimportant for online materials: the legal right to
provide access is one of the biggest issues of all – but to illustrate some differences
for non-online materials.
Libraries typically have collections of physical format digital publications, either held
in separate “electronic” collections or scattered through their print collections. There
seem to be far fewer collecting hassles with these materials than with online materials.
Rightly or wrongly, no-one ever talks about their presence in our collections as an act
of preservation in itself, whereas for online materials that language is constantly used.
On the other hand, the preservation of physical format materials is much more closely
linked with access issues. The two preservation drivers for these materials seem to be:
will someone walk in tomorrow and request this 20 year old disk?; and an awareness
that many physical format digital publications come with Technical Protection
Measures (TPMs) specifically designed to control access in ways that will also defeat
preservation copying. We think about these as the critical preservation issues.
We have also come to recognise that good old media failure is a much more pressing
risk that must be addressed urgently. At the NLA we are in the middle of a project to
design efficient workflows to move the content of physical format digital materials to
a much better managed mass storage system as a priority risk management strategy.
Doing that will also challenge us to better define how we are going to guarantee
access.
I have been trying to give you a picture of digital preservation operating in a context
of two powerful drivers that play an important part in how we see digital preservation
in libraries, but also play a contributing – sometimes a complicating role. For the final
round of this game of threes, I would like to concentrate on what I think of as core
digital preservation issues – data management, technological change and
obsolescence, and organisational issues to do with responsibility, sustainability and
resources.
Some of these issues are quite generic. Good data management practices are good
data management practices, bad ones are bad in almost any context. There may be
some subtle differences, for example with regard to documenting authenticity.
Libraries carry some of that role, but we expect to be challenged less often to prove
the point. On the other hand, because libraries are much less able to impose
obligations or to negotiate standards with creators, format obsolescence is a big issue
for us, especially for the plethora of obscure file formats connected with many of the
publications we collect.
I’m going to refer to a slightly left field case study here, because I think it is still the
truly outstanding example of successful digital preservation in libraries, even though
it uses a model that is not immediately applicable to the other kinds of digital
materials I have mentioned.
5
Digital publishing had a profound effect on scholarly communications. I remember
when I moved to the National Library in the early 1990s, a debate was just starting to
heat up concerning the preservation role of scholarly publishers. If libraries were
busily replacing print subscriptions with subscriptions to e-journals, who would
maintain a record of scholarly research and publication? If libraries themselves could
no longer take steps to preserve the journals they subscribed to, because they owned
only a licence for access, was there any guarantee that commercial publishers would
do so, and could their commercial instincts be relied upon to perform this over the
long term? A classic preservation responsibility question!
This conundrum was neatly addressed for a growing portion of the world’s scientific
literature by a series of digital preservation agreements between the National Library
of the Netherlands (the KB), and a number of the world’s major scientific publishers.
First negotiated in the early years of this decade, the agreement covers collecting and
access issues, obligating the publishers to deposit their e-journals and defining the
access rights granted to the KB. The model allows the KB to take necessary
preservation action to keep the journal contents securely protected and to maintain
accessibility. This model has provided the KB with significant incentives to do
ground breaking work on emulation and on development of Universal Virtual
Computer concepts in conjunction with other partners. Although they have done
almost nothing in the area of web archiving, the KB are now one of the leading
libraries in terms of digital preservation workflows and readiness. A tremendous
achievement, and their digital preservation agreements with publishers are a gift to all
of us.
Two more case studies – very briefly. With regard to the preservation of online
materials that have already been archived, the IIPC recently asked the NLA to
convene and lead a Preservation Working Group to focus attention on preservation
issues – at last! – and initially to advise on the standards, tools and practices that are
available, or are needed, to preserve web archives. Potentially, a big step forward.
With regard to digital preservation more generally, we have been working with the
university sector, through the Australian Partnership for Sustainable Repositories
(APSR), to develop tools that will enable digital preservation. In particular, we have
been working on a software tool to help repository managers determine the level of
obsolescence risk for the file formats in their care. This tool automatically accesses
information from file format registries such as the TNA PRONOM registry, and help
repository managers assess whether they have reliable ongoing support for providing
access.
I want to bring this presentation to an end by highlighting this kind of work as the
most hopeful sign of where we are in digital preservation in libraries. We are moving
ahead with the development of tools, workflows and preservation planning based on
an end-to-end understanding of what is needed and of the risks we have to address.
We desperately need to critique the tools that are available with a view to making
them work better together. I think we have been saying something similar for some
years, but it has a new and more concrete determination about it.
But for libraries, it will always have a big overburden of context! As well as making
things work for individual libraries, there is the potential to make them work for
6
communities of libraries. In fact, that is the likely and intended impact of the work
being done by the IIPC, APSR, and so on.
We are also very conscious that the communities which libraries serve are themselves
increasingly active creators and collectors of important digital information, which
they may not have the capacity to store, manage and preserve themselves.
There is obviously still poorly explored potential to work across sectoral boundaries,
recognising where we have different needs and where we can plug what is appropriate
into our sometimes different, sometimes similar workflows.
Is this going to happen? Do we have the dependencies to sustain this and to sustain
the digital collections that future users will expect to have access to?
For the library sector, I would say that we have some of the dependencies, but not all.
We especially worry about resources, skills, and the legislative basis for our collecting
and preservation mandate. Where we are with some of these will depend much on
decisions that we can only ask others to make.
Thank you.
7
Download