Digital Preservation – the state of the game on the library lawns Colin Webb, National Library of Australia For NAA Digital Futures International Forum 18-19 September 2007 Thank you …..Looking over my notes I realise that I have failed to mention very important work being done by major players such as the British Library, Library of Congress, National Library of New Zealand and others. Still, I hope what I will cover gives you a good sense of where things stand in libraries. I would like to begin with a story that I hope will explain the title of my presentation: State of the game on the library lawns. In the mid 1980s, I was working in what was then called The Australian Archives. As a conservator and preservation manager, I and my staff had contact with the collection that few other AA staff had. It was an exciting place to work including the frequent discovery of small treasures. At that time, the Australian Archives did not hold exhibitions or have a publications program, and apart from a small number of researchers and a procession of undervalued agency registry managers, I suspect most Australians hardly knew we existed – or if they did, they simply didn’t care. (At least that’s how it felt from the inside.) Then in 1988 (I think it was), a Queensland academic “discovered” the vast amount of copyright registration exhibits held by the Archives, from a buxom Mae West doll of the 1920s, to photographs of the iconic Australian television puppet Mr Squiggle in the1950s. The researcher described it as a “treasure trove”. Of course we had known about them all along, and had been quietly preserving them for just such earnest use and delight. However, just as the researcher had “discovered” these treasures, his story was discovered by a journalist. A cover story in The Bulletin, double page spreads in the major capital city newspapers: it was fabulous publicity. We were all so excited. “The eagle has landed!” the headlines said. “We have struck the mother lode!” And we opened our papers to read how, in the dark and dusty vaults of the national archives, a researcher had uncovered these hitherto lost treasures. It has been turned into an Indiana Jones story, with the clever researcher stumbling upon treasures under the very noses of the archivists. “Dusty vaults” indeed! I tell this story by way of claiming the right to use what may seem to be a similarly mischievous metaphor, but one which I hope will illuminate what I see as the state of play with digital preservation in libraries. Compared with the conduct of digital preservation in the archives sector in scrupulously undusty vaults, digital preservation in libraries is a game underway on the library lawns, in the sunshine and the open air, amongst the dust of activity, the heat and the flies and the barracking of passers by. Like many open air games, it may have the appearance of a leisurely pursuit of little consequence, but the participants and at least some of the spectators are playing for high stakes. 1 I use this metaphor not only for the sake of a bit of gentle banter with my former colleagues, but as a way of introducing what I see as some key contextual features that characterise the state of play for digital preservation in libraries. I am going to spend a significant part of my presentation talking about that and the way it influences how we see digital preservation, the kind of preservation actions we are most interested in, and where we currently are –at least as I can present it in a short talk like this. The first bit of context – and I suppose the most obvious bit – is that “libraries” is as generic a term as “archives”. They are useful labels suggesting important commonalities, but there are of course as many different kinds of libraries as there are different kinds of archival institutions. Consider the different levels and kinds of interest in digital preservation that might be expected from the National Library of Australia compared with the Queanbeyan City Library, or the branch library in a small country town. However, I guess one thing that libraries have in common is that much of the information they hold has been made specifically for a public readership. That’s not to say that libraries only hold published materials – some of our most interesting stuff is unique and unpublished, but in almost all cases material held by libraries was either produced, or knowingly acquired, for the purpose of public access. Because of this, libraries are meant to be places where you can find information, or at least find out how to locate and get access to information even if the library itself doesn’t hold it. In the digital world, these roles remain, albeit with some increasingly blurry boundaries between kinds of information, and increasing opportunities for finding information without needing to even know where it is physically held. To follow through on this idea, I am going to focus my remarks about digital preservation on what we consider to be published digital information, ignoring for the moment the significant challenges of unpublished digital materials. At the risk of making this seem incredibly complicated – which one should never do when talking about digital preservation – I am going to play a game of threes. In fact, I am going to ask you to keep three sets of threes in mind! I want to divide the body of published digital materials into three categories which tend to get treated in different ways, and which are generally at three different levels of preservation progress: Online materials freely available on the World Wide Web. Electronic journals published by mainstream publishers, usually as commercial products. Physical format digital publications, issued on media such as CDs, DVDs, once upon a time on floppy disks. I am also going to talk about three categories of issues that have a huge impact on how preservation is approached in libraries. The first two, collecting and access, I will be referring to as context issues, while the third is what would be recognised as preservation issues themselves. 2 Finally, in talking about those core preservation issues, I will ask you to keep in mind three focuses that we know we have to address: Data management, including maintenance of data identity and integrity, security and managing risks of media failure. Management of intelligibility, to do with technological change and format obsolescence. Organisational issues to do with responsibility, sustainability and resources. I will try to keep track of this complex array, because we need to understand it if we are going to make meaningful statements about the progress of digital preservation, in the context in which libraries operate and do their core business. Collecting Let me begin with collecting, and speak in particular about the collecting of online materials. This is an area of strength but also of great challenge for libraries. When referring to online information, I’m assuming I don’t need to convince anyone that there is a lot of it. Many national libraries believe their traditional roles of collecting, preserving and providing access to the national published output must include something approaching a representative sample of information published online. The simple reality understood by most libraries with an interest in web archiving is that preservation has to start with collecting. Much more than in the print world, the window of opportunity for collecting online information is usually quite small and unpredictable. There is no obligation on most online creators to look after what they produce or even to create it in ways that make it easy to preserve. In the vast world of web publishing, many creators have interests that either ignore or even run counter to what a library would consider to be reliable ongoing access. This means that the chances of information surviving unchanged for very long in the wild world of the web are very hit and miss, and not necessarily correlated with the long term value of the information. Most libraries are convinced of the need to move information from the high risk environment of the live web to a relatively safe place where it can be protected. Most of us have decided it is better to take this step without waiting until all preservation problems have been solved and perfect preservation and content management systems are in place, simply because most of the current material that needs to be kept will already be lost if we wait. So, from the earliest days of web archiving in libraries, collecting has been spoken of as part of preservation. Many national libraries have set up web archives believing they are contributing to digital preservation efforts. In 2003 a number of us formed the International Internet Preservation Consortium, a somewhat loosely named group that was and is committed to taking action that will help its members build and preserve web archives. Currently the IIPC consists of 26 member institutions, mostly national libraries but also including a few archives and private web archiving operations, including the well known Internet Archive. 3 In Australia, since 1996 the library sector has been building the PANDORA national collection of Australian online information, initiated by the NLA but now contributed to by partners including all the mainland state libraries and a number of other information and collecting agencies. There have been huge achievements in this first preservation step. On the other side of the ledger, however, it is easy to feel that context has become the substance of preservation. The importance of collecting has preoccupied the library sector, both in terms of allocating funding resources and just as importantly, in the allocation of brain power and energy. The IIPC provides a good reflection of this situation. The consortium is an action-oriented one – you can’t belong to it without committing to involvement in R&D work. To date, the great majority of the R&D work undertaken within IIPC has been focused on better tools for collecting. This is a pressure that can only continue, even while we try to shift some weight to the preservation of what has been collected. Our substantial collecting efforts are constantly challenged by the changing nature of web publishing. Already I consider our collecting efforts to be quite inadequate, given that we are still trying to find a way to automatically gather information from database structured websites – which surely must account for the majority of what we want to collect! The IIPC funded work at the NLA on a database gathering tool called Xinq some years ago, but it has made almost no contribution to actual gathering efforts, and we have barely even started thinking about how to collect and represent current changes in web use. Even in the NLA, which has a long standing digital preservation program, the collecting context bites hard on our preservation effort. We are beginning a thoroughgoing review of our web archiving activities, aimed at getting a better balance between tightly selected and curated research quality collections such as PANDORA, and the kind of bulk automated harvesting that will allow future users to see a broad cross-section of the information on the Australian web. We are doing this after more than a decade of very active collecting, but in the context of looking for ways to maintain this momentum in the absence of supportive legislation, and the changing face of web publishing. At the same time looking for ways we can free up resources to have an impact on our growing core preservation challenges. So, collecting is both influential context for, and part of, digital preservation for us. I will say similar things about the other big context issue for libraries, which is access. As with collecting, this is an intrinsic driver for libraries – in some ways it is what we exist for! As an overall driver it is so important that it tends to drain resources away from core preservation – whilst also contributing to our preservation work. Libraries have an impressive track record in recognising the opportunities for federating access to their collections. This kind of thinking goes back a long way in library land, so it has been a natural step to try to extend this to seamless searching and getting of resources across and beyond the library sector. This is such an intrinsic part of the vision of modern libraries that it will always win attention in debates about priorities for systems development work. As a preservation specialist, I have had to come to accept this, and to see the positives in it for preservation. It is surely easier to 4 argue that our preservation efforts are critical when it is recognised that we can’t just put data away with the hope that no-one will want to access it for decades. I’m going to choose another type of published digital information as a case study here – not because access is at all unimportant for online materials: the legal right to provide access is one of the biggest issues of all – but to illustrate some differences for non-online materials. Libraries typically have collections of physical format digital publications, either held in separate “electronic” collections or scattered through their print collections. There seem to be far fewer collecting hassles with these materials than with online materials. Rightly or wrongly, no-one ever talks about their presence in our collections as an act of preservation in itself, whereas for online materials that language is constantly used. On the other hand, the preservation of physical format materials is much more closely linked with access issues. The two preservation drivers for these materials seem to be: will someone walk in tomorrow and request this 20 year old disk?; and an awareness that many physical format digital publications come with Technical Protection Measures (TPMs) specifically designed to control access in ways that will also defeat preservation copying. We think about these as the critical preservation issues. We have also come to recognise that good old media failure is a much more pressing risk that must be addressed urgently. At the NLA we are in the middle of a project to design efficient workflows to move the content of physical format digital materials to a much better managed mass storage system as a priority risk management strategy. Doing that will also challenge us to better define how we are going to guarantee access. I have been trying to give you a picture of digital preservation operating in a context of two powerful drivers that play an important part in how we see digital preservation in libraries, but also play a contributing – sometimes a complicating role. For the final round of this game of threes, I would like to concentrate on what I think of as core digital preservation issues – data management, technological change and obsolescence, and organisational issues to do with responsibility, sustainability and resources. Some of these issues are quite generic. Good data management practices are good data management practices, bad ones are bad in almost any context. There may be some subtle differences, for example with regard to documenting authenticity. Libraries carry some of that role, but we expect to be challenged less often to prove the point. On the other hand, because libraries are much less able to impose obligations or to negotiate standards with creators, format obsolescence is a big issue for us, especially for the plethora of obscure file formats connected with many of the publications we collect. I’m going to refer to a slightly left field case study here, because I think it is still the truly outstanding example of successful digital preservation in libraries, even though it uses a model that is not immediately applicable to the other kinds of digital materials I have mentioned. 5 Digital publishing had a profound effect on scholarly communications. I remember when I moved to the National Library in the early 1990s, a debate was just starting to heat up concerning the preservation role of scholarly publishers. If libraries were busily replacing print subscriptions with subscriptions to e-journals, who would maintain a record of scholarly research and publication? If libraries themselves could no longer take steps to preserve the journals they subscribed to, because they owned only a licence for access, was there any guarantee that commercial publishers would do so, and could their commercial instincts be relied upon to perform this over the long term? A classic preservation responsibility question! This conundrum was neatly addressed for a growing portion of the world’s scientific literature by a series of digital preservation agreements between the National Library of the Netherlands (the KB), and a number of the world’s major scientific publishers. First negotiated in the early years of this decade, the agreement covers collecting and access issues, obligating the publishers to deposit their e-journals and defining the access rights granted to the KB. The model allows the KB to take necessary preservation action to keep the journal contents securely protected and to maintain accessibility. This model has provided the KB with significant incentives to do ground breaking work on emulation and on development of Universal Virtual Computer concepts in conjunction with other partners. Although they have done almost nothing in the area of web archiving, the KB are now one of the leading libraries in terms of digital preservation workflows and readiness. A tremendous achievement, and their digital preservation agreements with publishers are a gift to all of us. Two more case studies – very briefly. With regard to the preservation of online materials that have already been archived, the IIPC recently asked the NLA to convene and lead a Preservation Working Group to focus attention on preservation issues – at last! – and initially to advise on the standards, tools and practices that are available, or are needed, to preserve web archives. Potentially, a big step forward. With regard to digital preservation more generally, we have been working with the university sector, through the Australian Partnership for Sustainable Repositories (APSR), to develop tools that will enable digital preservation. In particular, we have been working on a software tool to help repository managers determine the level of obsolescence risk for the file formats in their care. This tool automatically accesses information from file format registries such as the TNA PRONOM registry, and help repository managers assess whether they have reliable ongoing support for providing access. I want to bring this presentation to an end by highlighting this kind of work as the most hopeful sign of where we are in digital preservation in libraries. We are moving ahead with the development of tools, workflows and preservation planning based on an end-to-end understanding of what is needed and of the risks we have to address. We desperately need to critique the tools that are available with a view to making them work better together. I think we have been saying something similar for some years, but it has a new and more concrete determination about it. But for libraries, it will always have a big overburden of context! As well as making things work for individual libraries, there is the potential to make them work for 6 communities of libraries. In fact, that is the likely and intended impact of the work being done by the IIPC, APSR, and so on. We are also very conscious that the communities which libraries serve are themselves increasingly active creators and collectors of important digital information, which they may not have the capacity to store, manage and preserve themselves. There is obviously still poorly explored potential to work across sectoral boundaries, recognising where we have different needs and where we can plug what is appropriate into our sometimes different, sometimes similar workflows. Is this going to happen? Do we have the dependencies to sustain this and to sustain the digital collections that future users will expect to have access to? For the library sector, I would say that we have some of the dependencies, but not all. We especially worry about resources, skills, and the legislative basis for our collecting and preservation mandate. Where we are with some of these will depend much on decisions that we can only ask others to make. Thank you. 7