Digital Preservation Service Recommendations Digital Repository Planning & Implementation Team Tim Donohue (chair), Tom Habing, Bill Mischo, Chris Prom, Beth Sandore, Sarah Shreeves, Tom Teper, John Weible September 2009 Initial Charge The University Library is responsible for the long-term management and survival of an increasing amount of digital content that represents the University’s investments in collections and other materials that support the institution’s educational and research mission. While the IDEALS repository supports the deposit by faculty and other researchers of scholarship produced under the auspices of Illinois, the Library must put in place a similar set of technologies and services for the digital content that represents its collections and other materials. The development of a digital preservation repository is an integral component of the Library’s ongoing stewardship responsibility to the University. In order to plan for a centralized digital preservation service, the University charged a team to assess program and technical needs, within the Library and across stakeholder units and programs. The Digital Repository Planning & Implementation Team was charged with assessing current Library needs, establishing general service requirements and performing an environmental scan to identify possible solutions. Although the singular term “repository” was used to define this team, it became apparent that a digital preservation service would likely consist of one or more disparate systems which work together to manage and curate the Library’s digital content. Brief Summary of Recommendations The Digital Repository Planning & Implementation Team recommends that the University Library begin a staged implementation of Fedora (http://www.fedora.info/) as the initial system to support a larger Digital Preservation Service program. This staged implementation should first begin with a more detailed investigation of Fedora’s strengths and weaknesses as compared to the Digital Preservation Service Requirements (See Appendix II) to ensure the system is the most appropriate initial candidate. This detailed investigation should be performed using a small subset of Library digital collections (preferably ones which represent unique test cases). One known hurdle in implementing Fedora is the extreme flexibility of the system, which requires an institution to make detailed planning decisions regarding how it should be implemented locally (and with which user interface(s)). For this reason, the team also recommends a more detailed investigation of the RODA (http://roda.di.uminho.pt/) user interface to Fedora, built by the National Archives of Portugal. The RODA interface was designed specifically for archives, with long-term digital preservation Page | 1 and related standards as its primary objective. Our hope is that RODA may allow us to more quickly implement an initial Digital Preservation Service for the Library and its stakeholder units and programs. The team also recommends that the University Library consider using external repositories (such as the Hathi Trust and Duracloud) for digital content as appropriate. For example, the Library should consider using the Hathi Trust as the preservation repository for our digitized book content done through the Open Content Alliance. The team also recommends the formation of a dedicated group with Library IT to plan, implement, and provide ongoing support for the preservation and management of the digital content created and acquired by the Library. See the Proposal for a Repository and Scholarly Communication Services Technical Team for more details. Software Analysis The Digital Repository Planning & Implementation Team performed an environmental scan of potential software candidates. We analyzed each software candidate based on publicly available documentation and/or knowledge based on local expertise with the software. Each software option was compared against the Digital Preservation Service Requirements (See Appendix II) to attempt to discern how closely the software met those needs. The following software was reviewed as a part of the environmental scan. These software options have generally been grouped into Open Source and Commercially-provided options. A third category, On Our Radar includes a few software options which are worth following as they mature, or generally looking into further in the future. Open Source candidates Of the four open source options analyzed, two of the systems were found to have a very small initial user/support community. DAITSS and irPlus may be worth keeping track of, but neither has the broad user/support community equivalent to DSpace or Fedora. DAITSS (Dark Archive In The Sunshine State) (http://daitss.fcla.edu/) The software and support is currently managed entirely by developers at the Florida Center for Library Automation (FCLA). Although the software itself is worth keeping an eye on, the lack of a global user community means that new feature requests or immediate bug fixes may require larger amounts of local resources. As it is maintained entirely by FCLA, it’s also possible that the software may still be very specific to that institutions needs. DSpace (http://www.dspace.org/) The software has the largest global user community of any open source repository software. We have local expertise in this environment, as IDEALS runs on DSpace software. However, DSpace has a few major weaknesses. It doesn’t adequately support hierarchical metadata Page | 2 schemas (e.g. MODS and similar schemas) – it only supports its own version of “qualified” Dublin core and similar structured schemas. In addition, it doesn’t support versioning of files, or tracking relationships between files. The DuraSpace merger with Fedora potentially provides an even larger user community in the long term. Fedora (http://www.fedora.info/) The software has a rather large global user community, mostly based in the USA. We do have some local expertise with this software through research performed by the NDIIPP project (and similar Library projects). It is a highly flexible system, though such flexibility necessitates that an institution make its own local decisions around how to best implement Fedora. It doesn’t come with an out-of-the-box user interface, but several third-party user interfaces are available in a variety of programming languages. Fedora is able to manage multiple versions of files and track relationships between files. Its major weakness is that it can require a number of programmers to install, maintain and customize for local needs. The DuraSpace merger with DSpace potentially provides an even larger user community in the long term. It seems promising (though not definite) that the Fedora architecture may eventually be supported with DSpace as its user interface. irPlus (University of Rochester) (http://code.google.com/p/irplus/) The software and support is currently managed entirely by developers at the University of Rochester. The software is still a beta release and not ready for production implementations. The software itself seems to offer features very similar to DSpace, with aspects of a “NetFileslike” system – allowing editing/sharing of unfinished research. This software seems less like a preservation system, and more like a researcher repository for creating, editing, sharing, and archiving digital research. As it is maintained entirely by U of Rochester, it’s also possible that the software may still be very specific to that institutions needs. RODA (National Archives of Portugal) (http://roda.di.uminho.pt/) This software is built on top of Fedora, and was specifically designed based on the needs of archives (it is in production at the National Archives of Portugal), with long-term preservation, OAIS compliance and TRAC compliance as primary objectives. Its interfaces concentrate more on the management and digital preservation of objects. But, it also has a variety of nice user interface features including: a slideshow preview, video preview, page turner, timeline (to view all PREMIS events on an object), and a desktop deposit tool. At a quick glance, RODA seems to be one of the more promising open-source digital preservation systems available. Although it was built specific to the needs of the National Archives of Portugal, it seems to parallel most of the needs we’ve laid forth in our list of Digital Preservation Service requirements. Commercially-Provided candidates Because of current Library budget situation, most commercially-provided options were immediately discounted because of relative costs, perceived lack of openness and the Page | 3 recommendation that the software “should have a relatively strong base of users / community support within the academic and research library community”. Because most of these systems are not as open as we’d like, our primary analysis only concerned Microsoft Zentity – Microsoft’s open source software which requires purchase of additional Microsoft software to install and run. Microsoft Zentity (http://research.microsoft.com/en-us/projects/zentity/) Although the software requires Microsoft products to install, it is open source and provided free of charge. In the Grainger Engineering Library, we do have expertise in the Microsoft technologies (ASP.Net, SQL Server, etc.) that are used within Zentity. As it is newly released, Zentity does not currently have a large user / support community (other than support provided by Microsoft). Zentity is currently only a Microsoft Research project, which means there is no guarantee of continued support or eventual release as a Microsoft product. Other Commercially-Provided Candidates: EMC Documentum – http://www.documentum.com/ HP TRIM software (document & records mgmt) – http://h18000.www1.hp.com/products/software/im/governance_ediscovery/trim/ IBM Enterprise Content Management – http://www-01.ibm.com/software/data/content-management/ Oracle Content Management – http://www.oracle.com/products/middleware/content-management/ Xythos – http://www.xythos.com/ On Our Radar DuraCloud (http://www.duraspace.org/) This service is still in the planning stages, but is one of the initial goals of the DuraSpace merger of DSpace and Fedora. “DuraCloud is a hosted service that takes advantage of the cost efficiencies of cloud storage and cloud computing, while adding value to help ensure longevity and re-use of digital content…DuraCloud will be accessible directly as a Web service and also via plug-ins to digital repositories including Fedora and DSpace.” (DuraSpace press release - May 12, 2009) HathiTrust (http://www.hathitrust.org/) HathiTrust seems extremely promising, and is worth keeping a close eye on. It is a logical place where we could preserve much of our digitized content, once its API is more fully flushed out. Page | 4 However, at this time, the Google-based Ingest system limits what we can store within HathiTrust. iRODS (http://www.irods.org/) Of these three software solutions, iRODS is the only one that is currently available for our usage. However, iRODS is not repository software. Rather, it is more like a data storage/replication layer which could be utilized by a repository management system. Recommendations Based on the above software analysis, the Digital Repository Planning & Implementation team recommends the University Library begin a staged implementation of Fedora as the initial software system to support a larger Digital Preservation Service program. In conjunction with this staged implementation of Fedora, the DRPI team also recommends further investigation of the RODA interface for Fedora to potentially ease the resource burden normally associated with a new Fedora implementation. Why Fedora? Fedora seems to be the closest to meeting the Digital Preservation Service Requirements (See Appendix II) detailed by this team. Fedora has a large US academic and research library support community, which looks to increase in size with its recent merger with DSpace into DuraSpace. In addition, many of our CIC colleagues already are implementing Fedora within their Libraries, including (but not limited to) Indiana, Wisconsin, Northwestern and Minnesota. It is the hope of this team that these potential collaborations, along with local experience from the NDIIPP projects, may allow the Library to more easily implement Fedora as the initial software for our Digital Preservation Service program. Why RODA? One known hurdle in implementing Fedora is the extreme flexibility of the system, which requires an institution to make detailed planning decisions regarding how it should be implemented locally (and with which user interface(s)). For this reason, the team also recommends a more detailed investigation of the RODA (http://roda.di.uminho.pt/) user interface to Fedora, built by the National Archives of Portugal. The RODA interface was designed specifically for archives, with long-term digital preservation and related standards as its primary objective. Our hope is that RODA may allow us to more quickly implement an initial Digital Preservation Service for the Library and its stakeholder units and programs. Needed Resources: In order to implement these recommendations and to provide trusted, stable, and consistent service, the team recommends the formation of a dedicated team within Library IT to plan, implement, and provide ongoing support for the preservation and management of the digital content created and Page | 5 acquired by the Library. See the Proposal for a Repository and Scholarly Communication Services Technical Team for more details. Staged Implementation Plan for Local Repository Services: 1. Install Fedora and RODA interface on local library servers for further research. 2. Perform a more detailed analysis of both Fedora and RODA by ingesting several unique digital collections of the Library chosen based on the Known Content Types / Production Streams (See Appendix I). This analysis should determine (again, based on the Digital Preservation Service Requirements) whether to continue with this software in a Production level environment. In addition, this analysis should also help determine the general resource cost of ingesting/managing new collections within this environment. 3. Report results of detailed analysis back to Digital Repository Planning & Implementation team as well as stakeholders including the Preservation Librarian and the Head of DCC. This team will provide a plan with how best to move forward, and a final report of resources (time, staff, storage, etc.) needed to move to Production environment and for ongoing maintenance. 4. Assuming results are positive, install software in Production environment within the Library 5. Begin ingesting digital collections into Production environment. The order of which will need to be decided. Page | 6 Appendix I – Known Content Types / Production Streams The following represents a rough list of digital content creation streams and digital content types known to exist within the Library. This list was generated by the Digital Repository Planning and Implementation team during a brainstorming session and should not be considered complete or finalized. The list is only meant as a reference of the types of digital content which may need to be handled by the Digital Preservation Service program. Digital Content Production Streams Faculty-generated research (e.g. IDEALS) Campus records management In-library production Purchased content Harvested content (non-commercial content created elsewhere) Digital Content Types (alphabetically) Audio CAD drawings Databases Datasets Dynamic content Email GIS Metadata files Simulation models Software Still Image Text-based Video Websites Page | 7 Appendix II – Digital Preservation Service Requirements The following represents a rough set of recommended “requirements” for a library-wide digital preservation service. Please note that these are not requirements for a digital access system, but for a system (or set of systems) that will allow staff to efficiently and securely manage and preserve digital objects which may be disseminated via a variety of access applications. In addition, these “requirements” may not be specific features of a single software system, but may be enforced by multiple systems or by local policies and procedures. Therefore, all requirements need not be met by a single software system, but are worth keeping in mind during analysis of various software options. These requirements were presented to interested stakeholders within the Library at a Brown Bag lunch on June 17, 2009. Minor revisions/additions were made as a result of that meeting. (Note: The ordering of these requirements is entirely random. Requirements have not been ranked in terms of importance, etc.) Object Storage Ability to store all types of digital files (including text, image, audio, video, and complex objects) Ability to name and track relationships between different versions of files (for example, a set of JPG2000 images and a raw OCR text file derived from those images) or several files which together make up a complex/compound object Ability to bulk ingest and export large sets of files using standard schemas (such as METS, or similar) Easy deposit system for content (preferably easy for anyone, not just developers) Metadata Ability to store metadata in multiple schema and maintain relationships between metadata and files Ability to specify a least common denominator metadata schema for descriptive metadata (for a common searching / browsing interface) Support for Unicode for metadata Support for XML for metadata Support for provenance metadata (what changes made, who made them, when, why, how, etc), and plenty of room for growth of this metadata over time Support for “group-level” metadata (across groups of objects or collections of objects) Support for record-keeping or administrative metadata Support for batch update/replace of metadata Support for versioning of metadata Page | 8 Search / Browse Supports both known item and keyword searching, as well as browsing for files Support searching/browsing of metadata fields common to all (or most) objects Support searching/browsing of content types or MIME types Reports Support report generation based on any “saved search” (e.g. reports of all files of particular content types / MIME types, or reports based on date added, etc.) Preservation activities Support for fixity checks and virus scanning (either external or internal to system), preferably of both files and metadata Ability to implement simple and scalable backup and restore procedures Programmatic activities Support for a simple, standard API which can be used across all types of objects Well documented/supported system and API (preferably with a larger, active user community) API which supports enumeration of files and their contents (to perform actions or format transformations on many files in bulk) Authorization / Authentication Ability to manage access controls to objects as well as sets of objects at appropriate granularity Administration / Logging Administrative layer to log changes to objects at an appropriately granular level Event logging and ability to monitor any modifications within the system Other Scalable system and sets of procedures and policies Preferably, system should have a relatively strong base of users / community support within the academic and research library community Page | 9