Digital Preservation Service Requirements

advertisement
Digital Preservation Service Recommendations
Digital Repository Planning & Implementation Team
Tim Donohue (chair), Tom Habing, Bill Mischo, Chris Prom, Beth Sandore, Sarah Shreeves, Tom Teper, John Weible
September 2009
Initial Charge
The University Library is responsible for the long-term management and survival of an increasing
amount of digital content that represents the University’s investments in collections and other materials
that support the institution’s educational and research mission. While the IDEALS repository supports
the deposit by faculty and other researchers of scholarship produced under the auspices of Illinois, the
Library must put in place a similar set of technologies and services for the digital content that represents
its collections and other materials. The development of a digital preservation repository is an integral
component of the Library’s ongoing stewardship responsibility to the University. In order to plan for a
centralized digital preservation service, the University charged a team to assess program and technical
needs, within the Library and across stakeholder units and programs.
The Digital Repository Planning & Implementation Team was charged with assessing current Library
needs, establishing general service requirements and performing an environmental scan to identify
possible solutions. Although the singular term “repository” was used to define this team, it became
apparent that a digital preservation service would likely consist of one or more disparate systems which
work together to manage and curate the Library’s digital content.
Brief Summary of Recommendations
The Digital Repository Planning & Implementation Team recommends that the University Library begin a
staged implementation of Fedora (http://www.fedora.info/) as the initial system to support a larger
Digital Preservation Service program. This staged implementation should first begin with a more
detailed investigation of Fedora’s strengths and weaknesses as compared to the Digital Preservation
Service Requirements (See Appendix II) to ensure the system is the most appropriate initial candidate.
This detailed investigation should be performed using a small subset of Library digital collections
(preferably ones which represent unique test cases).
One known hurdle in implementing Fedora is the extreme flexibility of the system, which requires an
institution to make detailed planning decisions regarding how it should be implemented locally (and
with which user interface(s)). For this reason, the team also recommends a more detailed investigation
of the RODA (http://roda.di.uminho.pt/) user interface to Fedora, built by the National Archives of
Portugal. The RODA interface was designed specifically for archives, with long-term digital preservation
Page | 1
and related standards as its primary objective. Our hope is that RODA may allow us to more quickly
implement an initial Digital Preservation Service for the Library and its stakeholder units and programs.
The team also recommends that the University Library consider using external repositories (such as the
Hathi Trust and Duracloud) for digital content as appropriate. For example, the Library should consider
using the Hathi Trust as the preservation repository for our digitized book content done through the
Open Content Alliance.
The team also recommends the formation of a dedicated group with Library IT to plan, implement, and
provide ongoing support for the preservation and management of the digital content created and
acquired by the Library. See the Proposal for a Repository and Scholarly Communication Services
Technical Team for more details.
Software Analysis
The Digital Repository Planning & Implementation Team performed an environmental scan of potential
software candidates. We analyzed each software candidate based on publicly available documentation
and/or knowledge based on local expertise with the software. Each software option was compared
against the Digital Preservation Service Requirements (See Appendix II) to attempt to discern how closely
the software met those needs.
The following software was reviewed as a part of the environmental scan. These software options have
generally been grouped into Open Source and Commercially-provided options. A third category, On Our
Radar includes a few software options which are worth following as they mature, or generally looking
into further in the future.
Open Source candidates
Of the four open source options analyzed, two of the systems were found to have a very small
initial user/support community. DAITSS and irPlus may be worth keeping track of, but neither
has the broad user/support community equivalent to DSpace or Fedora.
DAITSS (Dark Archive In The Sunshine State) (http://daitss.fcla.edu/)
The software and support is currently managed entirely by developers at the Florida Center for
Library Automation (FCLA). Although the software itself is worth keeping an eye on, the lack of
a global user community means that new feature requests or immediate bug fixes may require
larger amounts of local resources. As it is maintained entirely by FCLA, it’s also possible that the
software may still be very specific to that institutions needs.
DSpace (http://www.dspace.org/)
The software has the largest global user community of any open source repository software.
We have local expertise in this environment, as IDEALS runs on DSpace software. However,
DSpace has a few major weaknesses. It doesn’t adequately support hierarchical metadata
Page | 2
schemas (e.g. MODS and similar schemas) – it only supports its own version of “qualified”
Dublin core and similar structured schemas. In addition, it doesn’t support versioning of files, or
tracking relationships between files. The DuraSpace merger with Fedora potentially provides an
even larger user community in the long term.
Fedora (http://www.fedora.info/)
The software has a rather large global user community, mostly based in the USA. We do have
some local expertise with this software through research performed by the NDIIPP project (and
similar Library projects). It is a highly flexible system, though such flexibility necessitates that an
institution make its own local decisions around how to best implement Fedora. It doesn’t come
with an out-of-the-box user interface, but several third-party user interfaces are available in a
variety of programming languages. Fedora is able to manage multiple versions of files and track
relationships between files. Its major weakness is that it can require a number of programmers
to install, maintain and customize for local needs. The DuraSpace merger with DSpace
potentially provides an even larger user community in the long term. It seems promising
(though not definite) that the Fedora architecture may eventually be supported with DSpace as
its user interface.
irPlus (University of Rochester) (http://code.google.com/p/irplus/)
The software and support is currently managed entirely by developers at the University of
Rochester. The software is still a beta release and not ready for production implementations.
The software itself seems to offer features very similar to DSpace, with aspects of a “NetFileslike” system – allowing editing/sharing of unfinished research. This software seems less like a
preservation system, and more like a researcher repository for creating, editing, sharing, and
archiving digital research. As it is maintained entirely by U of Rochester, it’s also possible that
the software may still be very specific to that institutions needs.
RODA (National Archives of Portugal) (http://roda.di.uminho.pt/)
This software is built on top of Fedora, and was specifically designed based on the needs of
archives (it is in production at the National Archives of Portugal), with long-term preservation,
OAIS compliance and TRAC compliance as primary objectives. Its interfaces concentrate more
on the management and digital preservation of objects. But, it also has a variety of nice user
interface features including: a slideshow preview, video preview, page turner, timeline (to view
all PREMIS events on an object), and a desktop deposit tool. At a quick glance, RODA seems to
be one of the more promising open-source digital preservation systems available. Although it
was built specific to the needs of the National Archives of Portugal, it seems to parallel most of
the needs we’ve laid forth in our list of Digital Preservation Service requirements.
Commercially-Provided candidates
Because of current Library budget situation, most commercially-provided options were
immediately discounted because of relative costs, perceived lack of openness and the
Page | 3
recommendation that the software “should have a relatively strong base of users / community
support within the academic and research library community”. Because most of these systems
are not as open as we’d like, our primary analysis only concerned Microsoft Zentity – Microsoft’s
open source software which requires purchase of additional Microsoft software to install and
run.
Microsoft Zentity (http://research.microsoft.com/en-us/projects/zentity/)
Although the software requires Microsoft products to install, it is open source and provided free
of charge. In the Grainger Engineering Library, we do have expertise in the Microsoft
technologies (ASP.Net, SQL Server, etc.) that are used within Zentity. As it is newly released,
Zentity does not currently have a large user / support community (other than support provided
by Microsoft). Zentity is currently only a Microsoft Research project, which means there is no
guarantee of continued support or eventual release as a Microsoft product.
Other Commercially-Provided Candidates:

EMC Documentum – http://www.documentum.com/

HP TRIM software (document & records mgmt) –
http://h18000.www1.hp.com/products/software/im/governance_ediscovery/trim/

IBM Enterprise Content Management –
http://www-01.ibm.com/software/data/content-management/

Oracle Content Management –
http://www.oracle.com/products/middleware/content-management/

Xythos – http://www.xythos.com/
On Our Radar
DuraCloud (http://www.duraspace.org/)
This service is still in the planning stages, but is one of the initial goals of the DuraSpace merger
of DSpace and Fedora. “DuraCloud is a hosted service that takes advantage of the cost
efficiencies of cloud storage and cloud computing, while adding value to help ensure longevity
and re-use of digital content…DuraCloud will be accessible directly as a Web service and also via
plug-ins to digital repositories including Fedora and DSpace.” (DuraSpace press release - May 12,
2009)
HathiTrust (http://www.hathitrust.org/)
HathiTrust seems extremely promising, and is worth keeping a close eye on. It is a logical place
where we could preserve much of our digitized content, once its API is more fully flushed out.
Page | 4
However, at this time, the Google-based Ingest system limits what we can store within
HathiTrust.
iRODS (http://www.irods.org/)
Of these three software solutions, iRODS is the only one that is currently available for our usage.
However, iRODS is not repository software. Rather, it is more like a data storage/replication
layer which could be utilized by a repository management system.
Recommendations
Based on the above software analysis, the Digital Repository Planning & Implementation team
recommends the University Library begin a staged implementation of Fedora as the initial software
system to support a larger Digital Preservation Service program. In conjunction with this staged
implementation of Fedora, the DRPI team also recommends further investigation of the RODA interface
for Fedora to potentially ease the resource burden normally associated with a new Fedora
implementation.
Why Fedora?
Fedora seems to be the closest to meeting the Digital Preservation Service Requirements (See Appendix
II) detailed by this team. Fedora has a large US academic and research library support community,
which looks to increase in size with its recent merger with DSpace into DuraSpace. In addition, many of
our CIC colleagues already are implementing Fedora within their Libraries, including (but not limited to)
Indiana, Wisconsin, Northwestern and Minnesota. It is the hope of this team that these potential
collaborations, along with local experience from the NDIIPP projects, may allow the Library to more
easily implement Fedora as the initial software for our Digital Preservation Service program.
Why RODA?
One known hurdle in implementing Fedora is the extreme flexibility of the system, which requires an
institution to make detailed planning decisions regarding how it should be implemented locally (and
with which user interface(s)). For this reason, the team also recommends a more detailed investigation
of the RODA (http://roda.di.uminho.pt/) user interface to Fedora, built by the National Archives of
Portugal. The RODA interface was designed specifically for archives, with long-term digital preservation
and related standards as its primary objective. Our hope is that RODA may allow us to more quickly
implement an initial Digital Preservation Service for the Library and its stakeholder units and programs.
Needed Resources:
In order to implement these recommendations and to provide trusted, stable, and consistent service,
the team recommends the formation of a dedicated team within Library IT to plan, implement, and
provide ongoing support for the preservation and management of the digital content created and
Page | 5
acquired by the Library. See the Proposal for a Repository and Scholarly Communication Services
Technical Team for more details.
Staged Implementation Plan for Local Repository Services:
1. Install Fedora and RODA interface on local library servers for further research.
2. Perform a more detailed analysis of both Fedora and RODA by ingesting several unique digital
collections of the Library chosen based on the Known Content Types / Production Streams (See
Appendix I). This analysis should determine (again, based on the Digital Preservation Service
Requirements) whether to continue with this software in a Production level environment. In
addition, this analysis should also help determine the general resource cost of
ingesting/managing new collections within this environment.
3. Report results of detailed analysis back to Digital Repository Planning & Implementation team as
well as stakeholders including the Preservation Librarian and the Head of DCC. This team will
provide a plan with how best to move forward, and a final report of resources (time, staff,
storage, etc.) needed to move to Production environment and for ongoing maintenance.
4. Assuming results are positive, install software in Production environment within the Library
5. Begin ingesting digital collections into Production environment. The order of which will need to
be decided.
Page | 6
Appendix I – Known Content Types / Production Streams
The following represents a rough list of digital content creation streams and digital content types known
to exist within the Library. This list was generated by the Digital Repository Planning and
Implementation team during a brainstorming session and should not be considered complete or
finalized. The list is only meant as a reference of the types of digital content which may need to be
handled by the Digital Preservation Service program.
Digital Content Production Streams





Faculty-generated research (e.g. IDEALS)
Campus records management
In-library production
Purchased content
Harvested content (non-commercial content created elsewhere)
Digital Content Types (alphabetically)














Audio
CAD drawings
Databases
Datasets
Dynamic content
Email
GIS
Metadata files
Simulation models
Software
Still Image
Text-based
Video
Websites
Page | 7
Appendix II – Digital Preservation Service Requirements
The following represents a rough set of recommended “requirements” for a library-wide digital
preservation service. Please note that these are not requirements for a digital access system, but for a
system (or set of systems) that will allow staff to efficiently and securely manage and preserve digital
objects which may be disseminated via a variety of access applications. In addition, these
“requirements” may not be specific features of a single software system, but may be enforced by
multiple systems or by local policies and procedures. Therefore, all requirements need not be met by a
single software system, but are worth keeping in mind during analysis of various software options.
These requirements were presented to interested stakeholders within the Library at a Brown Bag lunch
on June 17, 2009. Minor revisions/additions were made as a result of that meeting.
(Note: The ordering of these requirements is entirely random. Requirements have not been ranked in
terms of importance, etc.)
Object Storage




Ability to store all types of digital files (including text, image, audio, video, and complex objects)
Ability to name and track relationships between different versions of files (for example, a set of
JPG2000 images and a raw OCR text file derived from those images) or several files which
together make up a complex/compound object
Ability to bulk ingest and export large sets of files using standard schemas (such as METS, or
similar)
Easy deposit system for content (preferably easy for anyone, not just developers)
Metadata









Ability to store metadata in multiple schema and maintain relationships between metadata and
files
Ability to specify a least common denominator metadata schema for descriptive metadata (for
a common searching / browsing interface)
Support for Unicode for metadata
Support for XML for metadata
Support for provenance metadata (what changes made, who made them, when, why, how, etc),
and plenty of room for growth of this metadata over time
Support for “group-level” metadata (across groups of objects or collections of objects)
Support for record-keeping or administrative metadata
Support for batch update/replace of metadata
Support for versioning of metadata
Page | 8
Search / Browse



Supports both known item and keyword searching, as well as browsing for files
Support searching/browsing of metadata fields common to all (or most) objects
Support searching/browsing of content types or MIME types
Reports

Support report generation based on any “saved search” (e.g. reports of all files of particular
content types / MIME types, or reports based on date added, etc.)
Preservation activities


Support for fixity checks and virus scanning (either external or internal to system), preferably of
both files and metadata
Ability to implement simple and scalable backup and restore procedures
Programmatic activities



Support for a simple, standard API which can be used across all types of objects
Well documented/supported system and API (preferably with a larger, active user community)
API which supports enumeration of files and their contents (to perform actions or format
transformations on many files in bulk)
Authorization / Authentication

Ability to manage access controls to objects as well as sets of objects at appropriate granularity
Administration / Logging


Administrative layer to log changes to objects at an appropriately granular level
Event logging and ability to monitor any modifications within the system
Other


Scalable system and sets of procedures and policies
Preferably, system should have a relatively strong base of users / community support within the
academic and research library community
Page | 9
Download