Open Source Options for Digital Curation Library 2.012 October 4, 2012

advertisement
Open Source Options for
Digital Curation
Library 2.012
October 4, 2012
Christinger Tomer
University of Pittsburgh
Definitions of Digital Curation
According to the Digital Curation Centre in
the U.K., digital curation involves
maintaining, preserving and adding value to
digital research data throughout its lifecycle
and covering the stewardship of data from
the point of conceptualisation to its eventual
disposal. It is based on the presumption that
such data has multiple uses and uses in
other contexts.
An Illustration of the DCC Curation
Lifecycle Model
See
http://www.dcc.ac.uk/resources/c
uration-lifecycle-model
The Role of Libraries in Digital
Curation
Heidorn argues that "[i]ncreasingly, data are being
recognized as first-class intellectual objects that can
undergo quality checks, peer review, distribution, and
reuse. The reuse of data contributes as much to society
as the reuse of a concept in a journal article. The data
set can be cited and contribute to the reputation of the
creator of the data for good or ill." He goes on to assert
that "[l]ibraries ..... have a duty to society to collect,
preserve, and disseminate the intellectual output of the
society—including this data." (From P. Bryan Heidorn
(2011): The Emerging Role of Libraries in Data Curation
and E- science, Journal of Library Administration,
51:7-8, 662-672.)
Key Factors in Digital Curation
•
Identity -- "Identity is contextual: some objects are associated with
information that allows identification only within a limited context (e.g., an
object may be uniquely identified only within the context of objects residing
on the same server), while others have enough information to make them
globally identifiable (e.g., a global identifier such as a GUID or ISBN).”
See: Sally Vermaaten, et al. Identifying Threats to Successful Digital
Preservation: the SPOT Model for Risk Assessment. D-Lib Magazine 18
(September/October 2012): 8.
•
•
Authenticity and Understandability
o Evaluation of the understandability of data requires that there be
sufficient context (documentation, meta- data, or provenance)
describing the data, and that the data is usable.
Persistence
More Key Factors
•
•
•
Renderability -- "the property that a digital object is able to be used
in a way that retains the object's significant characteristics,"
meaning that the hardware and software necessary to render the
object are available or may be reproduced through emulation.
Integrity
o Integrity of data assumes that the data can be proven to be
identical, at the bit level, to some prior accepted or verified
state. Data integrity may be required for usability,
understandability, authenticity, trust, and thus overall quality.
Access and Usability
o Data Generators
o Data Seekers
o Data Consumers
Archon
Archon, which has been developed by a group from the
University of Illinois, supports the creation of records
conforming to MARC and EAD, as well as their import
and export.
Archon's designers refer to it as a "simple
archiving" system, but its effective use does
require a working knowledge of standards for
archival description. Its preview feature is
especially effective in the treatment of visual
materials.
ICA AToM allows creators to build what are
effectively compound documents and user
to scroll through thumbnails of the
documents on the interface
Artefactual Systems in
British Columbia is
developing ICA AToM
2.0, which will be
available as open source,
community-supported
software and as a feebased service
ICA AToM
ResourceSpace
Document View under ResourceSpace
CWIS
Islandora
The idea underlying Islandora is that a creator places
an object in the repository, Fedora Commons, and
then links that object to other materials, e.g., text,
images, et al., that are mounted through the CMS,
which is a Drupal instance.
Islandora is a hybrid, combining the Fedora
Commons repository system as the back end
with Drupal, the LAMP-based CMS, as the
front end. This hybridized approach is gaining
in popularity among developers, who believe
that successful design must provide a place for
narrative treatments.
Omeka
Omeka is based on the
"LAMP" architecture. Perhaps
its most important feature is
its modular design.
Omeka's Modularity
Omeka supports a
wide array of plugins
that have been
designed to enhance
the functionality of the
system. In this
illustration, one of the
examples is the
Creative Commons
Chooser, which allows
the creator of an object
to select the
appropriate license
from the entry
interface.
Omeka.net
DSpace is based on Java and Apache Tomcat and will run
with equal facility on *nix or Windows systems
DSpace
ePrints is another LAMP-based system,
distinguished by its reliance on PERL and popularity,
which owes much to its ease-of-use, particularly in
the generation of metadata.
ePrints3
DSpace, with Manakin Interface
HubZero Client Interface
HubZero, which was developed at Purdue University,
is based on Joomla, the content management
system, and uses MySQL as its back-end. The main
aim of the system is to provide a platform on which
researchers can mount and annotate datasets.
Penn State’s ScholarSphere
Released on September 24,
2012, ScholarSphere is
another hybrid system, based
on Hydra, a Ruby-on-Rails
front-end and Fedora
Commons.
The Common Sense of IR Plus
IR Plus is a system
developed by the University
of Rochester Libraries. It is
another variation on the
hybrid theme, in this case it
uses Apache Tomcat's
WebDAV extension to
support personal file storage
and public archiving, with
depositors able to make
materials mounted in the
personal storage area
available to collaborative
groups and/or the public by
toggling a software switch.
This design is intended to
reduce the friction associated
with the use of other
archiving/repository systems
Sharing & Publishing under IR Plus
Interoperability and Related Issues
•
•
•
•
Are Key Archival and/or Bibliographic
Standards such as MARC and/or EAD
Supported?
Does the System Support the Open Archives
Initiative's Metadata Harvesting Protocol?
To What Extent is Content Exportable? To
What Extent is the System Itself Portable?
Is the system extensible?
Ease-of-Use
•
•
•
•
Does the creation of objects within the System
require a professional level knowledge of
metadata generation?
What are the characteristics of the workflow?
Does the workflow support multiple roles?
Does the system incorporate lookup features
based on Web APIs?
How does the system support the organization
of objects once they have been mounted?
Documentation and Support
•
•
•
•
Is the System Supported by an Active
Documentation Project? What is the quality of
the documentation that is available?
Are their user forums through which questions
and configuration, content creation, and/or bugs
may be addressed?
How often is the software updated?
In the case of extensible systems, how
productive are the developer communities
providing extensions, plugins, themes, etc.?
Factors in Evaluating Open Source
Software
• License
• Activity and Age of the Project
• Unit Tests
• Code Quality
• Base Use Test
• Modification Test
Download