Open Source Options for Digital Curation Library 2.012 October 4, 2012 Christinger Tomer University of Pittsburgh Definitions of Digital Curation According to the Digital Curation Centre in the U.K., digital curation involves maintaining, preserving and adding value to digital research data throughout its lifecycle and covering the stewardship of data from the point of conceptualisation to its eventual disposal. It is based on the presumption that such data has multiple uses and uses in other contexts. An Illustration of the DCC Curation Lifecycle Model See http://www.dcc.ac.uk/resources/c uration-lifecycle-model The Role of Libraries in Digital Curation Heidorn argues that "[i]ncreasingly, data are being recognized as first-class intellectual objects that can undergo quality checks, peer review, distribution, and reuse. The reuse of data contributes as much to society as the reuse of a concept in a journal article. The data set can be cited and contribute to the reputation of the creator of the data for good or ill." He goes on to assert that "[l]ibraries ..... have a duty to society to collect, preserve, and disseminate the intellectual output of the society—including this data." (From P. Bryan Heidorn (2011): The Emerging Role of Libraries in Data Curation and E- science, Journal of Library Administration, 51:7-8, 662-672.) Key Factors in Digital Curation • Identity -- "Identity is contextual: some objects are associated with information that allows identification only within a limited context (e.g., an object may be uniquely identified only within the context of objects residing on the same server), while others have enough information to make them globally identifiable (e.g., a global identifier such as a GUID or ISBN).” See: Sally Vermaaten, et al. Identifying Threats to Successful Digital Preservation: the SPOT Model for Risk Assessment. D-Lib Magazine 18 (September/October 2012): 8. • • Authenticity and Understandability o Evaluation of the understandability of data requires that there be sufficient context (documentation, meta- data, or provenance) describing the data, and that the data is usable. Persistence More Key Factors • • • Renderability -- "the property that a digital object is able to be used in a way that retains the object's significant characteristics," meaning that the hardware and software necessary to render the object are available or may be reproduced through emulation. Integrity o Integrity of data assumes that the data can be proven to be identical, at the bit level, to some prior accepted or verified state. Data integrity may be required for usability, understandability, authenticity, trust, and thus overall quality. Access and Usability o Data Generators o Data Seekers o Data Consumers Archon Archon, which has been developed by a group from the University of Illinois, supports the creation of records conforming to MARC and EAD, as well as their import and export. Archon's designers refer to it as a "simple archiving" system, but its effective use does require a working knowledge of standards for archival description. Its preview feature is especially effective in the treatment of visual materials. ICA AToM allows creators to build what are effectively compound documents and user to scroll through thumbnails of the documents on the interface Artefactual Systems in British Columbia is developing ICA AToM 2.0, which will be available as open source, community-supported software and as a feebased service ICA AToM ResourceSpace Document View under ResourceSpace CWIS Islandora The idea underlying Islandora is that a creator places an object in the repository, Fedora Commons, and then links that object to other materials, e.g., text, images, et al., that are mounted through the CMS, which is a Drupal instance. Islandora is a hybrid, combining the Fedora Commons repository system as the back end with Drupal, the LAMP-based CMS, as the front end. This hybridized approach is gaining in popularity among developers, who believe that successful design must provide a place for narrative treatments. Omeka Omeka is based on the "LAMP" architecture. Perhaps its most important feature is its modular design. Omeka's Modularity Omeka supports a wide array of plugins that have been designed to enhance the functionality of the system. In this illustration, one of the examples is the Creative Commons Chooser, which allows the creator of an object to select the appropriate license from the entry interface. Omeka.net DSpace is based on Java and Apache Tomcat and will run with equal facility on *nix or Windows systems DSpace ePrints is another LAMP-based system, distinguished by its reliance on PERL and popularity, which owes much to its ease-of-use, particularly in the generation of metadata. ePrints3 DSpace, with Manakin Interface HubZero Client Interface HubZero, which was developed at Purdue University, is based on Joomla, the content management system, and uses MySQL as its back-end. The main aim of the system is to provide a platform on which researchers can mount and annotate datasets. Penn State’s ScholarSphere Released on September 24, 2012, ScholarSphere is another hybrid system, based on Hydra, a Ruby-on-Rails front-end and Fedora Commons. The Common Sense of IR Plus IR Plus is a system developed by the University of Rochester Libraries. It is another variation on the hybrid theme, in this case it uses Apache Tomcat's WebDAV extension to support personal file storage and public archiving, with depositors able to make materials mounted in the personal storage area available to collaborative groups and/or the public by toggling a software switch. This design is intended to reduce the friction associated with the use of other archiving/repository systems Sharing & Publishing under IR Plus Interoperability and Related Issues • • • • Are Key Archival and/or Bibliographic Standards such as MARC and/or EAD Supported? Does the System Support the Open Archives Initiative's Metadata Harvesting Protocol? To What Extent is Content Exportable? To What Extent is the System Itself Portable? Is the system extensible? Ease-of-Use • • • • Does the creation of objects within the System require a professional level knowledge of metadata generation? What are the characteristics of the workflow? Does the workflow support multiple roles? Does the system incorporate lookup features based on Web APIs? How does the system support the organization of objects once they have been mounted? Documentation and Support • • • • Is the System Supported by an Active Documentation Project? What is the quality of the documentation that is available? Are their user forums through which questions and configuration, content creation, and/or bugs may be addressed? How often is the software updated? In the case of extensible systems, how productive are the developer communities providing extensions, plugins, themes, etc.? Factors in Evaluating Open Source Software • License • Activity and Age of the Project • Unit Tests • Code Quality • Base Use Test • Modification Test