a centre of expertise in data curation and preservation The DCC Curation Lifecycle Model Sarah Higgins Ross Harvey Angus Whyte with graphics advice from Chris Blackall Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-ncsa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. DCC Curation Lifecycle Model DCC Curation Lifecycle Model The Curation Lifecycle The DCC Curation Lifecycle Model provides a graphical high level overview of the stages required for successful curation and preservation of data from initial conceptualisation or receipt. The model can be used to plan activities within an organisation or consortium to ensure all necessary stages are undertaken, each in the correct sequence. •www.dcc.ac.uk DCC Curation Lifecycle Model Using the DCC Curation Lifecycle Model The model enables: • mapping of granular functionality • definition of roles and responsibilities • building frameworks of standards and technologies to implement • identification of additional steps required • identification of actions which are not required • ensuring adequate documentation of processes and policies •www.dcc.ac.uk DCC Curation Lifecycle Model •www.dcc.ac.uk DCC Curation Lifecycle Model •www.dcc.ac.uk DCC Curation Lifecycle Model Data (Digital Objects or Databases) Data, any information in binary digital form, is at the centre of the Curation Lifecycle. This includes: • simple digital objects • complex digital objects • databases •www.dcc.ac.uk DCC Curation Lifecycle Model Data (Digital Objects or Databases) • simple digital objects • discrete digital items, such as textual files, images or sound files, along with their related identifiers and metadata • complex digital objects • discrete digital objects, made by combining a number of other digital objects, such as websites • databases • structured collections of records or data stored in a computer system •www.dcc.ac.uk DCC Curation Lifecycle Model Full Lifecycle Actions Description and Representation Information Assign administrative, descriptive, technical, structural and preservation metadata, using appropriate standards, to ensure adequate description and control over the long-term. Collect and assign representation information required to understand and render both the digital material and the associated metadata. •www.dcc.ac.uk DCC Curation Lifecycle Model Full Lifecycle Actions Description Information (Metadata) • • • • • • • • persistently identifies data and maintains reliable links to them clearly describes what they are clearly identifies technical information needed to use data identifies who is responsible for their management and preservation describes what can be done to them describes what is needed to represent them at the required level of fidelity records their history and documents their authenticity allows users to understand their context and relationship to other objects. •www.dcc.ac.uk DCC Curation Lifecycle Model Full Lifecycle Actions Representation Information • • • Structure Information: describes the format and data structure concepts to be applied to the bitstream, which result in more meaningful values like characters or number of pixels. Semantic Information: this is needed on top of the structure information. If the digital object is interpreted by the structure information as a sequence of text characters, the semantic information should include details of which language is being expressed. Other Representation Information: includes information about relevant software, hardware and storage media, encryption or compression algorithms, and printed documentation. •www.dcc.ac.uk DCC Curation Lifecycle Model Full Lifecycle Actions Preservation Planning Plan for preservation throughout the curation lifecycle of digital material. This would include plans for management and administration of all curation lifecycle actions. •www.dcc.ac.uk DCC Curation Lifecycle Model Full Lifecycle Actions Preservation Planning – ensure future data access Digital preservation: • is a set of managed activities • aims at ensuring the bit-stream is maintained • aims at ensuring that data are accessible • is concerned with maintaining bit streams and ensuring accessibility for a definable period of time •www.dcc.ac.uk DCC Curation Lifecycle Model Full Lifecycle Actions Preservation Planning – ensure longevity, integrity, accessibility • • longevity • as long as required - longer than the original access system integrity • copy data to a reliable digital storage system • ongoing management - data security, backups, error checking • refresh data and maintain multiple copies of the bit stream • ensure you have preservation action rights. • accessibility • assign persistent identifiers • add sufficient metadata and representation information • choose limited open file formats • monitor technical developments • retain and manage the original bit stream •www.dcc.ac.uk DCC Curation Lifecycle Model Full Lifecycle Actions Community Watch and Participation Maintain a watch on appropriate community activities, and participate in the development of shared standards, tools and suitable software. •www.dcc.ac.uk DCC Curation Lifecycle Model Full Lifecycle Actions Community Watch and Participation – benefits of collaboration • • • • • • • • • • access to a wider range of expertise access to tools and systems that might otherwise be unavailable encouragement for other stakeholders to take preservation seriously shared influence on R&D of standards and practices attraction of resources and other support for well-coordinated programmes at a regional, national or sectoral level shared influence on agreements with producers increased coverage of preserved materials better planning to reduce wasted effort shared development costs shared learning opportunities UNESCO, Guidelines for the Preservation of Digital Heritage, 2003 •www.dcc.ac.uk DCC Curation Lifecycle Model Full Lifecycle Actions Curate and Preserve Be aware of, and undertake management and administrative actions planned to promote curation and preservation throughout the curation lifecycle. •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Conceptualise Conceive and plan the creation of data, including capture method and storage options. •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Conceptualise - plan with digital curation in mind • develop robust workflow, processes and documentation • choose appropriate, existing open standards - interoperability • capture and store data in curation-friendly file formats (open source) • record sufficient information during data capture to assist with ongoing use • scrupulously identify files • store data on appropriate media • identify a safe place for storage (e.g. a trusted archive) and make sure that archive will take your data • identify access methods • identify legal framework •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Create or Receive Create data including administrative, descriptive, structural and technical metadata. Preservation metadata may also be added at the time of creation. Receive data, in accordance with documented collecting policies, from data creators, other archives, repositories or data centres, and if required assign appropriate metadata. •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Create or Receive – ensure data are curation ready • • • • • • • • of high quality well structured adequately documented interoperable authentic (it is what it claims to be) accurate (it hasn’t been tampered with) renderable (it can be used in the ways for which it was intended, or viewed as originally intended) in a form that best ensures its longevity •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Appraise and Select Evaluate data and select for longterm curation and preservation. Adhere to documented guidance, policies or legal requirements. •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Appraise and Select – develop robust policies How long do we want to keep the data? • in terms of changes of technology • in terms an organisation’s business requirements • in terms of user requirements (e.g. as evidence to verify conclusions derived from research). How long do we need to keep the data? • assess benefits and risks of keeping/not keeping data • what are the consequences of not keeping the data? • how much would it cost to recreate it in the future? • is it even possible to recreate it in the future? •www.dcc.ac.uk DCC Curation Lifecycle Model Occasional Actions Dispose Dispose of data, which has not been selected for long-term curation and preservation in accordance with documented policies, guidance or legal requirements. Typically data may be transferred to another archive, repository, data centre or other custodian. In some instances data is destroyed. The data’s nature may, for legal reasons, necessitate secure destruction. •www.dcc.ac.uk DCC Curation Lifecycle Model Occasional Actions Dispose – transfer or destruction? • transfer • if no longer relevant for business function but useful to someone else • for safe keeping – institutional archive • for greater accessibility – more widely accessible data archive • secure destruction – prevent re-use or reconstruction • sensitive data no longer relevant for business function •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Ingest Transfer data to an archive, repository, data centre or other custodian. Adhere to documented guidance, policies or legal requirements. •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Preservation Action Undertake actions to ensure longterm preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning, validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats. •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Preservation Action – specific necessary actions • • • • • • keep the original data bit stream as well as any ‘preservation version’ for future proofing clean and validate data, to ensure they can be managed and re-used over time add high quality preservation metadata and representation information to increase potential for discovery, re-use and preservation ensure acceptable data structures or file formats (eg non-proprietary, well-documented) to increase the chance of future recoverability apply good data management practices implement secure storage and institutional or organisational continuity Based on Lord, P and Macdonald, A, eScience Curation Report, 2003 •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Preservation Action – implement preservation methods • • • • • Migration – transform formats as technologies change Emulation – keep original data and application software and create programs to emulate their behaviour on contemporary architectures Formal descriptions – encode behaviours of original application, at creation, in a format understood by a Universal Virtual Computer (a platform independent layer between hardware and software) to allow reconstitution in original form. Digital archaeology – future recovery as needed or exploratory basis Computer museums – archive whole systems: hardware and software Based on Lord, P and Macdonald, A, eScience Curation Report, 2003 •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Preservation Action – automate with tools • identifying data (where it is located, what formats it is in) • format validation, format registries, obsolescence tools • describing data (automated metadata creation) • technical metadata extraction, conversion to xml schema • manipulating data (data management, data storage, repositories) • normalising and encapsulation tools • preserving data (migration) • web archiving tools, emulation tools, preservation metadata extraction tools • • data registration (ingest) documentation of commonly used terms and concepts • thesaurii, word lists, ontologies • rights management and access control •www.dcc.ac.uk DCC Curation Lifecycle Model Occasional Actions Reappraise Return data which fails validation procedures for further appraisal and reselection. •www.dcc.ac.uk DCC Curation Lifecycle Model Occasional Actions Migrate Migrate data to a different format. This may be done to accord with the storage environment or to ensure the data’s immunity from hardware or software obsolescence. •www.dcc.ac.uk DCC Curation Lifecycle Model Occasional Actions Migrate – for preservation storage • File formats for long-term preservation should be: non-proprietary, open source and well documented • This facilitates: curation, future access, reuse and future migrations Examples • JPEG – digital image thumbnails • TIFF – high quality digital images • PDF/A-1 – documents – with look and feel (ISO 19005-1, Document management – electronic document file formats for long-term preservation) • HTML – web pages • XML – data or text •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Store Store the data in a secure manner adhering to relevant standards. •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Access, Use and Reuse Ensure that data is accessible to both designated users and reusers, on a day-to-day basis. This may be in the form of publicly available published information. Robust access controls and authentication procedures may be applicable. •www.dcc.ac.uk DCC Curation Lifecycle Model Sequential Actions Transform Create new data from the original, for example • By migration into a different format. • By creating a subset, by selection or query, to create newly derived results, perhaps for publication. •www.dcc.ac.uk DCC Curation Lifecycle Model DCC Scarp - Neuro-imaging case-study Occasional Lifecycle action Main risks identified Steps being taken/required Dispose Privacy breach Introduce assured deletion process Reappraise Migrate QA process to check integrity of image & study data Obsolescence of hardware or software Media degradation or obsolescence Data policy to identify criteria for migrating datasets •www.dcc.ac.uk DCC Curation Lifecycle Model DCC Scarp - Neuro-imaging case-study Full Lifecycle actions Main risks identified Steps being taken/required Description and representation information Loss of integrity of information, i.e. links between dataset elements & study docs. Ontology to describe data Standard ‘master file’ documentation. Data documentation system to link and describe study files Preservation planning Preservation failure Loss of key member(s) of staff Seeking funding for preservationactivity Funding bodies may misperceive the level of infrastructure for local curation Develop data policy Community watch and participation Curate and preserve Active participation in e-Science consortia, multi-centre projects and professional networks Data integration unmanageable Establish extent of data cleaning needs, using sampling approach •www.dcc.ac.uk DCC Curation Lifecycle Model DCC Scarp - Neuro-imaging case-study Sequential Lifecycle action Main risks identified Steps being taken/required Conceptualise Data integration to enable retrospective & multi-centre studies Create or receive Ontology to map different assessment scales File hashing Appraise and select Inability to evaluate the effectiveness of preservation Appraisal process & criteria Data documentation Ingest Archived data cannot be traced to receipt Ontology to describe data Privacy breach Anonymisation to strip images of identifying data •www.dcc.ac.uk DCC Curation Lifecycle Model DCC Scarp - Neuro-imaging case-study Sequential Lifecycle action Main risks identified Steps being taken/required Preservation action Context information lost or unrecorded Provenance information lost or unrecorded Data documentation system to provide schema & guidelines for context and provenance information. QA to check integrity of files received Store Extent of what is within an archival object is unclear Destruction or non-availability of repository site Standard master file format Offsite storage of backups Access, use and reuse Finding/searching tools are not sufficiently effective or usable Ontology to describe data Dataset sharing in grid projects Remote analysis services Automated analysis Data documentation- share metadata Transform Data integration unmanageable Ontology to map terms Normalisation to correct scanner inhomogeneities •www.dcc.ac.uk