The DCC Curation Lifecycle Model Sarah Higgins Ross Harvey Angus Whyte

advertisement
a centre of expertise in data curation and preservation
The DCC Curation Lifecycle Model
Sarah Higgins
Ross Harvey
Angus Whyte
with graphics advice from Chris Blackall
Funded by:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK:
Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-ncsa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San
Francisco, California, 94105, USA.
DCC Curation Lifecycle Model
DCC Curation Lifecycle Model
The Curation Lifecycle
The DCC Curation Lifecycle Model provides a graphical high level
overview of the stages required for successful curation and
preservation of data from initial conceptualisation or receipt.
The model can be used to plan activities within an organisation or
consortium to ensure all necessary stages are undertaken, each in
the correct sequence.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Using the DCC Curation Lifecycle Model
The model enables:
• mapping of granular functionality
• definition of roles and responsibilities
• building frameworks of standards and technologies to
implement
• identification of additional steps required
• identification of actions which are not required
• ensuring adequate documentation of processes and policies
•www.dcc.ac.uk
DCC Curation Lifecycle Model
•www.dcc.ac.uk
DCC Curation Lifecycle Model
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Data (Digital Objects
or Databases)
Data, any information in binary
digital form, is at the centre of the
Curation Lifecycle. This includes:
• simple digital objects
• complex digital objects
• databases
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Data (Digital Objects or Databases)
•
simple digital objects
• discrete digital items, such as textual files, images or sound files, along
with their related identifiers and metadata
•
complex digital objects
• discrete digital objects, made by combining a number of other digital
objects, such as websites
•
databases
• structured collections of records or data stored in a computer system
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Full Lifecycle Actions
Description and
Representation Information
Assign administrative, descriptive,
technical, structural and
preservation metadata, using
appropriate standards, to ensure
adequate description and control
over the long-term.
Collect and assign representation
information required to understand
and render both the digital material
and the associated metadata.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Full Lifecycle Actions
Description Information (Metadata)
•
•
•
•
•
•
•
•
persistently identifies data and maintains reliable links to them
clearly describes what they are
clearly identifies technical information needed to use data
identifies who is responsible for their management and preservation
describes what can be done to them
describes what is needed to represent them at the required level of
fidelity
records their history and documents their authenticity
allows users to understand their context and relationship to other
objects.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Full Lifecycle Actions
Representation Information
•
•
•
Structure Information: describes the format and data structure
concepts to be applied to the bitstream, which result in more
meaningful values like characters or number of pixels.
Semantic Information: this is needed on top of the structure
information. If the digital object is interpreted by the structure
information as a sequence of text characters, the semantic information
should include details of which language is being expressed.
Other Representation Information: includes information about
relevant software, hardware and storage media, encryption or
compression algorithms, and printed documentation.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Full Lifecycle Actions
Preservation Planning
Plan for preservation throughout
the curation lifecycle of digital
material. This would include plans
for management and
administration of all curation
lifecycle actions.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Full Lifecycle Actions
Preservation Planning – ensure future data access
Digital preservation:
• is a set of managed activities
• aims at ensuring the bit-stream is maintained
• aims at ensuring that data are accessible
• is concerned with maintaining bit streams and ensuring
accessibility for a definable period of time
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Full Lifecycle Actions
Preservation Planning – ensure longevity, integrity, accessibility
•
•
longevity
• as long as required - longer than the original access system
integrity
• copy data to a reliable digital storage system
• ongoing management - data security, backups, error checking
• refresh data and maintain multiple copies of the bit stream
• ensure you have preservation action rights.
•
accessibility
• assign persistent identifiers
• add sufficient metadata and representation information
• choose limited open file formats
• monitor technical developments
• retain and manage the original bit stream
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Full Lifecycle Actions
Community Watch and
Participation
Maintain a watch on appropriate
community activities, and
participate in the development of
shared standards, tools and
suitable software.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Full Lifecycle Actions
Community Watch and Participation – benefits of collaboration
•
•
•
•
•
•
•
•
•
•
access to a wider range of expertise
access to tools and systems that might otherwise be unavailable
encouragement for other stakeholders to take preservation seriously
shared influence on R&D of standards and practices
attraction of resources and other support for well-coordinated
programmes at a regional, national or sectoral level
shared influence on agreements with producers
increased coverage of preserved materials
better planning to reduce wasted effort
shared development costs
shared learning opportunities
UNESCO, Guidelines for the Preservation of Digital
Heritage, 2003
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Full Lifecycle Actions
Curate and Preserve
Be aware of, and undertake
management and administrative
actions planned to promote
curation and preservation
throughout the curation lifecycle.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Conceptualise
Conceive and plan the creation of
data, including capture method and
storage options.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Conceptualise - plan with digital curation in mind
• develop robust workflow, processes and documentation
• choose appropriate, existing open standards - interoperability
• capture and store data in curation-friendly file formats (open
source)
• record sufficient information during data capture to assist with
ongoing use
• scrupulously identify files
• store data on appropriate media
• identify a safe place for storage (e.g. a trusted archive) and make
sure that archive will take your data
• identify access methods
• identify legal framework
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Create or Receive
Create data including
administrative, descriptive,
structural and technical metadata.
Preservation metadata may also
be added at the time of creation.
Receive data, in accordance with
documented collecting policies,
from data creators, other archives,
repositories or data centres, and if
required assign appropriate
metadata.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Create or Receive – ensure data are curation ready
•
•
•
•
•
•
•
•
of high quality
well structured
adequately documented
interoperable
authentic (it is what it claims to be)
accurate (it hasn’t been tampered with)
renderable (it can be used in the ways for which it was intended, or
viewed as originally intended)
in a form that best ensures its longevity
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Appraise and Select
Evaluate data and select for longterm curation and preservation.
Adhere to documented guidance,
policies or legal requirements.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Appraise and Select – develop robust policies
How long do we want to keep the data?
• in terms of changes of technology
• in terms an organisation’s business requirements
• in terms of user requirements (e.g. as evidence to verify
conclusions derived from research).
How long do we need to keep the data?
• assess benefits and risks of keeping/not keeping data
• what are the consequences of not keeping the data?
• how much would it cost to recreate it in the future?
• is it even possible to recreate it in the future?
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Occasional Actions
Dispose
Dispose of data, which has not
been selected for long-term
curation and preservation in
accordance with documented
policies, guidance or legal
requirements.
Typically data may be transferred
to another archive, repository, data
centre or other custodian. In some
instances data is destroyed. The
data’s nature may, for legal
reasons, necessitate secure
destruction.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Occasional Actions
Dispose – transfer or destruction?
•
transfer
• if no longer relevant for business function but useful to someone else
• for safe keeping – institutional archive
• for greater accessibility – more widely accessible data archive
•
secure destruction – prevent re-use or reconstruction
• sensitive data no longer relevant for business function
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Ingest
Transfer data to an archive,
repository, data centre or other
custodian. Adhere to documented
guidance, policies or legal
requirements.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Preservation Action
Undertake actions to ensure longterm preservation and retention of
the authoritative nature of data.
Preservation actions should ensure
that data remains authentic, reliable
and usable while maintaining its
integrity.
Actions include data cleaning,
validation, assigning preservation
metadata, assigning representation
information and ensuring acceptable
data structures or file formats.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Preservation Action – specific necessary actions
•
•
•
•
•
•
keep the original data bit stream as well as any ‘preservation
version’ for future proofing
clean and validate data, to ensure they can be managed and re-used
over time
add high quality preservation metadata and representation
information to increase potential for discovery, re-use and preservation
ensure acceptable data structures or file formats (eg non-proprietary,
well-documented) to increase the chance of future recoverability
apply good data management practices
implement secure storage and institutional or organisational continuity
Based on Lord, P and Macdonald, A,
eScience Curation Report, 2003
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Preservation Action – implement preservation methods
•
•
•
•
•
Migration – transform formats as technologies change
Emulation – keep original data and application software and create
programs to emulate their behaviour on contemporary architectures
Formal descriptions – encode behaviours of original application, at
creation, in a format understood by a Universal Virtual Computer (a
platform independent layer between hardware and software) to allow
reconstitution in original form.
Digital archaeology – future recovery as needed or exploratory basis
Computer museums – archive whole systems: hardware and software
Based on Lord, P and Macdonald, A,
eScience Curation Report, 2003
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Preservation Action – automate with tools
•
identifying data (where it is located, what formats it is in)
• format validation, format registries, obsolescence tools
•
describing data (automated metadata creation)
• technical metadata extraction, conversion to xml schema
•
manipulating data (data management, data storage, repositories)
• normalising and encapsulation tools
•
preserving data (migration)
• web archiving tools, emulation tools, preservation metadata extraction
tools
•
•
data registration (ingest)
documentation of commonly used terms and concepts
• thesaurii, word lists, ontologies
•
rights management and access control
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Occasional Actions
Reappraise
Return data which fails validation
procedures for further appraisal and
reselection.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Occasional Actions
Migrate
Migrate data to a different format. This
may be done to accord with the
storage environment or to ensure the
data’s immunity from hardware or
software obsolescence.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Occasional Actions
Migrate – for preservation storage
• File formats for long-term preservation should be: non-proprietary,
open source and well documented
• This facilitates: curation, future access, reuse and future migrations
Examples
• JPEG – digital image thumbnails
• TIFF – high quality digital images
• PDF/A-1 – documents – with look and feel
(ISO 19005-1, Document management – electronic document file
formats for long-term preservation)
• HTML – web pages
• XML – data or text
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Store
Store the data in a secure manner
adhering to relevant standards.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Access, Use and Reuse
Ensure that data is accessible to
both designated users and reusers, on a day-to-day basis. This
may be in the form of publicly
available published information.
Robust access controls and
authentication procedures may be
applicable.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
Sequential Actions
Transform
Create new data from the original,
for example
• By migration into a different
format.
• By creating a subset, by
selection or query, to create
newly derived results, perhaps
for publication.
•www.dcc.ac.uk
DCC Curation Lifecycle Model
DCC Scarp - Neuro-imaging case-study
Occasional Lifecycle action
Main risks identified
Steps being taken/required
Dispose
Privacy breach
Introduce assured deletion
process
Reappraise
Migrate
QA process to check integrity of
image & study data
Obsolescence of hardware or
software
Media degradation or obsolescence
Data policy to identify criteria for
migrating datasets
•www.dcc.ac.uk
DCC Curation Lifecycle Model
DCC Scarp - Neuro-imaging case-study
Full Lifecycle actions
Main risks identified
Steps being taken/required
Description and
representation information
Loss of integrity of information, i.e.
links between dataset elements &
study docs.
Ontology to describe data
Standard ‘master file’
documentation.
Data documentation system to
link and describe study files
Preservation planning
Preservation failure
Loss of key member(s) of staff
Seeking funding for
preservationactivity
Funding bodies may misperceive the
level of infrastructure for local
curation
Develop data policy
Community watch and
participation
Curate and preserve
Active participation in e-Science
consortia, multi-centre projects
and professional networks
Data integration unmanageable
Establish extent of data cleaning
needs, using sampling approach
•www.dcc.ac.uk
DCC Curation Lifecycle Model
DCC Scarp - Neuro-imaging case-study
Sequential Lifecycle action
Main risks identified
Steps being taken/required
Conceptualise
Data integration to enable
retrospective & multi-centre
studies
Create or receive
Ontology to map different
assessment scales
File hashing
Appraise and select
Inability to evaluate the effectiveness
of preservation
Appraisal process & criteria
Data documentation
Ingest
Archived data cannot be traced to
receipt
Ontology to describe data
Privacy breach
Anonymisation to strip images of
identifying data
•www.dcc.ac.uk
DCC Curation Lifecycle Model
DCC Scarp - Neuro-imaging case-study
Sequential Lifecycle action
Main risks identified
Steps being taken/required
Preservation action
Context information lost or
unrecorded
Provenance information lost or
unrecorded
Data documentation system to
provide schema & guidelines for
context and provenance
information.
QA to check integrity of files
received
Store
Extent of what is within an archival
object is unclear
Destruction or non-availability of
repository site
Standard master file format
Offsite storage of backups
Access, use and reuse
Finding/searching tools are not
sufficiently effective or usable
Ontology to describe data
Dataset sharing in grid projects
Remote analysis services
Automated analysis
Data documentation- share
metadata
Transform
Data integration unmanageable
Ontology to map terms
Normalisation to correct scanner
inhomogeneities
•www.dcc.ac.uk
Download