DCC 101: Preserve Michael Day UKOLN, University of Bath

advertisement
a centre of expertise in data curation and preservation
DCC 101: Preserve
Michael Day
UKOLN, University of Bath
m.day@ukoln.ac.uk
Funded by:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK:
Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-ncsa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San
Francisco, California, 94105, USA.
Digital Curation 101, October 6th-10th, 2008, NeSC, Edinburgh
a centre of expertise in data curation and preservation
Presentation outline:
•
•
•
•
•
•
Preservation in the curation life-cycle
Roles and responsibilities
Reasons for preserving research data
Digital preservation challenges and strategies
Major types of research data collection
Infrastructures for preservation and curation
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Module outline
• This module will explore actions required to
ensure long-term preservation and retention
of the authoritative nature of data
• Preservation actions should ensure that data
remains authentic, reliable and usable while
maintaining its integrity.
• Actions include data cleaning, validation,
assigning preservation metadata, assigning
representation information and ensuring
acceptable data structures or file formats
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Learning outcomes
• An greater awareness of the factors that need
to be taken into account when considering
how to preserve research data (and other
materials) over time
• A deeper understanding of the preservation
options currently available
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Preservation in the DCC lifecycle
• In the DCC Curation Lifecycle Model, the
“Preservation Action” stage:
• Immediately follows the “Ingest” stage
• Is followed by the “Store” stage
• Is directly linked with “Transform” and “Appraise
and Select” stages
• Includes major elements from the inner circles:
• Description and Representation Information,
Preservation Planning, Community Watch
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Preservation in the DCC lifecycle
• There are major dependencies on the rest of
the curation process
• The creation stage is normally the best time to
ensure that data are fit-for-purpose and
“preservable”
• Need to document both explicit and implicit
knowledge, contexts (part of the metadata issue)
• Preservation Planning informs ingest strategies as
well as preservation actions and transformations
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Who undertakes preservation?
• Researchers
• Indirectly - they have most direct contact with
creation stage, and understand how data can be
used
• Directly - sometimes responsible for maintaining
community data collections
• Information professionals
• Sometimes, but it depends on the context
• IT professionals
• Primarily informaticians working with scientists
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Roles and responsibilities (1)
• Long-lived data
collections (NSB)
•
•
•
•
•
Data authors
Data managers
Data scientists
Data users
Funding agencies
• Dealing with data
(JISC)
•
•
•
•
•
•
Scientist
Institution
Data centre
User
Funder
Publisher
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Roles and responsibilities (2)
• Scientists
• Initial creation and use of data
• Expectation of first use and in gaining appropriate
credit and recognition
• Responsible for:
•
•
•
•
Managing data for life of project
For using standards (where possible)
For complying with data policies
For making the data available in a form that can (easily?)
be used by others
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Roles and responsibilities (3)
• Institutions:
• Role less clear
• Institutional policies may require short-term
management of data
• Advocacy and training
• Some institutions are developing repository
services
• Are rarely currently used for research data
• Federated approaches maintain disciplinary involvement
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Roles and responsibilities (3)
• Data centres
• Undertakes curation and provides access
• Responsible for:
•
•
•
•
•
•
Selection and ingest
Participating in the development of standards
Protecting the rights of data creators
Supporting ingest and metadata capture
Supporting re-use (tools and services)
Training
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Roles and responsibilities (4)
• Users:
• Users of third-party data
• Responsible for:
•
•
•
•
Adhering to any licenses and restrictions on use
Acknowledging data creators and curators
Managing any derived data
Provide feedback to scientists and data centres
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Roles and responsibilities (5)
• Funding bodies:
• Acting at policy level
• Responsible for:
• Considering wider policy perspectives
• Developing policies in co-operation with other
stakeholders
• Monitoring and enforcing data policies
• Support for long-term data management
• Support for data curation
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
What is research data?
• An extremely broad category of material:
• “... any information that can be stored in digital
form, including text, numbers, images, video or
movies, audio, software, algorithms, equations,
animations, models, simulations, etc.” (National
Science Board, Long-lived digital data collections,
2005)
• In practice, it can mean almost anything
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Why curate research data? (1)
• Part of the normal research process:
• The need for others to validate and replicate
research
• In some disciplines, supporting data is routinely
made available to reviewers and linked from
journal papers
• Principles of sharing and openness are firmly
embedded in some disciplines
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Why curate research data? (2)
• Extrinsic and intrinsic value;
• High investment in research
• Data can be very expensive to capture and
analyse
• Data is impossible to recreate once lost
• Observational data (by definition) is irreplaceable
• Current generations of instruments can gather
more data than can be analysed
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Why curate research data? (3)
• The potential for creating 'new' knowledge
from existing data:
• Re-use, re-analysis, data mining
• Annotation, e.g. in molecular biology astronomy
• Combining datasets in innovative ways, e.g.
mapping biodiversity data onto ecological GIS
• “Science 2.0”
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Why curate research data? (4)
• It is increasingly a requirement of some
research funding bodies
• Some have quite mature data retention policies
(not necessarily for permanent retention)
• Increasing expectation of access to data from
publicly-funded research
• OECD Principles and guidelines for access to
research data from public funding (2007)
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Why curate research data? (5)
• Institutional asset management:
• Universities and other research organisations
invest very large sums of money into research
activities
• Research data is a key output of this activity
• It is, therefore, an institutional asset that needs
stewardship
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Why curate research data? (6)
• Promoting the institution, research group or
individual:
• Re-use helps promote visibility and 'impact'
• Institutions become acknowledged 'centres of
competence'
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Preservation challenges (1)
• Media (1)
• Currently magnetic or optical tape and disks, some
devices (e.g., memory sticks)
• Examples include: CD, DVD (optical), DAT, DLT, laptop
hard drives (magnetic)
• Unknown lifetimes
• Subject to differences in quality or storage conditions
• But relatively short lifetimes compared to paper or good
quality microform
• Lifetimes measured in years rather than decades
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Preservation challenges (2)
• Media (2)
• Technical solutions
• Longer lasting media:
• e.g. Norsam's High Density Rosetta system - analogue
storage on nickel plates
• COM (output to good-quality microform)
• Keeping paper copies!
• Periodic copying of data bits on to new media
(refreshing) - data management solution
• Principle of active management
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Preservation challenges (3)
• Hardware and software dependence
• Most digital objects are dependent on particular
configurations of hardware and software
• Relatively short obsolescence cycles for:
• Hardware
• Scientific instrumentation, peripherals (e.g. floppy disk
drives)
• Software
• e.g., word-processing files, CAD
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Conceptual problems (1)
• What is an digital object?
• Some are analogues of traditional objects, e.g.
meeting minutes, research papers
• Others are not, e.g. Web pages, GIS, 3D models
of chemical structures
• Complexity
• Dynamic nature
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Conceptual problems (2)
• Three layers:
• Physical: the bits stored on a particular medium
• Logical: defines how the bits are used by a
software application, based on data types (e.g.
ASCII); in order to understand (or preserve) the
bits, we need to know how to process this
• Conceptual: things that we deal with in the real
world
From: Ken Thibodeau, “Overview of technological approaches to digital
preservation and challenges in coming years.” In: The state of digital
preservation: an international perspective. CLIR, 2002.
http://www.clir.org/
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Conceptual problems (3)
• On which of these layers should preservation
activities focus?
• We need to preserve the ability to reproduce the objects, not
just the bits
• In fact, we can change the bits and logical representation
and still reproduce an authentic conceptual object (e.g.
converting into PDF)
• Authenticity and integrity
• How can we trust that an object is what it claims to be?
• Digital information can easily be changed by accident or
design
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Some general principles (1)
• Most of the technical problems associated with longterm digital preservation can be solved if a life-cycle
management approach is adopted
• i.e. a continual programme of active management
• Ideally, combines both managerial and technical processes,
e.g., as in the OAIS Model
• Many current systems are attempting to support this
approach
• Preservation strategies need to be seen in this wider context
• Preservation needs to be considered at a very early
stage in an object's life-cycle
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Some general principles (2)
• There is a need to identify 'significant properties'
• Recognises that preservation is context dependent
• Helps with choosing an acceptable preservation strategy
• Consider encapsulation
• Surrounding the digital object - at least conceptually - with all
of the information needed to decode and understand it
(including software)
• Produces autonomous 'self-describing' objects, reduces
external dependencies (linked to the Information Package
concept in the OAIS Reference Model)
• Keep the original byte-stream
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Some general principles (3)
• Metadata and documentation is vitally important
• Relates to the OAIS concepts like Representation
Information and Preservation Description
Information
• Functions
• Records scientific meaning
• Records the research context
• Enables the development of finding aids
• Standards are being developed that support digital
preservation activities (e.g., the PREMIS Data
Dictionary)
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Digital preservation strategies
• Three main families:
• Technology preservation
• Technology emulation
• Information migration
• Also:
• Digital archaeology (rescue)
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Technology preservation
• The preservation of an information object together with all
of the hardware and software needed to interpret it
• Successfully preserves the look, feel and behaviour of the
whole system (at least while the hardware and software still
functions)
• May have a role for historically important hardware
• Severe problems with storage and ongoing maintenance,
missing documentation
• Would inevitably lead to 'museums' of “ageing and
incompatible computer hardware” -- Mary Feeney
• May have a shorter-term role for supporting the rescue of
digital objects (digital archaeology)
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Technology emulation (1)
• Preserving the original bit-streams and application
software; running this on emulator programs that
mimic the behaviour of obsolete hardware
• Emulators change over time
• Chaining, rehosting
• Emulation Virtual Machines
• Running emulators on simplified 'virtual machines' that can
be run on a range of different platforms
• Virtual machines are migrated so the original bit-streams
do not have to be
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Technology emulation (2)
• Benefits:
• Technique already widely used, e.g. for emulating
different hardware, computer games
• Preserves (and uses) the original bits
• Reduces the need for regular object transformations (but
emulators and virtual machines may themselves need to
be migrated)
• Retains ‘look-and-feel’
• May be the only approach possible where objects are
complex or dependent on executable code
• Less 'understanding' of formats is needed; little
incremental cost in keeping additional formats
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Technology emulation (3)
• Challenges:
• Do organisations have the technical skills necessary to
implement the strategy?
• Preserving 'look and feel' may not be needed for all
objects
• It will be difficult to know definitively whether user
experience has been accurately preserved
• Conclusions:
• Promising family of approaches
• Needs further practical application and research, e.g.
• Dioscuri software (National Library of the Netherlands
(KB), Nationaal Archief and Planets project)
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Information migration (1)
• Managed transformations:
• A set of organised tasks designed to achieve the periodic
transfer of digital information from one hardware and
software configuration to another, or from one generation
of computer technology to a subsequent one - CPA/RLG
report (1996)
• Abandons attempts to keep old technology (or
substitutes for it) working
• A 'known' solution used by data archives and software
vendors (e.g., a linear migration strategy is used by
software vendors for some data types, e.g. Microsoft
Office files)
• Focuses on the content (or properties) of objects
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Information migration (2)
• Main types (from OAIS Model):
•
•
•
•
Refreshment
Replication
Repackaging
Transformation
• Challenges:
• Labour intensive
• There can be problems with ensuring the 'integrity and
authenticity' of objects
• Transformations need to be documented (part of the
preservation metadata)
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Information migration (3)
• Uses:
• Seems to be most suitable for dealing with large
collections of similar objects
• Migration can often be combined with some form of
standardisation process, e.g., on ingest
• ASCII
• Bit-mapped-page images
• Well-defined XML formats
• Some variations: migration on Request (CAMiLEON
project)
• Keep original bits, migrate the rendering tools
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Digital archaeology
• Not so much a preservation strategy, but the
default situation if there isn't one
• Using various techniques to recover digital content
from obsolete or damaged physical objects
(media, hardware, etc.)
• A time consuming process, needs specialised equipment
and (in most cases) adequate documentation
• Considered to be expensive (and risky)
• Remains an option for content deemed to be of value
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Choosing a strategy
• Preservation strategies are not in competition (different
strategies will work together)
• It has been suggested that we should keep the original bits
(with some documentation) in any case
• But the strategy chosen has implications for:
• The technical infrastructure required (and metadata)
• Collection management priorities
• Rights management
• e.g, Owning the rights to re-engineer software
• Costs
• Planets project - PLATO preservation planning tool
• Decision support tool
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
File formats and preservation
• Formats can be identified and validated at
ingest
• JHOVE, PRONOM-DROID
• Standardisation on ingest
• Perceived wisdom suggests the adoption of open or
non-proprietary standards, e.g. databases
structured in XML, uncompressed images
• However, we need more empirical data on how
robust some of these standards are to random bitrot
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Other preservation challenges
• Scale (1):
• The “digital deluge”
•
•
•
•
e-Science
New generations of instruments
Computer simulations
Many terabytes generated per day, petabyte scale
computing (and growing)
• Cory Doctorow, “Welcome to the petacentre.”
Nature, 455, pp 17-21, 4 Sep 2008
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Other preservation challenges
• Scale (2):
• Problems of scale are particularly acute in
traditional 'big-science' disciplines:
• Particle physics (e.g., the Large Hadron Collider)
• Astronomy (sky surveys, etc)
• But “smaller experiments will grow the fastest”
(Szalay & Gray, Nature, 440, 413-4, 23 Mar 2006)
• Bioinformatics, crystallography, engineering design, and
many others
• In some cases it may be cheaper just to generate
the data again, e.g. for computer simulations
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Other preservation challenges
• Complexity (1)
• Research data is extremely diverse - not really a
single category of material
•
•
•
•
tabular data, images, GIS, etc.
raw machine output vs, derived data
varying levels of structure (XML, legacy formats, etc.)
many different standards
• Research data is not homogeneous
• No one-size-fits-all approach possible
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Other preservation challenges
• Complexity (2):
• Even wider range of social contexts in which data
is used (and shared)
• DCC SCARP project has been exploring
disciplinary factors in curation practice
• Practice even within single disciplines is very fragmented
• Case studies ongoing
• Big-science archives, medical and social sciences,
architecutre and engineering, biological images
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Other preservation challenges
• Diverse research cultures
• Data practices vary widely, even within a single
discipline
• Gene sequence data is typically deposited in public
databases
• In proteomics sharing is not so widespread; partly driven
by lack of standards, but there is also concern about who
have exploitation rights
• Role of commercial interests
• Pharmaceuticals, architecture and engineering,
geological prospecting
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Other preservation challenges
• Costs
• Recent JISC study (2008) - focusing on the
institution level
• Some findings:
• The complex service requirements for curating research
data means that institutions are setting-up federated
approaches to repository development
• Currently ingest costs are much higher than long-term
storage and preservation costs
• Start-up (and R&D) costs are high, but there can be
economies of scale
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Research data collections (1)
• A typology (1):
• From National Science Board report Long-lived
digital data collections (2005)
• Research data collections – the products of one or more
focused research projects
• Resource or community data collections – collections
that emerge to serve particular subject sub-disciplines
• Reference data collections – serve a broader and more
diverse set of user communities
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Research data collections (2)
• Data in “research data collections” is most at
risk
• A modern version of the “file-drawer problem”
• Data stored on personal hard-drives or on media;
largely undocumented
• Particular challenge when the data creator has
retired or moved to another institution
• Data creators not aways aware of its potential
value
• The reward structure of science is not always
helpful
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Curation infrastructures (1)
• Focus on the generic:
• Need for a balance between:
• The 'bottom-up' discipline-based drivers that promote the
generation of research data
• The policy level, looking to make cost effective
investment in curation
• When building Infrastructures, focus on the
generic
• Storage systems and middleware
• Preservation services
• Identifying the needs of the wider community
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Curation infrastructures (2)
• The need for collaboration:
• Need for 'deep-infrastructure' recognised as far
back as 1996 by the Task Force on Archiving of
Digital Information
• Digital preservation involves the "grander problem
of organizing ourselves over time and as a society
... [to manoeuvre] effectively in a digital landscape"
(p. 7)
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Summing-up
• Long-term preservation of research data is a
big ongoing challenge
• Solutions are based on the active
management of data
• Decisions needed on whether to adopt
standard formats, identifying significant
properties, preservation planning
• Research disciplines and sub-disciplines are
at different stages of maturity
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
The Future ...
• “It is always a mistake for a historian to try and
predict the future. Life, unlike science, is
simply too full of surprises” - Richard J. Evans,
In defence of history (1997, p. 62)
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Further reading
• National Science Board, Long-lived digital data collections:
enabling research and education in the 21st century (NSF,
2005) http//www.nsf.gov/pubs/2005/nsb0540/
• Liz Lyon, Dealing with data; roles, rights, responsibilities and
relationships (JISC, 2007)
http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2
005/dealingwithdata.aspx
• Neil Beagrie, Jullia Chruszcz, and Brian Lavoie, Keeping
research data safe: a cost model and guidance for UK
universities (JISC, 2008)
http://www.jisc.ac.uk/publications/publications/keepingresearchd
atasafe.aspx
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Thank you for your
attention!
“Pigabyte”
King Bladud’s Pigs in Bath
(public art project), Summer
2008
http://www.kingbladudspigs.org/
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
a centre of expertise in data curation and preservation
Acknowledgements
• UKOLN is funded by the Museums, Libraries and
Archives Council (MLA), the Joint Information
Systems Committee (JISC) of the UK higher and
further education funding councils, as well as by
project funding from the JISC, the European Union,
and other sources. UKOLN also receives support
from the University of Bath, where it is based.
• More information: http://www.ukoln.ac.uk/
“Preserve,” Digital Curation 101, NeSC, Edinburgh, October 2008
Download