High contrast colours will help audiences to read text from

advertisement
Tackling concrete digital preservation
challenges with SPRUCE
Paul Wheatley
SPRUCE Project Manager
University of Leeds
Twitter: @prwheatley
http://openplanetsfoundation.org/blogs/paul
Summary
• Some digital preservation challenges and solutions
–Not exhaustive
–Illustrate with some real examples
–Summarise with some practical steps for digital
preservation
• Taking a community approach to digital
preservation
–SPRUCE Project
–How to get involved
–Where to get help
Keeping the bits
00101100101110101
01010101010101110
01010101010010010
01011010101010111
10101001000101111
01010101010110101
11001010100101010
01010101001010101
01001010111010111
Digital data is fragile
Courtesy of State and University Library, Denmark
Digital preservation storage: keeping the bits
• Media decay
• Media becomes partially or completely
unreadable
• Media obsolescence
• Without the respective hardware to
read the (hand held) media, it becomes
inaccessible
• Practical issues
• Inserting lots of discs into a drive is
costly
Images courtesy of The British Library
Bit storage recommendations
• Don’t fall for media longevity claims from vendors! They are
missing the point!
• Accept that media decays, media formats will change, and
any media will become inaccessible in the medium term
• Rather than putting your data in a dark archive and trusting
it will survive for a long period...
• Manage it closely, refresh to new media frequently, chose
media that is easy to manage
– Choose media that is easy to access (server storage, cloud, external
hard drives)
– Make at least 3 copies of all data, keep copies in different
geographical locations
– Frequently check the condition of your data
Verifiable Manifests (Checksums!)
• Allow you to easily check the condition of your digital stuff
Are any of your digital
files damaged?
Are any of your
digital files missing?
• Single most useful digital preservation activity
• Generate manifests as early as possible
• Frequently re-check them over time
• Mend content when necessary
• LoC Bagit specification and Bagger tool
Dependence on software
10101010101111010001010101000
10010100101110100101010101001
00100101010100100000101010101
01111101000101010100010010100
10111010010101010100100100101
01010010000010101010101111101
00010101010001001010010111010
01010101010010010010101010010
00001010101010110101010111101
00010101010001001010010111010
01010101010010010010101010010
00001010101010111110100010101
01000100101001011101001010101
01001001001010101001000001010
10101011111010001010101000100
1010010111010010101010100100…
SOI
APP0 JFIF
1.2
APP13 IPTC
APP2 ICC
DQT
SOF0 200x392
DRI
DHT
SOS
ECS0
RST0
ECS1
RST1
ECS2…
When it goes wrong…
Migration, Emulation and all that...
• Migrate content from an obsolete format to a more modern
usable format
• Emulate the original computing environment and run the
obsolete software originally used
• Words of caution:
– Is software obsolescence a really critical risk for our digital data?
– The debate continues... International Council of Archives Congress
2012:
• Michael Carden, National Archives Australia: NAA migrates all content
• Oliver Morley, UK National Archives: “digital formats have standardized”
• Blogged by Inge Angevarre: http://www.ncdd.nl/blog/?p=2786
– The hard part is the quality assurance of the results. Was anything
lost or damaged in the process?
Stuff happens!
• Whenever a digital collection is moved, processed, curated
or altered in any way.... things can go wrong!
• Network dropouts at critical times
• Disks get full, subsequent data copied there is lost
• Software bugs lead to unexpected results
• Human error leads to all sorts of issues
• Stuff happens a lot more at scale!
Digitisation post processing corruption
Images courtesy of The British Library
TIFF to JPEG2000 migration corruption
Images courtesy of The British Library
Technology can be imperfect!
Format specification
ambiguity and
corresponding tool bugs
JPEG2000s can be missing
vital source resolution
JPEG
2000
• For more on JPEG2000 format and tool risks see:
http://wiki.opf-labs.org/display/TR/JP2
Images courtesy of The British Library
Assume nothing, validate everything
• Only process or alter digital content when it is absolutely
necessary
• Double check everything
• Make no assumptions
First steps in practical digital preservation

Prompt check in – have you got what you thought you would receive?
– Check expected files are present, open a random selection to verify expected quality
– Request replacements from supplier promptly

Create a verifiable manifest
– Create a top down manifest file that lists each digital object in your collection as a relative
filename and a checksum
– Library of Congress Bagit specification and tools will also do a good job here

Make at least 3 copies. Protect the bits
– Keep a copy on easily accessible media
– Backup to tape or more disk. Keep copies in different geographical locations to avoid
catastrophic disaster. Cloud storage is also an option.

Frequently inspect the condition of your data
– Revisit the collection, recalculate your manifests and verify content has not been lost
– Do a test recovery of your backups to ensure they are working effectively!

Record the existence of each of your collections in a digital items register
– Record: What it is, who is the responsible owner, where it is, who owns it, and who can
access it.

Assume nothing, validate everything!
– Double check any processes in the lifecycle that move or alter your digital content
– Built in checks can be flawed, a second opinion is much more trustworthy
SPRUCE Project
Sustainable Preservation Using Community
Engagement
• JISC funded
• 2 years in length (until Nov 2013)
• £250k funding
http://wiki.opf-labs.org/display/SPR
Some observations
• Lack of focus on the real needs of digital
preservation practitioners
• Insufficient collaboration + coordination
• Duplication of effort
The SPRUCE Mashup:
Identify and Solve concrete problems
• 3 day workshop for ~30 people
• Practitioners bring along digital
collections
• We identify preservation challenges
• Pair up practitioners with technical
experts
• Apply existing open source tools to
solve the problems
• In doing so, we exchange knowledge
about digital preservation
• Develop a supportive community
Glasgow Mashup
April 2012
What questions do practitioners want answered?
–What is this digital collection?
–What risks are associated with this digital collection?
–Separate collection content from temporary/other files.
–Identify and weed duplicate or similar files.
–Is the metadata consistent with the content?
–Are all the pages present in each issue?
–Are all digitised pages in focus?
–Are any files damaged?
–Are the files compliant with a particular profile?
• See the results here: http://bit.ly/spruce-results
Make it sustainable
• Work with practitioners to develop a business case
for their work
• Make small funding awards to further develop and
embed the work begun in the mashups
York Mashup
September 2011
Online collaboration
• Sharing requirements
• Sharing experiences: what tools worked well, what
approaches should be avoided
• Building on existing tools, rather than re-inventing
the wheel
• Libraries + Information Science question and answer
site:
–http://libraries.stackexchange.com/
• More recommended collaborative activities:
–http://bit.ly/spruce-collaborate
Thanks for listening! Any quesions?
Paul Wheatley
SPRUCE Project Manager
University of Leeds
Twitter: @prwheatley
Email: p.r.wheatley@leeds.ac.uk
http://openplanetsfoundation.org/blogs/paul
Download