Tackling concrete digital preservation challenges with SPRUCE Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley http://openplanetsfoundation.org/blogs/paul Summary • Some digital preservation challenges and solutions –Not exhaustive –Illustrate with some real examples –Summarise with some practical steps for digital preservation • Taking a community approach to digital preservation –SPRUCE Project –How to get involved –Where to get help Keeping the bits 00101100101110101 01010101010101110 01010101010010010 01011010101010111 10101001000101111 01010101010110101 11001010100101010 01010101001010101 01001010111010111 Digital data is fragile Courtesy of State and University Library, Denmark Digital preservation storage: keeping the bits • Media decay • Media becomes partially or completely unreadable • Media obsolescence • Without the respective hardware to read the (hand held) media, it becomes inaccessible • Practical issues • Inserting lots of discs into a drive is costly Images courtesy of The British Library Bit storage recommendations • Don’t fall for media longevity claims from vendors! They are missing the point! • Accept that media decays, media formats will change, and any media will become inaccessible in the medium term • Rather than putting your data in a dark archive and trusting it will survive for a long period... • Manage it closely, refresh to new media frequently, chose media that is easy to manage – Choose media that is easy to access (server storage, cloud, external hard drives) – Make at least 3 copies of all data, keep copies in different geographical locations – Frequently check the condition of your data Verifiable Manifests (Checksums!) • Allow you to easily check the condition of your digital stuff Are any of your digital files damaged? Are any of your digital files missing? • Single most useful digital preservation activity • Generate manifests as early as possible • Frequently re-check them over time • Mend content when necessary • LoC Bagit specification and Bagger tool Dependence on software 10101010101111010001010101000 10010100101110100101010101001 00100101010100100000101010101 01111101000101010100010010100 10111010010101010100100100101 01010010000010101010101111101 00010101010001001010010111010 01010101010010010010101010010 00001010101010110101010111101 00010101010001001010010111010 01010101010010010010101010010 00001010101010111110100010101 01000100101001011101001010101 01001001001010101001000001010 10101011111010001010101000100 1010010111010010101010100100… SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0 200x392 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2… When it goes wrong… Migration, Emulation and all that... • Migrate content from an obsolete format to a more modern usable format • Emulate the original computing environment and run the obsolete software originally used • Words of caution: – Is software obsolescence a really critical risk for our digital data? – The debate continues... International Council of Archives Congress 2012: • Michael Carden, National Archives Australia: NAA migrates all content • Oliver Morley, UK National Archives: “digital formats have standardized” • Blogged by Inge Angevarre: http://www.ncdd.nl/blog/?p=2786 – The hard part is the quality assurance of the results. Was anything lost or damaged in the process? Stuff happens! • Whenever a digital collection is moved, processed, curated or altered in any way.... things can go wrong! • Network dropouts at critical times • Disks get full, subsequent data copied there is lost • Software bugs lead to unexpected results • Human error leads to all sorts of issues • Stuff happens a lot more at scale! Digitisation post processing corruption Images courtesy of The British Library TIFF to JPEG2000 migration corruption Images courtesy of The British Library Technology can be imperfect! Format specification ambiguity and corresponding tool bugs JPEG2000s can be missing vital source resolution JPEG 2000 • For more on JPEG2000 format and tool risks see: http://wiki.opf-labs.org/display/TR/JP2 Images courtesy of The British Library Assume nothing, validate everything • Only process or alter digital content when it is absolutely necessary • Double check everything • Make no assumptions First steps in practical digital preservation Prompt check in – have you got what you thought you would receive? – Check expected files are present, open a random selection to verify expected quality – Request replacements from supplier promptly Create a verifiable manifest – Create a top down manifest file that lists each digital object in your collection as a relative filename and a checksum – Library of Congress Bagit specification and tools will also do a good job here Make at least 3 copies. Protect the bits – Keep a copy on easily accessible media – Backup to tape or more disk. Keep copies in different geographical locations to avoid catastrophic disaster. Cloud storage is also an option. Frequently inspect the condition of your data – Revisit the collection, recalculate your manifests and verify content has not been lost – Do a test recovery of your backups to ensure they are working effectively! Record the existence of each of your collections in a digital items register – Record: What it is, who is the responsible owner, where it is, who owns it, and who can access it. Assume nothing, validate everything! – Double check any processes in the lifecycle that move or alter your digital content – Built in checks can be flawed, a second opinion is much more trustworthy SPRUCE Project Sustainable Preservation Using Community Engagement • JISC funded • 2 years in length (until Nov 2013) • £250k funding http://wiki.opf-labs.org/display/SPR Some observations • Lack of focus on the real needs of digital preservation practitioners • Insufficient collaboration + coordination • Duplication of effort The SPRUCE Mashup: Identify and Solve concrete problems • 3 day workshop for ~30 people • Practitioners bring along digital collections • We identify preservation challenges • Pair up practitioners with technical experts • Apply existing open source tools to solve the problems • In doing so, we exchange knowledge about digital preservation • Develop a supportive community Glasgow Mashup April 2012 What questions do practitioners want answered? –What is this digital collection? –What risks are associated with this digital collection? –Separate collection content from temporary/other files. –Identify and weed duplicate or similar files. –Is the metadata consistent with the content? –Are all the pages present in each issue? –Are all digitised pages in focus? –Are any files damaged? –Are the files compliant with a particular profile? • See the results here: http://bit.ly/spruce-results Make it sustainable • Work with practitioners to develop a business case for their work • Make small funding awards to further develop and embed the work begun in the mashups York Mashup September 2011 Online collaboration • Sharing requirements • Sharing experiences: what tools worked well, what approaches should be avoided • Building on existing tools, rather than re-inventing the wheel • Libraries + Information Science question and answer site: –http://libraries.stackexchange.com/ • More recommended collaborative activities: –http://bit.ly/spruce-collaborate Thanks for listening! Any quesions? Paul Wheatley SPRUCE Project Manager University of Leeds Twitter: @prwheatley Email: p.r.wheatley@leeds.ac.uk http://openplanetsfoundation.org/blogs/paul