Challenges of Digital Preservation

advertisement
Challenges of Digital Preservation
MA / CS 109
April 22, 2011
Andrea Goethals
Manager of Digital Preservation & Repository Services
Harvard Library
“Digital Content”?

Digitized (born-analog)

Born-digital
◦
◦
◦
◦
Tweets
Web sites
Email
Documents
 PDF
 Word, OpenOffice …
 Spreadsheets
◦ Data sets
Digital content is not new





Russell Kirsch’s son (source: NIST)
1957: 1st digital image
1969: ARPAnet
1971: 1st email sent
1972: 1st consumerlevel video game
1975: 1st digital
camera
But has only recently exploded

1998: 1st Google index
◦ 26 million pages

2000: Google index
◦ 1 billion pages

2008: Google link
processors
◦ 1 trillion unique URIs
◦ “… and the number of
individual Web pages out
there is growing by
several billion pages
per day” – from the
official Google blog
The coming tsunami

2010: estimated at
1.2 ZB (1 ZB is 1
million TBs)
◦ DVDs stacked from
Earth to the Moon
and back

2020: expected to
grow by a factor of
44 to 35 ZB
◦ DVDs stacked halfway
to Mars
Source: 2010 IDC Digital Universe Study sponsored by EMC
Outpacing storage
Source: 2009 IDC Digital Universe Study sponsored by EMC
Why do we care?
May be historically significant
Captured March 19, 2011 for a Japan Earthquake collection created by Virginia Tech,
Internet Archive (http://www.archive-it.org/public/collection.html?id=2438)
May be a work of art
YouTube Play. A Biennial of Creative Video (Oct. 2010 -)
May be an important reference
Only
available
in digital
form
May only be possible digitally
Who cares?

Cultural heritage institutions
◦ Libraries, archives
◦ Museums, historical societies
◦ Academic institutions
Governments
 Entertainment, news and media industry
 Scientific community
 Funding bodies (NSF, NIH)
 You?

Preservation historically

Archives and libraries have been
preserving all kinds of analog material
for centuries using:
◦ Environmental control
◦ Conservation treatments

Can store away until resources allow
processing
◦ Benign neglect approach works well
Analog content is fairly durable

Even damaged, may still be identifiable,
readable, usable
Anatolian
Cuneiform
Tablet, circa
1850 BCE
In contrast digital content is
Easily destroyed
 Transient
 Hidden
 Requires more active attention – benign
neglect approach doesn’t work

Digital content is easily destroyed
Bad people
 Hardware or
software failures
 Human mistakes

◦ The slip of a finger can
lead to catastrophic
results
◦ “Help! Accidental
deletion. I accidentally
deleted 62 images…
can you please recover
them from backups?”
Digital content is transient

Average lifespan of a Web site is between
44 and 100 days
Captured April 8, 2009
Visited October 13, 2010
Digital content is hidden

Which is corrupt?
Digital content is hidden

Both. Use helps but its not enough to
detect corruption.
But is it usable???

It’s not enough to preserve the digital bits
◦ AppleWorks?
◦ WordStar?
◦ Excel 1.0?

To use digital content we need software
that can read the format
Reading formats
ffd8ffe000104a46494600010201
008300830000ffed0fb050686f74
6f73686f7020332e30003842494d
03e90a5072696e7420496e666f00
0000007800000000004800480000
000002f40240ffeeffee03060252
0347052803fc0002000000480048
0000000002d80228000100000064
000000010003030300000001270f
0001000100000000000000000000
0000600800190190000000000000
0000000000000000000000000000
0000000000000000000000003842
494d03ed0a5265736f6c7574696f
6e0000000010008313a3000200 ...
Reading formats
ffd8ffe000104a46494600010201
008300830000ffed0fb050686f74
6f73686f7020332e30003842494d
03e90a5072696e7420496e666f00
0000007800000000004800480000
000002f40240ffeeffee03060252
0347052803fc0002000000480048
0000000002d80228000100000064
000000010003030300000001270f
0001000100000000000000000000
0000600800190190000000000000
0000000000000000000000000000
0000000000000000000000003842
494d03ed0a5265736f6c7574696f
6e0000000010008313a3000200 ...
SOI
APP0
APP13
APP2
DQT
SOF0
DRI
DHT
SOS
ECS0
RST0
ECS1
RST1
ECS2
...
JFIF 1.2
IPTC
ICC
183x512
Reading formats
ffd8ffe000104a46494600010201
008300830000ffed0fb050686f74
6f73686f7020332e30003842494d
03e90a5072696e7420496e666f00
0000007800000000004800480000
000002f40240ffeeffee03060252
0347052803fc0002000000480048
0000000002d80228000100000064
000000010003030300000001270f
0001000100000000000000000000
0000600800190190000000000000
0000000000000000000000000000
0000000000000000000000003842
494d03ed0a5265736f6c7574696f
6e0000000010008313a3000200 ...
SOI
APP0
APP13
APP2
DQT
SOF0
DRI
DHT
SOS
ECS0
RST0
ECS1
RST1
ECS2
...
JFIF 1.2
IPTC
ICC
183x512
Access to information
information
HW (paper)
content
information
content
symbols
bits
formats
SW
HW
language
HW (paper)
Analog book
Digital book
Unmediated access
Technology-mediated access
Formats are key to digital
preservation
information
content
bits
formats
SW
HW
If the format
of our content
is unsupported
by technology,
we can’t
access the
content’s
information!
Dependent on fleeting technology
We are dependent on technology to
interpret (render, play, etc.) digital content
 No technology sticks around – it all ages
and disappears
 Eventually all digital content in its original
format becomes unusable!

Format obsolescence

Kodak PhotoCD
◦ Used by libraries in the 1990’s and into 2000’s
as a preservation format
◦ Best decoders were from Kodak and are no
longer supported
◦ Very few software decoders remaining – soon
images in this format will be unusable
◦ Harvard’s Digital Repository Service has
7,243 of these
Two sub-problems

Keep the bits safe

Keep the information
usable as technology
changes
Safe bits

Infrastructure, polices, practices and
professional staff to counter risks
◦ High quality storage
◦ Redundancy (multiple copies, multiple
locations)
◦ Media refreshing (replacing)
◦ Security and access restrictions
◦ Content recovery
◦ Integrity monitoring (check for corruption)…
Integrity monitoring

Message digests – unique signatures for
digital content
◦ Fixed-size bit strings
 6326ec82b3200df4a87fc54356d2cb73
◦ Calculated by cryptographic hash functions,
e.g. MD5, SHA1, …
Any changes to a file result in a changed
message digest
 Useful for detecting corruption

Usable information
People have to be able to find it
 People must be able to manage it
 Document what’s important (description,
context, ownership, processing history)
 Know what you are preserving
(formats)…

A TIFF is a TIFF?
Tiff 4.0
 Tiff 5.0
 Tiff 6.0
 Tiff 6.0 extension YCbCr
(Class Y)
 TIFF/IT (ISO 12639:2003)
 TIFF/EP (ISO 122342:2001)
 RichTIFF
 EXIF 2.0

EXIF 2.1 (JEIDA-49-1998)
 EXIF 2.2 (JEITA CP-3451)
 GeoTIFF 1.0
 TIFF-FX (RFC 2301)
 Class F (RFC 2306)
 RFC 1314
 Canon RAW (.crw, .cr2,
.tif)
 Nikon RAW (.nef)
 DNG (Adobe Digital
Negative)

Identifying formats
Techniques: “magic numbers”, full parse
 Few tools

◦ Support limited number of formats
◦ Accuracy varies

Some improvements
◦ File Information Tool Set (FITS)
 fits.google.code
◦ NARA-sponsored research
Usable information
Make sure there’s technology to support
the formats! (technology watch)
 Preservation strategies

◦ Technology preservation
◦ Creation of viewing software
◦ Emulation & variations:
 Universal Virtual Machine
 Universal Virtual Computer
◦ Format normalization
◦ Format migrations…
Key format migration considerations

What can’t be lost in the transformation?
“Significant properties”
◦ E.g. color, embedded metadata, resolution,
ICC profiles, interaction, attachments, fonts,
links
◦ How important are each of these properties?
– weighted criteria
To what format? “Preservable” formats
 What else must be changed? Ex: Links
 How many versions to keep?

Preservation lifecycle – a series of
hand-offs
Create or acquire digital content
 Ingest into a preservation repository

◦ Continuous cycle of:
 Monitoring
 Planning
 Intervention
◦ Subject to collection management decisions

Transfer to next generation of the
repository or to a different repository
Ongoing commitment

Requires continual proactive program
◦ You can’t just start and stop
◦ Time frames are MUCH shorter than for
preservation of analog material

Requires ongoing investment in
infrastructure and staff
Can’t do it alone
Digital preservation activities must be
shared across institutions
 Even collectively we don’t have adequate
resources or understanding

Preservation community
Collaborative organizations (NDSA, IIPC,
OPF)
 Collaborative projects
 Standards and best practices
 Shared infrastructure and tools

◦
◦
◦
◦
Formats registry
Repository software
Preservation planning tools
Format tools
Download