Jantz Mar6_2002.ppt

advertisement
Digital Preservation - Outline

Introduction - Definitions, Facts, Challenges

Digital Archiving – A Life Cycle View
–
–
–

Metadata
Strategies
RUL Projects
Trusted Digital Repositories
1
Digital Libraries, R. Jantz - Feb. 26, 2002
Digital Dark Ages?
As we move into the electronic era of digital objects it is important to know
that there are new barbarians at the gate and that we are moving into an
era where much of what we know today, much of what is coded and
written electronically, will be lost forever. We are, to my mind, living in the
midst of digital Dark Ages; consequently, much as monks of times past, it
falls to librarians and archivists to hold to the tradition which reveres
history and the published heritage of our times. (Kuny, 1998)
2
Digital Libraries, R. Jantz - Feb. 26, 2002
Preservation
“The urge to preserve is endemic to our roles as librarians.”
“The patent office, home to nearly 6.5 million patents dating to
1790, is converting to an electronic database and discarding a
significant portion of its paper files after they have been scanned
and digitized.” -Mitchell, A. (2001). Ingenuity’s Blueprints, Into History’s Dustbin. NY
Times. December 30, 2001, p. A1.
A scenario: A truck loaded with hazardous waste is headed toward
a dump site. Will our descendants know where we have buried
the waste? (Bide, et al, 1999)
3
Digital Libraries, R. Jantz - Feb. 26, 2002
Digital Preservation – Some Numbers

$20 Trillion loss of information expected over the next 20 years
(Lysakowski and Leibowitz, 2000)
–
–

Within 10 years, the total number of electronic records could be doubling
every 60 minutes.
From an economic model, the cost of converting from MS-Office95 to
Office97 is estimated at 711,110 work years.
80 Million books in the US are rapidly deteriorating
Yale University states that 80% of their collection is endangered.

Print material
All print material (ascii text) published in the world each year could be
stored in about 5 terabytes

4
Images
Over 80 billion photographs are taken each year which would take 400
petabytes to store.
Digital Libraries, R. Jantz - Feb. 26, 2002
Numbers – continued
(from: http://www.ccsf.caltech.edu/~roy/dataquan/)





Megabyte – one million bytes
Gigabyte – 1000 megabytes
Terabyte – 1000 gigabytes
10 terabytes: the printed collection of the US Library of
Congress
Petabyte – 1000 terabytes
2 petabytes: all the material in US academic research libraries
Exabyte – 1000 petabytes
5 exabytes: all words ever spoken by human beings
5
Digital Libraries, R. Jantz - Feb. 26, 2002
Preservation in Digital Libraries
Preservation:“The managerial, financial, and technical issues involved in
preserving library (or archive) materials in all formats - and/or their
information content - so as to maximize their useful life” (Eden, 1997)
Digital preservation is defined as the managed activities
necessary for ensuring:
1. The long term maintenance of a byte stream and
2. Continued accessibility of the contents thru time and changing technology.
Digital Libraries vs. Digital Archives: Archives make a commitment to
long-term preservation of digital information. (Joint Task Force on Digital
Archiving)
6
Digital Libraries, R. Jantz - Feb. 26, 2002
Why Would You Digitally Preserve?

Protect original print artifact

Provide access by accurately representing originals

Preserve material that exists in electronic form only

Enhance research by “improving” originals
–
–
High resolution imagery to study details
Searchable text
7
Digital Libraries, R. Jantz - Feb. 26, 2002
The Challenges of Digital Preservation








Lack of standards (or too many standards)
Lack of documentation on production and use
Cost and rapid obsolescence of technology
Impermanence of the medium
Mutability of the content (easily changed – legal
issues)
Version control
Need to guarantee integrity of digital information
Migration of information (driven by external factors)
8
Digital Libraries, R. Jantz - Feb. 26, 2002
What to Archive – A Checklist










9

Historical and research value
Aesthetic and artistic merit
Uniqueness of an item
Subject content – relevant to Institution
Access – Restrictions and inventory
Condition
Frequency of use – frequency of change
Ownership
Redundancy – concern for loss or modification
Length of preservation
Is any other institution archiving the material?
Digital Libraries, R. Jantz - Feb. 26, 2002
Candidates for Preservation

Material created (not digitized) in digital format
–
–
–
–
–

Government information
–
–

–

Census data, international statistics (Do we rely on the government to preserve this
material?)
Consortiums such as Inter-university Consortium for Political and Social Research
(ICPSR) have a role.
Print material/manuscripts that are digitized for access and/or preservation:
–
10
Reference databases (online catalogs, subject specific indexes, etc)
Electronic journals
Digital maps
Data
Websites (e.g. research guides, web-based databases, documents)
Original documents not retained (e.g. as in the NJ Environmental Digital Library)
Original document retained (as in Special Collections)
Electronic (analog) media that is digitized (audio, video tapes)
Digital Libraries, R. Jantz - Feb. 26, 2002
Digital Archiving – A Life Cycle View
11
1.
Creation
2.
Acquisition and Collection Development
3.
Identification and Cataloging
4.
Storage
5.
Preservation (incl. Metadata)
6.
Access
from (Hodge, 2000)
Digital Libraries, R. Jantz - Feb. 26, 2002
Digital Preservation Strategies
12

Migration: transferring digital materials from one media or format to
another because of obsolescence, failure in media, software updates,
standards, etc.

Emulation: refers to the process of mimicking, in software, a piece of
hardware or software so that other processes think the original
equipment/function is still available in its original form.
(http://www.nla.gov.au/padi/topics/17.html)

Encapsulation: A technique of grouping together a digital object and
anything else necessary to provide access to that object. This
technique aims to overcome the problems of the technological
obsolescence of file formats because the details of how to interpret the
digital bits in the object can be part of the encapsulated information.
(http://www.nla.gov.au/padi/topics/17.html)
Digital Libraries, R. Jantz - Feb. 26, 2002
Migration of Digital Information
Reasons for Migration:

Medium refreshing (e.g. rewrite a CD)

Medium conversion (diskette to CD)

Format conversion (ascii to pdf)

Version upgrade (Office97 to Office2000)

Migration of technical environment (W98 to NT)
13
Digital Libraries, R. Jantz - Feb. 26, 2002
The Migration Process

Error Prone

Labor intensive and expensive

Governed by external factors

The only approach that works for now
14
Digital Libraries, R. Jantz - Feb. 26, 2002
RUL Projects – A Sampling

Medieval Early Modern Data Bank

Eagleton Public Opinion Polls

The Augustine Collection

REALITI – A Digital Preservation Framework
15
Digital Libraries, R. Jantz - Feb. 26, 2002
Medieval Early Modern Data Bank - MEMDB
Characteristics:









16
At: http://www.scc.rutgers.edu/memdb
Content: commodity prices in the medieval period
Access: public domain
Compiler: Co-directors of MEMDB
Owner: RUL?
Archiver: (who should archive?)
Type: Database on the web
Format: html, Active server pages, MS-Access, html
Metadata req’mts: numeric data
Questions: What is the primary document? How long should it be preserved?
Extent of “document”? Owner? Preserve look & feel?
Digital Libraries, R. Jantz - Feb. 26, 2002
Eagleton Public Opinion Polls
Characteristics
•
•
•
•
•
•
•
•
•
•
17
At: http://www.scc.rutgers.edu/eagleton_tst
Content: New Jersey public opinion (1970 - )
Access: public domain
Compiler: Eagleton Institute
Owner: Eagleton/Star Ledger
Archiver: RUL/Scholarly Communication Center
Type: database on the Web
Format: html, MS-Access, portable spss files
Metadata req’mnts: Questionnaires & numeric data
Questions: Preserve “look & feel”, spss (proprietary software)
Digital Libraries, R. Jantz - Feb. 26, 2002
The Augustine Collection
Characteristics
•
•
•
•
•
•
•
•
•
•
18
At: http://www.scc.rutgers.edu/augustine
Content: Photographs from 19th Century New Jersey Shore
Access: public domain
Compiler: William F. Augustine
Owner: RUL Special Collections
Archiver: RUL Special Collections
Type: image archive
Format: html, jpeg
Metadata req’mnts: original artifacts
Questions: image format, preserve digital archive, individual items/collection
Digital Libraries, R. Jantz - Feb. 26, 2002
REALITI – A Digital Preservation Framework
(Rutgers Electronic Access to Library Information thru Technology Integration)
Characteristics
•
•
•
•
•
•
•
•
•
•
19
At: http://www.scc.rutgers.edu/realiti
Content: Civil War period in New Jersey
Access: public domain
Compiler: RUL Special Collections
Owner: RUL
Archiver: RUL Special Collections/SCC
Type: Images on the Web
Format: html, ColdFusion, MS-Access, PDF, djvu,tiff
Metadata: Preservation, multiple formats
Questions: formats, compression, metadata, original artifact
Digital Libraries, R. Jantz - Feb. 26, 2002
Preservation Metatdata for Digital Collections*
Collection Level
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
20
(continued)
Persistent identifier:
Date of creation:
Structural type: (e.g. ascii text, jpeg images, etc)
Technical infrastructure: files, databases, html, etc.
File description
System requirements:
Installation requirements:
Storage information:
Access inhibitors:
Access facilitators:
Preservation action permission:
Validation: (information about validation
mechanism)
Relationships (to other objects):
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
Quirks: (any characteristic that may cause loss in
funtionality)
Archiving decision (work):
Decision reason (work):
Institution responsible for archiving decision:
Archiving decision (manifestation):
Decision reason (manifestation):
Institution Responsible for Archiving Decision
(manifestation)
Intention Type
Institution with preservation responsibility
Process
Record Creator
Other
* (from National Library of Australia: http://www.nla.gov.au/preserve/pmeta.html )
Digital Libraries, R. Jantz - Feb. 26, 2002
Trusted Digital Repositories
(http://www.rlg.org/pr/pr2001-attributes.html)
A Proposed Framework for a Trusted Archival Agent:

Administrative – adherence to agreed upon standards

Organizational – commitment to management on behalf of depositors

Financial – adherence to good business practices

Technological – infrastructure in place with upgrade policies

Security – policies for security, auditability, and backup

Procedural – Repository practices will be in place and documented.
21
Digital Libraries, R. Jantz - Feb. 26, 2002
Possible Organizational Models
(Who might be a digital repository)










22

Originators (e.g. individual researchers)
Publishers (What happens when the publisher goes out of business?)
Libraries, museums, and other conservator institutions
National libraries and archives
Cooperative service agencies (e.g. OCLC, RLG, ICPSR for social science
research)
Segmented market providers (e.g. Bell & Howell for preserving
dissertation literature and Early English Books)
Private storage providers
Computer centers
Scholarly associations (e.g. American Institute of Physics)
Indexing and abstracting services
Certified digital archives.
Digital Libraries, R. Jantz - Feb. 26, 2002
Institutional Efforts

OCLC and Web Document Digital Archive (WDDA) Project
–
–

RLG Cultural Materials
–
–
–

–
Addresses problem of material no longer available from the publisher
Modeled on distributed print libraries. Reich, et al, (2001). D-Lib Magazine, 7, (6).
OAIS – Open Archival Information System Reference Model
–
23
Cultural Materials is being developed through members to set the conditions for
contributing and distributing their digital surrogates of valuable collections.
The goal is a growing, significant, online resource and service solution.
At: http://www.rlg.org/culturalres
LOCKSS – A permanent web publishing and access system
–

Tools for libraries and archives to preserve and maintain access to digital content
At: http://www.oclc.org/digitalpreservation
–
Requirements for any system responsible for preserving any type of information over a
long period.
At: http://ssdoo.gsfc.nasa.gov/nost/isoas
Digital Libraries, R. Jantz - Feb. 26, 2002
Digital Preservation – Concluding Thoughts




Librarians and archivists are a key to the solution.
“A major academic scandal will have to happen first . .
.” in order to focus attention and resources. (Graham,
2000).
A combination of solutions will be employed including
migration and emulation.
Digital “archaeology” will be used to recover lost data.
24
Digital Libraries, R. Jantz - Feb. 26, 2002
Preservation Resources on the Web
Institutional Issues:
ARL Preservation Program (http://www.arl.org/preserv/index.html)
Digital Preservation Needs and Requirements in RLG Member Institutions
(http://www.rlg.org/preserv/digpres.html)
– RLG DigiNews (http://www.rlg.org/preserv/diginews/)
–
Technical Information/Papers:
–
–
–
–
25
–
Avoiding Technological Quicksand
(http://www.clir.org/pubs/reports/rothenberg/contents.html)
PADI - Preserving Access to Digital Information - from the National Library of
Australia (http://www.nla.gov.au/padi/).
Background Papers and Technical Information - from LOC American Memories site
(http://memory.loc.gov/ammem/ftpfiles.html)
Preservation of electronic information - a bibliography
(http://homes.ukoln.ac.uk/~lismd/preservation.html)
Digital Imaging Tutorial - http://www.library.cornell.edu/preservation/tutorial/
Digital Libraries, R. Jantz - Feb. 26, 2002
More Information on the Web
Technical Information/Papers (continued):
–
–
–
CLIR Publications (http://www.clir.org/pubs/reports/reports.html)
Kuny, T. (1998/May). The digital dark dges? Challenges in the preservation of electronic
information. International Preservation News, (17), At http://www.ifla.org/VI/4/news/1798.htm#2
Hodge, G. M. (2000). Best practices for digital archiving: An information life cycle
approach. D-Lib Magazine, 6, (1), available at:
http://www.dlib.org/dlib/janauary00/01hodge.html
Handbooks:
–
–
Hunter, G. S. (2000). Preserving Digital Information: A How-To-Do-It Manual, New York:
Neil-Schuman Publishers
Sitts, M. K. (2000). Handbook for Digital Projects: A Management Tool for Preservation
and Access, Andover, Massachusetts: Northeast Document Conservation Center
26
Digital Libraries, R. Jantz - Feb. 26, 2002
References







Bide, M, Potter, E, &Watkinson, A. (1999) , Digital Preservation: an introduction to the standards issues
surrounding the deposit of non-print publications. At: www.bic.org.uk/digpres.doc
Graham, P. (2000). RLG and Archiving at the heart of the research library mission. RLG News. Winter 2000.
(50). P. 12 – 13.
Graham, P. (1998/February). Digital strategies for the Rutgers University Libraries: a white paper draft.
DRAFT 4.
Hedstrom, M. & Montgomery, S. (1998). Digital Preservation Needs and Requirements in RLG Member
Institutions: A Study Commissioned by the Research Libraries Group. Available at:
http://www.rlg.org/preserv/digpres.html
Hodge, G. (2000). Best practices for digital archiving: An information life cycle approach. D-Lib Magazine, 6,
(1). Available at: http://www.dlib.org/dlib/january00/01hodge.html.
Lysakowski, R. & Leibowitz, Z. (2000). Looming information age crisis expected to cause trillion-dollar losses
over the next 20 years: Titantic 2020 – a call to action. Available at: http://www.censa.org.
Rothenberg, J. (1998/January). Avoiding Technological Quicksand: Finding a Viable Technical
Foundation for Digital Preservation. Available at: http://www.clir.org/pubs/reports/rothenberg/contents.html
27
Digital Libraries, R. Jantz - Feb. 26, 2002
28
Digital Libraries, R. Jantz - Feb. 26, 2002
Migration & Complexity of the Technical
Environment
Client
Formats (extensions)
Application Software
Network
Operating System
Hardware
29
Digital Libraries, R. Jantz - Feb. 26, 2002
Server
Pdf, doc, wpd, txt, gif, jpeg,
tiff, eoo, por, ebx, mdb
Adobe, Netscape
Word, SPSS
TCP/IP
W95/98/2000
PCs, disks, CDs
Servers: web, file,
email, etc
TCP/IP
Unix/NT
Server, storage, Network
Preservation in Digital Libraries
Preservation:“The managerial, financial, and technical issues involved in
preserving library (or archive) materials in all formats - and/or their
information content - so as to maximize their useful life” (Eden, 1997)
Digital preservation: The term refers exclusively to the preservation
(whatever exactly that entails) of material which is available [solely?] in
electronic form (Bide, 1999).
And the digital version is considered to be the primary archival item.
(Hedstrom, 1998)
Digital Libraries vs. Digital Archives: Archives make a commitment to
long-term preservation of digital information. (Joint Task Force on Digital
Archiving)
30
Digital Libraries, R. Jantz - Feb. 26, 2002
Digital Archiving- Getting Started







Form an archiving working group
Prepare a preliminary policy statement
Trial the policy statement with several small, existing projects
Examine what others are doing and bring in best ideas.
Collaborate with others who are interested in digital in
preservation.
Initiate forums on digital archiving; invite colleagues, students,
researchers, etc.
Submit a recommendation for a digital archiving program and
next steps.
31
Digital Libraries, R. Jantz - Feb. 26, 2002
Trusted Digital Repositories
(http://www.rlg.org/pr/pr2001-attributes.html)
A Proposed Definition (from RLG document):

Technology Infrastructure
–
–

Organization
–
–
–
–
–
32
Auditability, security, and communication
Backup policies incl. avoiding, detecting and restoring corrupted data
–
Certification
Compliance
Reputation and performance
Agreements between creators and providers
Open sharing of what is being preserved and for whom
Balanced risk, benefit, and cost
Digital Libraries, R. Jantz - Feb. 26, 2002
Download