What I Learned This Summer

advertisement
“What I Learned This Summer”:
A Week at SAA’s First Electronic
Records Summer Camp
Daniel Linke
University Archivist and Curator of Public Policy Papers
December 14, 2007
Geisel Library at UCSD (Photo by Sara Muth)
University of California, San Diego
August 6-10, 2007
Yes, that
Geisel
(Photo by Sara Muth)
Eleanor Roosevelt Campus
(Photo by Sara Muth)
Our accommodations in the Asante dormitory
(Photo by Sara Muth)
• My suitemates:
Peter Johnson,
Eric Paquette,
and Dylan
McDonald
•
(Photo courtesy of Eric Paquette)
27 attendees from a variety of institutions
(government, educational, and private
repositories):
• UCSD, UC-Irvine, Harvard B. School, U.
New Mexico, UT:Arlington, Occidental
College, UWI:Madison
• AZ, CA, NC, and WA State Archives
• CIGNA, National Fire Protection
Association, Ford, History Associates
• Sacramento Archives and Museum
• Marist Brothers of Canada
Terrace of the college commons where we took our meals
(Photos by Sara Muth)
Fellow “campers” : Police Explorers Club
(Photo by Sara Muth)
Our classroom was within the SDSC
(Photo by Eric Paquette)
Our classroom
(Photo by Chien-Yi Hou)
Some instructors standing at the back
(Photo by Chien-Yi Hou)
SAA Summer School Instructors
•
•
•
•
•
•
Mark Conrad (NARA)
Preservation principles
Mike Smorul (U Md)
Preservation services
Reagan Moore (SDSC)
Data grids
Arcot (Raja) Rajasekar (SDSC) Advanced data grids
Richard Marciano (SDSC) Preservation applications
Chien-Yi Hou (SDSC)
Preservation applications
What the
week
consisted of
(in format)
(Photo by Chien-Yi Hou)
What the week consisted of in subjects covered
• Monday
–
–
–
–
Electronic Records 101 (Conrad)
Components of an Electronic Records Program (Conrad)
Infrastructure Independence (Moore)
mySRB Tutorial (Moore)
• Tuesday
– Appraisal and Disposition (Conrad, Marciano, Chien-Yi)
– Accessioning (Smorul, Marciano, Conrad)
• Wednesday
– Arrangement (Marciano, Conrad, Moore)
– Description (Marciano, Rajasekar, Chien-Yi, Moore)
• Thursday
– Preservation (Moore, Smorul, Chien-Yi)
– Access (Moore, Marciano)
• Friday
– Scalability (Moore, Marciano)
– Getting started (Conrad, Moore)
What are Electronic Records?
• Easy to Define
– Any Record that Can Only be Accessed With
a Computer
• Hard to Define
– Many Records Don’t Have an Analog
Equivalent
– Often Difficult to Say Where the “Boundaries”
of a Record Are
Where Do They Come From?
• Types of applications that can create
electronic records
– Word processing
– Databases
– Spreadsheets
– Geographic Information Systems
– E-mail
– Any Computer Application Could Potentially be
used to Create Electronic Records
Unique Qualities: Faster than Rabbits
• They Multiply!
• PERMANENT Federal Electronic Records
– 1 to 5% of the Total Produced
– Next 15 Years – 350 Petabytes Produced
(Peta = 1000 TB)
– Beyond the Current State of the Art
• Archivists can Identify the Wheat and Chaff
– Resource Allocators are Taking Notice
Unique Qualities: Handle With Care
• They are Fragile!
– Easily Deleted
– Keeping the Contextual Information Linked to
the Data is Difficult
• Without this it is difficult to assert you have
authentic records
Unique Qualities: Manipulation
• The Good: Organized or Used in Multiple Ways
– Records can be more easily used.
• Records that would be difficult to use in paper
form can be used quite easily in electronic form.
•The Not So Good:
-Records can be easily changed.
Unique Qualities: Native Habitat vs. Zoo
• Original Applications
– Run Out of Room
– Go Belly Up
• Moving the Records Out of Their Native
Habitat can be Challenging
– Where is the Boundary Between the Records
and the Application?
– How do You Maintain Essential Characteristics
in a Zoo (aka Preservation Environment)?
– The Formats Become Obsolete, Too!
COMPONENTS OF AN ELECTRONIC RECORDS
PROGRAM
1. Policies and Mandates
2. Technical Infrastructure
3. Social Infrastructure
Technical Infrastructure
• Challenge: there are NO proven methods
for the long-term retention of E/R in many
formats
-Ongoing Empirical Research: but theory does
not Make it So!
Storage Resource Broker (SRB)
Infrastructure Independence
Evolving
Technology
Preservation
Records
Environment
External World
Preservation environment middleware insulates
records from changes in the external world
Infrastructure Independence
• Use data grids to preserve records
independently of the choice of technology
• Management of archives properties
• Map technology components to preservation
principles
– Capabilities that support preservation requirements
• Construct preservation environment from
components
– Archival engineering perspective
• Use infrastructure independence to enable use
of new technology
– View that new technology is an opportunity instead of
a challenge
Preservation Standards
• Architectural Model
– OAIS, Reference Model for an Open Archival Information System
• Representation information for each record
• Submission / Archival / Dissemination Information Package (SIP / AIP / DIP)
– Data grid - Storage Resource Broker (SRB), integrated Rule Oriented Data System
(iRODS)
– Digital Library - DSpace, Fedora
• Metadata
– Dublin core
– LCDRG, NARA Life Cycle Data Requirements Guide
– PREMIS, Preservation Metadata Implementation Strategies
• Metadata organization
•
– MPEG-21, ISO/IEC TR 21000-1: MPEG-21 Multimedia Framework
– METS, Metadata Encoding and Transmission Standard
– OAIS, Reference Model for an Open Archival Information System
Submission / Harvesting
– Producer Archive Interface (NASA)
– OAI-PMH, Open Archives Initiative - Protocol for Metadata Harvesting
• Data format
– pdf, xml, (330 formats retrievable on web crawls)
• Assessment criteria
– RLG/NARA TRAC - Trustworthy Repositories Audit & Certification: Criteria and Checklist.
http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/tra
c.pdf
Using a Data Grid – in Abstract
Data Grid
•User asks for data from the data grid
•The data is found and returned
•Where & how details are hidden
Using a Data Grid - Details
ux-brk14
Oracle
ux-brk12
DB
Storage Resource
Broker Server
Metadata Catalog
Storage Resource
Broker Server
•User asks for data
•Data request goes to SRB Server
•Server looks up information in catalog
•Catalog tells which SRB server has data
•1st server asks 2nd for data
•The data is found and returned
For more details, see:
Moore, Regan, “Building Preservation
Environments with Data Grid
Technology”, American Archivist, vol. 69,
no. 1, pp. 139-158, July 2006
Appraisal of ER: Get There Early
• Records Need to be Appraised:
– Early in Their Lifecycle
• Fragile
• Ephemeral
– In Their Native Habitat
• Functionality
Technical Appraisal
• For Permanent Records Have to Conduct
Technical Appraisal
– Feasibility of Preserving the Records
– Identify all of the Digital Objects
– Essential Characteristics
• At Scale!
Bootcamp
continued…
Appraise
this !@#$
Disposition
Arrangement
In Action…
In Action…
Tapping into Archival Knowledge
Electronic Records "Summer
Camp"
The Website
The Website, cont’d
Formulating Appraisal Rules
• Retrieve root webpage
•
‘http://water.usgs.gov/lookup/getgislist’
For each entry:
– Create an “matching entry” collection on the SRB
– Add ‘entry description’ metadata to that collection
– Create “Description” subcollection
•
•
•
•
Load
Load
Load
Load
web page
all “.gif” | “.jpg” | “.jpeg” files
all “.doc”
metadata file
– Create “ArcINFO” subcollection
• Load all “.e00” | “.clr” | “.asc” | “.nit” | “.dlg” | “.txt” files
– Create “Shape” subcollection
• Load all “.shp” files
– Create “SDTS” subcollection
• Load all “.sdts” files
– Create “Others” subcollection
• Load “.tfw” | “.rdb” | “.clr” | “.asc” | “.prj” files
– DECOMPRESS & LOAD “.zip” | “.gz” | “.tgz” | “.tar” | “.tar.gz”
files
E-FOIA Document Collections: Dep. Of State
National Archives and Records Administration
Transcontinental Persistent Archive Prototype
Federation of Five
Independent Data Grids
NARA I
MCAT
NARA II
MCAT
Georgia Tech
MCAT
U Md
MCAT
SDSC
MCAT
Extensible Environment, can federate with additional research and
education sites. Each data grid uses different vendor products.
ACE – Basic Methodology
• Three-tiered Cryptographic Information.
Integrity
Token
k:1
Cryptographic
Summary
Information
• 1 IT/object
• 1 CSI/time window
• ~1KB
• 1 CSI / (n) objects
• ~100MB/year
l:1
Witness
• 1 Witness/week
• ~2-3KB/year
• Each tier is periodically audited separately
according to policies set by managers.
End of the day
(Photos by Sara Muth)
Club
Asante
Photos by Sara Muth (top) and
Eric Paquette (right)
Commemorative Corkscrew
(Photo by Gary Spurr)
Acknowledgments
Slides with text are from the course
instructors’ PowerPoint presentations:
Conrad, et. al
Photos as credited.
(Photo by Eric Paquette)
Download