13.1 Preservation Metadata

advertisement
13.1 Preservation Metadata
It is recommended that all projects dealing with digital content attempt to follow the guidelines
established in the most recent version of the PREMIS (PREservation Metadata Implementation
Strategies) Data Dictionary for Preservation Metadata (http://www.loc.gov/standards/premis/).
The PREMIS guidelines are very detailed in describing the preservation metadata that is ideal to
capture. At this point, it is only recommended all projects attempt to meet a few minimum
requirements detailed below.
Table of contents
13.1 PREMIS Data Model Overview
13.2 Minimum Requirements
13.1 PREMIS Data Model Overview
The PREMIS Data Model consists of five primary entities:
•
•
•
•
Intellectual Entity – content that is considered a single intellectual unit (may or may not
be digital)
Objects – digital form(s) of an intellectual entity
Event – an event that impacts or involves at least one Object. Events are associated
with or preformed by an Agent
Agent – a person, organization, or software/system associated with Events that occur on
an Object or Rights on an Object
•
Rights – assertions of very basic rights/permissions pertaining to an Object
The PREMIS Data Dictionary details the recommended preservation metadata recommended to
be captured for the last four entities (Objects, Events, Agents, and Rights). The Intellectual
Entity is not covered by PREMIS, as it would be described using the best practices for
descriptive metadata.
It’s also worth noting that PREMIS deals with three types of Objects:
•
•
•
File – an actual file on an operating system (e.g. a PDF file)
Bitstream – a series of bytes (1s and 0s) within a File which have meaningful properties
unto themselves (e.g. the header information within a JPEG2000 image file)
Representation – a set of files (including structural metadata) which are required to
render a single intellectual entity (e.g. a webpage consisting of HTML, CSS, and images,
all necessary to render something useable)
The PREMIS Data Model is described more completely in the Introduction of the PREMIS Data
Dictionary for Preservation Metadata: http://www.loc.gov/standards/premis/
13.2 Minimum Requirements
The PREMIS list of recommended preservation metadata is extensive. As of July 2008, there are
still no known full implementations of PREMIS. What follows is a list of the minimal metadata
which should be captured for each entity (Object, Event, Agent, Rights).
Please note that although these best practices recommend the minimal preservation metadata
that should be gathered, PREMIS does not specify a metadata schema for implementation. We
recommend storing this metadata in an appropriate metadata schema, based on the used
packaging format. For example, if METS is used for packaging, there is an existing PREMIS
metadata schema for usage as administrative metadata with METS:
http://www.loc.gov/standards/premis/schemas.html
Objects
Minimally, the following preservation metadata should be captured about an Object:
•
•
•
•
•
•
•
Object Type (majority of the time will be “file”)
Identifiers
Fixity (checksum, etc.)
Size
Format of file
Relationships to other objects (especially in establishing digital provenance)
Level of Preservation Support requested??
Within the PREMIS data dictionary, this information is expressed as follows. Please note that
object type abbreviations refer to: File (F), Representation (R), and Bitstream (B).
Semantic
Unit / Component
Object
Type
Note
Examples
objectIdentifier
-objectIdentifierType
R, F, B
The type of identifier used to locate the
object within the preservation system
in which it is stored.
hdl (Handle)
-objectIdentifierValue
R, F, B
The value of the object’s identifier
2142/8796
objectCategory
R, F, B
The type of object being described.
Controlled Vocab: representation, file,
or bitstream
file
representation
bitstream
preservationLevel
R, F
Level of preservation support
attempted for this object. (We need to
establish our own controlled
vocabulary for these values)
Categories?
R, F
The date this preservation level was
assigned
2008-03-29
F, B
The information necessary to perform
occasional fixity checks
--messageDigestAlgorithm
F, B
Algorithm used to generate the
message digest
MD5
--messageDigest
F, B
Value of the message digest
(a checksum
value)
-size
F, B
The size (in bytes) of file
1024
-format
F, B
The mime type of the file format
application/pdf
image/jp2
text/xml
-preservationLevelValue
-preservationLevelDateAssigned
R, F
1, 2, 3, or 4?
full or bit-level?
objectCharacteristics
-fixity
--formatDesignation
F, B
---formatName
F, B
originalName
R, F
123456.pdf
The original filename
Events
Although all events on objects can oftentimes be difficult to track and record, it is
recommended that we attempt to record the following types of events (whenever possible):
•
•
•
•
File format changes – this includes both migration to alternative formats (e.g. for
access), as well as normalization to common formats. This helps us to keep track of the
provenance of files.
Ingest into a new digital system/repository – this helps us keep track of the various
locations of files
Modifications to files which significantly change the file itself – It is unimportant to
track fixes to spelling or minor font changes. However, larger changes such as OCR of an
image-based PDF, removal/addition of pages, or other major structural/content changes
are important events within the historical provenance of a file.
Any activities resulting in a new file – Generally, we should attempt to track any
activities which create a new file. When the creation of new files is outsourced (e.g. for
large scale digitization), this may be more difficult to track. However, it’s still worth
tracking the source of the files (even if the source is generically identified as the
company which created the new files).
Minimally, the following preservation metadata should be captured about an Event on an
Object:
•
•
•
•
Event Type
Event Date / Time (to best of your knowledge)
Event Detail (human readable notes on the event that occurred)
References to the Object(s) affected and the Agent(s) that performed the event
Within the PREMIS data dictionary, this information is expressed as follows.
Semantic
Unit / Component
Note
Examples
eventIdentifier
-eventIdentifierType
A controlled vocabulary representing
UIUC Library
the Institution or Company that
performed the event. This would likely
usually be something like “UIUC
Library”.
OCA
etc.
An identifier which can be used to
reference this event. This should likely
be based on the date/time the event
occurred, to ensure its uniqueness.
scan-2008-03-23
migrate-2008-04-21
eventType
The type of event described. We
need to establish our own Controlled
Vocabulary of event types. PREMIS
documents some suggested terms.
ingestion
creation
deletion
migration
normalization
validation
(etc.)
eventDateTime
The date/time when the event
occurred. Recommended in ISO 8601
2006-07-16T19:20:30
eventDetail
Detailed notes (human readable /
understandable) of the event that
occurred
(Description of the
event: who, what,
why, what software
was used, etc.)
linkingAgentIdentifier
Provides information about which
agent performed event
-eventIdentifierValue
-linkingAgentType
References the agentIdenfierType of
the Agent(s) performing the Event (see
the Agent section below!)
-linkingAgentValue
References the agentIdenfierValue of
the Agent(s) performing the Event (see
the Agent section below!)
UIUC Library
OCA
etc.
linkingObjectIdentifier
Provides information about which
object(s) were affected by the event
-linkingObjectType
References the objectIdenfierType of
the Object(s) affected by the Event
(see the Object section above!)
(a checksum value)
-linkingObjectValue
References the objectIdenfierValue of
the Object(s) affected by the Event
(see the Object section above!)
1024
Agents
Only Agents which perform actual Events on Objects need to be tracked. Agents may be
organizations, software programs, systems or individual people.
Minimally, the following preservation metadata should be captured about an Agent which
performs an Event:
•
•
Agent Type (person, software, etc.)
Agent Name
Within the PREMIS data dictionary, this information is expressed as follows.
Semantic
Unit / Component
Note
Examples
agentIdentifier
-agentIdentifierType
A controlled vocabulary representing
the type of an agent identifier. For a
person, this may be represented as
“UIUC NetID”.
UIUC Library
UIUC NetID
Software Program
-agentIdentifierValue
An identifier which can be used to
reference this agent.
tdonohue
LSDWG
Acrobat-Pro-9.0
agentType
The type of agent described. We
need to establish our own Controlled
Vocabulary of event types. PREMIS
documents some suggested terms.
person
organization
software
agentName
A human readable name for the agent
Tim Donohue
Large Scale Digitization Working Group
Adobe Acrobat 9.0 Pro
Rights
For the purpose of tracking simplistic provenance of digital files, Rights Statements are
unnecessary. In PREMIS, Rights Statements tend to document the permissions of a repository
on objects within it.
There are no minimally required preservation metadata that should be captured for Rights
statements. However, if it is easily captured or available, it is recommended to attempt to
record known Copyright Information about individual objects in the following PREMIS data
dictionary units
•
copyrightInformation
o copyrightStatus – status of the copyright (e.g. copyrighted, publicdomain,
unknown)
o copyrightJurisdiction – jurisdiction of copyright (e.g. us, de)
o copyrightStatusDeterminationDate – date when this status was determined
o copyrightNote – any additional notes about copyright information
Again, copyright information is not necessary to record, unless it is already known.
Download