13.1 Preservation Metadata It is recommended that all projects dealing with digital content attempt to follow the guidelines established in the most recent version of the PREMIS (PREservation Metadata Implementation Strategies) Data Dictionary for Preservation Metadata (http://www.loc.gov/standards/premis/). The PREMIS guidelines are very detailed in describing the preservation metadata that is ideal to capture. At this point, it is only recommended all projects attempt to meet a few minimum requirements detailed below. Table of contents 13.1 PREMIS Data Model Overview 13.2 Minimum Requirements 13.1 PREMIS Data Model Overview The PREMIS Data Model consists of five primary entities: • • • • Intellectual Entity – content that is considered a single intellectual unit (may or may not be digital) Objects – digital form(s) of an intellectual entity Event – an event that impacts or involves at least one Object. Events are associated with or preformed by an Agent Agent – a person, organization, or software/system associated with Events that occur on an Object or Rights on an Object • Rights – assertions of very basic rights/permissions pertaining to an Object The PREMIS Data Dictionary details the recommended preservation metadata recommended to be captured for the last four entities (Objects, Events, Agents, and Rights). The Intellectual Entity is not covered by PREMIS, as it would be described using the best practices for descriptive metadata. It’s also worth noting that PREMIS deals with three types of Objects: • • • File – an actual file on an operating system (e.g. a PDF file) Bitstream – a series of bytes (1s and 0s) within a File which have meaningful properties unto themselves (e.g. the header information within a JPEG2000 image file) Representation – a set of files (including structural metadata) which are required to render a single intellectual entity (e.g. a webpage consisting of HTML, CSS, and images, all necessary to render something useable) The PREMIS Data Model is described more completely in the Introduction of the PREMIS Data Dictionary for Preservation Metadata: http://www.loc.gov/standards/premis/ 13.2 Minimum Requirements The PREMIS list of recommended preservation metadata is extensive. As of July 2008, there are still no known full implementations of PREMIS. What follows is a list of the minimal metadata which should be captured for each entity (Object, Event, Agent, Rights). Please note that although these best practices recommend the minimal preservation metadata that should be gathered, PREMIS does not specify a metadata schema for implementation. We recommend storing this metadata in an appropriate metadata schema, based on the used packaging format. For example, if METS is used for packaging, there is an existing PREMIS metadata schema for usage as administrative metadata with METS: http://www.loc.gov/standards/premis/schemas.html Objects Minimally, the following preservation metadata should be captured about an Object: • • • • • • • Object Type (majority of the time will be “file”) Identifiers Fixity (checksum, etc.) Size Format of file Relationships to other objects (especially in establishing digital provenance) Level of Preservation Support requested?? Within the PREMIS data dictionary, this information is expressed as follows. Please note that object type abbreviations refer to: File (F), Representation (R), and Bitstream (B). Semantic Unit / Component Object Type Note Examples objectIdentifier -objectIdentifierType R, F, B The type of identifier used to locate the object within the preservation system in which it is stored. hdl (Handle) -objectIdentifierValue R, F, B The value of the object’s identifier 2142/8796 objectCategory R, F, B The type of object being described. Controlled Vocab: representation, file, or bitstream file representation bitstream preservationLevel R, F Level of preservation support attempted for this object. (We need to establish our own controlled vocabulary for these values) Categories? R, F The date this preservation level was assigned 2008-03-29 F, B The information necessary to perform occasional fixity checks --messageDigestAlgorithm F, B Algorithm used to generate the message digest MD5 --messageDigest F, B Value of the message digest (a checksum value) -size F, B The size (in bytes) of file 1024 -format F, B The mime type of the file format application/pdf image/jp2 text/xml -preservationLevelValue -preservationLevelDateAssigned R, F 1, 2, 3, or 4? full or bit-level? objectCharacteristics -fixity --formatDesignation F, B ---formatName F, B originalName R, F 123456.pdf The original filename Events Although all events on objects can oftentimes be difficult to track and record, it is recommended that we attempt to record the following types of events (whenever possible): • • • • File format changes – this includes both migration to alternative formats (e.g. for access), as well as normalization to common formats. This helps us to keep track of the provenance of files. Ingest into a new digital system/repository – this helps us keep track of the various locations of files Modifications to files which significantly change the file itself – It is unimportant to track fixes to spelling or minor font changes. However, larger changes such as OCR of an image-based PDF, removal/addition of pages, or other major structural/content changes are important events within the historical provenance of a file. Any activities resulting in a new file – Generally, we should attempt to track any activities which create a new file. When the creation of new files is outsourced (e.g. for large scale digitization), this may be more difficult to track. However, it’s still worth tracking the source of the files (even if the source is generically identified as the company which created the new files). Minimally, the following preservation metadata should be captured about an Event on an Object: • • • • Event Type Event Date / Time (to best of your knowledge) Event Detail (human readable notes on the event that occurred) References to the Object(s) affected and the Agent(s) that performed the event Within the PREMIS data dictionary, this information is expressed as follows. Semantic Unit / Component Note Examples eventIdentifier -eventIdentifierType A controlled vocabulary representing UIUC Library the Institution or Company that performed the event. This would likely usually be something like “UIUC Library”. OCA etc. An identifier which can be used to reference this event. This should likely be based on the date/time the event occurred, to ensure its uniqueness. scan-2008-03-23 migrate-2008-04-21 eventType The type of event described. We need to establish our own Controlled Vocabulary of event types. PREMIS documents some suggested terms. ingestion creation deletion migration normalization validation (etc.) eventDateTime The date/time when the event occurred. Recommended in ISO 8601 2006-07-16T19:20:30 eventDetail Detailed notes (human readable / understandable) of the event that occurred (Description of the event: who, what, why, what software was used, etc.) linkingAgentIdentifier Provides information about which agent performed event -eventIdentifierValue -linkingAgentType References the agentIdenfierType of the Agent(s) performing the Event (see the Agent section below!) -linkingAgentValue References the agentIdenfierValue of the Agent(s) performing the Event (see the Agent section below!) UIUC Library OCA etc. linkingObjectIdentifier Provides information about which object(s) were affected by the event -linkingObjectType References the objectIdenfierType of the Object(s) affected by the Event (see the Object section above!) (a checksum value) -linkingObjectValue References the objectIdenfierValue of the Object(s) affected by the Event (see the Object section above!) 1024 Agents Only Agents which perform actual Events on Objects need to be tracked. Agents may be organizations, software programs, systems or individual people. Minimally, the following preservation metadata should be captured about an Agent which performs an Event: • • Agent Type (person, software, etc.) Agent Name Within the PREMIS data dictionary, this information is expressed as follows. Semantic Unit / Component Note Examples agentIdentifier -agentIdentifierType A controlled vocabulary representing the type of an agent identifier. For a person, this may be represented as “UIUC NetID”. UIUC Library UIUC NetID Software Program -agentIdentifierValue An identifier which can be used to reference this agent. tdonohue LSDWG Acrobat-Pro-9.0 agentType The type of agent described. We need to establish our own Controlled Vocabulary of event types. PREMIS documents some suggested terms. person organization software agentName A human readable name for the agent Tim Donohue Large Scale Digitization Working Group Adobe Acrobat 9.0 Pro Rights For the purpose of tracking simplistic provenance of digital files, Rights Statements are unnecessary. In PREMIS, Rights Statements tend to document the permissions of a repository on objects within it. There are no minimally required preservation metadata that should be captured for Rights statements. However, if it is easily captured or available, it is recommended to attempt to record known Copyright Information about individual objects in the following PREMIS data dictionary units • copyrightInformation o copyrightStatus – status of the copyright (e.g. copyrighted, publicdomain, unknown) o copyrightJurisdiction – jurisdiction of copyright (e.g. us, de) o copyrightStatusDeterminationDate – date when this status was determined o copyrightNote – any additional notes about copyright information Again, copyright information is not necessary to record, unless it is already known.