Using PREMIS in monograph digitisation at the National Library of Finland

advertisement
Using PREMIS in monograph
digitisation at the National Library of
Finland
PREMIS Implementation Fair
7th Oct 2009
Karo Salminen
Jukka Kervinen
Contents
•
Background
– METS at NLF
– Digitisation at NLF
•
•
•
Requirements for new METS profile
METS package contents
PREMIS in METS sections
–
–
–
–
•
Representations
Events
Agents
Rights
Open issues
METS at NLF
• First used in 2005 for digitised newspapers and serials
• Currently 2 million pages in METS containers
• Original METS profile from METAe project, now adopted by CCS
• Need for improved preservation metadata – PREMIS for recording
digitisation events
• National Digital Library Initiative
– Funded by the Ministry of Education
– Archives, Museums and Libraries
– Strives for national long-term preservation solution (Kansallinen
digitaalisen aineiston pitkäaikaissäilytysjärjestelmä)
Digitisation at NLF
• Revising the digitisation workflow in 2009
– Integrated tools to cover the whole workflow and
record events
– Moving towards mass digitisation
• Not only scanning
– OCR
– Structural analysis and markup
• Articles, chapters, illustrations
• Level depends on available funding
Requirements for new METS profile
• Generic profile to capture
– events and agents related to digitisation (provenance
information), and
– technical metadata
• Will serve as SIP for digital preservation
system
– No selected system yet – shooting at a moving target
• Intention to utilise existing schemas and best
practices as much as possible
METS package contents
• METS with MARC, MODS, PREMIS and MIX
• For each page:
–
–
–
–
master image in JPEG2000 (lossless)
access image in JPEG
thumbnail
OCRd text in ALTO XML
• Single PDF with hidden text layer
• Whole SIP compressed to a zip file
PREMIS in METS sections
• PREMIS distributed within different METS sections
• Common amdSec for the intellectual entity
– Events and Agents in digiprovMDs
– Rights in rightsMD
• amdSec for each file
– PREMIS Object in techMD
• objectCharacteristicsExtension containing MIX 2.0
– PREMIS Events in digiprovMDs
– Agents referenced from the common amdSec
Representations
• METS structMap for listing files comprising a
representation
– Repr 1: Preservation images + ALTO XMLs
– Repr 2: Access images + ALTO XMLs
– Repr 3: PDF
• Derivation relationships expressed in PREMIS
– e.g. access copy is derived from master copy
• Other representation related information in amdSec
(e.g. linked events)
Events 1/2
• Events common to intellectual entity
– Automatic processing
•
•
•
•
Scanning
Layout analysis
OCR
Fetching of MARC record from Voyager catalogue
– Manual processing
• Quality assurance
• Verification of image crop and deskew
Events 2/2
• File specific events
–
–
–
–
–
Scanning of individual page
Image processing (crop, deskew)
File conversion (e.g. TIFF to JPEG2000)
Message digest calculation
File validation (JHOVE)
Agents
• Each software component
– Copinet (scanner software)
– DocWorks (post-processing software)
• FineReader (OCR component)
• ImageGear (image processing component)
• etc.
• Operators responsible for manual processing
Rights
• Currently, copyright status of articles, illustrations or
monographs not known
– For works with single author, copyright status is fairly easy to
determine.
– Huge problem with newspapers and serials
• Copyright status is expressed in MARC and PREMIS.
Possible values:
– undetermined
– unknown
– copyrighted
• new semantic unit for expiration date in PREMIS? More suitable
than copyrightNote for automatic processing
– copyright expired
Open issues
• Redundant information (message digests, file sizes
etc.)
• For monograph with 300 pages, METS will have
400.000 (!) lines of XML  is the profile feasible at all?
– Not for human consumption!
– What about systems? Can they utilise all the captured
information? If yes, how do they perform?
– How much work will be needed to ingest SIPs into systems?
Thanks for your attention!
Questions?
Download