Using PREMIS in monograph digitisation at the National Library of Finland PREMIS Implementation Fair 7th Oct 2009 Karo Salminen Jukka Kervinen Contents • Background – METS at NLF – Digitisation at NLF • • • Requirements for new METS profile METS package contents PREMIS in METS sections – – – – • Representations Events Agents Rights Open issues METS at NLF • First used in 2005 for digitised newspapers and serials • Currently 2 million pages in METS containers • Original METS profile from METAe project, now adopted by CCS • Need for improved preservation metadata – PREMIS for recording digitisation events • National Digital Library Initiative – Funded by the Ministry of Education – Archives, Museums and Libraries – Strives for national long-term preservation solution (Kansallinen digitaalisen aineiston pitkäaikaissäilytysjärjestelmä) Digitisation at NLF • Revising the digitisation workflow in 2009 – Integrated tools to cover the whole workflow and record events – Moving towards mass digitisation • Not only scanning – OCR – Structural analysis and markup • Articles, chapters, illustrations • Level depends on available funding Requirements for new METS profile • Generic profile to capture – events and agents related to digitisation (provenance information), and – technical metadata • Will serve as SIP for digital preservation system – No selected system yet – shooting at a moving target • Intention to utilise existing schemas and best practices as much as possible METS package contents • METS with MARC, MODS, PREMIS and MIX • For each page: – – – – master image in JPEG2000 (lossless) access image in JPEG thumbnail OCRd text in ALTO XML • Single PDF with hidden text layer • Whole SIP compressed to a zip file PREMIS in METS sections • PREMIS distributed within different METS sections • Common amdSec for the intellectual entity – Events and Agents in digiprovMDs – Rights in rightsMD • amdSec for each file – PREMIS Object in techMD • objectCharacteristicsExtension containing MIX 2.0 – PREMIS Events in digiprovMDs – Agents referenced from the common amdSec Representations • METS structMap for listing files comprising a representation – Repr 1: Preservation images + ALTO XMLs – Repr 2: Access images + ALTO XMLs – Repr 3: PDF • Derivation relationships expressed in PREMIS – e.g. access copy is derived from master copy • Other representation related information in amdSec (e.g. linked events) Events 1/2 • Events common to intellectual entity – Automatic processing • • • • Scanning Layout analysis OCR Fetching of MARC record from Voyager catalogue – Manual processing • Quality assurance • Verification of image crop and deskew Events 2/2 • File specific events – – – – – Scanning of individual page Image processing (crop, deskew) File conversion (e.g. TIFF to JPEG2000) Message digest calculation File validation (JHOVE) Agents • Each software component – Copinet (scanner software) – DocWorks (post-processing software) • FineReader (OCR component) • ImageGear (image processing component) • etc. • Operators responsible for manual processing Rights • Currently, copyright status of articles, illustrations or monographs not known – For works with single author, copyright status is fairly easy to determine. – Huge problem with newspapers and serials • Copyright status is expressed in MARC and PREMIS. Possible values: – undetermined – unknown – copyrighted • new semantic unit for expiration date in PREMIS? More suitable than copyrightNote for automatic processing – copyright expired Open issues • Redundant information (message digests, file sizes etc.) • For monograph with 300 pages, METS will have 400.000 (!) lines of XML is the profile feasible at all? – Not for human consumption! – What about systems? Can they utilise all the captured information? If yes, how do they perform? – How much work will be needed to ingest SIPs into systems? Thanks for your attention! Questions?