Using PREMIS in monograph digitisation at the National Library of Finland

Using PREMIS in monograph digitisation at the National Library of Finland PREMIS Implementation Fair 7th Oct 2009 Karo Salminen Jukka Kervinen Contents • Background – METS at NLF – Digitisation at NLF • • • Requirements for new METS profile METS package contents PREMIS in METS sections – – – – • Representations Events Agents Rights Open issues METS at NLF • First used in 2005 for digitised newspapers and serials • Currently 2 million pages in METS containers • Original METS profile from METAe project, now adopted by CCS • Need for improved preservation metadata – PREMIS for recording digitisation events • National Digital Library Initiative – Funded by the Ministry of Education – Archives, Museums and Libraries – Strives for national long-term preservation solution (Kansallinen digitaalisen aineiston pitkäaikaissäilytysjärjestelmä) Digitisation at NLF • Revising the digitisation workflow in 2009 – Integrated tools to cover the whole workflow and record events – Moving towards mass digitisation • Not only scanning – OCR – Structural analysis and markup • Articles, chapters, illustrations • Level depends on available funding Requirements for new METS profile • Generic profile to capture – events and agents related to digitisation (provenance information), and – technical metadata • Will serve as SIP for digital preservation system – No selected system yet – shooting at a moving target • Intention to utilise existing schemas and best practices as much as possible METS package contents • METS with MARC, MODS, PREMIS and MIX • For each page: – – – – master image in JPEG2000 (lossless) access image in JPEG thumbnail OCRd text in ALTO XML • Single PDF with hidden text layer • Whole SIP compressed to a zip file PREMIS in METS sections • PREMIS distributed within different METS sections • Common amdSec for the intellectual entity – Events and Agents in digiprovMDs – Rights in rightsMD • amdSec for each file – PREMIS Object in techMD • objectCharacteristicsExtension containing MIX 2.0 – PREMIS Events in digiprovMDs – Agents referenced from the common amdSec Representations • METS structMap for listing files comprising a representation – Repr 1: Preservation images + ALTO XMLs – Repr 2: Access images + ALTO XMLs – Repr 3: PDF • Derivation relationships expressed in PREMIS – e.g. access copy is derived from master copy • Other representation related information in amdSec (e.g. linked events) Events 1/2 • Events common to intellectual entity – Automatic processing • • • • Scanning Layout analysis OCR Fetching of MARC record from Voyager catalogue – Manual processing • Quality assurance • Verification of image crop and deskew Events 2/2 • File specific events – – – – – Scanning of individual page Image processing (crop, deskew) File conversion (e.g. TIFF to JPEG2000) Message digest calculation File validation (JHOVE) Agents • Each software component – Copinet (scanner software) – DocWorks (post-processing software) • FineReader (OCR component) • ImageGear (image processing component) • etc. • Operators responsible for manual processing Rights • Currently, copyright status of articles, illustrations or monographs not known – For works with single author, copyright status is fairly easy to determine. – Huge problem with newspapers and serials • Copyright status is expressed in MARC and PREMIS. Possible values: – undetermined – unknown – copyrighted • new semantic unit for expiration date in PREMIS? More suitable than copyrightNote for automatic processing – copyright expired Open issues • Redundant information (message digests, file sizes etc.) • For monograph with 300 pages, METS will have 400.000 (!) lines of XML  is the profile feasible at all? – Not for human consumption! – What about systems? Can they utilise all the captured information? If yes, how do they perform? – How much work will be needed to ingest SIPs into systems? Thanks for your attention! Questions?

Using PREMIS in monograph digitisation at the National Library of Finland

Related documents

Products

Support

Using PREMIS in monograph digitisation at the National Library of Finland

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib