Chronicling America and the National Digital Newspaper Program: Technical Aspects Part 1: Newspapers and Microfilm Challenges USNP Part 2: Technical Details Image views Text searching Indexing Part 3: Managing a newspaper digitization project PIALA 2010 UH Manoa Hamilton Library Challenges Newspapers are a difficult medium Never meant to last, made for daily use and disposal Pages crumble and acid corrodes the materials Tracking serial publications over time Patron demand increased, storage space grew scarce, binding costs rose PIALA 2010 UH Manoa Hamilton Library Microfilm Adopted in the 1920s as a standard Turns newspaper from a storage nightmare to a relatively easy medium to handle Libraries had to decide what to do with the hardcopy Keep in holdings? Deaccession? PIALA 2010 UH Manoa Hamilton Library United States Newspaper Program (USNP) Began in 1982 Funded by National Endowment for the Humanities, managed by the Library of Congress University of Hawai’i with Hawaiian Historical Society, Hawai’i State Archives and State Library contributed for Hawai’i In mid-2000s: the USNP had received over $54 million in NEH support & non-federal contributions of approx $19.6 million Bibliographic records for over 140,000 newspaper titles; access to 70 million pages of newsprint in microfilm PIALA 2010 UH Manoa Hamilton Library USNP Goal: Locate, catalog, and microfilm newspapers Hawai’i microfilmed 260,000 pages and cataloged 476 titles Program ended in 2007 PIALA 2010 UH Manoa Hamilton Library USNP Preservation Microfilming Guidelines Optimum legibility Image orientation & reduction ratios to fill frame & obtain greatest degree of legibility in public use copies Quality Each roll of first generation film shall be inspected frame-by-frame by both the filming agency and the project for density and resolution and to determine that the film is free of emulsion scratches, abrasions, fingerprints, spots, fog, and other defects http://www.loc.gov/preserv/usnpguidelines.html PIALA 2010 UH Manoa Hamilton Library USNP Preservation Microfilming Guidelines Density • No less than five readings at start, middle & end of each reel with a transmission densitometer calibrated daily • Maximum (Dmax) density measurements taken on exposed image with no words or graphics • Background densities no lower than .80 & no higher than 1.20, lower densities preferred for older pages & to facilitate production of readerprinter & enlargement prints. • Base-plus-fog density (Dmin) on the master negative shall not exceed .10 PIALA 2010 UH Manoa Hamilton Library National Endowment for the Humanities and Library of Congress created NDNP No single US collection of newspapers Every institution focusing on particular themes relating to their collecting plans Thousands of volumes of newspapers spread across the country Enhance access to newspapers, building on the foundation of the United States Newspaper Program PIALA 2010 UH Manoa Hamilton Library NDNP Overview 2-Year awards to state projects, renewable Digitize 100,000 pages of microfilmed newspaper Newspapers picked must be from between 1836 to 1922 Historical essays on each newspaper Collation and Quality Control on all papers PIALA 2010 UH Manoa Hamilton Library NDNP Goals 20-year span with phased, sustainable development of 30 million page database Establish technical conversion specs & practices for efficient basic discovery & access Develop production tools to ensure good digital objects that can be managed & preserved long-term Provide public access to and take preservation responsibility for the digitized newspapers Create a national resource of historically significant newspapers from all the states and U.S. territories PIALA 2010 UH Manoa Hamilton Library NDNP Microfilm-related Challenges Where are the master reels? Copyright issues (Who filmed the newspapers and owns the master microfilm) Technical specifications (Poorly filmed, low density readings, etc) Microfilm standards applied vary widely PIALA 2010 UH Manoa Hamilton Library No universally accepted metadata standard for historical newspapers Online historical newspapers produced by public or private sector existed as discrete systems, metadata structures not designed for interoperability Titles, issues, pages and reels all need to be represented as different yet related classes of objects PIALA 2010 UH Manoa Hamilton Library NDNP Digital Deliverables Images scanned at 300-400 dpi • Three formats: grayscale, uncompressed Tiff 6.0 Images Compressed JPEG2000 images PDF Image with hidden text Accompanying structural and technical metadata OCR text for all pages PIALA 2010 UH Manoa Hamilton Library NDNP Scanning specifications De-skew images with a skew of greater than 3 degrees Crop to visible edge of page Capture grayscale preservation microfilm targets PIALA 2010 UH Manoa Hamilton Library NDNP OCR specifications Conform to ALTO XML schema • ALTO (Analyzed Layout and Text Object) is a XML (Extensible Markup Language) Schema that details technical metadata for describing the layout and content of physical text resources Bounding box coordinate data • Each column is sectioned and coordinates are used to place words PIALA 2010 UH Manoa Hamilton Library NDNP Metadata requirements (Metadata is Information about Information) METS (Metadata Encoding and Transmission Standard) format records preservation metadata Structural metadata to relate pages to title, date, and edition; sequence pages within issue or section; and to identify image and OCR files Technical metadata to support the functions of the Library of Congress repository PIALA 2010 UH Manoa Hamilton Library XML Rules Single, unique root element Matching open/close tags Consistent capitalization Correctly nested elements (no overlapping elements) Attribute values enclosed in quotes No repeating attributes in an element Provides international, vendor independent standard for describing information PIALA 2010 UH Manoa Hamilton Library Family of XML data standards includes: METS – Metadata Encoding and Transmission Standard MODS – Metadata Object Description Schema PREMIS – PREservation Metadata Implementation Strategies EAD – Encoded Archival Description PIALA 2010 UH Manoa Hamilton Library METS (Metadata Encoding and Transmission Standard) XML Schema for the purpose of creating XML files that define: • the hierarchical structure of digital library objects (images, text files, etc.) • the names and locations of the files • the associated metadata (e.g., MODS) PIALA 2010 UH Manoa Hamilton Library Metadata Object Description Schema (MODS) An XML Schema designed for expressing bibliographic data (Think of it as an alternative to the MARC format) PIALA 2010 UH Manoa Hamilton Library Sections of a METS file <mets> <metsHdr/> - METS header (document talks about itself) <dmdSec/> - Descriptive metadata (MODS, etc.) <amdSec/> - Administrative metadata (copyright info., etc.) <fileSec/> - File section (names and locations of files) <structMap/> - Structural map (relationships of the parts) <structLink/> - Linking information <behaviorSec/> - Binding executables/actions to object </mets> PIALA 2010 UH Manoa Hamilton Library Title METS Combines bibliographic and holdings data in a single title record, converted from MARC to MARC XML format Titles digitized will have additional data • descriptive essays, more precise geographic coverage data • which is put in a Metadata Object Description Schema (MODS) object within the larger METS document PIALA 2010 UH Manoa Hamilton Library Issue and Reel METS Issue METS • Issue Data • Page Data Reel METS • Reel Data • Target Data PIALA 2010 UH Manoa Hamilton Library WHY? XML structure used by software for creation of multiple outputs: • HTML/XHTML for Web display; PDF for printing Ease of editing (single records or batches of records) Ability to validate data Ease of data management and publishing Interoperability • Repository submission and OAI harvesting PIALA 2010 UH Manoa Hamilton Library All that coding pays off for the user when SEARCHING Geographic metadata Title metadata Date metadata PIALA 2010 UH Manoa Hamilton Library Keyword searching OCR/OWR does not yield article “transcriptions”; text OCR’d from images of newspapers is used for searching purposes Several options • ANY of the words, ALL of the words • EXACT PHRASE • Proximity search – Look for words within 5, 10, 50 or 100 words of one another PIALA 2010 UH Manoa Hamilton Library Page thumbnail view Click on thumbnail or description of page to view larger version PIALA 2010 UH Manoa Hamilton Library Page view Different format can be selected with one click PIALA 2010 UH Manoa Hamilton Library Browse Issues A calendar view indicating which issues have been digitized Can change which year you’re viewing Browse First Pages PIALA 2010 UH Manoa Hamilton Library Project Management From Microfilm to Digital Images Managing a Newspaper Conversion Project PIALA 2010 UH Manoa Hamilton Library NDNP & University of Hawai’i UH first grant began in July 2008, running until June 2010 Grant renewed: July 2010-June 2012 Utilizing the microfilm created under the USNP Excellent quality microfilm (in theory) Fewer problems with cataloging/description, acquiring 2N duplicates (in theory) PIALA 2010 UH Manoa Hamilton Library Project Management Request for Proposals (RFP) • Include all LC technical specifications Position Description(s) • Coordinator, students Hiring and Training PIALA 2010 UH Manoa Hamilton Library Project components Microfilm identification and duplication Digitization Metadata creation & Validation PIALA 2010 UH Manoa Hamilton Library Microfilm selection Choose what is important to your institution(s) if possible Copyright • • Reels created by or for your institution Reels by Proquest, etc, you may have to ask for permission and pay much higher duplication fees Decide • PIALA 2010 Complete runs of few titles, or many short/incomplete runs of a lot of titles UH Manoa Hamilton Library Vendors iArchives • Leaders in the field • Lots of experience OCLC/BSLW (Backstage Library Works) Apex/Covantage Northern Micrographics (NMT) Local or national microfilm duplication companies PIALA 2010 UH Manoa Hamilton Library Equipment 10 500 GB External Hard Drives (Western Digital MyBooks) and Pelican cases 1 PC with double monitor Software: Library of Congress’ Digital Validator and Viewer (DVV) Densitometer Microfilm reader/scanner PIALA 2010 UH Manoa Hamilton Library Our Stuff Densitometer Pelican Cases Microfilm scanner PC with 2 monitors & portable HDs (red) PIALA 2010 UH Manoa Hamilton Library Staffing Project Coordinator • Quality Control Technician Graduate students Advisory Board Subject/history/newspaper specialists PIALA 2010 UH Manoa Hamilton Library Metadata Collection Density readings Recorded onto a spreadsheet PIALA 2010 UH Manoa Hamilton Library Preparing the Microfilm: Metadata Data from, OCLC MARC record & local holdings PIALA 2010 UH Manoa Hamilton Library Preparing the Microfilm: Collation Review use copy of reel • Missing issues or pages • Duplicate issues or pages • Mutilated pages • Other abnormalities (E.g. pages out of order, incorrect dates) PIALA 2010 UH Manoa Hamilton Library Preparing the Microfilm: Collation Review use copy, record data on spreadsheet PIALA 2010 UH Manoa Hamilton Library iArchives Digitization Workflow QC Film Scanning Split, De-Skew, Crop Shared Storage (NAS) QC QC QC Image Processing Image Metadata KEY: ■ Automatic process [image processing, OCR, …] ■ Manual process [image + page metadata] ■ Quality Control Page/Reel Metadata Workflow Manager DB QC OCR Framework QC Post Process Customer Deliverables Automated Processing Cloud Scan QC Split, Crop & DeSkew iArchives OWR Framework 3 Leading OCR Software Programs 2,000,000 Word Dictionary OWR 2,000,000 Name Dictionary Post-vendor validation Once the hard drive returned, we verify/validate the batch using the DVV program Verification compares the metadata listed in the master XML file to the metadata found in the issue XML files for correctness Validation is done if a new master XML file needs to be created. It creates checksums for each file and records them in the subsequent metadata Copy contents of hard drive onto our server PIALA 2010 UH Manoa Hamilton Library Quality Control Image quality Too dark? Too light? Skewed? Correct image? Compare digitized image to microfilmed image No Missing Issue/Page tags Review metadata Dates LCCN # Locations PIALA 2010 UH Manoa Hamilton Library Thumbnail View can use DVV or any graphics program PIALA 2010 UH Manoa Hamilton Library Quality Control LC Digital Viewer and Validator (DVV) PIALA 2010 UH Manoa Hamilton Library Metadata Viewer PIALA 2010 UH Manoa Hamilton Library OCR PIALA 2010 UH Manoa Hamilton Library Headers PIALA 2010 UH Manoa Hamilton Library Title Essays - 500 words Describes newspaper’s history • Date of establishment • Editors • Type of news reported • Political viewpoint • Where is the paper today? Published to Chronicling America PIALA 2010 UH Manoa Hamilton Library Links Chronicling America: http://chroniclingamerica.loc.gov/ Library of Congress: http://www.loc.gov/ndnp/ National Endowment for the Humanities: http://www.neh.gov/projects/ndnp.html Hawai’i Newspapers: a union list http://evols.library.manoa.hawaii.edu/handle/10524/2 089 Using <METS> and <MODS> to Create XML Standards-based Digital Library Applications http://www.loc.gov/standards/mods/presentations/me ts-mods-morgan-ala07/ PIALA 2010 UH Manoa Hamilton Library Thank You! Mahalo! Kinisou Chapur! Questions? Comments? Email us at: ♦ chantiny@hawaii.edu ♦ erenst@hawaii.edu https://sites.google.com/a/hawaii.edu/ndnp-hawaii/ PIALA 2010 UH Manoa Hamilton Library