4.0 Newspaper Digitization This section includes a summary of the current prevailing best practices for newspaper digitization. Table of contents 4.1 Background 4.2 Microfilm • Summary • Microfilm Evaluation 4.3 Best Practices for Olive Active Paper Projects • Scanning Specifications • File Naming Conventions • Distillation (Segmentation, OCR, Output to XML) 4.4 Best Practices for NDNP Projects • Summary • Overview of Technical Approach for 2009-11 NDNP Awards • Deliverables 4.1 Background The single most critical factor in the success of newspaper digitization is the availability of good quality microfilm. Although it is possible to digitize newspapers from an original print copy, this process is very labor-intensive and considerably more expensive than digitizing from film. Another key consideration is the platform to be used to deliver the digital content. The choice of platform will drive some decisions regarding technical specifications. Although the goal of the National Digital Newspaper Program (NDNP) is to generate a set of best practices and national standards for newspaper digitization, there is currently considerable variation in practice and no consensus regarding several major issues. For example, NDNP does not offer subpage-level segmentation, also called article zoning. Most members of the newspaper digitization community, however, do advocate article segmentation. With regard to scanning requirements, there are as many proponents of 8-bit grayscale as there are for bitonal scanning. The particular content management/delivery system may determine some technical specifications. In considering source material, newspapers published before 1923 are in the public domain and may be freely digitized. Orphaned post-1923 titles may also be freely digitized. If a newspaper that began publication before 1923 is still being published, there may be compelling reasons not to digitize pre-1923 content without the permission of the publisher. Post-1923 titles still in publication can be digitized only with the permission of the publisher. 4.2 Microfilm Summary Microfilm may be unsuitable for digitization due to many factors: • Poor condition of original (poorly printed, stained, faded, damaged) • Original newsprint poorly prepared for filming (e.g., page curvature of bound newsprint may produce gutter shadows) • Original filmed using unsuitable (too high) reduction ratio, which can affect image quality and OCR results • Original filmed with variations in density within images or between exposures, which would necessitate adjustment of scanning parameters within a reel to obtain proper contrast and focus • Original filmed using uneven lighting • Poor condition of film (deteriorating, dirty or scratched) Generally, film produced following United States Newspaper Program Guidelines (established in the mid-1980s) and RLG preservation microfilm guidelines (established in the early 1990s) yields the best results. The USNP guidelines stipulate: • Originals in good condition • Use of high resolution camera • Use of polyester film stock • Reduction ratio between 16x and 20x • Quality index of 8.0 or above (using resolution test pattern) • Background densities between 0.8 and 1.2 (ideal is 1.0 for newspapers of average text quality, and 0.9-1.0 for originals with faint or broken text) • Variation of densities within an image or between exposures of no more than 0.2 Microfilm Evaluation Film used for newspaper digitization should be a clean second-generation duplicate to silver negative film. (Negative film offers less noise and better contrast and is easier to correct scratches. Positive film is third generation, with lower resolution, which produces poor OCR results.) The polarity will be reversed during scanning. Scanning from service copies should be avoided. In addition, film used for newspaper digitization should be polyester rather than acetate. Polyester film is stable and durable. Acetate film should be duplicated to polyester stock before scanning. [Film produced before 1970 is probably on acetate stock. Film produced between 1970 and the late 1980s may be on acetate stock. Hold up a wound roll of film to the light and examine the side of the roll. If no light shows through, it is probably acetate. Also, if it is curled, warped, buckled, brittle, blistered, or stinky, it is probably acetate.] Microfilm, especially microfilm of newspapers, is not perfect. Even if the resolution, reduction ratio, and densities are less than optimal, you can do sample scans and test for usability of OCR. 4.3 Best Practices for Olive Active Paper Projects Scanning Specifications • Create master image to one of the following specifications: o TIFF (Compression is CCITT group 4), 300 DPI, bi-tonal Recommended for newspapers containing none or few photographs or graphic elements. o TIFF (Compression is Jpeg), 300 DPI, 8-bit grayscale Recommended for newspapers containing many photographs or graphic elements. • Clean TIFF image – removing dirt and extraneous noise to improve compression and OCR – pay extra care not to get broken letters. Some dirt is preferred over broken letters. • Crop page images to page edge • De-skew pages exhibiting more than 3 degrees of skew • Split double page frames (newspapers filmed two sheets per frame) into single page image. (One image per page.) File Naming Conventions • Post-2008 Projects The Post-2008 naming convention applies to projects starting after March 2008. Continuing titles that began prior to 2008 will continue to use the Pre-2008 convention The Post-2008 convention is based on a combination of the calendar date, section, edition, and page number of a given newspaper page. Each TIFF file should include the following elements and be in this precise order: CCC-nnn-YYYY-MM-DD-VV-NN-XXX.TIFF o o o o o o o o CCC: Project code (3 numbers)- provided by Olive nnn: Publication name- 3 letters code - provided by Olive YYYY: Year of issue date MM: Month of issue date DD: Day of issue date VV: Page version/edition (default is 01) NN: Section (default is 01) XXX: Page # in section (in the order in the paper) Examples: • 069-PMV-1985-12-12-01-01-001.tiff (Dec 12, 1985, ed. 1, sec. 1, pg. 1) • 069-PMV-1985-12-12-01-01-002.tiff (Dec 12, 1985, ed. 1, sec. 1, pg. 2) • 069-PMV-1985-12-12-02-02-001.tiff (Dec 12, 1985, ed. 2, sec. 2, pg. 1) Pre-2008 Projects The Pre-2008 convention is based on a combination of the calendar date and page number of a given newspaper page. Each TIFF file should include the following elements and be in this precise order: (Project,3)-(Name,3)-(Year,4)-(Month,2)-(Day,2)-(PageNo,3)-(Status,*).tif o o o o o o o CCC: Project code (3 numbers)- provided by Olive nnn: Publication name- 3 letters code - provided by Olive YYY: Year if issue date MM: Month of issue date DD: Day of issue date XXX: Page # in section (in the order in the paper) Status: single Example: • 039-TUC-1905-01-31-001-single.tif (Urbana Courier, Jan 31, 1905, page 1) Olive will not accept files utilizing a different naming convention. When un-resolvable duplicates or discrepancies in page/issue/date numbering exist, the vendor will save the scan within a separate directory named appropriately as “Error” or something similar. Files in this folder will need to be examined and resolved by UIUC before being supplied to Olive. (Note: the Post-2008 convention remedies this file naming problem.) Files in the “Error” folder will follow the above convention with the addition of a CopyNo to indicate the duplicate version. (Project;3)-(Name;3)-(Year;4)-(Month;2)-(Day;2)-(PageNo;3)-(CopyNo;3)-(Status).TIF Example: • 039-TUC-1905-01-30-001-single.tif (Courier Jan 30, 1905) • 039-TUC-1905-01-30-001-002-single.tif (duplicate saved in “Error” folder) Distillation (Segmentation, OCR, Output to XML) Distillation processing is performed by Olive and includes image analysis, article segmentation, OCR processing, and output to XML. Information on the distillation process, as provided by Olive Software in 2007, is detailed in this section. The automatic segmentation process is tasked with recognizing newspaper information objects or entities – these can be articles, pictures, or ads. It also recognizes each entity’s internal components (in an article, for example, these include title, subtitle, byline, and body text). All this is done through analysis of page layout geometry and the fonts used on each page. Once segmentation has been performed, the print edition is converted to a “Digital Newspaper.” A Digital Newspaper consists of images and XML files. The images are rectangular snapshots which can be used to build up every information object in the newspaper; the XML files record the text, structure and layout of the document. The distillation process was designed to overcome the inherent problems associated with the conversion of scanned images and microfilm as well as the inability of OCR programs to properly read page layout geometry. Distillation is a five-step process: Image analysis, Layout analysis, OCR, Entity building and Output to XML. • Step One: Image analysis This stage is crucial to the distilling process since the page image is analyzed to find horizontal and vertical lines, text strings, and picture regions. Nonlinear distortion, combined with the complex layout of the newspaper page, makes life difficult for OCR software. If entire page regions are ignored by the OCR, mistakenly treated as dead areas or pictures, segmentation is compromised. Scanned images suffer from nonlinear distortion – distortion that cannot be predicted and compensated for. A few examples of nonlinear distortion: o Text and line warping, resulting from bad adjustment of scanned paper. o Washed-out letters, generally in titles, resulting from loss of dye. o Poor-quality text including speckles and general distortion, resulting from lowquality microfilming (different lighting on different parts of the page), or scanning of late-generation film (a copy of a copy of a copy). Olive’s image analyzer overcomes nonlinear distortion and poor image quality using image processing algorithms that were developed especially for this purpose. • Step Two: Layout analysis The segmentation engine used in digital materials was adapted for this stage of the process. Working like a human eye, the segmentation engine views a newspaper page from a distance and analyzes the geometry of the page using lines and shapes recognized in image analysis. It builds a net of image objects, examining alignment, size, brightness, and other characteristics of groups of elements on the grid. The result is a rough page structure definition, which includes text regions, classified as body text or titles. • Stage Three: Optical Character Recognition After separate image analysis and layout analysis have been completed, the OCR process is performed on each of the text regions detected by the layout analyzer. This way, the OCR engine can work on relatively small rectangles, all of which contain text. The precision with which these regions are detected has a huge impact on overall OCR accuracy. The number of un-recognized or badly-recognized areas decreases by a factor of two or three. The results of OCR are written into a PDF containing a full issue in page images. All information about word coordinates, font, size and OCR errors is stored for analysis. • Stage Four: Entity Building In this stage, all the information gathered in image analysis, layout analysis and OCR is collated. The segmentation engine analyzes textual objects, and their opticallyrecognized text, to find entities and entity components. This structural information is also written into the PDF. • Stage Five: Output to XML In the final stage, the structural and layout definitions gathered during the distillation process are written to non-proprietary XML files, together with the OCR-generated text. In addition, many rectangular snapshots of each newspaper page are taken, and saved together with the text. These snapshots can be used to assemble any entity in the newspaper, using coordinates found in the XML. The data is stored within a flat-file XML repository and is organized by an index tree by publication, date, section, page and then by page components. Olive’s XML architecture is based on its Preservation Markup Language Schema (PrXML). This schema maps the original document’s content, style, and hidden intelligence in an open source XML format. PrXML is a “Hyper Schema”, not limited to a specific standard. Olive Software enables conversions of the PrXML schema into other schemas such as OAI and METS. 4.4 Best Practices for NDNP Projects [This section (2.5.4) is extracted from the National Digital Newspaper Program (NDNP) Technical Guidelines for Applicants 2009 document (66 page PDF) available at http://www.loc.gov/ndnp/pdf/NDNP_200911TechNotes.pdf] Summary LC specifications for NDNP include the following: • Film selection: In evaluating microfilm, use film produced at a reduction ratio below 20x if possible. The master negative film duplicated for scanning should have resolution test patterns readable at 5.0 or higher. The range of variation in density within images and between exposures should be narrower than 0.9-1.2, with variation of no more than 0.2 within an image and between exposures. • Scanning: Scan from second-generation duplicate silver negative microfilm. Capture specifications are 8-bit grayscale between 300 and 400 dpi relative to the physical dimensions of the original newspaper. Master images to be provided to LC as uncompressed TIFFs (6.0). 2B film images should be split so that there is one page image per file. De-skew images with a skew greater than 3 degrees. Crop to include visible edge of page. • OCR: One OCR text file per page image. Text in UTF-8 character set. No graphic elements saved with OCR text. OCR text ordered column-by-column. OCR text file with bounding-box coordinate data at the world level. OCR will conform to the ALTO SML schema. All page images must be accompanied by an ALTO XML file containing recognized text. • Derivatives: In addition to the master TIFF image file and OCR text, participants will provide a searchable PDF image with hidden text for each page image and a JPEG2000 compressed image file. (The PDF image with hidden text can be created at the time of processing by the OCR application.) The page image will be grayscale, downsampled to 150 dpi and encoded using a medium JPEG quality setting. JPEG2000 compression will be 8:1. • Metadata: issue and page metadata, reel metadata, general information (see Appendix A: Digital Asset Metadata Elements Dictionary) Overview of Technical Approach for 2009-11 NDNP Awards The National Digital Newspaper Program is a long-term effort and the technical environment will change as the program continues. The National Endowment for the Humanities (NEH) and the Library of Congress (LC) have selected a technical approach to balance long-term objectives and shorter-term constraints. These include: • • • • convenient accessibility over the World Wide Web for the general public to the entire collection as it grows, through a consistent interface and using proven technology; page images of sufficient spatial and tonal resolution to support effective performance of OCR (optical character recognition) software and representation of printed half-tones, given the limitations of microfilm, expecting that future improvements in OCR and image processing will be applied to the same images; the use of digital formats with a high probability of sustainability - in particular, using standard formats where possible and proprietary formats only where widely adopted; and attention to the cost of digital conversion and maintenance of the resulting assets. The goal of the initial program phase is to build a Web-accessible NDNP delivery application with sufficient geographic coverage and digital assets to validate the technical approach and to serve as a test bed for future research and development in techniques to enhance the content and access interface, and to support effective use by scholars and the general public. This award cycle is a continuation of the initial program development phase. In succeeding phases of the project, the approach and associated guidelines will be evaluated and revised based on feedback from awardees, experience in providing access to historic newspapers online, and technological advances. In summary, the current technical approach is based on: • • • • • • grayscale images (scanned for maximum resolution possible between 300-400 dpi, relative to the original material) from microfilm OCR with word-bounding boxes, uncorrected, with recognition of columns, but without segmentation of pages into articles, structural metadata for pages, issues, editions, and titles to support a chronologicallybased browsing interface, copies of all page images and associated metadata at LC, an interface designed specifically for access to historic newspapers in the public domain, mounted at LC (the initial interface will permit full-text searches with retrieval of individual page images, and highlighting of search words on the images), and the ability of awardees to re-use any digital assets created for NDNP in other systems or for other purposes. NEH and LC recognize that other institutions may choose other approaches or formats for their own digital repository and delivery systems and thus either weigh costs and benefits differently or wish for compatibility to existing systems. Applicants may pursue local approaches in parallel with participation in NDNP, with the overall goal of providing effective widespread access to newspapers through scanning and text conversion and evaluating alternative interfaces for navigating and exploring large collections of newspapers. Applicants who use other formats locally must be capable of providing digital assets to the NDNP according to the specifications described below. The National Digital Newspaper Program supports a consistent technical specification for digital newspaper reproductions and associated metadata in order to maintain parity of services for materials from a variety of institutions and collections and to support the “best practices” of today’s understanding of digital preservation needs. Deliverables Awardees are expected to deliver the following to the Library of Congress, to allow construction of a permanent archive and a unified interface for searching and browsing the entire NDNP collection. After the cooperative agreements are announced, LC will convene a meeting of awardees to review these technical guidelines, and establish work-plan milestones, and specifications for 2009-11 deliverables. For each title • • • Up-to-date MARC record from the CONSER database, fully conformant to current standards for cataloging U.S. print newspapers, Additional title-level metadata related to the title run/s digitized and delivered (see Appendix A: Digital Asset Metadata Elements), and Newspaper History Essay – scope and content of each title, history and significance – 500 words. For each issue/edition • Structural metadata for issues/editions digitized and organized by date (see Appendix A: Digital Asset Metadata Elements) For each newspaper page • • • • • • • Page image in two raster formats Grayscale, scanned for maximum resolution possible between 300-400 dpi, relative to the original material, uncompressed TIFF 6.0 (Appendix B – File Format Profiles), Same image, compressed as JPEG2000 (Appendix B – File Format Profiles), OCR text and associated bounding boxes for words (see Appendix B – File Format Profiles), 1 file per page image, PDF Image with Hidden Text, i.e., with text and image correlated (see Appendix B – File Format Profiles), Structural metadata to relate pages to title, date, and edition, sequence pages within issue or section; and to identify image and OCR files (see Appendix A: Digital Asset Metadata Elements and Appendix C – XML Metadata Templates), and Technical metadata to support the functions of a trusted repository (see Appendix A: Digital Asset Metadata Elements, Appendix B – File Format Profiles and Appendix C – XML Metadata Templates). Awardees will deliver all digital assets in a METS object structure (Metadata Encoded Transmission Schema), according to an XML Batch template structure. (See Appendix C – XML Metadata Templates.) For delivery, the awardee shall organize the page images and related files for each newspaper title in a hierarchical directory structure sufficient for identification of the individual digital assets from the metadata provided. For each microfilm reel digitized: • • A second-generation (2N) duplicate silver negative microfilm, made from the camera master, will be barcoded and deposited with the Library of Congress on completion of the award (LC to supply barcodes for all reels), and Technical metadata concerning the quality characteristics of the film used for digitization (See Appendix A – Digital Asset Metadata Elements/Reel Information) will be encoded in a METS object with other digital assets (See Appendix C – XML Metadata Templates.). [This section (2.5.4) is extracted from the National Digital Newspaper Program (NDNP) Technical Guidelines for Applicants 2009 document (66 page PDF) available at http://www.loc.gov/ndnp/pdf/NDNP_200911TechNotes.pdf]