4.0 Newspaper Digitization

advertisement
4.0 Newspaper Digitization
This section includes a summary of the current prevailing best practices for newspaper
digitization.
Table of contents
4.1 Background
4.2 Microfilm
• Summary
• Microfilm Evaluation
4.3 Best Practices for Olive Active Paper Projects
• Scanning Specifications
• File Naming Conventions
• Distillation (Segmentation, OCR, Output to XML)
4.4 Best Practices for NDNP Projects
• Summary
• Overview of Technical Approach for 2009-11 NDNP Awards
• Deliverables
4.1 Background
The single most critical factor in the success of newspaper digitization is the availability of good
quality microfilm. Although it is possible to digitize newspapers from an original print copy, this
process is very labor-intensive and considerably more expensive than digitizing from film.
Another key consideration is the platform to be used to deliver the digital content. The choice
of platform will drive some decisions regarding technical specifications. Although the goal of
the National Digital Newspaper Program (NDNP) is to generate a set of best practices and
national standards for newspaper digitization, there is currently considerable variation in
practice and no consensus regarding several major issues.
For example, NDNP does not offer subpage-level segmentation, also called article zoning. Most
members of the newspaper digitization community, however, do advocate article
segmentation. With regard to scanning requirements, there are as many proponents of 8-bit
grayscale as there are for bitonal scanning. The particular content management/delivery
system may determine some technical specifications.
In considering source material, newspapers published before 1923 are in the public domain and
may be freely digitized. Orphaned post-1923 titles may also be freely digitized. If a newspaper
that began publication before 1923 is still being published, there may be compelling reasons
not to digitize pre-1923 content without the permission of the publisher. Post-1923 titles still in
publication can be digitized only with the permission of the publisher.
4.2 Microfilm
Summary
Microfilm may be unsuitable for digitization due to many factors:
•
Poor condition of original (poorly printed, stained, faded, damaged)
•
Original newsprint poorly prepared for filming (e.g., page curvature of bound newsprint
may produce gutter shadows)
•
Original filmed using unsuitable (too high) reduction ratio, which can affect image
quality and OCR results
•
Original filmed with variations in density within images or between exposures, which
would necessitate adjustment of scanning parameters within a reel to obtain proper
contrast and focus
•
Original filmed using uneven lighting
•
Poor condition of film (deteriorating, dirty or scratched)
Generally, film produced following United States Newspaper Program Guidelines (established in
the mid-1980s) and RLG preservation microfilm guidelines (established in the early 1990s)
yields the best results. The USNP guidelines stipulate:
•
Originals in good condition
•
Use of high resolution camera
•
Use of polyester film stock
•
Reduction ratio between 16x and 20x
•
Quality index of 8.0 or above (using resolution test pattern)
•
Background densities between 0.8 and 1.2 (ideal is 1.0 for newspapers of average text
quality, and 0.9-1.0 for originals with faint or broken text)
•
Variation of densities within an image or between exposures of no more than 0.2
Microfilm Evaluation
Film used for newspaper digitization should be a clean second-generation duplicate to silver
negative film. (Negative film offers less noise and better contrast and is easier to correct
scratches. Positive film is third generation, with lower resolution, which produces poor OCR
results.) The polarity will be reversed during scanning. Scanning from service copies should be
avoided. In addition, film used for newspaper digitization should be polyester rather than
acetate. Polyester film is stable and durable. Acetate film should be duplicated to polyester
stock before scanning.
[Film produced before 1970 is probably on acetate stock. Film produced between 1970 and the
late 1980s may be on acetate stock. Hold up a wound roll of film to the light and examine the
side of the roll. If no light shows through, it is probably acetate. Also, if it is curled, warped,
buckled, brittle, blistered, or stinky, it is probably acetate.]
Microfilm, especially microfilm of newspapers, is not perfect. Even if the resolution, reduction
ratio, and densities are less than optimal, you can do sample scans and test for usability of OCR.
4.3 Best Practices for Olive Active Paper Projects
Scanning Specifications
•
Create master image to one of the following specifications:
o TIFF (Compression is CCITT group 4), 300 DPI, bi-tonal
Recommended for newspapers containing none or few photographs or
graphic elements.
o TIFF (Compression is Jpeg), 300 DPI, 8-bit grayscale
Recommended for newspapers containing many photographs or graphic
elements.
•
Clean TIFF image – removing dirt and extraneous noise to improve compression and
OCR – pay extra care not to get broken letters. Some dirt is preferred over broken
letters.
•
Crop page images to page edge
•
De-skew pages exhibiting more than 3 degrees of skew
•
Split double page frames (newspapers filmed two sheets per frame) into single page
image. (One image per page.)
File Naming Conventions
•
Post-2008 Projects
The Post-2008 naming convention applies to projects starting after March 2008.
Continuing titles that began prior to 2008 will continue to use the Pre-2008 convention
The Post-2008 convention is based on a combination of the calendar date, section,
edition, and page number of a given newspaper page. Each TIFF file should include the
following elements and be in this precise order:
CCC-nnn-YYYY-MM-DD-VV-NN-XXX.TIFF
o
o
o
o
o
o
o
o
CCC: Project code (3 numbers)- provided by Olive
nnn: Publication name- 3 letters code - provided by Olive
YYYY: Year of issue date
MM: Month of issue date
DD: Day of issue date
VV: Page version/edition (default is 01)
NN: Section (default is 01)
XXX: Page # in section (in the order in the paper)
Examples:
• 069-PMV-1985-12-12-01-01-001.tiff (Dec 12, 1985, ed. 1, sec. 1, pg. 1)
• 069-PMV-1985-12-12-01-01-002.tiff (Dec 12, 1985, ed. 1, sec. 1, pg. 2)
• 069-PMV-1985-12-12-02-02-001.tiff (Dec 12, 1985, ed. 2, sec. 2, pg. 1)
Pre-2008 Projects
The Pre-2008 convention is based on a combination of the calendar date and page
number of a given newspaper page.
Each TIFF file should include the following elements and be in this precise order:
(Project,3)-(Name,3)-(Year,4)-(Month,2)-(Day,2)-(PageNo,3)-(Status,*).tif
o
o
o
o
o
o
o
CCC: Project code (3 numbers)- provided by Olive
nnn: Publication name- 3 letters code - provided by Olive
YYY: Year if issue date
MM: Month of issue date
DD: Day of issue date
XXX: Page # in section (in the order in the paper)
Status: single
Example:
• 039-TUC-1905-01-31-001-single.tif (Urbana Courier, Jan 31, 1905, page 1)
Olive will not accept files utilizing a different naming convention. When un-resolvable
duplicates or discrepancies in page/issue/date numbering exist, the vendor will save the
scan within a separate directory named appropriately as “Error” or something similar.
Files in this folder will need to be examined and resolved by UIUC before being supplied
to Olive. (Note: the Post-2008 convention remedies this file naming problem.)
Files in the “Error” folder will follow the above convention with the addition of a
CopyNo to indicate the duplicate version.
(Project;3)-(Name;3)-(Year;4)-(Month;2)-(Day;2)-(PageNo;3)-(CopyNo;3)-(Status).TIF
Example:
• 039-TUC-1905-01-30-001-single.tif (Courier Jan 30, 1905)
• 039-TUC-1905-01-30-001-002-single.tif (duplicate saved in “Error” folder)
Distillation (Segmentation, OCR, Output to XML)
Distillation processing is performed by Olive and includes image analysis, article segmentation,
OCR processing, and output to XML. Information on the distillation process, as provided by
Olive Software in 2007, is detailed in this section.
The automatic segmentation process is tasked with recognizing newspaper information
objects or entities – these can be articles, pictures, or ads. It also recognizes each entity’s
internal components (in an article, for example, these include title, subtitle, byline, and body
text). All this is done through analysis of page layout geometry and the fonts used on each
page.
Once segmentation has been performed, the print edition is converted to a “Digital
Newspaper.” A Digital Newspaper consists of images and XML files. The images are
rectangular snapshots which can be used to build up every information object in the
newspaper; the XML files record the text, structure and layout of the document.
The distillation process was designed to overcome the inherent problems associated with the
conversion of scanned images and microfilm as well as the inability of OCR programs to
properly read page layout geometry. Distillation is a five-step process: Image analysis, Layout
analysis, OCR, Entity building and Output to XML.
•
Step One: Image analysis
This stage is crucial to the distilling process since the page image is analyzed to find
horizontal and vertical lines, text strings, and picture regions. Nonlinear distortion,
combined with the complex layout of the newspaper page, makes life difficult for OCR
software. If entire page regions are ignored by the OCR, mistakenly treated as dead
areas or pictures, segmentation is compromised.
Scanned images suffer from nonlinear distortion – distortion that cannot be predicted
and compensated for. A few examples of nonlinear distortion:
o Text and line warping, resulting from bad adjustment of scanned paper.
o Washed-out letters, generally in titles, resulting from loss of dye.
o Poor-quality text including speckles and general distortion, resulting from lowquality microfilming (different lighting on different parts of the page), or
scanning of late-generation film (a copy of a copy of a copy).
Olive’s image analyzer overcomes nonlinear distortion and poor image quality using
image processing algorithms that were developed especially for this purpose.
•
Step Two: Layout analysis
The segmentation engine used in digital materials was adapted for this stage of the
process. Working like a human eye, the segmentation engine views a newspaper page
from a distance and analyzes the geometry of the page using lines and shapes
recognized in image analysis. It builds a net of image objects, examining alignment,
size, brightness, and other characteristics of groups of elements on the grid. The result
is a rough page structure definition, which includes text regions, classified as body text
or titles.
•
Stage Three: Optical Character Recognition
After separate image analysis and layout analysis have been completed, the OCR
process is performed on each of the text regions detected by the layout analyzer. This
way, the OCR engine can work on relatively small rectangles, all of which contain text.
The precision with which these regions are detected has a huge impact on overall OCR
accuracy. The number of un-recognized or badly-recognized areas decreases by a
factor of two or three.
The results of OCR are written into a PDF containing a full issue in page images. All
information about word coordinates, font, size and OCR errors is stored for analysis.
•
Stage Four: Entity Building
In this stage, all the information gathered in image analysis, layout analysis and OCR is
collated. The segmentation engine analyzes textual objects, and their opticallyrecognized text, to find entities and entity components.
This structural information is also written into the PDF.
•
Stage Five: Output to XML
In the final stage, the structural and layout definitions gathered during the distillation
process are written to non-proprietary XML files, together with the OCR-generated
text. In addition, many rectangular snapshots of each newspaper page are taken, and
saved together with the text. These snapshots can be used to assemble any entity in
the newspaper, using coordinates found in the XML.
The data is stored within a flat-file XML repository and is organized by an index tree by
publication, date, section, page and then by page components.
Olive’s XML architecture is based on its Preservation Markup Language Schema
(PrXML). This schema maps the original document’s content, style, and hidden
intelligence in an open source XML format. PrXML is a “Hyper Schema”, not limited to
a specific standard. Olive Software enables conversions of the PrXML schema into
other schemas such as OAI and METS.
4.4 Best Practices for NDNP Projects
[This section (2.5.4) is extracted from the National Digital Newspaper Program (NDNP)
Technical Guidelines for Applicants 2009 document (66 page PDF) available at
http://www.loc.gov/ndnp/pdf/NDNP_200911TechNotes.pdf]
Summary
LC specifications for NDNP include the following:
•
Film selection: In evaluating microfilm, use film produced at a reduction ratio below 20x
if possible. The master negative film duplicated for scanning should have resolution test
patterns readable at 5.0 or higher. The range of variation in density within images and
between exposures should be narrower than 0.9-1.2, with variation of no more than 0.2
within an image and between exposures.
•
Scanning: Scan from second-generation duplicate silver negative microfilm. Capture
specifications are 8-bit grayscale between 300 and 400 dpi relative to the physical
dimensions of the original newspaper. Master images to be provided to LC as
uncompressed TIFFs (6.0). 2B film images should be split so that there is one page
image per file. De-skew images with a skew greater than 3 degrees. Crop to include
visible edge of page.
•
OCR: One OCR text file per page image. Text in UTF-8 character set. No graphic
elements saved with OCR text. OCR text ordered column-by-column. OCR text file with
bounding-box coordinate data at the world level. OCR will conform to the ALTO SML
schema. All page images must be accompanied by an ALTO XML file containing
recognized text.
•
Derivatives: In addition to the master TIFF image file and OCR text, participants will
provide a searchable PDF image with hidden text for each page image and a JPEG2000
compressed image file. (The PDF image with hidden text can be created at the time of
processing by the OCR application.) The page image will be grayscale, downsampled to
150 dpi and encoded using a medium JPEG quality setting. JPEG2000 compression will
be 8:1.
•
Metadata: issue and page metadata, reel metadata, general information (see Appendix
A: Digital Asset Metadata Elements Dictionary)
Overview of Technical Approach for 2009-11 NDNP Awards
The National Digital Newspaper Program is a long-term effort and the technical environment
will change as the program continues. The National Endowment for the Humanities (NEH) and
the Library of Congress (LC) have selected a technical approach to balance long-term objectives
and shorter-term constraints. These include:
•
•
•
•
convenient accessibility over the World Wide Web for the general public to the
entire collection as it grows, through a consistent interface and using proven
technology;
page images of sufficient spatial and tonal resolution to support effective
performance of OCR (optical character recognition) software and representation of
printed half-tones, given the limitations of microfilm, expecting that future
improvements in OCR and image processing will be applied to the same images;
the use of digital formats with a high probability of sustainability - in particular,
using standard formats where possible and proprietary formats only where widely
adopted;
and attention to the cost of digital conversion and maintenance of the resulting
assets.
The goal of the initial program phase is to build a Web-accessible NDNP delivery application
with sufficient geographic coverage and digital assets to validate the technical approach and to
serve as a test bed for future research and development in techniques to enhance the content
and access interface, and to support effective use by scholars and the general public. This
award cycle is a continuation of the initial program development phase.
In succeeding phases of the project, the approach and associated guidelines will be evaluated
and revised based on feedback from awardees, experience in providing access to historic
newspapers online, and technological advances.
In summary, the current technical approach is based on:
•
•
•
•
•
•
grayscale images (scanned for maximum resolution possible between 300-400 dpi,
relative to the original material) from microfilm
OCR with word-bounding boxes, uncorrected, with recognition of columns, but without
segmentation of pages into articles,
structural metadata for pages, issues, editions, and titles to support a chronologicallybased browsing interface,
copies of all page images and associated metadata at LC,
an interface designed specifically for access to historic newspapers in the public domain,
mounted at LC (the initial interface will permit full-text searches with retrieval of
individual page images, and highlighting of search words on the images), and
the ability of awardees to re-use any digital assets created for NDNP in other systems or
for other purposes.
NEH and LC recognize that other institutions may choose other approaches or formats for their
own digital repository and delivery systems and thus either weigh costs and benefits differently
or wish for compatibility to existing systems. Applicants may pursue local approaches in parallel
with participation in NDNP, with the overall goal of providing effective widespread access to
newspapers through scanning and text conversion and evaluating alternative interfaces for
navigating and exploring large collections of newspapers. Applicants who use other formats
locally must be capable of providing digital assets to the NDNP according to the specifications
described below.
The National Digital Newspaper Program supports a consistent technical specification for digital
newspaper reproductions and associated metadata in order to maintain parity of services for
materials from a variety of institutions and collections and to support the “best practices” of
today’s understanding of digital preservation needs.
Deliverables
Awardees are expected to deliver the following to the Library of Congress, to allow construction
of a permanent archive and a unified interface for searching and browsing the entire NDNP
collection. After the cooperative agreements are announced, LC will convene a meeting of
awardees to review these technical guidelines, and establish work-plan milestones, and
specifications for 2009-11 deliverables.
For each title
•
•
•
Up-to-date MARC record from the CONSER database, fully conformant to current
standards for cataloging U.S. print newspapers,
Additional title-level metadata related to the title run/s digitized and delivered (see
Appendix A: Digital Asset Metadata Elements), and
Newspaper History Essay – scope and content of each title, history and significance –
500 words.
For each issue/edition
•
Structural metadata for issues/editions digitized and organized by date (see Appendix A:
Digital Asset Metadata Elements)
For each newspaper page
•
•
•
•
•
•
•
Page image in two raster formats
Grayscale, scanned for maximum resolution possible between 300-400 dpi, relative to
the original material, uncompressed TIFF 6.0 (Appendix B – File Format Profiles),
Same image, compressed as JPEG2000 (Appendix B – File Format Profiles),
OCR text and associated bounding boxes for words (see Appendix B – File Format
Profiles), 1 file per page image,
PDF Image with Hidden Text, i.e., with text and image correlated (see Appendix B – File
Format Profiles),
Structural metadata to relate pages to title, date, and edition, sequence pages within
issue or section; and to identify image and OCR files (see Appendix A: Digital Asset
Metadata Elements and Appendix C – XML Metadata Templates), and
Technical metadata to support the functions of a trusted repository (see Appendix A:
Digital Asset Metadata Elements, Appendix B – File Format Profiles and Appendix C –
XML Metadata Templates).
Awardees will deliver all digital assets in a METS object structure (Metadata Encoded
Transmission Schema), according to an XML Batch template structure. (See Appendix C – XML
Metadata Templates.)
For delivery, the awardee shall organize the page images and related files for each newspaper
title in a hierarchical directory structure sufficient for identification of the individual digital
assets from the metadata provided.
For each microfilm reel digitized:
•
•
A second-generation (2N) duplicate silver negative microfilm, made from the camera
master, will be barcoded and deposited with the Library of Congress on completion of
the award (LC to supply barcodes for all reels), and
Technical metadata concerning the quality characteristics of the film used for
digitization (See Appendix A – Digital Asset Metadata Elements/Reel Information) will be
encoded in a METS object with other digital assets (See Appendix C – XML Metadata
Templates.).
[This section (2.5.4) is extracted from the National Digital Newspaper Program (NDNP)
Technical Guidelines for Applicants 2009 document (66 page PDF) available at
http://www.loc.gov/ndnp/pdf/NDNP_200911TechNotes.pdf]
Download