NDNPPresentationEdited

advertisement
Chronicling America and the
National Digital Newspaper
Program:
Technical Aspects
 Part 1: Newspapers and Microfilm
 Challenges
 USNP
 Part 2: Technical Details
 Image views
 Text searching
 Indexing
 Part 3: Managing a newspaper digitization
project
PIALA 2010
UH Manoa Hamilton Library
Challenges
 Newspapers are a difficult medium
 Never meant to last, made for daily use
and disposal
 Pages crumble and acid corrodes the
materials
 Tracking serial publications over time
 Patron demand increased, storage space
grew scarce, binding costs rose
PIALA 2010
UH Manoa Hamilton Library
Microfilm
 Adopted in the 1920s as a standard
 Turns newspaper from a storage
nightmare to a relatively easy medium
to handle
 Libraries had to decide what to do
with the hardcopy
 Keep in holdings?
 Deaccession?
PIALA 2010
UH Manoa Hamilton Library
United States Newspaper
Program (USNP) Began in
1982
 Funded by National Endowment for the
Humanities, managed by the Library of
Congress
 University of Hawai’i with Hawaiian Historical
Society, Hawai’i State Archives and State
Library contributed for Hawai’i
 In mid-2000s: the USNP had received over $54
million in NEH support & non-federal
contributions of approx $19.6 million
 Bibliographic records for over 140,000
newspaper titles; access to 70 million pages of
newsprint in microfilm
PIALA 2010
UH Manoa Hamilton Library
USNP
 Goal: Locate, catalog, and microfilm
newspapers
 Hawai’i microfilmed 260,000 pages
and cataloged 476 titles
 Program ended in 2007
PIALA 2010
UH Manoa Hamilton Library
USNP Preservation
Microfilming
Guidelines
 Optimum legibility
Image orientation & reduction ratios to fill frame
& obtain greatest degree of legibility in public
use copies
 Quality
Each roll of first generation film shall be inspected
frame-by-frame by both the filming agency and
the project for density and resolution and to
determine that the film is free of emulsion
scratches, abrasions, fingerprints, spots, fog,
and other defects
http://www.loc.gov/preserv/usnpguidelines.html
PIALA 2010
UH Manoa Hamilton Library
USNP Preservation
Microfilming Guidelines
 Density
• No less than five readings at start, middle & end of
each reel with a transmission densitometer
calibrated daily
• Maximum (Dmax) density measurements taken on
exposed image with no words or graphics
• Background densities no lower than .80 & no
higher than 1.20, lower densities preferred for
older pages & to facilitate production of readerprinter & enlargement prints.
• Base-plus-fog density (Dmin) on the master
negative shall not exceed .10
PIALA 2010
UH Manoa Hamilton Library
National Endowment for
the Humanities and Library
of Congress created NDNP
 No single US collection of newspapers
 Every institution focusing on particular
themes relating to their collecting plans
 Thousands of volumes of newspapers
spread across the country
 Enhance access to newspapers, building
on the foundation of the United States
Newspaper Program
PIALA 2010
UH Manoa Hamilton Library
NDNP Overview
 2-Year awards to state projects,
renewable
 Digitize 100,000 pages of microfilmed
newspaper
 Newspapers picked must be from
between 1836 to 1922
 Historical essays on each newspaper
 Collation and Quality Control on all
papers
PIALA 2010
UH Manoa Hamilton Library
NDNP Goals
 20-year span with phased, sustainable development
of 30 million page database
 Establish technical conversion specs & practices for
efficient basic discovery & access
 Develop production tools to ensure good digital
objects that can be managed & preserved long-term
 Provide public access to and take preservation
responsibility for the digitized newspapers
 Create a national resource of historically significant
newspapers from all the states and U.S. territories
PIALA 2010
UH Manoa Hamilton Library
NDNP Microfilm-related
Challenges
 Where are the master reels?
 Copyright issues (Who filmed the
newspapers and owns the master
microfilm)
 Technical specifications (Poorly filmed,
low density readings, etc)
 Microfilm standards applied vary widely
PIALA 2010
UH Manoa Hamilton Library
No universally accepted
metadata standard for
historical newspapers
 Online historical newspapers
produced by public or private sector
existed as discrete systems,
metadata structures not designed for
interoperability
Titles, issues, pages and reels all
need to be represented as different
yet related classes of objects
PIALA 2010
UH Manoa Hamilton Library
NDNP
Digital Deliverables
 Images scanned at 300-400 dpi
• Three formats:
 grayscale, uncompressed Tiff 6.0
Images
 Compressed JPEG2000 images
 PDF Image with hidden text
 Accompanying structural and
technical metadata
 OCR text for all pages
PIALA 2010
UH Manoa Hamilton Library
NDNP Scanning
specifications
 De-skew images with a skew of greater
than 3 degrees
 Crop to visible edge of page
 Capture grayscale preservation microfilm
targets
PIALA 2010
UH Manoa Hamilton Library
NDNP OCR
specifications
 Conform to ALTO XML schema
• ALTO (Analyzed Layout and Text Object)
is a XML (Extensible Markup Language)
Schema that details technical metadata
for describing the layout and content of
physical text resources
 Bounding box coordinate data
• Each column is sectioned and
coordinates are used to place words
PIALA 2010
UH Manoa Hamilton Library
NDNP
Metadata requirements
(Metadata is Information about Information)
 METS (Metadata Encoding and Transmission
Standard) format records preservation
metadata
 Structural metadata to relate pages to title,
date, and edition; sequence pages within issue
or section; and to identify image and OCR files
 Technical metadata to support the functions
of the Library of Congress repository
PIALA 2010
UH Manoa Hamilton Library
XML Rules

Single, unique root element

Matching open/close tags

Consistent capitalization

Correctly nested elements (no overlapping elements)

Attribute values enclosed in quotes

No repeating attributes in an element
 Provides international, vendor independent standard
for describing information
PIALA 2010
UH Manoa Hamilton Library
Family of XML data
standards includes:
 METS – Metadata Encoding and
Transmission Standard
 MODS – Metadata Object
Description Schema
 PREMIS – PREservation Metadata
Implementation Strategies
 EAD – Encoded Archival
Description
PIALA 2010
UH Manoa Hamilton Library
METS
(Metadata Encoding and
Transmission Standard)
 XML Schema for the purpose of
creating XML files that define:
• the hierarchical structure of digital
library objects (images, text files,
etc.)
• the names and locations of the files
• the associated metadata (e.g., MODS)
PIALA 2010
UH Manoa Hamilton Library
Metadata Object
Description Schema
(MODS)
An XML Schema designed for expressing
bibliographic data
(Think of it as an alternative to the MARC
format)
PIALA 2010
UH Manoa Hamilton Library
Sections of a METS file
<mets>
<metsHdr/> -
METS header (document talks about itself)
<dmdSec/> -
Descriptive metadata (MODS, etc.)
<amdSec/> -
Administrative metadata (copyright info., etc.)
<fileSec/> -
File section (names and locations of files)
<structMap/> -
Structural map (relationships of the parts)
<structLink/> -
Linking information
<behaviorSec/> - Binding executables/actions to object
</mets>
PIALA 2010
UH Manoa Hamilton Library
Title METS
 Combines bibliographic and holdings data
in a single title record, converted from
MARC to MARC XML format
 Titles digitized will have additional data
• descriptive essays, more precise geographic
coverage data
• which is put in a Metadata Object Description
Schema (MODS) object within the larger METS
document
PIALA 2010
UH Manoa Hamilton Library
Issue and Reel METS
 Issue METS
• Issue Data
• Page Data
 Reel METS
• Reel Data
• Target Data
PIALA 2010
UH Manoa Hamilton Library
WHY?
 XML structure used by software for creation of
multiple outputs:
• HTML/XHTML for Web display; PDF for printing
 Ease of editing (single records or batches of
records)
 Ability to validate data
 Ease of data management and publishing
 Interoperability
• Repository submission and OAI harvesting
PIALA 2010
UH Manoa Hamilton Library
All that coding pays off
for the user when
SEARCHING
 Geographic
metadata
 Title metadata
 Date metadata
PIALA 2010
UH Manoa Hamilton Library
Keyword searching
 OCR/OWR does not yield article
“transcriptions”; text OCR’d from images of
newspapers is used for searching purposes
 Several options
• ANY of the words, ALL of the words
• EXACT PHRASE
• Proximity search
– Look for words within 5, 10, 50 or 100
words of one another
PIALA 2010
UH Manoa Hamilton Library
Page thumbnail view
 Click on
thumbnail
or
description
of page to
view larger
version
PIALA 2010
UH Manoa Hamilton Library
Page view
 Different
format can
be selected
with one
click
PIALA 2010
UH Manoa Hamilton Library
Browse Issues
 A calendar
view
indicating
which issues
have been
digitized
 Can change
which year
you’re viewing
 Browse First
Pages
PIALA 2010
UH Manoa Hamilton Library
Project Management
From Microfilm to Digital Images
Managing a Newspaper Conversion Project
PIALA 2010
UH Manoa Hamilton Library
NDNP
&
University of Hawai’i
 UH first grant began in July 2008,
running until June 2010
 Grant renewed: July 2010-June 2012
 Utilizing the microfilm created under the
USNP
 Excellent quality microfilm (in theory)
 Fewer problems with cataloging/description,
acquiring 2N duplicates (in theory)
PIALA 2010
UH Manoa Hamilton Library
Project Management
 Request for Proposals (RFP)
• Include all LC technical specifications
 Position Description(s)
• Coordinator, students
 Hiring and Training
PIALA 2010
UH Manoa Hamilton Library
Project components
 Microfilm identification and duplication
 Digitization
 Metadata creation & Validation
PIALA 2010
UH Manoa Hamilton Library
Microfilm selection
 Choose what is important to your institution(s) if
possible
 Copyright
•
•
Reels created by or for your institution
Reels by Proquest, etc, you may have to ask for permission
and pay much higher duplication fees
 Decide
•
PIALA 2010
Complete runs of few titles, or many short/incomplete runs
of a lot of titles
UH Manoa Hamilton Library
Vendors
 iArchives
• Leaders in the field
• Lots of experience
 OCLC/BSLW (Backstage Library Works)
 Apex/Covantage
 Northern Micrographics (NMT)
 Local or national microfilm duplication
companies
PIALA 2010
UH Manoa Hamilton Library
Equipment
 10 500 GB External Hard Drives (Western
Digital MyBooks) and Pelican cases
 1 PC with double monitor
 Software: Library of Congress’ Digital
Validator and Viewer (DVV)
 Densitometer
 Microfilm reader/scanner
PIALA 2010
UH Manoa Hamilton Library
Our Stuff
Densitometer
Pelican Cases
Microfilm
scanner
PC with
2 monitors
& portable
HDs (red)
PIALA 2010
UH Manoa Hamilton Library
Staffing
 Project Coordinator
• Quality Control Technician
 Graduate students
 Advisory Board
 Subject/history/newspaper specialists
PIALA 2010
UH Manoa Hamilton Library
Metadata Collection
 Density readings
 Recorded onto a spreadsheet
PIALA 2010
UH Manoa Hamilton Library
Preparing the Microfilm:
Metadata
Data from, OCLC MARC record & local
holdings
PIALA 2010
UH Manoa Hamilton Library
Preparing the Microfilm:
Collation
 Review use copy of reel
• Missing issues or pages
• Duplicate issues or pages
• Mutilated pages
• Other abnormalities (E.g. pages out of
order, incorrect dates)
PIALA 2010
UH Manoa Hamilton Library
Preparing the Microfilm:
Collation
Review use copy, record data on spreadsheet
PIALA 2010
UH Manoa Hamilton Library
iArchives Digitization Workflow
QC
Film
Scanning
Split,
De-Skew,
Crop
Shared
Storage
(NAS)
QC
QC
QC
Image
Processing
Image
Metadata
KEY:
■ Automatic process [image
processing, OCR, …]
■ Manual process [image + page
metadata]
■ Quality Control
Page/Reel
Metadata
Workflow
Manager
DB
QC
OCR
Framework
QC
Post
Process
Customer
Deliverables
Automated
Processing Cloud
Scan QC
Split, Crop & DeSkew
iArchives OWR Framework
3 Leading OCR
Software Programs
2,000,000 Word
Dictionary
OWR
2,000,000 Name
Dictionary
Post-vendor validation
 Once the hard drive returned, we
verify/validate the batch using the DVV
program
 Verification compares the metadata listed in the
master XML file to the metadata found in the issue
XML files for correctness
 Validation is done if a new master XML file needs to
be created. It creates checksums for each file and
records them in the subsequent metadata
 Copy contents of hard drive onto our
server
PIALA 2010
UH Manoa Hamilton Library
Quality Control
 Image quality
 Too dark? Too light? Skewed?
 Correct image?
 Compare digitized image to microfilmed
image
 No Missing Issue/Page tags
 Review metadata
 Dates
 LCCN #
 Locations
PIALA 2010
UH Manoa Hamilton Library
Thumbnail View
can use DVV or any
graphics program
PIALA 2010
UH Manoa Hamilton Library
Quality Control
LC Digital Viewer
and Validator (DVV)
PIALA 2010
UH Manoa Hamilton Library
Metadata Viewer
PIALA 2010
UH Manoa Hamilton Library
OCR
PIALA 2010
UH Manoa Hamilton Library
Headers
PIALA 2010
UH Manoa Hamilton Library
Title Essays - 500 words
Describes newspaper’s history
• Date of establishment
• Editors
• Type of news reported
• Political viewpoint
• Where is the paper today?
Published to Chronicling America
PIALA 2010
UH Manoa Hamilton Library
Links
 Chronicling America:
http://chroniclingamerica.loc.gov/
 Library of Congress: http://www.loc.gov/ndnp/
 National Endowment for the Humanities:
http://www.neh.gov/projects/ndnp.html
 Hawai’i Newspapers: a union list
http://evols.library.manoa.hawaii.edu/handle/10524/2
089
 Using <METS> and <MODS> to Create XML
Standards-based Digital Library Applications
http://www.loc.gov/standards/mods/presentations/me
ts-mods-morgan-ala07/
PIALA 2010
UH Manoa Hamilton Library
Thank You!
Mahalo!
Kinisou Chapur!
 Questions? Comments?
 Email us at:
♦ chantiny@hawaii.edu
♦ erenst@hawaii.edu
https://sites.google.com/a/hawaii.edu/ndnp-hawaii/
PIALA 2010
UH Manoa Hamilton Library
Download