Mining Newspaper Archives Kathleen Murray & Tara Carlisle

advertisement
Mining Newspaper Archives
Kathleen Murray & Tara Carlisle
Introduction
For family historians, historical newspapers are a wonderful source of information. Two freely
available newspaper archives, Chronicling America at the Library of Congress and The Portal to
Texas History at the University of North Texas, will be used to illustrate techniques for
improving research effectiveness and sharing results.
Chronicling America [http://chroniclingamerica.loc.gov/]
Chronicling America is a Website providing access to information about historic newspapers
and select digitized newspaper pages, and is produced by the National Digital Newspaper
Program (NDNP). NDNP, a
partnership between the National
Endowment for the Humanities (NEH)
and the Library of Congress (LC), is a
long-term effort to develop an Internetbased, searchable database of U.S.
newspapers with descriptive information and select digitization of historic pages. Supported by
NEH, this rich digital resource will be developed and permanently maintained at the Library of
Congress.
The Portal to Texas History℠ [http://texashistory.unt.edu/explore/collections/TDNP/]
The Portal is a gateway to Texas history materials. From prehistory to the present day, its
unique collections from Texas libraries, museums, archives, historical societies, genealogical
societies, and private family collections, provide researchers with primary
source materials. The University of North Texas (UNT) Libraries maintains
the Portal, which includes the Texas Digital Newspaper Program (TDNP)
collection. The UNT Libraries is the lead institution in Texas selected for the
National Digital Newspaper Program, Chronicling America. TDNP is a
partnership providing broad geographic access to digitized Texas
newspapers as far back as 1829. It currently includes 80,486 newspaper editions (618,619 files).
Types of Information







Births and deaths
Marriage announcements
Military Service
Land purchases
Promotions
Advertisements: Family businesses
Travel announcements
1 of 4
Mining Newspaper Archives
Kathleen Murray & Tara Carlisle
System Components
Before demonstrating search techniques, it helps to understand a few key components of a
newspaper archive’s system architecture. Put another way, here is an under-the-hood view of
Chronicling America and The Portal to Texas History℠. Both systems use metadata, linked data,
OCR software, and extensive application programming to optimize viewing, searching and
browsing items.
1.
Metadata
Descriptive information about a newspaper is
contained in structured metadata, which often
include elements such as collection, title,
publisher, date, language, description, subject
heading, and location. Structured metadata is
used by a system to parse items and create
relationships.
2.
Linked Data
The representation of data from different
collections in a common structure enables a
system to cross reference or link data. This
allows connections to be made among disparate collections and enhances browsing capability.
3.
OCR Software
The digitization of newspapers involves scanning the source documents either from paper or
microfilm. The result is a digital master image of the source document, typically in tiff format, as
well as other formats generally used for viewing and downloading, such as jpeg or PDF. Optical
Character Recognition (OCR) software converts the images into text format, thus enabling the
full text of newspapers to be searched.
4.
Application Programming Interface (API)
The system includes several different views of the data about the newspapers. The API allows
programmatic access to these views and enables the data to be manipulated, viewed, and
searched in multiple ways:




Thumbnail view of newspapers
All issues of one newspaper
All pages of one issue
Different sizes of one page
2 of 4
Mining Newspaper Archives

Kathleen Murray & Tara Carlisle
Collections, titles, dates
Searching Historic Newspapers
The presentation will demonstrate search techniques for specific types of information of interest
to family history researchers.
No hits? Don’t give up yet!
1.




2.
Boolean searching
Phrase searching
Searching names
Language of an era (runaway; ranaway)
Lots of hits? Narrow the field!



Faceted searching
Linked data
Advanced Search
Using Search Results
1.
Printing

Benefits and pitfalls
3 of 4
Mining Newspaper Archives

2.
Kathleen Murray & Tara Carlisle
PDFs – zoom feature
Saving


3.
Saving pages
Saving searches through persistent Links
Sharing



Email
Social Media
Persistent Links
Bibliography
Bekaert, J., Van De Ville, D., Rogge, B., Strauven, I., De Kooning, E., Rik Van de, W. (2002).
Metadata-based access to multimedia architectural and historical archive collections: A review.
Aslib Proceedings, 54(6), 362-371.
Holley, R. (2009). How good can it get? Analysing and improving OCR accuracy in large scale
historic newspaper digitisation programs. D-Lib Magazine, 15(3/4). Retrieved from
http://www.dlib.org/dlib/march09/holley/03holley.html
Klijn, E. (2008). The current state-of-art in newspaper digitization: A market perspective. D-Lib
Magazine, 14(1/2). Retrieved from http://www.dlib.org/dlib/january08/klijn/01klijn.html
Maxwell, A. (2010). Digital archives and history research: feedback from an end-user. Library
Review, 59(1), 24-39. DOI (Permanent URL): 10.1108/00242531011014664
Robinson, L. (2010). The evolution of newspaper digitization at the Washington State Library.
Microfilm and Imaging Review, 39, 24-27.
Thurlow, I. & Warren, Paul. (2008). Deploying and Evaluating Semantic Technologies in a
Digital Library. In J. Davies, M. Grobelnik, & D. Mladenić (Eds.), Semantic Knowledge
Management (pp. 181-198). Berlin: Springer.
Warwick, C., Galina, I., Rimmer, J., Terras, M., Blandford, A., Gow, J., & Buchanan, G. (2009).
Documentation and the users of digital resources in the humanities. Journal of Documentation,
65(1), 33-57.
National Digital Newspaper Program
More information on program guidelines, participation, and technical information can be found
at http://www.neh.gov/projects/ndnp.html or http://www.loc.gov/ndnp/.
4 of 4
Download