Mining Newspaper Archives Kathleen Murray & Tara Carlisle Introduction For family historians, historical newspapers are a wonderful source of information. Two freely available newspaper archives, Chronicling America at the Library of Congress and The Portal to Texas History at the University of North Texas, will be used to illustrate techniques for improving research effectiveness and sharing results. Chronicling America [http://chroniclingamerica.loc.gov/] Chronicling America is a Website providing access to information about historic newspapers and select digitized newspaper pages, and is produced by the National Digital Newspaper Program (NDNP). NDNP, a partnership between the National Endowment for the Humanities (NEH) and the Library of Congress (LC), is a long-term effort to develop an Internetbased, searchable database of U.S. newspapers with descriptive information and select digitization of historic pages. Supported by NEH, this rich digital resource will be developed and permanently maintained at the Library of Congress. The Portal to Texas History℠ [http://texashistory.unt.edu/explore/collections/TDNP/] The Portal is a gateway to Texas history materials. From prehistory to the present day, its unique collections from Texas libraries, museums, archives, historical societies, genealogical societies, and private family collections, provide researchers with primary source materials. The University of North Texas (UNT) Libraries maintains the Portal, which includes the Texas Digital Newspaper Program (TDNP) collection. The UNT Libraries is the lead institution in Texas selected for the National Digital Newspaper Program, Chronicling America. TDNP is a partnership providing broad geographic access to digitized Texas newspapers as far back as 1829. It currently includes 80,486 newspaper editions (618,619 files). Types of Information Births and deaths Marriage announcements Military Service Land purchases Promotions Advertisements: Family businesses Travel announcements 1 of 4 Mining Newspaper Archives Kathleen Murray & Tara Carlisle System Components Before demonstrating search techniques, it helps to understand a few key components of a newspaper archive’s system architecture. Put another way, here is an under-the-hood view of Chronicling America and The Portal to Texas History℠. Both systems use metadata, linked data, OCR software, and extensive application programming to optimize viewing, searching and browsing items. 1. Metadata Descriptive information about a newspaper is contained in structured metadata, which often include elements such as collection, title, publisher, date, language, description, subject heading, and location. Structured metadata is used by a system to parse items and create relationships. 2. Linked Data The representation of data from different collections in a common structure enables a system to cross reference or link data. This allows connections to be made among disparate collections and enhances browsing capability. 3. OCR Software The digitization of newspapers involves scanning the source documents either from paper or microfilm. The result is a digital master image of the source document, typically in tiff format, as well as other formats generally used for viewing and downloading, such as jpeg or PDF. Optical Character Recognition (OCR) software converts the images into text format, thus enabling the full text of newspapers to be searched. 4. Application Programming Interface (API) The system includes several different views of the data about the newspapers. The API allows programmatic access to these views and enables the data to be manipulated, viewed, and searched in multiple ways: Thumbnail view of newspapers All issues of one newspaper All pages of one issue Different sizes of one page 2 of 4 Mining Newspaper Archives Kathleen Murray & Tara Carlisle Collections, titles, dates Searching Historic Newspapers The presentation will demonstrate search techniques for specific types of information of interest to family history researchers. No hits? Don’t give up yet! 1. 2. Boolean searching Phrase searching Searching names Language of an era (runaway; ranaway) Lots of hits? Narrow the field! Faceted searching Linked data Advanced Search Using Search Results 1. Printing Benefits and pitfalls 3 of 4 Mining Newspaper Archives 2. Kathleen Murray & Tara Carlisle PDFs – zoom feature Saving 3. Saving pages Saving searches through persistent Links Sharing Email Social Media Persistent Links Bibliography Bekaert, J., Van De Ville, D., Rogge, B., Strauven, I., De Kooning, E., Rik Van de, W. (2002). Metadata-based access to multimedia architectural and historical archive collections: A review. Aslib Proceedings, 54(6), 362-371. Holley, R. (2009). How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine, 15(3/4). Retrieved from http://www.dlib.org/dlib/march09/holley/03holley.html Klijn, E. (2008). The current state-of-art in newspaper digitization: A market perspective. D-Lib Magazine, 14(1/2). Retrieved from http://www.dlib.org/dlib/january08/klijn/01klijn.html Maxwell, A. (2010). Digital archives and history research: feedback from an end-user. Library Review, 59(1), 24-39. DOI (Permanent URL): 10.1108/00242531011014664 Robinson, L. (2010). The evolution of newspaper digitization at the Washington State Library. Microfilm and Imaging Review, 39, 24-27. Thurlow, I. & Warren, Paul. (2008). Deploying and Evaluating Semantic Technologies in a Digital Library. In J. Davies, M. Grobelnik, & D. Mladenić (Eds.), Semantic Knowledge Management (pp. 181-198). Berlin: Springer. Warwick, C., Galina, I., Rimmer, J., Terras, M., Blandford, A., Gow, J., & Buchanan, G. (2009). Documentation and the users of digital resources in the humanities. Journal of Documentation, 65(1), 33-57. National Digital Newspaper Program More information on program guidelines, participation, and technical information can be found at http://www.neh.gov/projects/ndnp.html or http://www.loc.gov/ndnp/. 4 of 4