Contributor/Presenter: Title: Abstract:

Arturs Zogla (Head of Digital Library), National Library of Latvia
Next steps in newspaper digitization: making use of digitized texts at NLL
National Library of Latvia (NLL) has been involved in newspaper digitization since 2000. So far about 3
million pages of newspapers have been scanned, segmented and OCRed. This represents roughly a half
of all the newspapers that are stored at NLL.
An obvious outcome of these efforts has been a portal that lets users run full-text
search queries receiving results on an article level. However, NLL has also made further studies on how
to make use of a huge OCR text data that was produced by newspaper digitization projects. NLL has
developed several experimental solutions in the field of computer linguistics that use newspaper OCR
text as input data. These solutions include (a) a text corpora, (b) database of Named Entities and (c)
time-sensitive dictionaries.
In developing these solutions NLL has faced different technical challenges: mistakes in OCR texts,
inconsistent orthography, several spellings of the same Named Entity, etc. New text analysis tools and
methods were developed to solve these challenges.
NLL plans to use the new methods and solutions to further enhance functionality of periodicals portal.
Database of Named Entities can be used to improve relevance of search results. Time-sensitive
dictionaries, on the other hand, can be very helpful in making historical texts more perceivable to a
modern reader.
There are further plans to publish at least some of digitized texts as Linked Open Data. To do this some
copyright related issues must first be addressed.
Arturs Zogla has been an IT project manager at the National Library of Latvia since 2006 and became
Head of Digital Library at 2012. Author has been involved as project manager in most Digital Library
projects at NLL, including newspaper digitization.
In the past author has also been a lecturer at University of Latvia, where he taught courses on Semantic
Web and Architecture of Operating Systems.