Ulrike Junger Deutsche Nationalbibliothek Frankfurt/Main – Leipzig Can indexing be automated? - the example of the Deutsche Nationalbibliothek Abstract: The German subject headings authority file (Schlagwortnormdatei/SWD) provides a broad controlled vocabulary for indexing documents of all subjects. Traditionally used for intellectual subject cataloguing primarily of books the Deutsche Nationalbibliothek (DNB, German National Library) has been working on developping and implementing procedures for automated assignment of subject headings for online publications. This project, its results and problems are sketched in the paper. I. Introducing the Schlagwortnormdatei, the German subject headings authority file The Deutsche Nationalbibliothek (German National Library, in short DNB) is celebrating its 100th anniversary this year. Working with controlled subject headings has had a almost as long a tradition in the DNB, also in the times of the card catalogue. When computers were introduced into library work and cataloguing was henceforth done in databases the concept of authority files emerged. In Germany an authority file for subject headings was created starting in the mid 1980s. It is called Schlagwortnormdatei, in short SWD (this abbreviation will be used throughout this paper). In an English translation it simply means “Authority file for subject headings”. The first part of this paper wants to give an idea about its character and organization. The SWD contains records for various groups of headings: topical terms, geographic and ethnographic names, corporate bodies, work titles. Headings for persons are part of another autority file, the Personennamendatei (PND, File for Names of Persons1). Currently there are around 610.000 records in the SWD. Over 170.000 are topical terms covering all sciences and subjects. The larger share of the headings are individual names, e.g. for companies, geographical entities etc. This predominance of individual names is a result of the set of rules underlying the creation of new subject headings and their use. These Regeln für die Schlagwortkatalogisierung (RSWK, translated simply meaning “Rules for subject indexing”) have stipulations that the contents of a publication should be represented by the most narrow terms. Although DNB hosts all authority files and assumes a major editorial responsibility for them, they are in fact the result of a longstanding cooperation between DNB and partner libraries and library networks in Germany, Austria and Switzerland. All partners contribute new headings and fulfill editorial tasks. The SWD has a thesaurus-like structure. Besides the preferred term (the actual heading) an authority record contains synonyms, superordinate and related terms, and – dependent on the type of heading – codes for languages, countries etc. A simple homegrown classification is applied to most of the headinsg except geographic names. About 40.000 terms have been enhanced with DDC notations. Notes and information about the sources for a term complete the record. 1 The PND records are one of the data sets constituing the VIAF/Virtual International Authority File, http://viaf.org/. 1 Fig. 1: Example of a SWD record New headings are created if needed for indexing a publication. Thus the continuous growth and the contents of the SWD is to a great degree dependent on the degree in which libraries are doing subject cataloguing. DNB and also other libraries have been reducing this effort over a number of years now due to lack of capacity. An effect is that the maintenance of the topical terms in the natural sciences is insufficient because DNB stopped indexing doctoral theses about 5 years ago, a major source of new terms. How can SWD records be obtained? The DNB provides all of its authority files in both the German exchange format MAB as well as in MARC 21. In 2010 a linked data service was established, the data can be obtained free of charge.2 2012 is not only the year of the 100th anniversary of DNB but also the year that the SWD will vanish, and its contents be integrated into one large authority file, the Gemeinsame Normdatei (GND, Consolidated Authority File). Besides the SWD also the authority files for persons, musical works and corporate bodies (used in descriptive cataloguing) are incorporated. The format is close to MARC 21, record creation and use will be following the stipulations in RDA. II. The use of SWD subject headings for automated indexing As the national library for Germany DNB has the right to legal deposit. In the year 2006 a revised law on the DNB has become effective. Along with a change of the name (formerly Die Deutsche Bibliothek, now Deutsche Nationalbibliothek, German National Library) a major change in the mandate of DNB came into effect: it now also has to collect, catalogue, index and archive works in immaterial form, i.e. online publications. The DNB had been collecting online publications for a number of years, predominantly doctoral theses, on the average 20.000 works annually. But it soon became clear that DNB had to step up its efforts to increase the number of online publications collected. The development and implementation of new methods and channels for submission of online publications and campaigns to adress publishers lead to a very significant increase in the collection of online publications. In 2011 DNB collected more than 187.000 online documents.3 It can be expected that this number will grow at a fast pace in the coming years. 2 See http://www.dnb.de/EN/Service/DigitaleDienste/LinkedData/linkeddata_node.html 3 In comparison to this figure: DNB collected over 385.000 documents in physical form in 2011. 2 It also soon became clear that DNB does not have the capacity to catalogue and index all this material in the traditional way, i.e. intellectually/manually. Like many libraries DNB also has to do more (data and services) with less (staff). Still DNB has the claim and the need to provide catalogue data for every document in its collection; this encompasses subject data. In 2009 it was therefore decided to stop manually cataloguing online publications with the beginning of the year 2010 and to start a large project to develop methods for automated processing of monographic online publications.4 A major goal of the project was the automated assignment of SWD subject headings to online documents. This project, called Petrus5, was conducted from 2009 to 2011, with follow-up projects in 2012-2013. One could argue that in the era of full text indexing the assignment of controlled terms such as SWD subject headings is obsolete. But we are convinced that the use of controlled vocabulary serves an important purpose even in the Google age. A shared system of terms allows precise and comprehensive searches on all the collections a library holds and beyond. Indexing with terms taken from a controlled vocabulary has the advantage a single term is embedded in a semantic net or context, which can additionally be used for retrieval. Also, we wanted to have a continuity in subject cataloguing and our data. How did DNB approach this enterprise? One of the conditions of the project was that a system or software available on the market should be used and a homegrown software should be avoided. The first two years of Petrus therefore were dedicated to a market scan and thorough tests of several systems. These were chosen following a public bid invitation. In the end DNB decided to acquire and license the Averbis Extraction Platform, a system developped by the Averbis company located in Freiburg, Germany. This company is a spin-off of the University Hospitals in Freiburg and had so far been specializing in the automated indexing of medical publications. The process the Averbis Extraction Platform performs on documents basically is this: First it executes a textual analysis of the online publications based on various linguistic methods in order to extract terms carrying content out of textual parts and titles of online publications. It then ranks the extracted terms according to their meaning and importance. The extracted terms then are matched to the controlled vocabulary of the SWD. There are two main components in the system: the Averbis Concept Mapper, this is a configurable annotating tool based on a dictionary. It combines methods of machine learning with morphological and syntactical analysis. The dictionary is flexible and allows the integration of synonyms and various attributes for terms, e.g. classificatory information. the Dictionary Configurator: this is a user interface to create and modify user-specific concepts regarding the dictionary. Various sections of the SWD/PND have been integrated into the dictionary, as the basis for the assignment of subject headings. Currently these are: 170.000 topical terms, 153.000 records for geographic and ethnographic names and 311.000 records for persons. As mentioned, the so established dictionary can be configured with the Averbis Dictonary Configurator. It allows e.g. the realization of tailored indexing concepts for certain groups of objects/publications. Since the SWD is continually growing and enhanced the dictionary has to be 4 Serials and other types of online documents were excluded because of specific difficulties associated with cataloguing and indexing these materials. 5 Petrus is an acronym for „Prozessunterstützende Software für die digitale Deutsche Nationalbibliothek“, i.e. Software supporting processes for the digital German National Library. 3 updated regularly. It is a requirement for the Averbis system to retain once configurated concepts for the dictionary. After the acquisition of the Averbis system a number of tests were conducted to explore which configurations would bring the best results. Also, in the first tests only topical terms were integrated into the dictionary, geographic and persons’ names followed later. Since the major part of publications collected by DNB of course is in German it was decided to first concentrate on monographic publications in this language. A difficulty regarding the tests was the question how to do the evaluation and measure the quality of the automatically generated subject headings. The discussions resulted in the decision to intellectually control the results, using a sample of titles with automatically assigned subject headings. This was done by the subject specialists in the department of subject cataloguing, i.e. the same persons who usually index and classify books and other publications. In full awareness of the fact that intellectual indexing holds a subjective component and inter-indexer accordance on the average lies around 50%, we decided to consider their judgment as the “gold standard”. An evaluation database was created which contained the author, the title, a link to the full text of the document and the list of automatically assigned subject headings. Each heading had to be judged on 4-point-scale: very useful, useful, less useful, harmful.6 The raters should also note missing aspects/headings. This implied that the raters had “to forget” partially the rules they are supposed to follow when indexing intellectually. An example is that there are a number of stipulations in the Regeln für den Schlagwortkatalog/RSWK on how to arrange several subject headings to form a proper sequence. Fig. 2: User interface oft he evaluation database 6 4 This is meant with respect to retrieval. The ratings were then statistically evaluated: generalized precision and generalized recall as measures for usefulness and completeness of the assigned headings were calculated. As a showcase the results of a test conducted in October 2011 are described: Objects tested were German electronic full text documents, predominantly doctoral theses. All documents belonged to one of 12 subject areas, e.g. medicine7. Ten subject headings were assigned to each document. The configuration of the Averbis system used in this test contained a multi-level procedure for disambiguation of terms, the use of ignore-/exceptions lists, the use of classificatory imformation attached to the headings and a higher weight on the title terms compared to terms taken from the text corpus. A sample of 30 documents precessed was taken and evaluated by the librarians. The results are shown in the following figures: All publications collected by DNB are sorted into one of about 100 so-called subject groups based on the DDC. Since the distribution of online publications is very disparate across subjects, it proved necessary to concentrate on subject groups with a sufficient number of full text documents. 7 5 Although the values for recall all lie between 0.5 and 0.9, the precision is deficient, and the overall results achieved for automated indexing are considered not satisfactory enough. There still are too many incorrect and too few correct/useful headings assigned. This means that the productive stage is not yet reached and it is still too early to establish a routine workflow. A major reason for that is that ambiguous terms still are not discriminated well enough. Also, the SWD as a universal vocabulary contains many general terms (e.g. Methode/method), frequently occuring in documents, but with no specific meaning unless in combination with other headings. A solution to both problems could be the analysis of co-occurrences and the use of topical filters. Another general problem is the discrimination between names of persons, geographic names and topical terms. Here DNB thinks about using methods of Named Entity Recognition. An additional factor in order to reach better results could be taking into account a document’s formal structure. A doctoral thesis and a novel do not only have a different language, but a typical arrangement of the text. Another topic in need of further work on are confidence values. The subject headings the Averbis software assigns have a confidence value, i.e. a value between 0 and 1 marking the “sureness” the system has that the term is valid and correct in regard to the document. It is intended to manage via the confidence value which and how many headings are assigned to a document. Up to now the informative value of the confidence value is not sufficient. For this reason a fixed number of headings is produced in the tests, probably one of the reasons why so far besides some correct headings many incorrect headings are put out. The improvement of the methods to calculate the ranking and the confidence value therefore has a high priority. Early in 2012 DNB has decided to continue the project for two more years. It has become very clear that it is not easy to develop and implement a process using a universal controlled vocabulary for automated indexing of a universal collection. But we are convinced it can be done with reasonable results as long as the claim is not that automated indexing produces the same results as rule-guided intellectual indexing. The benchmark should be usefulness for retrieval. Author information: Ulrike Junger Head,Department of Subject Cataloguing Deutsche Nationalbiblothek Adickesallee 1 60322 Frankfurt/Main, Germany Phone: +40-69-1525 1500 Mail: u.junger@dnb.de Biography: Psychologist and theologian, academic librarian since 1994, various positions as subject librarian and authority file editor, head of the German Union Catalogiue of Serials, since 2009 head of the Department of Subject Cataloguing at the Deutsche Nationalbibliothek 6