Can indexing be automated?

advertisement
Ulrike Junger
Deutsche Nationalbibliothek
Frankfurt/Main – Leipzig
Can indexing be automated? - the example of the Deutsche
Nationalbibliothek
Abstract: The German subject headings authority file (Schlagwortnormdatei/SWD) provides a
broad controlled vocabulary for indexing documents of all subjects. Traditionally used for
intellectual subject cataloguing primarily of books the Deutsche Nationalbibliothek (DNB, German
National Library) has been working on developping and implementing procedures for automated
assignment of subject headings for online publications. This project, its results and problems are
sketched in the paper.
I. Introducing the Schlagwortnormdatei, the German subject headings authority file
The Deutsche Nationalbibliothek (German National Library, in short DNB) is celebrating its 100th
anniversary this year. Working with controlled subject headings has had a almost as long a
tradition in the DNB, also in the times of the card catalogue.
When computers were introduced into library work and cataloguing was henceforth done in
databases the concept of authority files emerged. In Germany an authority file for subject headings
was created starting in the mid 1980s. It is called Schlagwortnormdatei, in short SWD (this
abbreviation will be used throughout this paper). In an English translation it simply means
“Authority file for subject headings”. The first part of this paper wants to give an idea about its
character and organization.
The SWD contains records for various groups of headings: topical terms, geographic and
ethnographic names, corporate bodies, work titles. Headings for persons are part of another
autority file, the Personennamendatei (PND, File for Names of Persons1). Currently there are
around 610.000 records in the SWD. Over 170.000 are topical terms covering all sciences and
subjects. The larger share of the headings are individual names, e.g. for companies, geographical
entities etc. This predominance of individual names is a result of the set of rules underlying the
creation of new subject headings and their use. These Regeln für die Schlagwortkatalogisierung
(RSWK, translated simply meaning “Rules for subject indexing”) have stipulations that the contents
of a publication should be represented by the most narrow terms.
Although DNB hosts all authority files and assumes a major editorial responsibility for them, they
are in fact the result of a longstanding cooperation between DNB and partner libraries and library
networks in Germany, Austria and Switzerland. All partners contribute new headings and fulfill
editorial tasks.
The SWD has a thesaurus-like structure. Besides the preferred term (the actual heading) an
authority record contains synonyms, superordinate and related terms, and – dependent on the
type of heading – codes for languages, countries etc. A simple homegrown classification is applied
to most of the headinsg except geographic names. About 40.000 terms have been enhanced with
DDC notations. Notes and information about the sources for a term complete the record.
1
The PND records are one of the data sets constituing the VIAF/Virtual International Authority File,
http://viaf.org/.
1
Fig. 1: Example of a SWD record
New headings are created if needed for indexing a publication. Thus the continuous growth and the
contents of the SWD is to a great degree dependent on the degree in which libraries are doing
subject cataloguing. DNB and also other libraries have been reducing this effort over a number of
years now due to lack of capacity. An effect is that the maintenance of the topical terms in the
natural sciences is insufficient because DNB stopped indexing doctoral theses about 5 years ago, a
major source of new terms.
How can SWD records be obtained? The DNB provides all of its authority files in both the German
exchange format MAB as well as in MARC 21. In 2010 a linked data service was established, the
data can be obtained free of charge.2
2012 is not only the year of the 100th anniversary of DNB but also the year that the SWD will
vanish, and its contents be integrated into one large authority file, the Gemeinsame Normdatei
(GND, Consolidated Authority File). Besides the SWD also the authority files for persons, musical
works and corporate bodies (used in descriptive cataloguing) are incorporated. The format is close
to MARC 21, record creation and use will be following the stipulations in RDA.
II. The use of SWD subject headings for automated indexing
As the national library for Germany DNB has the right to legal deposit. In the year 2006 a revised
law on the DNB has become effective. Along with a change of the name (formerly Die Deutsche
Bibliothek, now Deutsche Nationalbibliothek, German National Library) a major change in the
mandate of DNB came into effect: it now also has to collect, catalogue, index and archive works in
immaterial form, i.e. online publications.
The DNB had been collecting online publications for a number of years, predominantly doctoral
theses, on the average 20.000 works annually. But it soon became clear that DNB had to step up
its efforts to increase the number of online publications collected. The development and
implementation of new methods and channels for submission of online publications and campaigns
to adress publishers lead to a very significant increase in the collection of online publications. In
2011 DNB collected more than 187.000 online documents.3 It can be expected that this number
will grow at a fast pace in the coming years.
2
See http://www.dnb.de/EN/Service/DigitaleDienste/LinkedData/linkeddata_node.html
3
In comparison to this figure: DNB collected over 385.000 documents in physical form in 2011.
2
It also soon became clear that DNB does not have the capacity to catalogue and index all this
material in the traditional way, i.e. intellectually/manually. Like many libraries DNB also has to do
more (data and services) with less (staff). Still DNB has the claim and the need to provide
catalogue data for every document in its collection; this encompasses subject data. In 2009 it was
therefore decided to stop manually cataloguing online publications with the beginning of the year
2010 and to start a large project to develop methods for automated processing of monographic
online publications.4 A major goal of the project was the automated assignment of SWD subject
headings to online documents. This project, called Petrus5, was conducted from 2009 to 2011, with
follow-up projects in 2012-2013.
One could argue that in the era of full text indexing the assignment of controlled terms such as
SWD subject headings is obsolete. But we are convinced that the use of controlled vocabulary
serves an important purpose even in the Google age. A shared system of terms allows precise and
comprehensive searches on all the collections a library holds and beyond. Indexing with terms
taken from a controlled vocabulary has the advantage a single term is embedded in a semantic net
or context, which can additionally be used for retrieval. Also, we wanted to have a continuity in
subject cataloguing and our data.
How did DNB approach this enterprise? One of the conditions of the project was that a system or
software available on the market should be used and a homegrown software should be avoided.
The first two years of Petrus therefore were dedicated to a market scan and thorough tests of
several systems. These were chosen following a public bid invitation.
In the end DNB decided to acquire and license the Averbis Extraction Platform, a system
developped by the Averbis company located in Freiburg, Germany. This company is a spin-off of
the University Hospitals in Freiburg and had so far been specializing in the automated indexing of
medical publications.
The process the Averbis Extraction Platform performs on documents basically is this: First it
executes a textual analysis of the online publications based on various linguistic methods in order
to extract terms carrying content out of textual parts and titles of online publications. It then ranks
the extracted terms according to their meaning and importance. The extracted terms then are
matched to the controlled vocabulary of the SWD.
There are two main components in the system:

the Averbis Concept Mapper, this is a configurable annotating tool based on a dictionary. It
combines methods of machine learning with morphological and syntactical analysis. The
dictionary is flexible and allows the integration of synonyms and various attributes for
terms, e.g. classificatory information.

the Dictionary Configurator: this is a user interface to create and modify user-specific
concepts regarding the dictionary.
Various sections of the SWD/PND have been integrated into the dictionary, as the basis for the
assignment of subject headings. Currently these are: 170.000 topical terms, 153.000 records for
geographic and ethnographic names and 311.000 records for persons.
As mentioned, the so established dictionary can be configured with the Averbis Dictonary
Configurator. It allows e.g. the realization of tailored indexing concepts for certain groups of
objects/publications. Since the SWD is continually growing and enhanced the dictionary has to be
4
Serials and other types of online documents were excluded because of specific difficulties associated with
cataloguing and indexing these materials.
5
Petrus is an acronym for „Prozessunterstützende Software für die digitale Deutsche Nationalbibliothek“, i.e.
Software supporting processes for the digital German National Library.
3
updated regularly. It is a requirement for the Averbis system to retain once configurated concepts
for the dictionary.
After the acquisition of the Averbis system a number of tests were conducted to explore which
configurations would bring the best results. Also, in the first tests only topical terms were
integrated into the dictionary, geographic and persons’ names followed later.
Since the major part of publications collected by DNB of course is in German it was decided to first
concentrate on monographic publications in this language.
A difficulty regarding the tests was the question how to do the evaluation and measure the quality
of the automatically generated subject headings. The discussions resulted in the decision to
intellectually control the results, using a sample of titles with automatically assigned subject
headings. This was done by the subject specialists in the department of subject cataloguing, i.e.
the same persons who usually index and classify books and other publications. In full awareness of
the fact that intellectual indexing holds a subjective component and inter-indexer accordance on
the average lies around 50%, we decided to consider their judgment as the “gold standard”.
An evaluation database was created which contained the author, the title, a link to the full text of
the document and the list of automatically assigned subject headings.
Each heading had to be judged on 4-point-scale: very useful, useful, less useful, harmful.6
The raters should also note missing aspects/headings.
This implied that the raters had “to forget” partially the rules they are supposed to follow when
indexing intellectually. An example is that there are a number of stipulations in the Regeln für den
Schlagwortkatalog/RSWK on how to arrange several subject headings to form a proper sequence.
Fig. 2: User interface oft he evaluation database
6
4
This is meant with respect to retrieval.
The ratings were then statistically evaluated: generalized precision and generalized recall as
measures for usefulness and completeness of the assigned headings were calculated.
As a showcase the results of a test conducted in October 2011 are described:
Objects tested were German electronic full text documents, predominantly doctoral theses. All
documents belonged to one of 12 subject areas, e.g. medicine7. Ten subject headings were
assigned to each document. The configuration of the Averbis system used in this test contained a
multi-level procedure for disambiguation of terms, the use of ignore-/exceptions lists, the use of
classificatory imformation attached to the headings and a higher weight on the title terms
compared to terms taken from the text corpus.
A sample of 30 documents precessed was taken and evaluated by the librarians.
The results are shown in the following figures:
All publications collected by DNB are sorted into one of about 100 so-called subject groups based on the DDC.
Since the distribution of online publications is very disparate across subjects, it proved necessary to concentrate
on subject groups with a sufficient number of full text documents.
7
5
Although the values for recall all lie between 0.5 and 0.9, the precision is deficient, and the overall
results achieved for automated indexing are considered not satisfactory enough. There still are too
many incorrect and too few correct/useful headings assigned. This means that the productive stage
is not yet reached and it is still too early to establish a routine workflow.
A major reason for that is that ambiguous terms still are not discriminated well enough. Also, the
SWD as a universal vocabulary contains many general terms (e.g. Methode/method), frequently
occuring in documents, but with no specific meaning unless in combination with other headings. A
solution to both problems could be the analysis of co-occurrences and the use of topical filters.
Another general problem is the discrimination between names of persons, geographic names and
topical terms. Here DNB thinks about using methods of Named Entity Recognition.
An additional factor in order to reach better results could be taking into account a document’s
formal structure. A doctoral thesis and a novel do not only have a different language, but a typical
arrangement of the text.
Another topic in need of further work on are confidence values. The subject headings the Averbis
software assigns have a confidence value, i.e. a value between 0 and 1 marking the “sureness” the
system has that the term is valid and correct in regard to the document. It is intended to manage
via the confidence value which and how many headings are assigned to a document. Up to now the
informative value of the confidence value is not sufficient. For this reason a fixed number of
headings is produced in the tests, probably one of the reasons why so far besides some correct
headings many incorrect headings are put out. The improvement of the methods to calculate the
ranking and the confidence value therefore has a high priority.
Early in 2012 DNB has decided to continue the project for two more years. It has become very
clear that it is not easy to develop and implement a process using a universal controlled vocabulary
for automated indexing of a universal collection. But we are convinced it can be done with
reasonable results as long as the claim is not that automated indexing produces the same results
as rule-guided intellectual indexing. The benchmark should be usefulness for retrieval.
Author information:
Ulrike Junger
Head,Department of Subject Cataloguing
Deutsche Nationalbiblothek
Adickesallee 1
60322 Frankfurt/Main, Germany
Phone: +40-69-1525 1500
Mail: u.junger@dnb.de
Biography: Psychologist and theologian, academic librarian since 1994, various positions as subject
librarian and authority file editor, head of the German Union Catalogiue of Serials, since 2009 head
of the Department of Subject Cataloguing at the Deutsche Nationalbibliothek
6
Download