Using Estonian Subject Thesaurus in digital environment

advertisement
Using Estonian Subject Thesaurus in digital environment
Sirje Nilbe
National Library of Estonia
Consortium of Estonian Libraries Network
E-mail: Sirje.Nilbe@nlib.ee
Tiiu Tarkpea
University of Tartu Library
E-mail: Tiiu.Tarkpea@ut.ee
Abstract
This paper views the usage possibilities of the Estonian Subject Thesaurus (EMS) as the
major subject indexing tool in the digital databases of Estonian libraries. The article
examines traditional resources like online catalogues and bibliographic databases but also
more recent resources – digital archives and institutional repositories. The availability of a
universal thesaurus facilitates a wide reuse of records and is economising in terms of
intellectual work. One further development trend of the EMS should be the conformity to the
contemporary Semantic Web standards like SKOS and Open Linked Data.
What is the Estonian Subject Thesaurus?
The Estonian Subject Thesaurus (in Estonian Eesti märksõnastik or EMS)1 is a controlled
vocabulary with universal coverage for information search and indexing of diverse library
material. The thesaurus was taken into use under this name in May 2009. Its predecessors –
the thesaurus of the University of Tartu Library and the Estonian Universal Thesaurus – date back
to the 1990s. Under a project carried out in 2007-2009 those two universal thesauri were
merged and named the Estonian Subject Thesaurus (Nilbe 2011). It is co-managed by the
ELNET Consortium (Consortium of Estonian Libraries Network), the National Library of
Estonia and the University of Tartu Library, and can be freely used by all Estonian libraries
and other interested institutions.
The EMS is maintained by a web-based database operating with the software MySQL and
php. The thesaurus is developed by a manager, programmer and 9 authorised editors, all with
part-time contribution.
At the end of April 2012 the thesaurus contained over 53 000 terms, among these over 36 000
preferred terms and 17 000 nonpreferred terms. This is an unusually large amount, even when
considering that about 8600 of them are place names.
The database of the thesaurus enables the users
 to browse the subject terms by subject fields;
 to search terms by the beginning or part of the word, or by exact match;
 to search terms by the English equivalent;
 to view search results as word lists or as full records;
 to search by every term in the online catalogue ESTER, in the database of Estonian
articles ISE, in the database of Estonian National Bibliography (ERB) or in Google;
1
http://ems.elnet.ee
1
 to print, e-mail or save into file the selected word lists or full records;
 to subscribe current awareness service for new, changed and deleted subject terms.
For using the data of the EMS in other systems, the subject term records can be exported in
MARC 21 format for authority data. Also, a word retrieval Web service for other systems has
been worked out (machine-to-machine interaction). The possible output includes machinereadable MARC 21, eye-readable MARC 21 or MARCXML.
The user interface is in Estonian and English, the latter was offered for use just recently, at the
end of April. However, the availability of the English-language user interface does not mean
that the EMS is a genuine bilingual thesaurus or that it could be used to generate an Englishlanguage controlled vocabulary. English-language terms can only be treated as translations of
Estonian-language terms and they do not constitute a separate whole which is semantically
and structurally organised. The compiling of multilingual thesauri rises a number of different
problems (Guidelines for Multilingual Thesauri, 2009) which the compilation of the EMS has
not actually aimed to address. Part of the English-language terms correspond to the terms in
the Library of Congress Subject Headings, another part of them have been collected from the
most recent professional literature. There are also concepts which lack satisfying equivalents
in English due to cultural and historical differencies.
<Figure 1. Display of a record of the EMS using English interface.>
The EMS mostly serves as a tool for post-coordinated indexing. Although library catalogues
and national bibliographies worldwide prefer pre-coordinated indexing, it has been used very
seldom in Estonia during the electronic era. Pre-coordination versus post-coordination is a
topic much discussed, one of the most recent overviews of the arguments for and against both
is presented in the IFLA document Guidelines for Subject Access in National Bibliographies
(Jahns 2012, 21-22). The present article does not aim at re-launching the discussion but rather
to point out that in Estonian databases subject terms have been given as lists of individual
descriptors. At the same time many descriptors themselves are complex concepts, meaning
that pre-coordination is realised rather via the means of expression of the natural language
than the syntax rules of the indexing language. For example, arhitektuuriajalugu ‘history of
architecture’, avaliku sektori ökonoomika ‘public economics’, põllumajanduskaardid
‘agricultural maps’.
The following overview looks at different digital environments of libraries which use the
EMS as the source of subject description metadata.
The thesaurus in the online catalogues of libraries
Although catalogues have long since ceased to be the only databases of libraries, they are still
(at least in Estonia) the most important and best known sources enabling users to access the
collections and information.
The major Estonian library catalogue ESTER2 contains nearly 3 million bibliographic records
and is jointly maintained by the 13 member libraries of the ELNET Consortium. All member
libraries use the EMS for indexing, 9 of them also the UDC system. The software of the
catalogue is the integrated library system Millennium from the U.S. Company Innovative
2
http://ester.tallinn.ee; http://ester.tartu.ee
2
Interfaces Inc. Its catalogue module is a traditional MARC-based database with browsable
indexes. Bibliographic and authority records have common indexes. Where the subject index
is concerned, cross references from authority records are arranged between data indexed from
bibliographic records. In this system authority files do not constitute separate units which
could be separately searched or browsed.
<Figure 2. Subject index of ESTER, enriched with references from authority records.>
Since the implementation of the EMS software in May 2009, the subject authority records are
no longer compiled manually but the data is updated regularly about twice a month on the
basis of extensions and corrections made in the EMS database. A MARC 21 file is exported
from the thesaurus system and loaded into the library system.
Other libraries (there are about 1000 libraries in Estonia) compile their catalogues mostly by
copy cataloguing from ESTER. This ensures that the subject terms from the EMS reach
practically all libraries. Special, public and school libraries often enrich the records with
additional subject terms which help their users in information search. As their library systems
do not support the integration of authority data with bibliographic data, the records of smaller
libraries sometimes contain synonymous descriptors and nonpreferred terms which increase
the number of access points.
Databases of articles
Estonia has a long-term tradition of compiling analytical databases of articles published in
local periodicals. Such databases have been created by different institutions and have
followed their own purpose, thus a lot of duplication has occurred. It has not been easy for
users to understand where to find the information they need.
Since 2009 the separately maintained bibliographic databases have been assembled into a
unified environment – the additional module Reference Database of the integrated library
system Millennium. For that reason the functionality and appearance of this database
resemble the online catalogue ESTER, and they both constitute a significant resource for
Estonian information consumers. The database of articles Index Scriptorum Estoniae (ISE)3 is
also managed by the ELNET Consortium.
For indexing old separate databases, different local vocabularies were used. The present
corresponding standard is the Estonian Subject Thesaurus. We have set an aim of harmonising
all older records with this thesauri which will take several years.
County libraries compile bibliographic databases on local history. They mostly use subject
terms from the EMS for indexing but their own library systems do not enable authority
control. Synonym control is particularly missed – libraries cannot offer USE references for
their users. For that reason bibliographers would like to supplement the EMS with many
specific terms on local life and are not satisfied with the situation where the EMS gives USE
references from specific terms to more general terms. For example, the authorised term in the
EMS is ökokogukonnad ’eco-communities’ and the term ökokülad ’eco villages’ gives the
following reference: ökokülad USE ökokogukonnad. Rural librarians would like to have also
the ökokülad as an authorised term.
3
http://ise.elnet.ee
3
Still the EMS editors have accepted a lot of suggestions by both bibliographers of local
history and compilers of the ISE database, and the necessary terms have been included in the
thesaurus – for example juubelid ’anniversaries’, rahvatants ’folk dance’, külapäevad ’village
days’, vallavanemad ’parish heads’.
The Estonian National Bibliography Database
The Estonian National Bibliography Database (ERB)4 is compiled and managed by the
National Library of Estonia, the corresponding software has been locally developed on the
basis of the database management system MySQL.
The items subject to registration in the national bibliography are catalogued, classified and
indexed in the shared library system Millennium which has been introduced previously in this
paper. From Millennium the bibliographic records are exported and loaded into the Estonian
National Bibliography Database. It is not possible to change the records in this database, all
the necessary corrections and changes are first made in the Millennium record and then
loaded into the National Bibliography database.
Subject search in the ERB can be carried out by subject terms, UDC numbers and keywords
focused on subject fields or title fields. Subject terms and UDC numbers included in
bibliographic records are browsable via indexes.
The ERB does not contain authority data and thus does not enable to take advantage of all the
possibilities offered by the thesaurus for finding the right search term. However, there is a
clickable EMS logo on each page allowing the user to move on to the EMS environment, to
find there a suitable search term and from the same location perform a direct search in the
ERB.
<Figure 3. Directing search from the EMS to the ERB.>
Digital archive DIGAR
DIGAR5 is the digital archive of the National Library of Estonia maintained in conformity
with the library’s tasks to collect, preserve and make available digital information, including
the tasks proceeding from the Legal Deposit Copy Act.
The archive preserves
 Estonian online publications issued on the Internet (books, newspapers,
journals, serials, maps and sheet music);
 digital copies of Estonian electronic publications issued on physical carriers
(floppy disk, CD-ROM, etc.);
 digital copies of publications on analogue carriers;
 print files of Estonian publications.
The information system of the archive is based upon the free software FEDORA which has
been supplemented by interfaces for data input, management and providing services for the
4
5
http://erb.nlib.ee
http://digar.nlib.ee
4
users. The latter is still under development but the archive is already operating and enjoys a
fairly large usership. The amount of born digital and digitised information is growing rapidly
which makes the improvement of the archive’s functionality a major stategic task of the
National Library of Estonia.
Most of the digital objects stored in DIGAR have been described in the online catalogue
ESTER and analytically in the database ISE. Records in those databases and full texts in
DIGAR are linked. The objects in DIGAR have descriptive metadata and search can be
carried out also with the archive’s user interface. The metadata of DIGAR is mostly imported
from ESTER, including subject descriptors. In the archive’s environment the metadata is
supplemented by the English-language equivalents of the descriptors which are retrieved from
the EMS database by a programmed query. The digital archive naturally enables to use full
text search.
<Figure 4. DIGAR subject descriptors displayd in English and Estonian.>
Institutional repositories
Several university libraries belonging to the ELNET Consortium manage their university’s
institutional repository or digital archive. As an example this paper views the University of
Tartu Digital Archive on DSpace6. The aim of the repository is to collect, preserve and make
available the digital information of the University of Tartu structural units. The respository is
publicly available, its materials can be found with Internet search engines and can be
interfaced by OAI data exchange protocols. The software of the DSpace is a free software
with open source code.
The repository preserves:
 print files of doctoral theses published by the University of Tartu;
 original files created in the University of Tartu structural units, e.g
conference papers, articles, reports, textbooks, personal archives, etc.
 digital copies of analogue carriers;
 online publications issued on the Internet;
 digital copies of eletronic publications issued on physical carriers.
All archived materials are supplied with metadata and linked with the online catalogue
ESTER, i.e the catalogue record contains a link to the full text stored in the repository. The
metadata, including the subject terms, are based on catalogue records. Unlike the catalogue,
the metadata of the material archived in the repository is supplemented with English-language
subject terms where the English equivalents of the EMS are used – these are added manually.
If so desired, subject terms or keywords in any languages can be added, preferably in the
language of the original. It is possible to create subject term patterns (subject terms provided
by default) for the whole collection.
We have considered whether it would be sensible to create a thesaurus within the repository
on the basis of the EMS, which would contain about 2000 subject terms. However, the
debates have resulted in the opinion that such a thesaurus would be too general and it would
not be sufficient for describing the contents of research works. In order to index more
specifically, that thesaurus would gradually be supplemented by uncontrolled terms or
6
http://dspace.utlib.ee
5
keywords, while the EMS already contains controlled terms for designating the necessary
concepts.
Other potential users
During the three years when the EMS has been publicly available, several memory and
knowledge institutions outside the library community have considered to start using it in their
information systems. Amongst them, cooperation with archives has been the closest – they
design and implement a new information system for archives. The information system of
museums is also under development, with a need for controlled vocabularies to describe
museum artefacts. Yet do archives and museums lack enthusiasm about integrating the EMS
into their information systems. It is too extensive for their collections and contains a lot of
concepts which are never needed for describing archival records or museum artefacts. At the
same time, many necessary concepts are probably missing. Archives and museums could try
to make a certain selection but it is difficult to retain the integrity of semantic relationships
when making a concise extract.
Without the corresponding intention, the EMS is assuming the role of a language resource,
especially for translators of specialised texts. In this respect we have received a lot of positive
feedback. Even the EuroTermBank is interested in including the EMS among their free
terminology resources. ELNET as the owner of the EMS is ready to allow it but technical
problems still need to be solved.
Conclusion
The Estonian Subject Thesaurus has achieved a firm position in Estonian libraries because its
development has tried to consider the needs of all interest groups. Estonian librarians like
specific and exhaustive indexing in order to provide high-level information services for the
users, thus they need a large amount of terms. Other memory and knowledge institutions with
not so universal needs seem to have difficulties in using the EMS.
The EMS is available as a web database and authority data in systems which support authority
control via traditional authority records. In other systems which lack the relationships between
subject terms and synonym control, the EMS loses a lot of its advantage as a tool for
information search. To some extent this disadvantage can be compensated by beginning the
search in the original database of the thesaurus.
The developers of digital collections have considered it necessary to enhance the metadata
with English-language equivalents, bearing in mind the international users who get the
information via the Internet search engines and OAI portals rather than by using one or
another digital collection directly.
One further development trend of the EMS should be the conformity to the contemporary
Semantic Web standards like SKOS and Open Linked Data.
References
Jahns, Y. (ed.) (2012). Guidelines for Subject Access in National Bibliographies. Berlin: De
Gryter Saur.
6
Nilbe, S. (2011). Semiautomatic merging of two universal thesauri: the case of Estonia. In:
Subject access: Preparing for the future. Edited by P. Landry, L. Bultrini, E. T. O’Neill, & S.
K. Roe, 51-57. Berlin: De Gruyter Saur.
Working Group on Guidelines for Multilingual Thesauri, IFLA Classification and Indexing
Section (2009). Guidelines for Multilingual Thesauri. IFLA Professional Reports, No. 115.
Available at: http://archive.ifla.org/VII/s29/pubs/Profrep115.pdf.
7
Figure 1. Display of a record of the EMS using the English interface.
8
Figure 2. Subject index of ESTER, enriched with references from authority records.
9
Figure 3. Directing search from the EMS to the ERB.
10
Figure 4. DIGAR subject descriptors displayd in English and Estonian.
11
Authors
Sirje Nilbe - academic degrees in Estonian language and information science. Worked from
1986 to 1997 at the University of Tartu Library, from 1998 at the National Library of Estonia.
Professional fields are authority control, classification and indexing, development of thesauri.
Head of the Authority Control Department of the National Library of Estonia (1998-),
Manager of the EMS Thesaurus (2009-).
Tiiu Tarkpea - physicist, academic librarian in University of Tartu Library (1983-), has
practiced as subject librarian, subject indexer and editor of the thesaurus. Head of the
Department of Subject Analysis (2004-), Chief of the Classification and Indexing Working
Group of the ELNET Consortium.
12
Download