Using Estonian Subject Thesaurus in digital environment Sirje Nilbe National Library of Estonia Consortium of Estonian Libraries Network E-mail: Sirje.Nilbe@nlib.ee Tiiu Tarkpea University of Tartu Library E-mail: Tiiu.Tarkpea@ut.ee Abstract This paper views the usage possibilities of the Estonian Subject Thesaurus (EMS) as the major subject indexing tool in the digital databases of Estonian libraries. The article examines traditional resources like online catalogues and bibliographic databases but also more recent resources – digital archives and institutional repositories. The availability of a universal thesaurus facilitates a wide reuse of records and is economising in terms of intellectual work. One further development trend of the EMS should be the conformity to the contemporary Semantic Web standards like SKOS and Open Linked Data. What is the Estonian Subject Thesaurus? The Estonian Subject Thesaurus (in Estonian Eesti märksõnastik or EMS)1 is a controlled vocabulary with universal coverage for information search and indexing of diverse library material. The thesaurus was taken into use under this name in May 2009. Its predecessors – the thesaurus of the University of Tartu Library and the Estonian Universal Thesaurus – date back to the 1990s. Under a project carried out in 2007-2009 those two universal thesauri were merged and named the Estonian Subject Thesaurus (Nilbe 2011). It is co-managed by the ELNET Consortium (Consortium of Estonian Libraries Network), the National Library of Estonia and the University of Tartu Library, and can be freely used by all Estonian libraries and other interested institutions. The EMS is maintained by a web-based database operating with the software MySQL and php. The thesaurus is developed by a manager, programmer and 9 authorised editors, all with part-time contribution. At the end of April 2012 the thesaurus contained over 53 000 terms, among these over 36 000 preferred terms and 17 000 nonpreferred terms. This is an unusually large amount, even when considering that about 8600 of them are place names. The database of the thesaurus enables the users to browse the subject terms by subject fields; to search terms by the beginning or part of the word, or by exact match; to search terms by the English equivalent; to view search results as word lists or as full records; to search by every term in the online catalogue ESTER, in the database of Estonian articles ISE, in the database of Estonian National Bibliography (ERB) or in Google; 1 http://ems.elnet.ee 1 to print, e-mail or save into file the selected word lists or full records; to subscribe current awareness service for new, changed and deleted subject terms. For using the data of the EMS in other systems, the subject term records can be exported in MARC 21 format for authority data. Also, a word retrieval Web service for other systems has been worked out (machine-to-machine interaction). The possible output includes machinereadable MARC 21, eye-readable MARC 21 or MARCXML. The user interface is in Estonian and English, the latter was offered for use just recently, at the end of April. However, the availability of the English-language user interface does not mean that the EMS is a genuine bilingual thesaurus or that it could be used to generate an Englishlanguage controlled vocabulary. English-language terms can only be treated as translations of Estonian-language terms and they do not constitute a separate whole which is semantically and structurally organised. The compiling of multilingual thesauri rises a number of different problems (Guidelines for Multilingual Thesauri, 2009) which the compilation of the EMS has not actually aimed to address. Part of the English-language terms correspond to the terms in the Library of Congress Subject Headings, another part of them have been collected from the most recent professional literature. There are also concepts which lack satisfying equivalents in English due to cultural and historical differencies. <Figure 1. Display of a record of the EMS using English interface.> The EMS mostly serves as a tool for post-coordinated indexing. Although library catalogues and national bibliographies worldwide prefer pre-coordinated indexing, it has been used very seldom in Estonia during the electronic era. Pre-coordination versus post-coordination is a topic much discussed, one of the most recent overviews of the arguments for and against both is presented in the IFLA document Guidelines for Subject Access in National Bibliographies (Jahns 2012, 21-22). The present article does not aim at re-launching the discussion but rather to point out that in Estonian databases subject terms have been given as lists of individual descriptors. At the same time many descriptors themselves are complex concepts, meaning that pre-coordination is realised rather via the means of expression of the natural language than the syntax rules of the indexing language. For example, arhitektuuriajalugu ‘history of architecture’, avaliku sektori ökonoomika ‘public economics’, põllumajanduskaardid ‘agricultural maps’. The following overview looks at different digital environments of libraries which use the EMS as the source of subject description metadata. The thesaurus in the online catalogues of libraries Although catalogues have long since ceased to be the only databases of libraries, they are still (at least in Estonia) the most important and best known sources enabling users to access the collections and information. The major Estonian library catalogue ESTER2 contains nearly 3 million bibliographic records and is jointly maintained by the 13 member libraries of the ELNET Consortium. All member libraries use the EMS for indexing, 9 of them also the UDC system. The software of the catalogue is the integrated library system Millennium from the U.S. Company Innovative 2 http://ester.tallinn.ee; http://ester.tartu.ee 2 Interfaces Inc. Its catalogue module is a traditional MARC-based database with browsable indexes. Bibliographic and authority records have common indexes. Where the subject index is concerned, cross references from authority records are arranged between data indexed from bibliographic records. In this system authority files do not constitute separate units which could be separately searched or browsed. <Figure 2. Subject index of ESTER, enriched with references from authority records.> Since the implementation of the EMS software in May 2009, the subject authority records are no longer compiled manually but the data is updated regularly about twice a month on the basis of extensions and corrections made in the EMS database. A MARC 21 file is exported from the thesaurus system and loaded into the library system. Other libraries (there are about 1000 libraries in Estonia) compile their catalogues mostly by copy cataloguing from ESTER. This ensures that the subject terms from the EMS reach practically all libraries. Special, public and school libraries often enrich the records with additional subject terms which help their users in information search. As their library systems do not support the integration of authority data with bibliographic data, the records of smaller libraries sometimes contain synonymous descriptors and nonpreferred terms which increase the number of access points. Databases of articles Estonia has a long-term tradition of compiling analytical databases of articles published in local periodicals. Such databases have been created by different institutions and have followed their own purpose, thus a lot of duplication has occurred. It has not been easy for users to understand where to find the information they need. Since 2009 the separately maintained bibliographic databases have been assembled into a unified environment – the additional module Reference Database of the integrated library system Millennium. For that reason the functionality and appearance of this database resemble the online catalogue ESTER, and they both constitute a significant resource for Estonian information consumers. The database of articles Index Scriptorum Estoniae (ISE)3 is also managed by the ELNET Consortium. For indexing old separate databases, different local vocabularies were used. The present corresponding standard is the Estonian Subject Thesaurus. We have set an aim of harmonising all older records with this thesauri which will take several years. County libraries compile bibliographic databases on local history. They mostly use subject terms from the EMS for indexing but their own library systems do not enable authority control. Synonym control is particularly missed – libraries cannot offer USE references for their users. For that reason bibliographers would like to supplement the EMS with many specific terms on local life and are not satisfied with the situation where the EMS gives USE references from specific terms to more general terms. For example, the authorised term in the EMS is ökokogukonnad ’eco-communities’ and the term ökokülad ’eco villages’ gives the following reference: ökokülad USE ökokogukonnad. Rural librarians would like to have also the ökokülad as an authorised term. 3 http://ise.elnet.ee 3 Still the EMS editors have accepted a lot of suggestions by both bibliographers of local history and compilers of the ISE database, and the necessary terms have been included in the thesaurus – for example juubelid ’anniversaries’, rahvatants ’folk dance’, külapäevad ’village days’, vallavanemad ’parish heads’. The Estonian National Bibliography Database The Estonian National Bibliography Database (ERB)4 is compiled and managed by the National Library of Estonia, the corresponding software has been locally developed on the basis of the database management system MySQL. The items subject to registration in the national bibliography are catalogued, classified and indexed in the shared library system Millennium which has been introduced previously in this paper. From Millennium the bibliographic records are exported and loaded into the Estonian National Bibliography Database. It is not possible to change the records in this database, all the necessary corrections and changes are first made in the Millennium record and then loaded into the National Bibliography database. Subject search in the ERB can be carried out by subject terms, UDC numbers and keywords focused on subject fields or title fields. Subject terms and UDC numbers included in bibliographic records are browsable via indexes. The ERB does not contain authority data and thus does not enable to take advantage of all the possibilities offered by the thesaurus for finding the right search term. However, there is a clickable EMS logo on each page allowing the user to move on to the EMS environment, to find there a suitable search term and from the same location perform a direct search in the ERB. <Figure 3. Directing search from the EMS to the ERB.> Digital archive DIGAR DIGAR5 is the digital archive of the National Library of Estonia maintained in conformity with the library’s tasks to collect, preserve and make available digital information, including the tasks proceeding from the Legal Deposit Copy Act. The archive preserves Estonian online publications issued on the Internet (books, newspapers, journals, serials, maps and sheet music); digital copies of Estonian electronic publications issued on physical carriers (floppy disk, CD-ROM, etc.); digital copies of publications on analogue carriers; print files of Estonian publications. The information system of the archive is based upon the free software FEDORA which has been supplemented by interfaces for data input, management and providing services for the 4 5 http://erb.nlib.ee http://digar.nlib.ee 4 users. The latter is still under development but the archive is already operating and enjoys a fairly large usership. The amount of born digital and digitised information is growing rapidly which makes the improvement of the archive’s functionality a major stategic task of the National Library of Estonia. Most of the digital objects stored in DIGAR have been described in the online catalogue ESTER and analytically in the database ISE. Records in those databases and full texts in DIGAR are linked. The objects in DIGAR have descriptive metadata and search can be carried out also with the archive’s user interface. The metadata of DIGAR is mostly imported from ESTER, including subject descriptors. In the archive’s environment the metadata is supplemented by the English-language equivalents of the descriptors which are retrieved from the EMS database by a programmed query. The digital archive naturally enables to use full text search. <Figure 4. DIGAR subject descriptors displayd in English and Estonian.> Institutional repositories Several university libraries belonging to the ELNET Consortium manage their university’s institutional repository or digital archive. As an example this paper views the University of Tartu Digital Archive on DSpace6. The aim of the repository is to collect, preserve and make available the digital information of the University of Tartu structural units. The respository is publicly available, its materials can be found with Internet search engines and can be interfaced by OAI data exchange protocols. The software of the DSpace is a free software with open source code. The repository preserves: print files of doctoral theses published by the University of Tartu; original files created in the University of Tartu structural units, e.g conference papers, articles, reports, textbooks, personal archives, etc. digital copies of analogue carriers; online publications issued on the Internet; digital copies of eletronic publications issued on physical carriers. All archived materials are supplied with metadata and linked with the online catalogue ESTER, i.e the catalogue record contains a link to the full text stored in the repository. The metadata, including the subject terms, are based on catalogue records. Unlike the catalogue, the metadata of the material archived in the repository is supplemented with English-language subject terms where the English equivalents of the EMS are used – these are added manually. If so desired, subject terms or keywords in any languages can be added, preferably in the language of the original. It is possible to create subject term patterns (subject terms provided by default) for the whole collection. We have considered whether it would be sensible to create a thesaurus within the repository on the basis of the EMS, which would contain about 2000 subject terms. However, the debates have resulted in the opinion that such a thesaurus would be too general and it would not be sufficient for describing the contents of research works. In order to index more specifically, that thesaurus would gradually be supplemented by uncontrolled terms or 6 http://dspace.utlib.ee 5 keywords, while the EMS already contains controlled terms for designating the necessary concepts. Other potential users During the three years when the EMS has been publicly available, several memory and knowledge institutions outside the library community have considered to start using it in their information systems. Amongst them, cooperation with archives has been the closest – they design and implement a new information system for archives. The information system of museums is also under development, with a need for controlled vocabularies to describe museum artefacts. Yet do archives and museums lack enthusiasm about integrating the EMS into their information systems. It is too extensive for their collections and contains a lot of concepts which are never needed for describing archival records or museum artefacts. At the same time, many necessary concepts are probably missing. Archives and museums could try to make a certain selection but it is difficult to retain the integrity of semantic relationships when making a concise extract. Without the corresponding intention, the EMS is assuming the role of a language resource, especially for translators of specialised texts. In this respect we have received a lot of positive feedback. Even the EuroTermBank is interested in including the EMS among their free terminology resources. ELNET as the owner of the EMS is ready to allow it but technical problems still need to be solved. Conclusion The Estonian Subject Thesaurus has achieved a firm position in Estonian libraries because its development has tried to consider the needs of all interest groups. Estonian librarians like specific and exhaustive indexing in order to provide high-level information services for the users, thus they need a large amount of terms. Other memory and knowledge institutions with not so universal needs seem to have difficulties in using the EMS. The EMS is available as a web database and authority data in systems which support authority control via traditional authority records. In other systems which lack the relationships between subject terms and synonym control, the EMS loses a lot of its advantage as a tool for information search. To some extent this disadvantage can be compensated by beginning the search in the original database of the thesaurus. The developers of digital collections have considered it necessary to enhance the metadata with English-language equivalents, bearing in mind the international users who get the information via the Internet search engines and OAI portals rather than by using one or another digital collection directly. One further development trend of the EMS should be the conformity to the contemporary Semantic Web standards like SKOS and Open Linked Data. References Jahns, Y. (ed.) (2012). Guidelines for Subject Access in National Bibliographies. Berlin: De Gryter Saur. 6 Nilbe, S. (2011). Semiautomatic merging of two universal thesauri: the case of Estonia. In: Subject access: Preparing for the future. Edited by P. Landry, L. Bultrini, E. T. O’Neill, & S. K. Roe, 51-57. Berlin: De Gruyter Saur. Working Group on Guidelines for Multilingual Thesauri, IFLA Classification and Indexing Section (2009). Guidelines for Multilingual Thesauri. IFLA Professional Reports, No. 115. Available at: http://archive.ifla.org/VII/s29/pubs/Profrep115.pdf. 7 Figure 1. Display of a record of the EMS using the English interface. 8 Figure 2. Subject index of ESTER, enriched with references from authority records. 9 Figure 3. Directing search from the EMS to the ERB. 10 Figure 4. DIGAR subject descriptors displayd in English and Estonian. 11 Authors Sirje Nilbe - academic degrees in Estonian language and information science. Worked from 1986 to 1997 at the University of Tartu Library, from 1998 at the National Library of Estonia. Professional fields are authority control, classification and indexing, development of thesauri. Head of the Authority Control Department of the National Library of Estonia (1998-), Manager of the EMS Thesaurus (2009-). Tiiu Tarkpea - physicist, academic librarian in University of Tartu Library (1983-), has practiced as subject librarian, subject indexer and editor of the thesaurus. Head of the Department of Subject Analysis (2004-), Chief of the Classification and Indexing Working Group of the ELNET Consortium. 12