Local multi-database access

advertisement
Cross searching in local databases: the development of
the meta-
catalogue of the KB
By Marco de Niet (Koninklijke Bibliotheek, Library Research Department)
Introduction
In the early ’90s the Koninklijke Bibliotheek, the National Library of the Netherlands, made the first sketches for a
digital library. The KB decided to initiate three strands of activities: the creation of the Deposit of Dutch electronic
publications, digitisation of various collections and the development of an multimedia workstation. In the past years these
plans have resulted in various local, national and international projects, such as NEDLIB, CERBERUS, Digitisation of
medieval illuminated manuscripts, and the Advanced Information Workstation. This paper will focus on the development
of this workstation, and the way the workstation served as an platform for bibliographical integration.
The project Advanced Information Workstation
The KB started the pilot project Advanced Information Workstation in 1993. The first step was to define the
technical and functional specifications, after which a pilot version was created. The aim of the workstation was to
provide a single point of access to all the electronic information services of the KB, with special attention for the
needs of researchers in the humanities. The project should result in:
 the integration of the protocols of all the networks the KB has access to;
 the ability to retrieve information from all those networks in a user friendly way, with facilities for finding
relevant information irrespective of its format (text, image, sound) or channel of dissemination (cd-rom,
Internet, local, national or international catalogues and databases);
 the facility to process the information found from a single PC: downloading, copying, ordering by email,
displaying with word or image processors, etc.
The official project Advanced Information Workstation started in October 1994, with financial support from the
Netherlands Organisation for Scientific Research (NWO). While defining the final functionality and the technical
architecture for the workstation, HTML and HTTP, the building blocks of the World Wide Web, gave a strong
impetus to the standardisation for retrieval of electronic information. This led to the decision to develop the
workstation with WWW technology.
On 1 April 1996 the first version of the AIW was implemented in the KB. Although this version was partly meant
for evaluation only, it already offered an important part of the proposed functionality. The core of the first version
was a database, with which many information services and secondary information resources had been made
accessible, be they CD-ROMs, online databases or services on the Internet. This database not only offered
information about the content of the selected resources (such as: title, author, classification code etc.), but also
practical or technical information (e.g. about login procedures).
In December 1996 the first phase of the project was concluded with the release of version 2 of the workstation. With
the development of this version, special attention had been paid to finding information by discipline or subject.
Other aids to find specific document types in the KB, such as newspapers and cd-roms had been added as well.
After these first two versions of the Advanced Information Workstation had been implemented, it became necessary
to examine the relation between the AIW and Alexicon, the general internet service of the KB. After all, it was not
very convenient, neither for the KB, nor for its users, to use two WWW-based systems. The best solution was to
fully integrate both services, and create a WWW system that could function as a general service for all users on the
Internet as well as a research facility for the visitors of the KB.
On 5 November 1997 the new network service was released, replacing both AIW and Alexicon. With the release of
this service, the functionality of the AIW had become available to the entire Internet community, although some
facilities can only be accessed from within the library. Visitors of the library need to have a valid pass and an
additional password to get access to the workstations.
At the end of 1998 phase 3 of the workstation project was initiated. The main goals of this project were to expand
and solidify the infrastructure, to update the software and maintenance tools and enhance the functionality of the
workstation by creating customised tools and a meta-catalogue, that would give access to all the catalogues of the
KB. A new version of the workstation is expected to be released in December 1999.
Integration
The keyword in the project Advanced Information Workstation, and perhaps the magic word in many information
retrieval projects, is integration. Integrating library services for end users implies in the view of the KB three sorts of
integration: technical, functional and bibliographical integration.
Technical integration
With technical integration we refer mainly to the connecting of local and wide area networks and their protocols.
Network protocols are in effect of no relevance at all to the library users, but still they often determine what
information can be accessed directly or not. When we started with the workstation project in 1993, the KB was
dealing with five network protocols, namely X.25 for the Pica library system, X.400 for e-mail, TCP/IP for the
Internet (still Gopher then!), Ultranet for the cd-roms and Novell for the various back office software such as data
editors. The advent of the World Wide Web was an enormous help to get rid of this diffusion, as the KB chose to
standardise as much as possible on the TCP/IP and HTTP-protocols. The PC's in the library on which all the local
facilities can be used, use Windows NT as operating system, but all services can be launched from a Web-browser.
The KB-website is the interface to the workstation, and the users have no access to the Windows desktop. Although
we have not solved all the problems concerning technical integration, such as implementing a single digital payment
device for all our information systems, we are confident that the developments on the World Wide Web will help us
find standardised solutions.
Functional integration
As important as technical integration, but, in our experience, somewhat underestimated by libraries is functional
integration. Many libraries tend to focus on creating technical gadgets with their workstations, instead of creating a
tool for their visitors with which they can make optimal use of the information they find while using the library's
services. The definition of the functionality of an advanced workstation should be based on the definition of the user
groups and the definition of actions they wish to perform on a workstation.
As the KB is a national library, it has many user groups which can not easily be defined in terms of social
background or profession. In essence, the entire adult population of the country can be considered to be potential
customers of the KB. However, when looking at the users of a library from the perspective of information retrieval
on a multimedia workstation, three distinctions can be made.



local users / remote users: with the decision to base the workstation on Internet technology, the KB also
decided that the information services should be available in the library as well as on remote PC’s. However,
because of licences, some services can only be used within the premises of the library, such as cd-roms and
software/applications. There are also a few DOS-based catalogues which are installed on the local Windows NT
network, which cannot be accessed remotely either. These databases however will be converted to a
Windows/Web environment during the AIW3 project.
registered / anonymous users: as the workstation is integrated with the KB-website, every visitor can see which
facilities we offer. Some of these facilities, such as the General Catalogue, are open to all visitors; others, such as
specialised databases, are restricted to registered users only (irrespective of the fact if they are working in the KB
or on a remote PC).
individual / group profiles: it is considered to be very user friendly if the users of the workstations can create
their own personal environment. After the completion of the AIW3 project, each individual user will be able to
create a personal desktop, which supports the creation of bookmarks, e-mail profiles, authentication and
accounting facilities etc. However, it can also be useful to consider an individual user as part of a specific user
group, especially from an educational point of view. First year students of a specific discipline for instance could
benefit from online guides which have been developed especially for them. The Advanced Information
Workstation currently contains some 10 guides on various disciplines and subjects, which support the users
while looking for or searching in the most important information systems for that specific discipline.
Regarding the actions that users in general, and scholars in particular would like to perform on a workstation: a
properly designed workstation should support the users during every phase of their research. We have distinguished
five of these phases:
 searching - the workstation serves as a quality gateway
 acquiring - the workstation offers various retrieval facilities
 processing - the workstation serves as a personal desktop
 communication - the workstation serves as a virtual address
 publishing - the workstation serves as a media provider for authors
Note that only the first two phases contain the traditional library functions, such as providing bibliographical
information through printed or online catalogues. The concept of a workstation shows that libraries are
transforming from information providers to service providers. In that respect the workstation can be called the
natural successor to the OPAC. It is an all-covering front end service, that gives access to all electronic services of a
library: information sources - be they primary information (full text documents, images), secondary (descriptions of
primary information resources) or tertiary (descriptions of secondary information resources) - , but also applications,
e-mail, digital payment facilities, SDI-like services etc.
When it comes to providing access to information sources, one should be aware of the fact that this not only means
physical access, but also support of search strategies. A user may be looking for a known item, but he may also look
for a specific type of publication (cd-roms, newspapers etc.) or items within a specific discipline or related to a
specific subject. To enable various searches, it is essential that libraries work towards bibliographical integration as
well.
Bibliographical integration
The creation of multimedia workstations offers a tremendous opportunity for libraries to finally make optimal use of
all the bibliographical data they have been creating for over centuries. After decades of library automation most
libraries are still very much in a transition phase. Many important parts of the collections are not yet available
through the OPAC’s, and many libraries have different databases for various types of publications, like books,
journal articles, manuscripts, maps, online resources. Besides maintaining catalogues, many libraries are involved in
the creation of bibliographies and indexes as well. It is obvious that many users are not aware of the width and
richness of the collections if they are not aware of the existence of all these systems. Setting up a workstation forces
you to think how to connect all these systems, and how to work towards bibliographical integration.
Catalogues of the KB
The current bibliographical situation of the KB is as follows: there are no less than 20 online catalogues and
documentation systems with bibliographical data that refer to our collections:
 General Catalogue (Books & journals, over 1.5 million records)
 Online Contents (Journal articles: over 1 million records)
 Electronic journals (over 200.000 records)
 Catalogus Epistularum Neerlandicarum (letters, ca 125.000 records)
 Grey literature in the Netherlands (over 110.000 records)
 Dutch Union Map Catalogue (60.000 records)
 Short-Title Catalogue of Netherlands (pre-1800 imprints, over 50.000 records)
 Auction and antiquarian catalogues (50.000 records)
 Children’s books (ca. 30.000 records)
 Documentation system on the history of the book (11.000 records)
 Chess catalogue (6.500 records)
 Short-Title Catalogue of Manuscripts (6.000 records)
 Documentation system on preservation (5.500 records)
 Documentation system on manuscripts (3.500 records)
 Documentation system on paper and papermaking (3.000 records)
 DutchESS (selected Internet resources, 3.000 records)
 Documentation system on book bindings (2.500 records)
 Incunabula Short-Title Catalogue (2.000 records)
 Electronic reference works (600 records)
 Alba amicorum (500 records)
Some of these catalogues overlap (e.g. the content of the grey literature catalogue is also available the general
catalogue), but most of them are completely isolated from other systems. The descriptions of the electronic journal
articles for instance have not yet been connected with Online Contents.
Besides all these catalogues, the KB also creates several bibliographies on various subjects, such as the history of the
book, philosophy and legal history.
All these databases have been created over the past 20 years or so. Some have been created in collaboration with
other libraries or institutions, others have been created locally, by one department, or even by one staff member of
the library. The following formats are being used in these databases:
 PicaPlus (MARC-related): General Catalogue, Online Contents, Catalogus Epistularum Neerlandicarum, Grey
literature, Short-Title Catalogue of Netherlands, Children’s books, Incunabula Short-Title Catalogue



Unimarc: Dutch Union Map Catalogue (will be converted to PicaPlus)
SGML: Electronic journals
Customised formats:
 Oracle: DutchESS, Electronic reference works
 Inmagic: Auction and antiquarian catalogues, Short-Title Catalogue of Manuscripts, Alba Amicorum, and
the documentation systems on history of the book, preservation, manuscripts, paper and book bindings
 MS-Access: Chess catalogue
It is obvious that the KB has the need to create a way to search all these databases simultaneously. It is, to say it
mildly, not very user friendly, to suggest a customer to search 20 catalogues to see if we have any relevant
information to offer. Cross searching would be a major step forward for both the KB and its users. The Advanced
Information Workstation is the perfect environment to offer multidatabase access and cross searching to our users.
There are already several ways to search simultaneously in various databases. The most widely spread and best
documented solution is the Z39.50 profile. Various OPAC’s across the world have already been connected with each
other using this protocol. The KB uses the Pica library system, which supports Z39.50 on the client side. The KB
does not have a Z39.50 server. Another complication with implementing this protocol for all the local databases is
that there are so many customised formats and applications in use. Many of these databases are not Z39.50
compliant. All databases would have to be converted to a different platform and/or bibliographic format to fit this
protocol. A way out could be to use custom made retrieval software, that would search all the KB-databases, but this
would imply more or less inventing the wheel again. The easiest solution for the KB was to copy and convert all
catalogues to a new database, with a common bibliographical format. The searching can then be done within one
single database, a metacatalogue. The major drawback of this solution is that the metacatalogue will have to be
updated continuously. Discrepancies between the source databases and the copied databases can easily arise if
updating is being neglected.
Creation of a meta-catalogue
In September 1996, the plan to create this meta-catalogue was launched. After some research had been done, the KB
decided on the following technical set up:
 Oracle dbms
 Oracle Context as a free text engine to index the records
 Download from the original databases in ASCII
 Conversion of the character sets to Latin 1
 Conversion of the original fields to SGML-like tags (31 fields)
 Addition of 16 fields with metadata to each record, e.g. to determine the source database and the date of
conversion.
Each individual database would be converted with a matching script to the Oracle database. Perl scripts would
convert queries entered from the web interface (on the workstations or any other Web-browser) into SQLstatements to search the meta-catalogue. (see fig. 1).
Fig. 2 is a flow chart which shows the way the original databases are converted to the new meta-catalogue.
Fig. 1: conversion of the databases to Oracle
Fig. 2: flowchart of data from source database to the meta-catalogue
The meta-catalogue is not meant to replace the source databases, it is an add-on. Some catalogues have additional
functionality, which will not be supported by the meta-catalogue. Its main goal is to provide one single access point
to the entire collection of the KB by combining all the bibliographic records that we have, and, if possible, providing
one way of linking to the full text documents that relate to the records found. In each title presentation the name of
the source database is presented, including a link to the database itself. This way, a user can decide to continue his
search in this database.
The meta-catalogue will have two search interfaces:
 a simple search, which is similar to the way search engines on the Internet operate. This search is an index of all
fields in all records in the metacatalogue. This way of searching would fit the needs of the modern users of
OPAC’s that type in just any word, without bothering which index is being searched.

an advanced search, which is more like the traditional OPAC search screens. From this search screen users
could decide to use specific indexes, such as author, title, keyword or abstract. The index of all fields in all
records would also be available from this advanced search option.
Test results
In May 1997 we were far enough with the development of the metacatalogue to set up a test among the library staff.
We asked ten colleagues to have a go at it for about 30 minutes. These were the results:
 Number of searches
 with good results: 60
 results > 250 hits: 7
 database errors: 24
 System down: 4 times
 Shortest search: 3 sec.
 Longest search: 195 sec.
 Average time < 250 hits: 43 sec.
 Average time >250 hits: 75 sec.
While creating the metacatalogue we already discovered that Oracle Context, the full text search engine, did not
perform very well. But these results with 10 simultaneous users were nonetheless rather disappointing. During the
next months, we upgraded the entire system: we reorganised the database, upgraded Oracle, upgraded Oracle
Context, added a second processor to the server, did a lot of tuning, mostly in cooperation with consultants from
Oracle. We also looked at alternatives for Oracle Context, such as: AltaVista (DEC), Excalibur (Excalibur Techn.),
Total Recall (Dataware Techn.), NetAnswer (Dataware Techn.), SearchServer (Fulcrum), ZyNDEX (ZyLAB), Search
‘97 (Verity Inc.), and Texis (Thunderstone Software).
As the focus of the workstation project was temporarily shifted from the development of this catalogue to the
integration with the website of the KB, it wasn’t until almost a year later, in February 1998, when we were able to
finish the updating and tuning of the metacatalogue and organise another test with the same user group. These were
the results (between brackets the numbers of the previous test):
 Number of searches
 with good results: 142 (60)
 results > 250 hits: 31 (7)
 database errors: 20 (24)
 System down: 0 (4)
 Shortest search: 3 sec. (3)
 Longest search: 174 sec. (195)
 Average time < 250 hits: 33 (43)
 Average time >250 hits: 78 (75)
Although there were some good improvements, esp. on the stability of the system, the performance was still far
from adequate. The testing also proved that the Oracle dbms was functioning properly, but that the bottle neck was
Oracle Context. We decided then to drop this software altogether, and try AltaVista as indexing tool.
Post scriptum, July 1999
The decision to switch to AltaVista happened in the same time when the ELAG conference was held in The Hague.
The meta-catalogue, nowadays simply called KB-catalogue, is well on its way to become an operational service. The
further development has been an essential part of the AIW3 project. Its bibliographic format has been converted to
an XML-DTD, and AltaVista has proven to be a good and fast tool for indexing and searching these records.
The KB-catalogue will be released in three phases. The first version will contain a subset only: descriptions of
electronic documents, such as electronic journals, internet resources, databases and cd-roms. This version will be
released in August 1999. The second version will contain another subset: special collections. This subset will be
based on the many Inmagic databases and some of the Pica databases, such as the STCN, the ISTC and the
catalogues of letters and manuscripts. This version will be released in the fall of 1999. At the end of the project
AIW3, in December 1999, we hope to release the metacatalogue in its entirety, with about 3 million records.
The Hague, March 1998 / July 1999
References
KB-website (incl. the workstation): http://www.kb.nl
Library research projects of the KB: http://www.kb.nl/kb/sbo/proj-en.html.
M. de Niet: ‘A Single Access Point to Information Resources: The Advanced Information Workstation of the
National Library of The Netherlands.’ In: Resource sharing and information networks 13 (1998) 2, p. 29-38.
Z39.50 in Europe: http://www.ukoln.ac.uk/dlis/z3950/
Download