presentation source

advertisement
Draft
Prototypes for searching multiple
databases
Presentation to CC:DA Taskforce on
Metadata 1/16/2000 San Antonio, Texas
Steve Miller & Mary Woodley
CHARGE 4
Recommend ways in which libraries may best
incorporate the use of metadata schemes into the
current library methods or resource description
and resource discovery
1. Patron uses ONE search interface to access all
information, whether it is a number of different
metadata types and standards, databases,
and OPAC(s)
CHARGE 4
2.
provide a seamless transition to the user to all
information available, moving from the ILS system of
front-end search mechanism that accesses numerous
resources, to a search Interface that can access all
information available in any standard, format, location
or subject. (Example: interface can search local OPAC
World Wide Web, metadata standards (EAD, TEI, GIL
Dublin Core), special collections, museum holdings,
etc., and present results in a useable format to the
patron through one search mechanism)
DEFINITIONS
Prototype
A virtually seamless access to information and relevant
retrieval of information from the user's point of view.
Seamless
Multiple sources of information can be searched,
resulting in a single list of search results. A prototy
system may provide the user with access to multipl
authority control lists, whether thesauri or LCSH or
whatever, in order to assist in vocabulary usage
and search definition, BEFORE the search is enable
DEFINITIONS
INTEROPERABILITY
ability of two or more systems or components to
exchange information and use the exchanged
information without special effort on either system
Different Kinds of Prototypes:





Searching multiple databases
vs.
Collecting multiple metadata records into single
database or repository
Searching different metadata semantics
vs.
Converting / mapping diverse elements into single
semantics standard searched by the interface
Subject-specific vs. Universal search interfaces
Questions for Each Possible
Prototype:



Can the user select which databases to search: one
or all or a combination of them?
Can the user select specific thesauri, subject
headings, name authorities, etc. to use as part of
the search?
Can the search results be sorted into lists by type
of metadata or by thesaurus/authority file, or are
all search results merged together without
differentiation?
Three Partial Prototype Interfaces:

NESSTAR: Networked Social Science Tools and
Resources
–

AHDS: Arts and Humanities Data Service
–

funded by DGXIII of the European Commission under
the 4th Framework Telematics Applications Programme
Funded by the Joint Information Systems Committee of
the UK’s Higher Education Funding Council
CORC: Cooperative Online Resource Catalog
–
OCLC project
NESSTAR
= Networked Social Science Tools and Resources
 a joint development project between
–
–
–

the Norwegian Social Science Data Services (NSD)
UK Data Archive
the Danish Data Archive (DDA)
Provides a common gateway to online social
science data resources
Social Science Metadata







No single established standard.
Many local "dialects" of the most common standard.
Different data archives have adapted their metadata to fit
different storage and retrieval systems.
= low level of standardization across archives.
DDI: Data Documentation Initiative established in 1995 to
create a universally supported metadata standard for the
social science community.
NESSTAR is using the XML-version of the DDI-standard
as the fundamental structure of its metadata system.
Developed a set of metadata converters to ease translation
of existing metadata.
Discovering Resources Across Archives




The resource discovery system of NESSTAR is metadatadriven.
The detailed structure of the DDI-DTD allows users to
search for data with a very high precision.
Researchers interested in particular subjects can move
beyond keywords and abstracts (normally included in
OPACs) and search directly on variable descriptions,
question texts, etc.
Searches can also be conducted on concepts such as
method of data collection (e.g. telephone interviews, faceto-face interview or self-completion questionnaires) or
sampling strategy (e.g. random, stratified, etc).
NESSTAR allows users to:
 Locate multiple data sources across
national boundaries
 Browse detailed metadata about these data
 Analyse and visualise data online
 Download the appropriate subsets of data
in one of a number of formats for local use
Three Search Screens

Simple free text search
 Structured search on a selection of fields
–

(like title, abstract, year etc.)
Advanced boolean search on all relevant
fields of the DDI-DTD
NESSTAR Project “Dream Machine”
(Social scientists’ ideal data search & retrieval scenario)
•
•
•
•
•
•
•
•
•
All existing empirical data available on-line.
An integrated resource discovery gateway and search-system in order
to identify and locate these resources.
Extensive amounts of metadata available (multimedia, hyperlinked and
totally integrated with the data as such).
Ability to browse and visualise data on-line.
Ability to convert the data in one of a number of formats and copy,
with the metadata, to a local machine.
“Active research agents" (knowbots) mining the net and informing the
user when new data within their special field of interest are made
available.
Efficient hyperlinks from the data sources to every scientific
publication ever produced on the basis of a dataset.
Ditto e-mail/web addresses to all relevant researchers, departments etc.
Efficient feedback system to the body of metadata allowing the user to
add to the collective memory of a dataset.
AHDS: Arts and Humanities Data Service

Funded by the Joint Information Systems Committee of the
UK’s Higher Education Funding Council
Five AHDS Databases:





Archaeology Data Service
History Data Service
Oxford Text Archive
Performing Arts Data Service
Visual Arts Data Service
The AHDS Resource Discovery
Challenge





Integrate users’ online access to distributed and
heterogeneous information resources.
Each collection presents information about its holdings
differently
Several service providers have data exchange and
interoperability agreements with third-parties.
The collections comprise a wide variety of resource types,
including electronic texts, databases, digital images,
geospatial information systems, and time-based film data.
Services have adopted very different resource description
and cataloguing practices.
Searching the Databases
Each collection can be accessed by one of two methods:

1) Through each service provider’s native catalog
–

Different capabilities tailored to the information needs of different
scholarly communities and to the resource description
requirements of very different digital collections.
2) Through the common AHDS Gateway
–
Presents different collections’ catalogs as a virtual uniform catalog
and bases search and retrieval capabilities on an unqualified
Dublin Core record.
The AHDS Gateway:

Provides a point of access to the electronic resources held by the five
AHDS service providers
–
such as electronic texts, databases, images and video and audio clips, online where possible

Allows users to search across a variety of independent and very different
online catalogues as if they were a single catalogue.

Provides access to high-quality information providers of data from the
disciplines of archaeology, history, literature and language, the performing
and visual arts

Resources selected for their quality and for their relevance to those
interested in the Arts & Humanities.

Users also benefit from a more refined approach to searching, being able
to search against specified fields (creator, subject, title, etc), which is not
possible through most Search engines available on the Internet.
AHDS: “Challenges to be Confronted”



Digital Preservation and Archiving
– Requires substantial infrastructural investment.
Controlled Vocabularies
– No agreement likely among service providers about their use
– OTA uses AACR; VADS uses AAT
– Even greater variation in use of date and coverage elements.
– How to assist users searching across catalogs with domain-specific
controlled vocabularies? Major challenge for the future.
Z39.50 Interoperability
– Relatively immature standard; few guidelines for its use yet developed.
– Different Z39.50-aware applications may conform to standards yet
remain incompatible with one another or interact in ways not meaningful
or helpful to the user.
– In a wider and impersonal networked environments, means will need to
be developed to ensure interoperability.
AHDS: “Challenges to be Confronted”

User Registration, Authentication, and Resource Ordering
–
–
–

AHDS benefits from its circumscribed service environment.
Problems will arise as it integrates third-party systems into its Gateway where
such services use independent registration, authentication, and resource ordering
services.
A more automated approach will be required to support scholarly and heritage
users who wish to locate, scrutinise, and acquire access to information objects of
interest irrespective of their location, format, and management.
Users’ Resource Discovery Preferences
–
–
AHDS has so far operated with numerous assumptions about users’ resource
discovery preferences in a distributed network environment. Those assumptions
have shaped the development of the AHDS Gateway and associated systems.
How users actually exploit the Gateway, particularly in relation to their use of
underlying Service Provider catalogues, will provide useful feedback for the
systems’ further development, but also for applied research into resource
discovery systems more generally.
CORC: Cooperative Online
Resource Catalog




OCLC research project
Web-based prototype system
Offers both full USMARC cataloging and an
enhanced cataloging mode for Dublin Core
Records can be imported into or exported from
CORC using:
– (1) MARC
– (2) HTML
– (3) RDF-compliant XML
Authority Control in CORC




OCLC is defining how the authority component will work.
CORC users will have access to a copy of the OCLC
Authority File and the ability to create provisional authority
records for use by other CORC participants.
The first version of the CORC authority search interface will
not offer all cross-references available in the Authority File.
CORC currently supports automation-assisted authority
control during resource record creation and editing for
selected fields.
The Future of the CORC Project



OCLC will introduce CORC as a production
service in July 2000.
CORC 1.0 is the first phase of OCLC's next
generation of cataloging services.
Version 1.0 will focus on an optimized metadata
creation services for electronic resources and on
providing an integrated view of those resources
with other bibliographic records in WorldCat (the
OCLC Online Union Catalog).
Projects by the Getty Trust:
aka
http://www.ahip.getty.edu/aka/ (retired)
Faces of LA:
http://facesla.org (retired)
Arthur
http://www.ahip.getty.edu/arthur/ (retired)
Getty Research Institute auction catalog records:
http://opac.pub.getty.edu
aka
•Developed by the Getty Information Institute whose
mission was to create and support standards of description,
tools, and guidelines for sharing cultural information
•Designed to demonstrate the value of controlled
vocabulary in searching electronic resources, including the
Web
•Public mode searched 4 databases with the option of using
Art and Architecture Thesaurus, Union List of Artists
Names; Staff mode searched over 26 databases: relational
databases as well as texts in Web resources
aka System
System components:
•Web harvester (public domain)
•WAIS text indexing system (public domain)
•Vocabulary Searching Interface (Getty developed
using Sybase)
•Supported Boolean searching
•Ability to expand, limit query & modify results
THE aka INTERFACE
Multiple Collections
Vocabulary
Search Option
Text Searching Field
Slide courtesy of Marty Harris
aka IMPLEMENTATIONS

aka provides internal searching access to 26 GII research, and
partner databases and two general search resource databases.

The Art and Archaeology Technical Abstracts
Avery Index to Architectural Periodicals
Bibliography of the History of Art (BHA)
Getty Education Institute ArtsEdNet Web Site
Getty Museum Collections Management System
Getty Research Institute Integrated Catalog, IRIS
GII Index of Cultural Heritage Web Sites
GRI Photo Study Collection - Antiquities
GRI Photo Study Collection - Illuminated
Manuscripts
GRI Photo Study Collection - Max Hutzel Collection
International Repertory of the Literature of Art
(RILA)
Provenance Index Sale Catalogues
Provenance Index Sale Contents
The American Film Institute OnLine
The Autry Museum of Western Heritage
CSU Northridge: Special Collections and Archives
















CSU Northridge: Special Collections and Archives
Fine Arts Museum of San Francisco Art Imagebase
Istituto Centrale per il Catalogo e la Documentazione
IIstituto
Centrale per il Catalogo e la Documentazione II
L.A. Library: People, Places and Events in Southern
California
L.A. Library: Historic Photographs of Southern
California
Lycos Image Index
Museum of Tolerance: Photos of Children
UCLA Fowler
Museum of Cultural History


USC Ethnic Studies Collection
Alta Vista









Slide courtesy of Marty Harris
©J. Paul Getty Trust
Slide courtesy of Murtha Baca
Retrieval using “a.k.a.”
•User has concept in mind
•Implementation in search engine
•Search Scope Note for
•E.g., User chooses databases to query
“Egyptian funerary”
•User may use Getty vocabulary to broaden
or refine search
"ushabti" OR "shabti" OR
"shawabti" OR "ushabtis" OR
"ushabtiu"
Slide courtesy of Patricia Harpring
From AAT
ushabti
shabti
shawabti
ushabtis
ushabtiu
© J. Paul Getty Trust
aat Search results
©J. Paul Getty Trust
Slide courtesy of Murtha Baca
aat Search results
©J. Paul Getty Trust
Slide courtesy of Murtha Baca
aka Limitations
•Currency
•Labor intensive
•Although vocabularies increased precision,
effectiveness reduced by the fact not all databases
used the same vocabularies or used the
vocabularies only in limited ways
•Could not generate a search from the Vocabulary lists
but had to type in the terms
THE aka “FACES of LA“ INTERFACE
Text Searching
Field
Vocabulary
Search Option
Multiple Collections
Slide courtesy of Marty Harris
Books & Auction Catalogs
Separate Authorities
Auction Catalog Record
Different cataloging standard,
same machine readable format (MARC)
Z39.50 Projects
FirstSearch
Pharos:
http://pharos.calstate.edu
Searchlight:
http://cdl.ucop.edu/libstaff/jan2000/
http//searchlight.cdlib.org/cgi-bin/searchlight/
FirstSearch
Provides the option of searching a single database or
multiple databases (up to 3) within a category
FirstSearch
•Database contains
catalog records for Web
sites
•Each record contains
abstracts, LCSH, DDC &
URLs
Pharos Project - CSU
•Goal is to provide to all 23 CSU campuses a gateway to
electronic resources with a Z39.50 Web interface
•Portal for searching, Interlibrary loan and information
competency tutorials
•Site is still under development
Pharos: capabilities & future
enhancements
Quick search
Search a ps eci f ci database
Select databases in
categor ies
Select a combi nat ion of
da tabases
Resu l t sdi spl ayed by
da tabase
I n devel o p m
ent:
Boo lean search ing
Li m ist
Search h i sory
t
Inclus i on of non-Z39.50
da tabases
Pub l ci Versi on
Option to Select a
specific database
Or search
categories of
databases
Select a combination
of databases
Pharos Result Screen
High Recall -- Low Precision
SearchLight - UC project
Select specific database
or category
Results sorted by
categories of materials
Result differences due to how
Server interprets request from
Client
Strengths of Z39.50 crossdatabase searching
Circumvents 2 of the primary problems facing
patrons:
1. Which database(s) is the most appropriate to use for the
query
2. Finding which specific databases are available:
a. Long A-Z lists
b. Buried menu lists
Provides a single portal to the information universe
May link to Full-text resources
No need to duplicate access in OPAC & “lists”
Weaknesses of Z39.50
metasearching
Does not currently include non-Z39.50 databases
Relies on keyword searching
Results effected by:
Which fields are indexed in the various databases
How each Server interprets the query from the Client
Keyword searching can result in high recall with low precision:
Search the individual databases to generate precise results
Diminishes the strengths of the catalog record: its access through
controlled vocabulary
Weaknesses of MetaSearching
• Different databases under different vocabulary
control
•Older relevant material not always available on
the Web
•Not all resources on the Web have the same
“value”, i.e., if a general Web search is included,
they are given the same “status” as other resources
on SearchLight
Comments & Feedback
Please send comments & feedback to:
Mary Woodley mary.woodley@csun.edu
& Steve Miller MLL@gml.lib.uwm.edu
Download