Using Authorities to Improve Subject Searches

advertisement
1
Using Authorities to Improve Subject Searches
Edward T. O'Neill, Rick Bennett, and Kerre Kammerer; OCLC Research
Abstract
Authority files have played an important role in improving the quality of
indexing and subject cataloging. Although authorities can significantly
improve search by increasing the number of access points, they are rarely an
integral part of the information retrieval process, particularly end-users
searches. A retrieval prototype, searchFAST, was developed to test the
feasibility of using an authority file as an index to bibliographic records.
searchFAST uses FAST (Faceted Application of Subject Terminology) as an
index to OCLC's WorldCat.org bibliographic database. The searchFAST
methodology complements, rather than replaces, existing WorldCat.org
access. The bibliographic file is searched indirectly; first the authority file is
searched to identify appropriate subject headings, then the headings are used
to retrieve the matching bibliographic records. The prototype demonstrates
the effectiveness and practicality of using an authority file as an index.
Searching the authority file leverages authority control work by increasing the
number of access points while supporting a simple interface designed for endusers.
Introduction
Authority files have traditionally been an integral part of cataloging. While there is broad
agreement that controlled vocabularies and authority control significantly improve catalogs,
authority control is costly and labor intensive. Although professional searchers often consult
authorities, few end-users have either the knowledge or experience to effectively use authority
files. Enabling end-users to effectively use authorities as search aids will require new
approaches that better utilize the information in authority records without adding complexity.
Unlocking the potential of authorities will necessitate reexamining the search process and
exploring new approaches.
Authority Control
In its simplest form, authority control consists of (1) determining the form of terms used as
access points and (2) verifying that the terms assigned are valid. This is usually accomplished by
creating a separate authority record for each access point that identifies the preferred term.
These records include cross references identifying common synonyms and information
explaining the rationale for establishing the particular form for the access point.
The authority record will identify the preferred term and list common synonyms as cross
references. For example, the preferred term for "authority control" in the Library of Congress
Subject Headings (LCSH) is Authority files (Information retrieval) with cross references
identifying Authority control (Information retrieval), Authority files (Cataloging), Authority
2
records (Information retrieval), Authority work (Information retrieval), and Library authority
files as synonyms. The terms found in the cross references are not valid access points—the
preferred term should be used instead. A resource describing authority work therefore should be
assigned the term Authority files (Information retrieval) rather than Authority work or
Authority records (Information retrieval).
Taylor defined authority control as "the result of the process of maintaining consistency in the
verbal form (heading) used to represent an access point and the further process of showing the
relationships among names, works, and subjects. It's accomplished through use of rules (in the
case of names and titles), use of a controlled vocabulary, and reference to an authority file to
create an authorized character string called a heading." 1 Chan and O'Neill emphasized that the
"purpose of authority control is similar to that of the now ubiquitous spell checkers; authority
control should detect incorrect headings and, when possible, correct or suggest corrections. Just
as spell checkers use dictionaries to determine the correct spelling of a word, authority control
relies on the authority file to determine the correct form of a heading." 2
End-User Searching
Subject heading schemas are often complex and difficult to use. The last printed manual for
applying the Library of Congress Subject Headings (LCSH) required four volumes3. Librarians
and other experienced users have the requisite skills to effectively search controlled
vocabularies. One of the more significant trends in information retrieval has been the shift away
from mediated to end-user searching. Since end users are often unfamiliar with controlled
vocabularies, controlled vocabularies, such as LCSH, have not necessarily improved the search
experience for end users. Markey concluded that "End-user searches do not resemble the
systematic approach of expert intermediary searchers" since end users make very limited use of
Boolean logic.4
The difficulties encountered by end users' use of LCSH are well documented; many of these
difficulties are attributable to its highly structured vocabulary and extensive application rules.
Markey argues that "making online IR systems more complicated with additional functionality,
frequent and unanticipated interruptions in the form of direct system intervention, and detailed
instructions and tutorials in system use, is not the right way to proceed." 5 From the perspective
of the typical end-user, intuitiveness and ease of use are more important than functionality.
Advanced features, while often demanded by experienced searchers, should not be an
impediment to end-users.
Authority File as an Index
Are controlled vocabularies synonymous with complexity? Do controlled vocabularies have a
role in the new search environment or are they a relic of the card catalog era? While the jury is
still out, many believe that controlled vocabularies remain relevant and should play a key role in
future retrieval systems. However, for that to happen, (1) the cost of using controlled
vocabularies, i.e. authority control, will have to decrease significantly, and (2) new approaches to
searching must be developed to better utilize the information from authority records without
adding complexity.
Although authority control is likely to remain a costly process, considerable progress has been
made in automating the process to reduce its cost. The second condition, although equally
3
important, is more problematic. If authority control and controlled vocabularies continue to be
used primarily by catalogers as tools to improve indexing quality, it will be increasingly difficult
to justify the costs. To unlock the potential of authorities as a means to improve searching will
require reexamining the search process, exploring new methodologies, and developing prototype
environments in which to test the methodologies.
Historically, there was a close link between an authority record and its associated bibliographic
records. In card catalogs, it was common practice to interfile cards for the authority records with
the cards for the bibliographic records. Cards for the cross references would also be interfiled to
redirect users to other parts of the catalog. With the advent of online catalogs, the direct linkage
between the authorities and bibliographic records was lost. Authority files and bibliographic
files are maintained by different agencies; subject authority files, such as LCSH, generally are
maintained centrally while bibliographic files are maintained locally.
There are significant benefits to linking authority and bibliographic files; one is that linking
permits the authority file to function as an index to the bibliographic file. This indirect approach
offers two distinct advantages over searching the bibliographic file directly: (1) authorities
contain a wealth of information not available in the bibliographic record and (2) the additional
information can be used to generate search trees to aid navigation. Authority records for schemas
such as LCSH or FAST (Faceted Application of Subject Terminology) are rich sources of
information about the subject and include many access points and navigation aids unavailable in
the bibliographic record. Figure 1 shows the LCSH authority record for Cataloging:
<Figure 1. LCSH Authority Record for Cataloging.>
Even this relatively simple authority record contains information including the Library of
Congress class numbers associated with the heading (Z693 and Z695.83), a cross reference from
the British spelling (Cataloguing), links to broader terms (Information organization, and
Technical services (Libraries)), and links to related terms (Books) which aren't found in the
bibliographic record. This additional information can be used to enhance searching by providing
additional access points. The cross reference from Cataloguing alone may make the difference
between a successful and an unsuccessful search.
Markey uses the definition for search trees originally defined by Mitev, Venner, & Walker6,7 as:
"a set of paths with branches or choices, which enables the system to carry out the most sensible
search function at each stage of the search." As an example of the utility of search trees, Markey
uses a search for "women in history" which generates a list of potentially relevant subject
headings. 8 The beginning of the resulting list is shown in figure 2:
<Figure 2. Search Tree from Markey's "women in history" Example*.>
*
Since the Markey paper was published, "African American" has replaced "Afro-American" as the
preferred term for LCSH—systems intelligently using authorities could automatically revise the terminology
relieving the searcher of that burden.
4
Only after the user selects a subject heading from the list would the bibliographic record(s) associated
with the subject heading be displayed. In its simplistic form, the search tree would create a list of
potentially relevant subject headings as an intermediate step such as:
Query --> Subject headings --> Bibliographic records
While a variety of methodologies can be used to create search trees, they share the common
objective of presenting the search results to the user in a number of small manageable steps. At
each step, the user can navigate by choosing among the options offered. None of the approaches
for creating search trees in the OPAC environment described in the literature use authority files
directly but instead rely on indexing the subject headings found in the bibliographic records.
searchFAST
searchFAST is a prototype that uses an authority file as the source of subject headings rather than
the headings from bibliographic records. OCLC, in cooperation with the Library of Congress,
developed FAST, a facetted subject heading schema derived from LCSH. FAST is well suited
for this application since (1) FAST is faceted and faceted schemas generate simpler search trees,
and (2) all FAST headings are established and have authority records. Through searchFAST,
available at http://fast.oclc.org/searchfast/, OCLC provides free public access to the FAST
authority file. In addition to identifying and validating subject headings for catalogers,
searchFAST can be used to search WorldCat.org. searchFAST complements other methods of
accessing WorldCat.org.
Upon starting searchFAST, the user is greeted with the option to enter a simple query. Any
search term or phrase can be entered to begin the search. The initial search is a keyword search
across all facets. The autosuggest feature displays potential matches as the query is entered.
After the initial search, options to refine the search are available.
A wide variety of possible access points in the FAST authority record is indexed and three
different types of indexes are provided: keyword, phrase, and miscellaneous indexes. The
keyword indexes store individual words extracted from the specified fields. The phrase indexes
provide access to the complete field, and the other indexes provide support access to particular
data elements. A detailed guide to using searchFAST is available on the FAST website.9
Searching
A search using the default Keywords in All Headings option will retrieve all FAST headings that
contain the words from the query. This option assumes a Boolean AND relationship between all
the words in the query and retrieves all authority records with headings or cross-references
containing those words. This default is often a good choice since it is simple and generally
works well. For example, a search for “sailboats” will result in the display shown in figure 3 †.
<Figure 3. Search Results for "sailboats".>
†
Search results will vary depending on when the search is performed. Both the number of headings
matching a query and the number of uses generally will increase over time. It's therefore likely that attempts to
replicate the examples used in this paper will produce different results than those shown here.
5
Stemming is used in query processing. The results here are actually for the root word "sailboat"
so the queries "sailboat", "sailboats", and "sailboating" are equivalent.
FAST authority records contain information on the frequency of assignment in OCLC's
WorldCat. This frequency is used for the default ranking so the most popular headings appear
first. However, other options for ordering the display are available; clicking on Heading will list
the headings alphabetically and clicking on Facet will group the headings by facet. Individual
authority records can be displayed by clicking on the heading. Selecting the heading
Sailboats—Hydrodynamics produces the display shown in figure 4.
<Figure 4. FAST Authority Record for Sailboats—Hydrodynamics.>
searchFAST supports two display levels; the brief record shown above is the default but a full
MARC display is available by clicking on View MARC.
FAST authorities are available as linked data and the full record of searchFAST uses the linked
data service. These links appear in the bibliographic record display under LINKS TO FULL
RECORD. The default link is shown first, with alternate formats following. For machine usage,
the desired response format can be requested, retrieving just the raw data. These links are
permanent, and can be embedded in other systems.
In addition to the keyword index, searchFAST supports four phrase indexes: (1) a heading index,
(2) a subfield index, (3) a source heading index, and (4) a see also reference index. Searches for
a known heading are most effective using the heading index since this type of search almost
always results in a single matching heading. Searches can also be limited to a single subfield.
For example, searching “education, higher” against the subfield index will retrieve all headings
with either Education, Higher as a main heading or Education (Higher) as a subdivision. The
initial results for this search is shown in figure 5.
<Figure 5. Search Results for “education, higher”. >
The source heading index is used to retrieve FAST headings by searching for the LCSH heading
from which the FAST heading was derived. Among other uses, this index is very helpful to
identify the FAST headings equivalent to an LCSH heading. The FAST heading corresponding
to the LCSH heading Little Traverse Bay (Mich.) can be found by searching the Full LC
Source Heading index for “little traverse bay (mich.)”. The search retrieves two FAST headings
since both FAST headings were derived from the same LCSH heading:
Lake Michigan—Little Traverse Bay
Michigan—Little Traverse Bay Region
The “See Also” reference index can be used to identify all authority records containing a
specific term or phrase. Searching the index will retrieve records with references to either
related or broader terms. A search on "philosophy" in this index retrieves 270 headings as
shown in figure 6.
6
<Figure 6. Search Results for “philosophy" >
In addition to the keyword and phrase indexes, searchFAST also supports specialized indexes
such as FAST Authority Record Number and LCCN. The LCCN index is used to identify a
FAST heading by searching on the LCCN of the LCSH heading from which it was derived.
Managing the Resulting Lists
When a long list of headings is obtained, several options are available to narrow or manage the
results. For example, the keyword search for "history" produces over 3,500 results as shown in
figure 7.
<Figure 7. Search Results for "history".
Using the Limit Results by drop down to select the “Uniform Title” facet reduces this list to
about 500 since only the headings for uniform titles will be shown.
Both the keyword and phrase indexes have an autosuggest feature. As the query is entered, a list
of suggested headings is displayed. This list is refined with each keystroke, based on the string
similarity and frequency of indexed words or phrases to the string being entered. Both keywords
and phrase indexes use the same list, but the list is limited to the facet or index you are searching.
For example, typing “pott” while the Limit Results by is set to Uniform Title, displays several
Harry Potter and Beatrix Potter titles, any one of which can be selected.
searchFAST supports inline Boolean searches, although only on a single keyword index at a
time. The Booleans AND, OR, and NOT are supported and must be entered as capitals.
Parentheses can be used to combine terms. The query "dog OR cat" retrieves 952 records as
shown in figure 8.
<Figure 8. Search Results for "dog OR cat".>
The results can be modified to eliminate headings containing either “diseases” or “training”
using the query "dog OR cat NOT (disease OR training)", which eliminates more than 50
headings.
Two special terms are defined to aid searching for geographic headings. Most FAST geographic
records—and many FAST event records—include a geographic area code (GAC), such as n-usmi for the state of Michigan. The normalization process for building keyword indexes would
obfuscate GACS, rendering them useless as a search aid. Therefore, a special pattern was defined
to allow the use of GACs in searches; GACs must be prefixed with "gac:". Entering "gac:n-usmi" will retrieve all authority records containing the GAC for Michigan. Keywords can be
added to GAC queries to refine the search. For example, the query “gac:n-us-mi lake” retrieves
49 records—all are headings that include the word 'lake' and also have the Michigan GAC. It is
important to note that all the headings retrieved by that search may not represent lakes; headings
such as Michigan—Lake County will also be retrieved.
7
To improve precision when searching for particular geographic features, a second special term is
provided to support searching by geographic feature type. Many FAST geographic authority
records include the geographic feature type such as stream (includes rivers, creeks), lake,
populated place, etc. For searching, the feature type must be prefixed with "feature:". The query
"gac:n-us-mi feature:lake” retrieves 17 headings for lakes in Michigan, as shown in figure 9.
None of the Great Lakes were retrieved since each of the Great Lakes has its own GAC.
<Figure 9. Search Results for "gac:n-us-mi feature:lake”.>
Retrieving WorldCat Records
While the procedure described above can be used to identify and assign FAST headings, it can
also be used to search WorldCat for bibliographic records. This can be an effective way to
perform a subject search in WorldCat. For example, if the simple query "fast" is used to initiate
a keyword search for resources on FAST subject headings, 178 headings are retrieved. Sorting
the list alphabetically (by clicking on Heading), then clicking to the second screen of the search
results finds FAST subject headings. Selecting that heading from the list generates the display
shown in figure 10.
<Figure 10. Search Results for "fast".>
Selecting Find in WorldCat causes searchFAST to open a new browser window for WorldCat.org
and to search the subject index for bibliographic records with the heading FAST subject
headings. A partial list of the records retrieved is shown in figure 11.
<Figure 11. Bibliographic Records with: FAST subject headings.>
searchFAST is a prototype and as such lacks the features, consistency, and performance expected
in production systems. One of those limitations is that FAST subject headings have not yet been
systematically added to WorldCat bibliographic records. Therefore, the search in WorldCat for
FAST subject headings must be emulated and the results are not as precise as if FAST subject
headings were in the bibliographic records. The emulation process results in high recall but less
than ideal precision. In many cases, the number of hits in WorldCat will be higher than is
reflected in the WorldCat Subject usage field in the FAST authority records. In spite of this
limitation, the results are generally very acceptable and provide access not otherwise available.
The window opened by searchFast to WorldCat.org provides the searcher with the full
functionality of WorldCat.org. Selecting any of the three links shown in figure 11 will display
the bibliographic data for the selected resource, a list of the institutions holding the resource, and
other relevant information. WorldCat.org also provides a number of options for refining the
search that are particularly helpful for queries generating a large number of hits. For resources
available online, WorldCat.org offers the option to view the resource. For example, clicking on
the View Now option for the third resource will link to the IFLA website to display the full text of
the paper on FAST presented at the 69th IFLA General Conference and Council in Berlin.
8
Conclusions
The searchFAST prototype demonstrates the effectiveness and practicality of using an authority
file as an index to a bibliographic file. This use leverages authority control work so that it
benefits searchers as well catalogers. The default keyword searches generally produce very
satisfactory results by generating search trees that can be easily navigated. More sophisticated
searches are also supported providing experienced searchers with options not found in other
systems.
9
010 sh 85020816
040 DLC $b eng $c DLC $d DLC
053 0 Z693 $b Z695.83
150 Cataloging
450 Cataloguing
550 Information organization $w g
550 Technical services (Libraries) $w g
550 Books
Figure 1. LCSH Authority Record for Cataloging.
10
Afro-American women—History
Afro-American women—History—19th century
Afro-American women—History—20th century
Afro-American women—Public opinion—History
Afro-American women—Southern States—History
Afro-American women—Southern States—History—19th century
American literature—Women authors—History and criticism
Indians of North America—Women—Biography—History and criticism
Italian American women—New York (N.Y.)—History
Figure 2. Search Tree from Markey's "women in history" Example.
11
Figure 3. Search Results for "sailboats".
12
Figure 4. FAST Authority Record for Sailboats—Hydrodynamics.
13
Figure 5. Search Results for “education, higher".
14
Figure 6. Search Results for “philosophy".
15
Figure 7. Search Results for "history".
16
Figure 8. Search Results for "dog OR cat".
17
Figure 9. Search Results for "gac:n-us-mi feature:lake”.
18
Figure 10. Search Results for "fast".
19
Figure 11. Bibliographic Records with: FAST subject headings.
20
Biographical Notes
Edward T. O’Neill (oneill@oclc.org) is a Senior Research Scientist at OCLC Research and
project manager for the FAST. Ed did his undergraduate work at Albion College and his
doctorial work at Purdue University in Operations Research. His research interests include
authority control, subject analysis, database quality, collection management, and bibliographic
relationships. He is active in IFLA and is a member of IFLA's Standing Committee of the
Classification and Indexing.
Rick Bennett (Rick_Bennett@oclc.org) is a Consulting Software Engineer in OCLC Research,
where he works on processing and manipulating bibliographic and authority data. Currently he
has been focusing on developing, maintaining, and displaying authority data for the FAST
project. Rick was an undergraduate at Penn State and graduate at Georgia Tech, where he
completed a program in Computer Engineering.
Kerre Kammerer (kammerer@oclc.org) is a Consulting Software Engineer in OCLC Research.
Kerre’s research interests include database quality control and authority data. Her current
research activities involve the creation and maintenance of FAST authority records and the
conversion of LC subject headings to FAST headings in bibliographic records. Kerre holds a BA
in Economics from DePauw University.
21
References
1
Taylor, Arlene G., and David P. Miller. Introduction to Cataloging and Classification. Westport, Conn: Libraries
Unlimited, 2006: 19.
2
Chan, Lois Mai and Edward T. O'Neill. FAST: Faceted Application of Subject Terminology: Principles and
Application. Libraries Unlimited, Santa Barbara, CA: 2010: 261.
3
Subject Cataloging Manual: Subject Headings. Washington, DC: Cataloging Distribution Service, Library of
Congress, 1990. Continually updated resource.
4
Markey, Karen. "Twenty-five Years of End-User Searching, Part 1: Research Findings", Journal of the American
Society for Information Science and Technology, Vol. 58(8) 2007: 1079.
5
Markey, Karen. "Twenty-five Years of End-User Searching, Part 2: Future Research Directions", Journal of the
American Society for Information Science and Technology, Vol. 58(8) 2007: 1123.
6
Markey, Karen. "Failure Analysis of Subject Searches in a Test of a New Design for Subject Access to Online
Catalogs", Journal of the American Society for Information Science and Technology, Vol. 47(7) 1996: 520.
7
Mitev, Nathalie Nadia, Gillian M. Venner, and Stephen Walker. Designing an online public access catalogue:
Okapi, a catalogue on a local area network. London: British Library, 1985.
8
Markey, Karen. "Failure Analysis of Subject Searches in a Test of a New Design for Subject Access to Online
Catalogs", Journal of the American Society for Information Science and Technology, Vol. 47(7) 1996: 525-6.
9
searchFAST, OCLC Online Computer Library Center, http://fast.oclc.org/searchfast/searchFastHowto.pdf
(accessed February 9, 2012)
Download