1 Using Authorities to Improve Subject Searches Edward T. O'Neill, Rick Bennett, and Kerre Kammerer; OCLC Research Abstract Authority files have played an important role in improving the quality of indexing and subject cataloging. Although authorities can significantly improve search by increasing the number of access points, they are rarely an integral part of the information retrieval process, particularly end-users searches. A retrieval prototype, searchFAST, was developed to test the feasibility of using an authority file as an index to bibliographic records. searchFAST uses FAST (Faceted Application of Subject Terminology) as an index to OCLC's WorldCat.org bibliographic database. The searchFAST methodology complements, rather than replaces, existing WorldCat.org access. The bibliographic file is searched indirectly; first the authority file is searched to identify appropriate subject headings, then the headings are used to retrieve the matching bibliographic records. The prototype demonstrates the effectiveness and practicality of using an authority file as an index. Searching the authority file leverages authority control work by increasing the number of access points while supporting a simple interface designed for endusers. Introduction Authority files have traditionally been an integral part of cataloging. While there is broad agreement that controlled vocabularies and authority control significantly improve catalogs, authority control is costly and labor intensive. Although professional searchers often consult authorities, few end-users have either the knowledge or experience to effectively use authority files. Enabling end-users to effectively use authorities as search aids will require new approaches that better utilize the information in authority records without adding complexity. Unlocking the potential of authorities will necessitate reexamining the search process and exploring new approaches. Authority Control In its simplest form, authority control consists of (1) determining the form of terms used as access points and (2) verifying that the terms assigned are valid. This is usually accomplished by creating a separate authority record for each access point that identifies the preferred term. These records include cross references identifying common synonyms and information explaining the rationale for establishing the particular form for the access point. The authority record will identify the preferred term and list common synonyms as cross references. For example, the preferred term for "authority control" in the Library of Congress Subject Headings (LCSH) is Authority files (Information retrieval) with cross references identifying Authority control (Information retrieval), Authority files (Cataloging), Authority 2 records (Information retrieval), Authority work (Information retrieval), and Library authority files as synonyms. The terms found in the cross references are not valid access points—the preferred term should be used instead. A resource describing authority work therefore should be assigned the term Authority files (Information retrieval) rather than Authority work or Authority records (Information retrieval). Taylor defined authority control as "the result of the process of maintaining consistency in the verbal form (heading) used to represent an access point and the further process of showing the relationships among names, works, and subjects. It's accomplished through use of rules (in the case of names and titles), use of a controlled vocabulary, and reference to an authority file to create an authorized character string called a heading." 1 Chan and O'Neill emphasized that the "purpose of authority control is similar to that of the now ubiquitous spell checkers; authority control should detect incorrect headings and, when possible, correct or suggest corrections. Just as spell checkers use dictionaries to determine the correct spelling of a word, authority control relies on the authority file to determine the correct form of a heading." 2 End-User Searching Subject heading schemas are often complex and difficult to use. The last printed manual for applying the Library of Congress Subject Headings (LCSH) required four volumes3. Librarians and other experienced users have the requisite skills to effectively search controlled vocabularies. One of the more significant trends in information retrieval has been the shift away from mediated to end-user searching. Since end users are often unfamiliar with controlled vocabularies, controlled vocabularies, such as LCSH, have not necessarily improved the search experience for end users. Markey concluded that "End-user searches do not resemble the systematic approach of expert intermediary searchers" since end users make very limited use of Boolean logic.4 The difficulties encountered by end users' use of LCSH are well documented; many of these difficulties are attributable to its highly structured vocabulary and extensive application rules. Markey argues that "making online IR systems more complicated with additional functionality, frequent and unanticipated interruptions in the form of direct system intervention, and detailed instructions and tutorials in system use, is not the right way to proceed." 5 From the perspective of the typical end-user, intuitiveness and ease of use are more important than functionality. Advanced features, while often demanded by experienced searchers, should not be an impediment to end-users. Authority File as an Index Are controlled vocabularies synonymous with complexity? Do controlled vocabularies have a role in the new search environment or are they a relic of the card catalog era? While the jury is still out, many believe that controlled vocabularies remain relevant and should play a key role in future retrieval systems. However, for that to happen, (1) the cost of using controlled vocabularies, i.e. authority control, will have to decrease significantly, and (2) new approaches to searching must be developed to better utilize the information from authority records without adding complexity. Although authority control is likely to remain a costly process, considerable progress has been made in automating the process to reduce its cost. The second condition, although equally 3 important, is more problematic. If authority control and controlled vocabularies continue to be used primarily by catalogers as tools to improve indexing quality, it will be increasingly difficult to justify the costs. To unlock the potential of authorities as a means to improve searching will require reexamining the search process, exploring new methodologies, and developing prototype environments in which to test the methodologies. Historically, there was a close link between an authority record and its associated bibliographic records. In card catalogs, it was common practice to interfile cards for the authority records with the cards for the bibliographic records. Cards for the cross references would also be interfiled to redirect users to other parts of the catalog. With the advent of online catalogs, the direct linkage between the authorities and bibliographic records was lost. Authority files and bibliographic files are maintained by different agencies; subject authority files, such as LCSH, generally are maintained centrally while bibliographic files are maintained locally. There are significant benefits to linking authority and bibliographic files; one is that linking permits the authority file to function as an index to the bibliographic file. This indirect approach offers two distinct advantages over searching the bibliographic file directly: (1) authorities contain a wealth of information not available in the bibliographic record and (2) the additional information can be used to generate search trees to aid navigation. Authority records for schemas such as LCSH or FAST (Faceted Application of Subject Terminology) are rich sources of information about the subject and include many access points and navigation aids unavailable in the bibliographic record. Figure 1 shows the LCSH authority record for Cataloging: <Figure 1. LCSH Authority Record for Cataloging.> Even this relatively simple authority record contains information including the Library of Congress class numbers associated with the heading (Z693 and Z695.83), a cross reference from the British spelling (Cataloguing), links to broader terms (Information organization, and Technical services (Libraries)), and links to related terms (Books) which aren't found in the bibliographic record. This additional information can be used to enhance searching by providing additional access points. The cross reference from Cataloguing alone may make the difference between a successful and an unsuccessful search. Markey uses the definition for search trees originally defined by Mitev, Venner, & Walker6,7 as: "a set of paths with branches or choices, which enables the system to carry out the most sensible search function at each stage of the search." As an example of the utility of search trees, Markey uses a search for "women in history" which generates a list of potentially relevant subject headings. 8 The beginning of the resulting list is shown in figure 2: <Figure 2. Search Tree from Markey's "women in history" Example*.> * Since the Markey paper was published, "African American" has replaced "Afro-American" as the preferred term for LCSH—systems intelligently using authorities could automatically revise the terminology relieving the searcher of that burden. 4 Only after the user selects a subject heading from the list would the bibliographic record(s) associated with the subject heading be displayed. In its simplistic form, the search tree would create a list of potentially relevant subject headings as an intermediate step such as: Query --> Subject headings --> Bibliographic records While a variety of methodologies can be used to create search trees, they share the common objective of presenting the search results to the user in a number of small manageable steps. At each step, the user can navigate by choosing among the options offered. None of the approaches for creating search trees in the OPAC environment described in the literature use authority files directly but instead rely on indexing the subject headings found in the bibliographic records. searchFAST searchFAST is a prototype that uses an authority file as the source of subject headings rather than the headings from bibliographic records. OCLC, in cooperation with the Library of Congress, developed FAST, a facetted subject heading schema derived from LCSH. FAST is well suited for this application since (1) FAST is faceted and faceted schemas generate simpler search trees, and (2) all FAST headings are established and have authority records. Through searchFAST, available at http://fast.oclc.org/searchfast/, OCLC provides free public access to the FAST authority file. In addition to identifying and validating subject headings for catalogers, searchFAST can be used to search WorldCat.org. searchFAST complements other methods of accessing WorldCat.org. Upon starting searchFAST, the user is greeted with the option to enter a simple query. Any search term or phrase can be entered to begin the search. The initial search is a keyword search across all facets. The autosuggest feature displays potential matches as the query is entered. After the initial search, options to refine the search are available. A wide variety of possible access points in the FAST authority record is indexed and three different types of indexes are provided: keyword, phrase, and miscellaneous indexes. The keyword indexes store individual words extracted from the specified fields. The phrase indexes provide access to the complete field, and the other indexes provide support access to particular data elements. A detailed guide to using searchFAST is available on the FAST website.9 Searching A search using the default Keywords in All Headings option will retrieve all FAST headings that contain the words from the query. This option assumes a Boolean AND relationship between all the words in the query and retrieves all authority records with headings or cross-references containing those words. This default is often a good choice since it is simple and generally works well. For example, a search for “sailboats” will result in the display shown in figure 3 †. <Figure 3. Search Results for "sailboats".> † Search results will vary depending on when the search is performed. Both the number of headings matching a query and the number of uses generally will increase over time. It's therefore likely that attempts to replicate the examples used in this paper will produce different results than those shown here. 5 Stemming is used in query processing. The results here are actually for the root word "sailboat" so the queries "sailboat", "sailboats", and "sailboating" are equivalent. FAST authority records contain information on the frequency of assignment in OCLC's WorldCat. This frequency is used for the default ranking so the most popular headings appear first. However, other options for ordering the display are available; clicking on Heading will list the headings alphabetically and clicking on Facet will group the headings by facet. Individual authority records can be displayed by clicking on the heading. Selecting the heading Sailboats—Hydrodynamics produces the display shown in figure 4. <Figure 4. FAST Authority Record for Sailboats—Hydrodynamics.> searchFAST supports two display levels; the brief record shown above is the default but a full MARC display is available by clicking on View MARC. FAST authorities are available as linked data and the full record of searchFAST uses the linked data service. These links appear in the bibliographic record display under LINKS TO FULL RECORD. The default link is shown first, with alternate formats following. For machine usage, the desired response format can be requested, retrieving just the raw data. These links are permanent, and can be embedded in other systems. In addition to the keyword index, searchFAST supports four phrase indexes: (1) a heading index, (2) a subfield index, (3) a source heading index, and (4) a see also reference index. Searches for a known heading are most effective using the heading index since this type of search almost always results in a single matching heading. Searches can also be limited to a single subfield. For example, searching “education, higher” against the subfield index will retrieve all headings with either Education, Higher as a main heading or Education (Higher) as a subdivision. The initial results for this search is shown in figure 5. <Figure 5. Search Results for “education, higher”. > The source heading index is used to retrieve FAST headings by searching for the LCSH heading from which the FAST heading was derived. Among other uses, this index is very helpful to identify the FAST headings equivalent to an LCSH heading. The FAST heading corresponding to the LCSH heading Little Traverse Bay (Mich.) can be found by searching the Full LC Source Heading index for “little traverse bay (mich.)”. The search retrieves two FAST headings since both FAST headings were derived from the same LCSH heading: Lake Michigan—Little Traverse Bay Michigan—Little Traverse Bay Region The “See Also” reference index can be used to identify all authority records containing a specific term or phrase. Searching the index will retrieve records with references to either related or broader terms. A search on "philosophy" in this index retrieves 270 headings as shown in figure 6. 6 <Figure 6. Search Results for “philosophy" > In addition to the keyword and phrase indexes, searchFAST also supports specialized indexes such as FAST Authority Record Number and LCCN. The LCCN index is used to identify a FAST heading by searching on the LCCN of the LCSH heading from which it was derived. Managing the Resulting Lists When a long list of headings is obtained, several options are available to narrow or manage the results. For example, the keyword search for "history" produces over 3,500 results as shown in figure 7. <Figure 7. Search Results for "history". Using the Limit Results by drop down to select the “Uniform Title” facet reduces this list to about 500 since only the headings for uniform titles will be shown. Both the keyword and phrase indexes have an autosuggest feature. As the query is entered, a list of suggested headings is displayed. This list is refined with each keystroke, based on the string similarity and frequency of indexed words or phrases to the string being entered. Both keywords and phrase indexes use the same list, but the list is limited to the facet or index you are searching. For example, typing “pott” while the Limit Results by is set to Uniform Title, displays several Harry Potter and Beatrix Potter titles, any one of which can be selected. searchFAST supports inline Boolean searches, although only on a single keyword index at a time. The Booleans AND, OR, and NOT are supported and must be entered as capitals. Parentheses can be used to combine terms. The query "dog OR cat" retrieves 952 records as shown in figure 8. <Figure 8. Search Results for "dog OR cat".> The results can be modified to eliminate headings containing either “diseases” or “training” using the query "dog OR cat NOT (disease OR training)", which eliminates more than 50 headings. Two special terms are defined to aid searching for geographic headings. Most FAST geographic records—and many FAST event records—include a geographic area code (GAC), such as n-usmi for the state of Michigan. The normalization process for building keyword indexes would obfuscate GACS, rendering them useless as a search aid. Therefore, a special pattern was defined to allow the use of GACs in searches; GACs must be prefixed with "gac:". Entering "gac:n-usmi" will retrieve all authority records containing the GAC for Michigan. Keywords can be added to GAC queries to refine the search. For example, the query “gac:n-us-mi lake” retrieves 49 records—all are headings that include the word 'lake' and also have the Michigan GAC. It is important to note that all the headings retrieved by that search may not represent lakes; headings such as Michigan—Lake County will also be retrieved. 7 To improve precision when searching for particular geographic features, a second special term is provided to support searching by geographic feature type. Many FAST geographic authority records include the geographic feature type such as stream (includes rivers, creeks), lake, populated place, etc. For searching, the feature type must be prefixed with "feature:". The query "gac:n-us-mi feature:lake” retrieves 17 headings for lakes in Michigan, as shown in figure 9. None of the Great Lakes were retrieved since each of the Great Lakes has its own GAC. <Figure 9. Search Results for "gac:n-us-mi feature:lake”.> Retrieving WorldCat Records While the procedure described above can be used to identify and assign FAST headings, it can also be used to search WorldCat for bibliographic records. This can be an effective way to perform a subject search in WorldCat. For example, if the simple query "fast" is used to initiate a keyword search for resources on FAST subject headings, 178 headings are retrieved. Sorting the list alphabetically (by clicking on Heading), then clicking to the second screen of the search results finds FAST subject headings. Selecting that heading from the list generates the display shown in figure 10. <Figure 10. Search Results for "fast".> Selecting Find in WorldCat causes searchFAST to open a new browser window for WorldCat.org and to search the subject index for bibliographic records with the heading FAST subject headings. A partial list of the records retrieved is shown in figure 11. <Figure 11. Bibliographic Records with: FAST subject headings.> searchFAST is a prototype and as such lacks the features, consistency, and performance expected in production systems. One of those limitations is that FAST subject headings have not yet been systematically added to WorldCat bibliographic records. Therefore, the search in WorldCat for FAST subject headings must be emulated and the results are not as precise as if FAST subject headings were in the bibliographic records. The emulation process results in high recall but less than ideal precision. In many cases, the number of hits in WorldCat will be higher than is reflected in the WorldCat Subject usage field in the FAST authority records. In spite of this limitation, the results are generally very acceptable and provide access not otherwise available. The window opened by searchFast to WorldCat.org provides the searcher with the full functionality of WorldCat.org. Selecting any of the three links shown in figure 11 will display the bibliographic data for the selected resource, a list of the institutions holding the resource, and other relevant information. WorldCat.org also provides a number of options for refining the search that are particularly helpful for queries generating a large number of hits. For resources available online, WorldCat.org offers the option to view the resource. For example, clicking on the View Now option for the third resource will link to the IFLA website to display the full text of the paper on FAST presented at the 69th IFLA General Conference and Council in Berlin. 8 Conclusions The searchFAST prototype demonstrates the effectiveness and practicality of using an authority file as an index to a bibliographic file. This use leverages authority control work so that it benefits searchers as well catalogers. The default keyword searches generally produce very satisfactory results by generating search trees that can be easily navigated. More sophisticated searches are also supported providing experienced searchers with options not found in other systems. 9 010 sh 85020816 040 DLC $b eng $c DLC $d DLC 053 0 Z693 $b Z695.83 150 Cataloging 450 Cataloguing 550 Information organization $w g 550 Technical services (Libraries) $w g 550 Books Figure 1. LCSH Authority Record for Cataloging. 10 Afro-American women—History Afro-American women—History—19th century Afro-American women—History—20th century Afro-American women—Public opinion—History Afro-American women—Southern States—History Afro-American women—Southern States—History—19th century American literature—Women authors—History and criticism Indians of North America—Women—Biography—History and criticism Italian American women—New York (N.Y.)—History Figure 2. Search Tree from Markey's "women in history" Example. 11 Figure 3. Search Results for "sailboats". 12 Figure 4. FAST Authority Record for Sailboats—Hydrodynamics. 13 Figure 5. Search Results for “education, higher". 14 Figure 6. Search Results for “philosophy". 15 Figure 7. Search Results for "history". 16 Figure 8. Search Results for "dog OR cat". 17 Figure 9. Search Results for "gac:n-us-mi feature:lake”. 18 Figure 10. Search Results for "fast". 19 Figure 11. Bibliographic Records with: FAST subject headings. 20 Biographical Notes Edward T. O’Neill (oneill@oclc.org) is a Senior Research Scientist at OCLC Research and project manager for the FAST. Ed did his undergraduate work at Albion College and his doctorial work at Purdue University in Operations Research. His research interests include authority control, subject analysis, database quality, collection management, and bibliographic relationships. He is active in IFLA and is a member of IFLA's Standing Committee of the Classification and Indexing. Rick Bennett (Rick_Bennett@oclc.org) is a Consulting Software Engineer in OCLC Research, where he works on processing and manipulating bibliographic and authority data. Currently he has been focusing on developing, maintaining, and displaying authority data for the FAST project. Rick was an undergraduate at Penn State and graduate at Georgia Tech, where he completed a program in Computer Engineering. Kerre Kammerer (kammerer@oclc.org) is a Consulting Software Engineer in OCLC Research. Kerre’s research interests include database quality control and authority data. Her current research activities involve the creation and maintenance of FAST authority records and the conversion of LC subject headings to FAST headings in bibliographic records. Kerre holds a BA in Economics from DePauw University. 21 References 1 Taylor, Arlene G., and David P. Miller. Introduction to Cataloging and Classification. Westport, Conn: Libraries Unlimited, 2006: 19. 2 Chan, Lois Mai and Edward T. O'Neill. FAST: Faceted Application of Subject Terminology: Principles and Application. Libraries Unlimited, Santa Barbara, CA: 2010: 261. 3 Subject Cataloging Manual: Subject Headings. Washington, DC: Cataloging Distribution Service, Library of Congress, 1990. Continually updated resource. 4 Markey, Karen. "Twenty-five Years of End-User Searching, Part 1: Research Findings", Journal of the American Society for Information Science and Technology, Vol. 58(8) 2007: 1079. 5 Markey, Karen. "Twenty-five Years of End-User Searching, Part 2: Future Research Directions", Journal of the American Society for Information Science and Technology, Vol. 58(8) 2007: 1123. 6 Markey, Karen. "Failure Analysis of Subject Searches in a Test of a New Design for Subject Access to Online Catalogs", Journal of the American Society for Information Science and Technology, Vol. 47(7) 1996: 520. 7 Mitev, Nathalie Nadia, Gillian M. Venner, and Stephen Walker. Designing an online public access catalogue: Okapi, a catalogue on a local area network. London: British Library, 1985. 8 Markey, Karen. "Failure Analysis of Subject Searches in a Test of a New Design for Subject Access to Online Catalogs", Journal of the American Society for Information Science and Technology, Vol. 47(7) 1996: 525-6. 9 searchFAST, OCLC Online Computer Library Center, http://fast.oclc.org/searchfast/searchFastHowto.pdf (accessed February 9, 2012)