Running head: HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES How Natural Language Processing Can Benefit Libraries Arianna L. Schlegel Southern Connecticut State University 1 HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 2 Abstract This paper will explore the current lack of natural language processing (NLP) features in online public access catalogs (OPACs). To do so, it will first examine advances recently made in online search engine development which have incorporated NLP, and then will look at the less significant developments made in OPAC research which also involve NLP. There are currently large discrepancies between the capabilities of online search engines and OPACs, with the former being several strides ahead of the latter. This paper recognizes the differences, and concludes with suggestions on how to improve future research into OPAC development. These include breaking away from the card catalog model, looking further into how humans approach a search, and incorporating NLP approaches, which might include dialogue between computer and human, more effective visual browsing, and word association. Keywords: Natural language processing, OPACs, libraries. HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 3 Why is Natural Language Processing Important for Libraries? Researchers have been studying artificial intelligence (AI) and its application to libraries for several decades, ever since computers began to play a large part in the organization of library resources. Artificial intelligence is the study of how to make computers imitate human beings as closely as possible. Natural language processing (NLP) is a branch of AI, wherein researchers aim to gather knowledge on how human beings understand and use language so that computer systems can be made to understand and manipulate natural languages (Chowdhury, 2003). This paper will attempt to emphasize how natural language processing has very important potential in enhancing the searching capabilities of online library catalogs. What is NLP and How Does it Relate to Human Language? Computers and humans simply do not speak the same language. Due to their radically different makeup, such a disconnect is inevitable. At their most basic, computers process in binary, which means their language building blocks consist of only two things: electronic signals in either an “on” or “off” state. Human, or natural, language at its most biological level is many, many degrees more complex; science still does not wholly understand how the brain processes the spoken and written word. However, for years researchers have been attempting to find a way to reconcile the two: those studying artificial intelligence search for ways in which to program computers so that machines can process – understand and respond to – human language. It is true that computers in this day and age work with higher-level programming languages such as C++ and Java, languages which employ English words and thus are much closer to human language than is binary machine language. Yet even the most evolved HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 4 programming language is not fluently readable by the human eye, particularly one not trained in computer science. Computer languages are similar to human languages in that they are used to instruct and inform the machine (insomuch as a computer can be considered “informed”). Therefore, the study of NLP is both relevant and full of potential. In this, the twenty-first century, with the vast speed and processing power of computers, and the potential for seemingly unstoppable gains in computing capabilities, it is feasible to believe that computers can become so advanced that they can largely emulate the incredibly complex language processes of a human being. What Can NLP Do for Libraries? One branch of NLP research attempts to develop computer interfaces that can take human search queries in natural languages and process them, without requiring the user to be constrained by “search terms” or concerned with word ambiguities, for instance. Ideally, a user should be able to use a search engine as he would a human resource, such as a reference librarian. Contrary to their nature, people in the twenty-first century have been forced, in their use of computerized search engines, to conform to certain search standards which require only snippets of natural language and very unnatural grammars. In a library catalog search, for example, one might enter only a one-word subject, even if trying to answer a much more complex and specific question. On the other hand, if one were to approach a reference librarian, an entire sentence, revealing the true intent of the inquiry, would most likely be used. Therefore, significant research has been done in the field of NLP, with the intent of making computerized card catalog searches much more amenable to human language queries. HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 5 Currently, most Online Public Access Catalog (OPAC) software found in libraries is considered “second generation,” offering subject searches which look solely for the occurrence of a single search word as it appears in the title of a work or in the subject headings associated with that work (Antelman, Lynema, & Pace, 2006; Borgman, 1996). This means that OPACs remain on almost the same level as physical card catalogs, in terms of how limited a user’s options are when trying various approaches in researching a topic. According to Loarer (1993), most OPAC searches are for subjects, and the results come only from a simple word search performed on the catalog. However, many believe that there is still a large, untapped potential for OPACs to offer much more user-friendly and intuitive searches. A Very Brief History of Online Catalogs As would be expected, most online card catalogs began as just that: online replications of the capabilities of the physical card catalog. Others allowed users to utilize Boolean search terms to combine two or more words in one query. Second-generation designs, which appeared in the 1980s and are still the standard today, simply offered the combined functionality of the two previous kinds of OPACs. Natural language processing did not factor into these systems. Therefore, it is clear that most libraries continue to offer very primitive online catalogs, despite the advances occurring all around them in other areas of search engine research, most notably that of internet searching. Library patrons have become used to search engines like Google, and therefore approach an OPAC with the expectations of a similar user experience. Unfortunately, Antelman et al. (2006) point out that online catalogs have remained largely stagnant for close to twenty years, failing to keep up with advances in search technology due largely to the profession overlooking the problem, for various reasons. Loarer (1993) even pointed out that “OPAC” is, HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 6 ironically, a homonym of the French pronunciation of “opaque,” and remarks on the unfortunate yet largely apt choice of acronyms. Clearly there is much that can be done to enhance the online catalog experience. We will first take a look at how online search engines are beginning to take advantage of natural language processing, and then examine how these developments might be applied to better the OPAC user experience. NLP and online search engines It is becoming very clear that online searching is replacing library searching for many people, even those who are regular patrons of the library. This is largely due to the belief that Internet search engines produce more numerous and more relevant query results than do OPACs. While Google is currently the most popular search engine, there are several new NLP search engine developments on the horizon that have great potential, and should be considered when redesigning OPAC search engines. Google While Google is not an NLP-based engine at its heart, it presents an interesting case study to begin with, as it consists of a very interesting blend of more robust searching powers than OPACs, while still constraining the searcher to very limited approaches. As Fox (2008) pointed out, Google has its users typing in, on average, 2.8 words that represent the most vital points of a search query; therefore, if one were interested in the learning the winner of the Westminster Kennel Club dog show in 1932, a typical Google search string might be “Westminster dog show winner 1932.” A relevant and useful result is returned at the top of Google’s results page, but the user is required to form the query in a very unnatural way. One certainly would not walk into a HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 7 library and use the same phrase, sans semantic fillers of any kind, when requesting the above information of a librarian; it simply would not make sense in a face-to-face human interaction. However, when using a traditional OPAC, the search would most likely start even more generally, with a search perhaps under “dog shows” or “Westminster.” Therefore, Google does allow for more specific searches, but the engine cannot process true requests, and therefore remains limited in its capability to produce the most relevant results for a user. Its results are also listed in order of relevance based on what Google’s algorithm deems most important; the user is largely shuttled into the pages which happen to have the best results according to the search engine’s “more links equals more importance” approach. Therefore, it is feasible that a searcher might not ever find the information he is looking for; even if Google returns tens of thousands of results, most users do not tend to look beyond the first several pages of results before trying a new search or simply giving up the query as unanswerable (Ibid). So while Google is a good starting point for NLP development, it certainly has a long way to go before it can be considered to be processing true natural language. WAG WAG, deemed an online “answer extraction system” by Neumann and Xu (2004), was developed by a team of German researchers, with the goal of answering questions, rather than offering results related to a search term or phrase. This search engine, then, much more closely resembles the work of a reference librarian, and offers a peek at what an OPAC might look like were it to blend reference with general card catalog capabilities. The engine allows for the use of more structured queries, which would typically include specific question words such as “who,” “when,” “where,” etc. The search engine uses NLP to determine what type of response is being HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 8 looked for; if the query contains “when,” the user is most likely looking for a string containing a date format, and phrases formatted only in that particular pattern are examined. WAG also recognizes an NLP concept called Named Entities (NEs), which are generally proper nouns, and dynamically creates a lexicon for the search engine to reference while it is generating its results. This means, for instance, that a user could enter a search query which included a full name, and WAG would recognize that it should scan web pages for an answer related not only to that full name, but also any information located near either the first or last names, which might appear separately from one another. Such semantic recognition is a feature which online search engines as well as OPACs currently do not handle well; NE lexicons would be very beneficial to a library patron’s search for references related to a proper noun. Alpha Alpha is the brainchild of mathematician Stephen Wolfram, and his goal with this project is to create a search engine that returns real answers, as opposed to links that are simply related to the more general search topic (Levy, 2006). Wolfram’s approach to creating such a relevant database is to utilize customized databases which can be scanned in order to answer specific, English-language questions quantitatively, returning numbers and other information that is pulled from the databases to create “mini dossiers” on the subject being queried (Ibid). Therefore, one would receive a numeric answer to a quantitative search question; i.e., the realtime, current distance between Earth and the sun, calculated from one of the numerous databases, as opposed to results which simply list pages that may or may not contain the desired information. HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 9 Of course, this approach clearly requires a large amount of background work to implement; Wolfram himself is creating many of the databases from which his search engine’s answers would be pulled. However, the concept is an important one to consider when determining best practices for NLP searches. It is possible to organize information so that it is truly more accessible, and an OPAC developer might consider building more customized databases in order to enhance a search engine’s performance. Hakia Another very pertinent development in the field of online search engines can be seen in the development of Hakia, which claims to be a true semantic search engine. Hakia’s goal is to return results that are credible and usable, but not necessarily popular by the same standards as Google or Yahoo!; today’s more common search engines have decreed “popular” to mean a web page which has many links pointing to it (Fox, 2008). On its website, Hakia (2007) explained its various NLP approaches, which offer some key insights into how to perhaps improve upon the Google searches of today. For instance, Hakia bases its search results on concept matching, rather than on a simple keyword matching or popularity ranking (Ibid). The search engine claims to use a more advanced version of a web crawler for indexing web pages, which can identify concepts through semantic, NLP analysis. Once the concepts are identified, the engine can produce results for a human searcher that are subcategorized from a more general topic search, and thus allow the user to choose to focus the search on more specialized results. Other special features of Hakia include the NLP ability to identify equivalent terms for a more responsive search (i.e., recognizing that another word for “treatment” could be “cure”), as well as the ability to group brand names under broader umbrella categories, such as recognizing HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 10 “Toyota” as a type of “car.” And, like Google, Hakia does offer suggestions and corrects spelling errors in searches – a feature which OPACs generally do not offer (Ibid). A final human touch in this search engine is that it offers an option to consider only “credible” sources, which are in fact recommended to Hakia by actual librarians, thus underscoring the important role that an information scientist still must play in recognizing and promoting appropriate sources. This is an important consideration in the development of any OPAC software, and it is gratifying to see such a point being made by an online search engine. At the moment, Hakia is still a very limited search engine, and therefore hasn’t gained wide usage, but it holds a lot of promise, and certainly can offer many pointers to developers of an NLP-based OPAC system. Advances in OPAC Development Using NLP With all of the developments taking place in the field of online search engines with respect to NLP, one might understandably think that online catalogs are following suit, and becoming much more flexible and relevant. However, this is largely not the case. OPACs tend to remain based solidly in the concept of physical card catalogs, offering very limited searching on traditional categories such as author, title, or subject. OPAC designers are overwhelmingly not taking advantage of many of the very useful search engine tools already in place, nor do they seem to be exploring the vast amount of research into how natural language processing could better serve the library user. Borgman (1996) pointed out that online catalogs do not seem to be even attempting to understand search behavior, despite library catalogs being pioneers in the field of computerized research. Fortunately, however, there are several small enterprises – most not yet being used by the majority of libraries – which attempt to reconcile NLP and OPAC HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 11 searching, with what appears to be notable success. While remembering that much is still left to be developed, we will now take a look at of some of the strides taken in OPAC development. Okapi The Center for Interactive Systems Research (CISR) was formed in 1987, with the intent of studying computerized information retrieval (IR). Much of its early research focused on OPACs; in the late 1980s, a program named Okapi was introduced and the group began working with the product via Text Retrieval Conference (TREC) competitions. The competitions encouraged teams to further advance the NLP capabilities of the software (Department of Information Science, City University of London, 2008). In 2000, Microsoft Research Cambridge took part in the competitions, and made some significant inroads in the development of more user-friendly OPACs, utilizing a combination of the Okapi IR engine and NLPWin, its proprietary NLP system (Elworthy, 2000). The group engineered a question-answering system that was able to take questions as input, parse them to gather the meaning, and then find answers by locating sentences that contained similar phrasing. The system took advantage of the common knowledge that questions are often formed in very similar ways; many start with the same structure, such as “who is” or “when was,” and are then followed by the important search cues, which the software could then exploit. Additionally, question words can be associated with what response is expected; for instance, “who is” requires a person in its answer, “where” requires a location, and “when” demands some form of a time response. Often, these keywords are also followed by even more focused specifics, such as “what country” or “what year,” which considerably narrows down the scope of potential results. Thus, the software could easily parse the majority of questions, and then grab only the specific HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 12 information it required in order to locate pertinent answers. While this software did not perform optimally during the competition tests, the concepts that it examined present some truly notable considerations for future OPAC development. Visualization software Luther, Kelly, and Beagle (2005) recognized that search engines often return such large data sets that the majority of the results are unusable – largely because they are never even viewed. One of the biggest problems with search engines in general, and with OPACs specifically, are that there are so many ways to look at a subject. A library patron’s needs vary drastically from person to person, and even between several searches performed all by the same patron. Where one is an expert another might be a novice, but when one begins to research a subject, OPACs require every user to start with very broad, sweeping categorizations of the information of which they are in search. Additionally, the broad subject searches usually return a mixture of results that vary drastically in depth, coverage, and approach. While this is sometimes useful – as when a patron does not know specifically for what they are looking – often a user has a predefined query or search term in mind, but must use a drill-down searching method in order to locate a useful answer. Luther et al. (2005) recognized the vast differences between the approach to every search engine query, and examined visualization software as a possible means of ameliorating the discrepancies. Visualization programs attempt to work more closely with how the human brain works and how it processes linguistics, allowing users to follow certain paths, associate concepts, and backtrack, using interfaces which present data clouds and subheadings for the user to click through. The software also uses NLP to produce HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 13 topical clusters which associate certain words or concepts more along the lines of how the human mind does so. Certain OPAC-specific software is being developed which attempts to incorporate these novel approaches to searching. Notably, OCLC is working in conjunction with Antarctica Systems, Inc. to create a visual interface to the OCLC’s Electronic Books database. AquaBrowser, which has been adopted by institutions including Harvard and Miami-Dade Public Library System, is another well-known product which utilizes visual language searching. In recent years, several other prominent library database software companies have been exploring the applicability of visual search to their own products (Ibid). Clearly this is an idea that is catching on. Endeca Antelman et al. (2006) evaluated North Carolina State University’s recent adoption of a next-generation OPAC software called Endeca. While not strictly an NLP search engine, this program allowed the university to upgrade from their older OPAC to offer a much better tool for finding relevant resources. Specifically, the university was eager to replace its keyword search engine with software that had been developed for use with large commercial websites, as they found that their students were more familiar with those types of searches. The software offers many features that the university did not find in traditional OPAC software, such as the ability to assign search indexes different relevance rankings, and to correct user typos or misspellings. The auto-correct in Endeca is also much more relevant, as it pulls its suggestions from a selfcompiled list of frequently used terms, rather than simply referring to a dictionary for options. These types of responses from search engines, which take into account the way a human is more HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 14 likely to approach a search, and which compile some of their searching mechanisms from the user herself, are much more desirable than the basic keyword searches most used in OPACs today. Conclusion How Advances in NLP Can Benefit OPACs NLP is still in its infancy in many ways, and it may be a long time before computers can understand natural language in any sense near the depth that humans are able. However, it is clear from the above cases that many significant advances are being made in the field which might be very promising for the future of online catalog searching. In an ideal world, an OPAC would be able to answer a short, specific question while pointing the user towards further resources – much like the role of a reference librarian. There is currently a vast difference between an interaction with a human and an interaction with a computer, and unfortunately, most scenarios are leaning towards “computer-friendly” rather than “user-friendly” interfaces (Chowdhury, 2003; Borgman, 1996). This means that humans are adapting themselves to what a computer expects, rather than the other way around. Such change does not need to happen. Research continues to take place in the field of NLP, and significant strides continue to be made. OPAC designers must begin to explore and apply these advances to their products, or they risk losing their already crumbling hold on patron use. Loarer (1993) agreed that more user-friendly developments in OPACs, particularly in language processing, will have users in educational, processional, and domestic environments taking significantly more advantage of the software. It is true, however, that much remains to be developed in the field of NLP. The nuances and complexities of human language are certainly difficult to map to a binary system of ones and HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 15 zeros. Most NLP research has been done solely in closed, controlled settings, and has not been applied to real-world scenarios. It is understandable, then, that only tried and true inroads into better searching software have been adopted by OPAC developers. However, applying NLP advances in search engine design to OPAC design is a necessary next step towards more technologically-friendly libraries. Following are some suggestions for future development in NLP and OPAC design which were gathered throughout my research: Study search behavior: one cannot design a system that is more accommodating to searching if human search approaches are not understood. For instance, users tend to search in stages as they refine their queries, and sometimes use searching as a way to formulate an actual question. Consider this when designing systems. Offer a way for a user to incorporate earlier, related searches into the current search (Borgman, 1996). Separate altogether from the card catalog model; it was designed for a specific physical space which no longer constrains the computerized OPAC, and therefore software designers must literally think outside of the physical card catalog (Ibid). Move on from query-matching systems. They were designed for use by skilled searchers, such as librarians, not for use by untrained library patrons. Other approaches to information-gathering are more natural for the layman user (Ibid). Consider question-parsing mechanisms that refine or enlarge the user’s query either automatically or in a dialogue with the user, using algorithms that take the query’s semantics into consideration (Loarer, 1993; Cenek, 2001). Make sure help is readily available to the user, offering aid on how to use the software as well as how to perform an effective search (Loarer, 1993). Users will HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 16 be especially unfamiliar with how to best utilize NLP searches when those become more widely adopted; they will need instruction that is easy to access, understand, and remember (Kreymer, 2002). Additionally, offering the option to format the organization and display of results to suit the searcher’s needs would further accommodate the objectives of a search. Allow for better random browsing, much like the opportunities which physical card catalogs used to offer a patron: the chance to stumble serendipitously across an interesting subject, or to refine a search with unfamiliar subcategories. Don’t make users guess which subject headings are being used in the OPAC, nor how subheadings are associated with one another. Allow for more visual conceptualization, which is more in line with how humans think and learn. These are just some of what ought to be considered when developing a more NLP-oriented OPAC system. It is impossible to know how many more facets might be discovered in the process of development, a process which should ideally include usability surveys and testing, to better understand what a library patron wishes to get from a search engine. Natural language processing is just one small part of what needs to go into the improvement of current OPACs. Yet it remains so very important in the further development of libraries which are based, at their heart, on the ideal of freely sharing human knowledge – through language. HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES 17 References Antelman, K., Lynema, E., & Pace, A.K. (2006, September). Toward a twenty-first century library catalog. Information Technology & Libraries, 25(3), 128-139. Borgman, C.L. (1996, July). Why are online catalogs still hard to use? Journal of the American Society for Information Science, 47(7), 493-503. Cenek, P. (2001, June). Dialogue interfaces for library systems. In proceedings of FIMU-RS2001-04. Chowdhury, G. G. (2003). Natural language processing. In B. Cronin (Ed.), Annual Review of Information Science and Technology, 37, 51-89. Medford, NJ: Information Today. Department of Information Science, City University of London. (2008). CISR – Mission statement. Retrieved from http://www.soi.city.ac.uk/organisation/is/research/cisr/ Elworthy, D. (2000). Question answering using a large NLP. Proceedings of the Ninth Text Retrieval Conference (TREC 2000), 355-360. Fox, V. (2008, January). The promise of natural language search. Information Today, 25(1), 5050. Hakia, Inc. (2007). Technology. Retrieved from http://company.hakia.com/technology.html Kreymer, O. (2002). An evaluation of help mechanisms in natural language processing. Online Information Review, 26(1), 30-39. Levy, S. (2009, May 22). Steven Levy on the answer engine, a radical new formula for web search. Wired Magazine, 17(6). Retrieved from http://www.wired.com/techbiz/people/magazine/17-06/ts_levy HOW NATURAL LANGUAGE PROCESSING CAN BENEFIT LIBRARIES Loarer, P.L. (1993). OPAC: Opaque or open, public, accessible, and co-operative? Some developments in natural language processing. Program: Electronic Library and Information Systems, 27(3), 251-268. Luther, J., Kelly M., & Beagle, D. (2005, March 1). Visualize this. Library Journal. Retrieved from http://www.libraryjournal.com/article/CA504640.html Neumann, G., & Xu, F. (2004, June). Mining natural language answers from the web. Web Intelligence & Agent Systems, 2(2), 123-135. 18