What is Natural Language Processing

How Natural Language Processing Can Benefit Libraries
Arianna L. Schlegel
Southern Connecticut State University
This paper will explore the current lack of natural language processing (NLP) features in online
public access catalogs (OPACs). To do so, it will first examine advances recently made in online
search engine development which have incorporated NLP, and then will look at the less
significant developments made in OPAC research which also involve NLP. There are currently
large discrepancies between the capabilities of online search engines and OPACs, with the
former being several strides ahead of the latter. This paper recognizes the differences, and
concludes with suggestions on how to improve future research into OPAC development. These
include breaking away from the card catalog model, looking further into how humans approach a
search, and incorporating NLP approaches, which might include dialogue between computer and
human, more effective visual browsing, and word association.
Keywords: Natural language processing, OPACs, libraries.
Why is Natural Language Processing Important for Libraries?
Researchers have been studying artificial intelligence (AI) and its application to libraries
for several decades, ever since computers began to play a large part in the organization of library
resources. Artificial intelligence is the study of how to make computers imitate human beings as
closely as possible. Natural language processing (NLP) is a branch of AI, wherein researchers
aim to gather knowledge on how human beings understand and use language so that computer
systems can be made to understand and manipulate natural languages (Chowdhury, 2003). This
paper will attempt to emphasize how natural language processing has very important potential in
enhancing the searching capabilities of online library catalogs.
What is NLP and How Does it Relate to Human Language?
Computers and humans simply do not speak the same language. Due to their radically
different makeup, such a disconnect is inevitable. At their most basic, computers process in
binary, which means their language building blocks consist of only two things: electronic signals
in either an “on” or “off” state. Human, or natural, language at its most biological level is many,
many degrees more complex; science still does not wholly understand how the brain processes
the spoken and written word. However, for years researchers have been attempting to find a way
to reconcile the two: those studying artificial intelligence search for ways in which to program
computers so that machines can process – understand and respond to – human language.
It is true that computers in this day and age work with higher-level programming
languages such as C++ and Java, languages which employ English words and thus are much
closer to human language than is binary machine language. Yet even the most evolved
programming language is not fluently readable by the human eye, particularly one not trained in
computer science.
Computer languages are similar to human languages in that they are used to instruct and
inform the machine (insomuch as a computer can be considered “informed”). Therefore, the
study of NLP is both relevant and full of potential. In this, the twenty-first century, with the vast
speed and processing power of computers, and the potential for seemingly unstoppable gains in
computing capabilities, it is feasible to believe that computers can become so advanced that they
can largely emulate the incredibly complex language processes of a human being.
What Can NLP Do for Libraries?
One branch of NLP research attempts to develop computer interfaces that can take human
search queries in natural languages and process them, without requiring the user to be
constrained by “search terms” or concerned with word ambiguities, for instance. Ideally, a user
should be able to use a search engine as he would a human resource, such as a reference
librarian. Contrary to their nature, people in the twenty-first century have been forced, in their
use of computerized search engines, to conform to certain search standards which require only
snippets of natural language and very unnatural grammars. In a library catalog search, for
example, one might enter only a one-word subject, even if trying to answer a much more
complex and specific question. On the other hand, if one were to approach a reference librarian,
an entire sentence, revealing the true intent of the inquiry, would most likely be used. Therefore,
significant research has been done in the field of NLP, with the intent of making computerized
card catalog searches much more amenable to human language queries.
Currently, most Online Public Access Catalog (OPAC) software found in libraries is
considered “second generation,” offering subject searches which look solely for the occurrence
of a single search word as it appears in the title of a work or in the subject headings associated
with that work (Antelman, Lynema, & Pace, 2006; Borgman, 1996). This means that OPACs
remain on almost the same level as physical card catalogs, in terms of how limited a user’s
options are when trying various approaches in researching a topic. According to Loarer (1993),
most OPAC searches are for subjects, and the results come only from a simple word search
performed on the catalog. However, many believe that there is still a large, untapped potential for
OPACs to offer much more user-friendly and intuitive searches.
A Very Brief History of Online Catalogs
As would be expected, most online card catalogs began as just that: online replications of
the capabilities of the physical card catalog. Others allowed users to utilize Boolean search
terms to combine two or more words in one query. Second-generation designs, which appeared
in the 1980s and are still the standard today, simply offered the combined functionality of the
two previous kinds of OPACs. Natural language processing did not factor into these systems.
Therefore, it is clear that most libraries continue to offer very primitive online catalogs, despite
the advances occurring all around them in other areas of search engine research, most notably
that of internet searching. Library patrons have become used to search engines like Google, and
therefore approach an OPAC with the expectations of a similar user experience. Unfortunately,
Antelman et al. (2006) point out that online catalogs have remained largely stagnant for close to
twenty years, failing to keep up with advances in search technology due largely to the profession
overlooking the problem, for various reasons. Loarer (1993) even pointed out that “OPAC” is,
ironically, a homonym of the French pronunciation of “opaque,” and remarks on the unfortunate
yet largely apt choice of acronyms. Clearly there is much that can be done to enhance the online
catalog experience. We will first take a look at how online search engines are beginning to take
advantage of natural language processing, and then examine how these developments might be
applied to better the OPAC user experience.
NLP and online search engines
It is becoming very clear that online searching is replacing library searching for many
people, even those who are regular patrons of the library. This is largely due to the belief that
Internet search engines produce more numerous and more relevant query results than do OPACs.
While Google is currently the most popular search engine, there are several new NLP search
engine developments on the horizon that have great potential, and should be considered when
redesigning OPAC search engines.
While Google is not an NLP-based engine at its heart, it presents an interesting case study
to begin with, as it consists of a very interesting blend of more robust searching powers than
OPACs, while still constraining the searcher to very limited approaches. As Fox (2008) pointed
out, Google has its users typing in, on average, 2.8 words that represent the most vital points of a
search query; therefore, if one were interested in the learning the winner of the Westminster
Kennel Club dog show in 1932, a typical Google search string might be “Westminster dog show
winner 1932.” A relevant and useful result is returned at the top of Google’s results page, but the
user is required to form the query in a very unnatural way. One certainly would not walk into a
library and use the same phrase, sans semantic fillers of any kind, when requesting the above
information of a librarian; it simply would not make sense in a face-to-face human interaction.
However, when using a traditional OPAC, the search would most likely start even more
generally, with a search perhaps under “dog shows” or “Westminster.” Therefore, Google does
allow for more specific searches, but the engine cannot process true requests, and therefore
remains limited in its capability to produce the most relevant results for a user. Its results are
also listed in order of relevance based on what Google’s algorithm deems most important; the
user is largely shuttled into the pages which happen to have the best results according to the
search engine’s “more links equals more importance” approach. Therefore, it is feasible that a
searcher might not ever find the information he is looking for; even if Google returns tens of
thousands of results, most users do not tend to look beyond the first several pages of results
before trying a new search or simply giving up the query as unanswerable (Ibid). So while
Google is a good starting point for NLP development, it certainly has a long way to go before it
can be considered to be processing true natural language.
WAG, deemed an online “answer extraction system” by Neumann and Xu (2004), was
developed by a team of German researchers, with the goal of answering questions, rather than
offering results related to a search term or phrase. This search engine, then, much more closely
resembles the work of a reference librarian, and offers a peek at what an OPAC might look like
were it to blend reference with general card catalog capabilities. The engine allows for the use of
more structured queries, which would typically include specific question words such as “who,”
“when,” “where,” etc. The search engine uses NLP to determine what type of response is being
looked for; if the query contains “when,” the user is most likely looking for a string containing a
date format, and phrases formatted only in that particular pattern are examined.
WAG also recognizes an NLP concept called Named Entities (NEs), which are generally
proper nouns, and dynamically creates a lexicon for the search engine to reference while it is
generating its results. This means, for instance, that a user could enter a search query which
included a full name, and WAG would recognize that it should scan web pages for an answer
related not only to that full name, but also any information located near either the first or last
names, which might appear separately from one another. Such semantic recognition is a feature
which online search engines as well as OPACs currently do not handle well; NE lexicons would
be very beneficial to a library patron’s search for references related to a proper noun.
Alpha is the brainchild of mathematician Stephen Wolfram, and his goal with this project
is to create a search engine that returns real answers, as opposed to links that are simply related
to the more general search topic (Levy, 2006). Wolfram’s approach to creating such a relevant
database is to utilize customized databases which can be scanned in order to answer specific,
English-language questions quantitatively, returning numbers and other information that is
pulled from the databases to create “mini dossiers” on the subject being queried (Ibid).
Therefore, one would receive a numeric answer to a quantitative search question; i.e., the realtime, current distance between Earth and the sun, calculated from one of the numerous databases,
as opposed to results which simply list pages that may or may not contain the desired
Of course, this approach clearly requires a large amount of background work to
implement; Wolfram himself is creating many of the databases from which his search engine’s
answers would be pulled. However, the concept is an important one to consider when
determining best practices for NLP searches. It is possible to organize information so that it is
truly more accessible, and an OPAC developer might consider building more customized
databases in order to enhance a search engine’s performance.
Another very pertinent development in the field of online search engines can be seen in
the development of Hakia, which claims to be a true semantic search engine. Hakia’s goal is to
return results that are credible and usable, but not necessarily popular by the same standards as
Google or Yahoo!; today’s more common search engines have decreed “popular” to mean a web
page which has many links pointing to it (Fox, 2008). On its website, Hakia (2007) explained its
various NLP approaches, which offer some key insights into how to perhaps improve upon the
Google searches of today. For instance, Hakia bases its search results on concept matching,
rather than on a simple keyword matching or popularity ranking (Ibid). The search engine
claims to use a more advanced version of a web crawler for indexing web pages, which can
identify concepts through semantic, NLP analysis. Once the concepts are identified, the engine
can produce results for a human searcher that are subcategorized from a more general topic
search, and thus allow the user to choose to focus the search on more specialized results.
Other special features of Hakia include the NLP ability to identify equivalent terms for a
more responsive search (i.e., recognizing that another word for “treatment” could be “cure”), as
well as the ability to group brand names under broader umbrella categories, such as recognizing
“Toyota” as a type of “car.” And, like Google, Hakia does offer suggestions and corrects
spelling errors in searches – a feature which OPACs generally do not offer (Ibid).
A final human touch in this search engine is that it offers an option to consider only
“credible” sources, which are in fact recommended to Hakia by actual librarians, thus
underscoring the important role that an information scientist still must play in recognizing and
promoting appropriate sources. This is an important consideration in the development of any
OPAC software, and it is gratifying to see such a point being made by an online search engine.
At the moment, Hakia is still a very limited search engine, and therefore hasn’t gained
wide usage, but it holds a lot of promise, and certainly can offer many pointers to developers of
an NLP-based OPAC system.
Advances in OPAC Development Using NLP
With all of the developments taking place in the field of online search engines with
respect to NLP, one might understandably think that online catalogs are following suit, and
becoming much more flexible and relevant. However, this is largely not the case. OPACs tend
to remain based solidly in the concept of physical card catalogs, offering very limited searching
on traditional categories such as author, title, or subject. OPAC designers are overwhelmingly
not taking advantage of many of the very useful search engine tools already in place, nor do they
seem to be exploring the vast amount of research into how natural language processing could
better serve the library user. Borgman (1996) pointed out that online catalogs do not seem to be
even attempting to understand search behavior, despite library catalogs being pioneers in the
field of computerized research. Fortunately, however, there are several small enterprises – most
not yet being used by the majority of libraries – which attempt to reconcile NLP and OPAC
searching, with what appears to be notable success. While remembering that much is still left to
be developed, we will now take a look at of some of the strides taken in OPAC development.
The Center for Interactive Systems Research (CISR) was formed in 1987, with the intent
of studying computerized information retrieval (IR). Much of its early research focused on
OPACs; in the late 1980s, a program named Okapi was introduced and the group began working
with the product via Text Retrieval Conference (TREC) competitions. The competitions
encouraged teams to further advance the NLP capabilities of the software (Department of
Information Science, City University of London, 2008).
In 2000, Microsoft Research Cambridge took part in the competitions, and made some
significant inroads in the development of more user-friendly OPACs, utilizing a combination of
the Okapi IR engine and NLPWin, its proprietary NLP system (Elworthy, 2000). The group
engineered a question-answering system that was able to take questions as input, parse them to
gather the meaning, and then find answers by locating sentences that contained similar phrasing.
The system took advantage of the common knowledge that questions are often formed in very
similar ways; many start with the same structure, such as “who is” or “when was,” and are then
followed by the important search cues, which the software could then exploit. Additionally,
question words can be associated with what response is expected; for instance, “who is” requires
a person in its answer, “where” requires a location, and “when” demands some form of a time
response. Often, these keywords are also followed by even more focused specifics, such as
“what country” or “what year,” which considerably narrows down the scope of potential results.
Thus, the software could easily parse the majority of questions, and then grab only the specific
information it required in order to locate pertinent answers. While this software did not perform
optimally during the competition tests, the concepts that it examined present some truly notable
considerations for future OPAC development.
Visualization software
Luther, Kelly, and Beagle (2005) recognized that search engines often return such large
data sets that the majority of the results are unusable – largely because they are never even
viewed. One of the biggest problems with search engines in general, and with OPACs
specifically, are that there are so many ways to look at a subject. A library patron’s needs vary
drastically from person to person, and even between several searches performed all by the same
patron. Where one is an expert another might be a novice, but when one begins to research a
subject, OPACs require every user to start with very broad, sweeping categorizations of the
information of which they are in search. Additionally, the broad subject searches usually return
a mixture of results that vary drastically in depth, coverage, and approach. While this is
sometimes useful – as when a patron does not know specifically for what they are looking –
often a user has a predefined query or search term in mind, but must use a drill-down searching
method in order to locate a useful answer. Luther et al. (2005) recognized the vast differences
between the approach to every search engine query, and examined visualization software as a
possible means of ameliorating the discrepancies. Visualization programs attempt to work more
closely with how the human brain works and how it processes linguistics, allowing users to
follow certain paths, associate concepts, and backtrack, using interfaces which present data
clouds and subheadings for the user to click through. The software also uses NLP to produce
topical clusters which associate certain words or concepts more along the lines of how the human
mind does so.
Certain OPAC-specific software is being developed which attempts to incorporate these
novel approaches to searching. Notably, OCLC is working in conjunction with Antarctica
Systems, Inc. to create a visual interface to the OCLC’s Electronic Books database.
AquaBrowser, which has been adopted by institutions including Harvard and Miami-Dade Public
Library System, is another well-known product which utilizes visual language searching. In
recent years, several other prominent library database software companies have been exploring
the applicability of visual search to their own products (Ibid). Clearly this is an idea that is
catching on.
Antelman et al. (2006) evaluated North Carolina State University’s recent adoption of a
next-generation OPAC software called Endeca. While not strictly an NLP search engine, this
program allowed the university to upgrade from their older OPAC to offer a much better tool for
finding relevant resources. Specifically, the university was eager to replace its keyword search
engine with software that had been developed for use with large commercial websites, as they
found that their students were more familiar with those types of searches. The software offers
many features that the university did not find in traditional OPAC software, such as the ability to
assign search indexes different relevance rankings, and to correct user typos or misspellings.
The auto-correct in Endeca is also much more relevant, as it pulls its suggestions from a selfcompiled list of frequently used terms, rather than simply referring to a dictionary for options.
These types of responses from search engines, which take into account the way a human is more
likely to approach a search, and which compile some of their searching mechanisms from the
user herself, are much more desirable than the basic keyword searches most used in OPACs
How Advances in NLP Can Benefit OPACs
NLP is still in its infancy in many ways, and it may be a long time before computers can
understand natural language in any sense near the depth that humans are able. However, it is
clear from the above cases that many significant advances are being made in the field which
might be very promising for the future of online catalog searching. In an ideal world, an OPAC
would be able to answer a short, specific question while pointing the user towards further
resources – much like the role of a reference librarian. There is currently a vast difference
between an interaction with a human and an interaction with a computer, and unfortunately, most
scenarios are leaning towards “computer-friendly” rather than “user-friendly” interfaces
(Chowdhury, 2003; Borgman, 1996). This means that humans are adapting themselves to what a
computer expects, rather than the other way around. Such change does not need to happen.
Research continues to take place in the field of NLP, and significant strides continue to be made.
OPAC designers must begin to explore and apply these advances to their products, or they risk
losing their already crumbling hold on patron use. Loarer (1993) agreed that more user-friendly
developments in OPACs, particularly in language processing, will have users in educational,
processional, and domestic environments taking significantly more advantage of the software.
It is true, however, that much remains to be developed in the field of NLP. The nuances
and complexities of human language are certainly difficult to map to a binary system of ones and
zeros. Most NLP research has been done solely in closed, controlled settings, and has not been
applied to real-world scenarios. It is understandable, then, that only tried and true inroads into
better searching software have been adopted by OPAC developers. However, applying NLP
advances in search engine design to OPAC design is a necessary next step towards more
technologically-friendly libraries. Following are some suggestions for future development in
NLP and OPAC design which were gathered throughout my research:
Study search behavior: one cannot design a system that is more accommodating to
searching if human search approaches are not understood. For instance, users
tend to search in stages as they refine their queries, and sometimes use searching
as a way to formulate an actual question. Consider this when designing systems.
Offer a way for a user to incorporate earlier, related searches into the current
search (Borgman, 1996).
Separate altogether from the card catalog model; it was designed for a specific
physical space which no longer constrains the computerized OPAC, and therefore
software designers must literally think outside of the physical card catalog (Ibid).
Move on from query-matching systems. They were designed for use by skilled
searchers, such as librarians, not for use by untrained library patrons. Other
approaches to information-gathering are more natural for the layman user (Ibid).
Consider question-parsing mechanisms that refine or enlarge the user’s query
either automatically or in a dialogue with the user, using algorithms that take the
query’s semantics into consideration (Loarer, 1993; Cenek, 2001).
Make sure help is readily available to the user, offering aid on how to use the
software as well as how to perform an effective search (Loarer, 1993). Users will
be especially unfamiliar with how to best utilize NLP searches when those
become more widely adopted; they will need instruction that is easy to access,
understand, and remember (Kreymer, 2002). Additionally, offering the option to
format the organization and display of results to suit the searcher’s needs would
further accommodate the objectives of a search.
Allow for better random browsing, much like the opportunities which physical
card catalogs used to offer a patron: the chance to stumble serendipitously across
an interesting subject, or to refine a search with unfamiliar subcategories.
Don’t make users guess which subject headings are being used in the OPAC, nor
how subheadings are associated with one another. Allow for more visual
conceptualization, which is more in line with how humans think and learn.
These are just some of what ought to be considered when developing a more NLP-oriented
OPAC system. It is impossible to know how many more facets might be discovered in the
process of development, a process which should ideally include usability surveys and testing, to
better understand what a library patron wishes to get from a search engine. Natural language
processing is just one small part of what needs to go into the improvement of current OPACs.
Yet it remains so very important in the further development of libraries which are based, at their
heart, on the ideal of freely sharing human knowledge – through language.
Antelman, K., Lynema, E., & Pace, A.K. (2006, September). Toward a twenty-first century
library catalog. Information Technology & Libraries, 25(3), 128-139.
Borgman, C.L. (1996, July). Why are online catalogs still hard to use? Journal of the American
Society for Information Science, 47(7), 493-503.
Cenek, P. (2001, June). Dialogue interfaces for library systems. In proceedings of FIMU-RS2001-04.
Chowdhury, G. G. (2003). Natural language processing. In B. Cronin (Ed.), Annual Review of
Information Science and Technology, 37, 51-89. Medford, NJ: Information Today.
Department of Information Science, City University of London. (2008). CISR – Mission
statement. Retrieved from http://www.soi.city.ac.uk/organisation/is/research/cisr/
Elworthy, D. (2000). Question answering using a large NLP. Proceedings of the Ninth Text
Retrieval Conference (TREC 2000), 355-360.
Fox, V. (2008, January). The promise of natural language search. Information Today, 25(1), 5050.
Hakia, Inc. (2007). Technology. Retrieved from http://company.hakia.com/technology.html
Kreymer, O. (2002). An evaluation of help mechanisms in natural language processing. Online
Information Review, 26(1), 30-39.
Levy, S. (2009, May 22). Steven Levy on the answer engine, a radical new formula for web
search. Wired Magazine, 17(6). Retrieved from
Loarer, P.L. (1993). OPAC: Opaque or open, public, accessible, and co-operative? Some
developments in natural language processing. Program: Electronic Library and
Information Systems, 27(3), 251-268.
Luther, J., Kelly M., & Beagle, D. (2005, March 1). Visualize this. Library Journal. Retrieved
from http://www.libraryjournal.com/article/CA504640.html
Neumann, G., & Xu, F. (2004, June). Mining natural language answers from the web. Web
Intelligence & Agent Systems, 2(2), 123-135.
Related flashcards
Virtual reality

23 Cards

Social groups

34 Cards

User interfaces

40 Cards

Chess engines

41 Cards

Create flashcards