The availability of language resources and reusable test collections

advertisement
CISE CRI:
AnswerBankA Reusable Test Collection for
Question Answering
University of Maryland and Massachusetts Institute of Technology
Question answering is a young, exciting research field that lies at the intersection of
computational linguistics and information retrieval. The technology is a promising
solution to the textual information overload problem prevalent in today’s informationrich environment. In contrast to the long “hit lists” returned by current information
retrieval systems, question answering systems leverage natural language processing
technology to provide much more precise and succinct responses to user questions.
Current question answering technology focuses on so-called factoid questions such as
“Who killed Abraham Lincoln?” that are usually answered by named entities such as
people, places, and dates. As with many other research areas, progress is measured by
quantitative evaluations against a common set of community-accepted benchmarks. In
the field of question answering, the QA Tracks at the Text Retrieval Conferences
(TRECs) serve this purpose. Unfortunately, theses evaluations are a yearly event, and
their results are not reproducible by individual systems outside of the formal evaluation.
In short, no reusable test collection exists for question answering research. To remedy
this situation and to spur future developments, we propose to build AnswerBank, a shared
community resource consisting of a large set of manually selected questions, their
answers, documents supporting those answers, and additional annotations.
In the past few months, we have built a small pilot test collection that demonstrates the
feasibility of our ideas. Using it, we have evaluated strategies for handling
morphological variation in document retrieval for question answering. Although
previously existing evaluation resources were unable to detect a performance difference
between two competing strategies (stemming and morphological expansion), our
experiments with the pilot test collection clearly demonstrated that one strategy is
superior to the other. From this, we have demonstrated that the creation of AnswerBank
will allow researchers to rapidly, and more importantly, accurately, assess the impact of
their algorithms. Faster experimental turn-around time will translate into faster
exploration of the solution space, and will lead to accelerated performance gains. We
hope that the creation of AnswerBank will not only lead to better systems, but also enable
the development of new techniques. For example, many statistical machine learning
approaches, which require the existence of large amounts of clean training data, would
greatly benefit from our project.
Through our recent experiences, we have learned that there is no such thing as an
“obvious” answer to a natural language question. Legitimate differences in opinion are
an inevitable part of any activity that involves humans engaged in real-world tasks.
Instead of viewing this as downside, we believe these variations to be instructive of the
underlying cognitive processes involved in answering questions. By properly managing
these differences, we can create a high-quality test collection that reflects real-world user
needs, and perhaps additionally shed some insight on the process through which certain
information requests are satisfied.
CISE CRI:
AnswerBankA Reusable Test Collection for
Question Answering
University of Maryland and Massachusetts Institute of Technology
1 Introduction
The working of modern information society is not limited by our ability to electronically
store information, but rather by our ability to effectively retrieve it in an efficient and
timely manner. One pressing issue concerns the increasingly difficult task of accessing
the enormous quantities of textual data ubiquitous in our daily lives. The ability to
specify information requests using natural language is seen as a potential solution,
especially when coupled with natural language processing technology capable of
delivering succinct responses instead of long “hit lists” which users must subsequently
browse through. This technology, known as question answering, is a young and exciting
research field that lies primarily at the intersection of information retrieval and
computational linguistics.
Question answering research presently focuses on fact-based questions (so-called
“factoid” questions) such as “What was the name of FDR’s dog?”. Although simple in
appearance, answering these questions may be deceptively difficult; for example,
consider the following passage: “Fala was the president’s faithful companion throughout
the war years. After his death, the shaggy Scottie lived with Eleanor…” Often, properly
computing an appropriate response requires bringing to bear many natural language
processing technologies from co-reference resolution to lexical semantics to logical
inferencing. As such, question answering not only serves as a practical, high-impact
application, but also as an experimental playground for exercising many core language
technologies. In addition, the field has ambitions to expand beyond factoid questions to
those involving, for example, extended user interactions requiring human-computer
dialogues, complex reasoning, and knowledge fusion from multiple knowledge sources.
To ensure steady progress in the field, researchers must be able to quantitatively and
automatically evaluate question answering systems. Unfortunately, the community
currently lacks a reusable test collection and the infrastructure necessary to support
repeatable experiments that produce comparable results “at the push of a button”.
The situation is quite different in many other areas of computational linguistics such as
statistical parsing (Marcus et al., 1994; Collins, 1997; Charniak 2001), machine
translation (Papineni, 2002), and automatic summarization (Lin, C.-Y. and Hovy, 2003),
where the existence of well-designed evaluation resources support experiments with rapid
turn-around time, resulting in faster exploration of the solution space and leading to
accelerated improvements in performance. We propose to develop AnswerBank, a
shared community resource consisting of a repository of manually selected factoid
questions and their answers, as determined by human assessors. This resource will
enable researchers in our field to conduct repeatable “push-button” evaluations,
potentially leading to rapid advances in the state of the art.
1.1 Project Goals
The central deliverable of this proposed project is AnswerBank, a reusable test collection
for question answering experiments on a corpus of newspaper articles (the AQUAINT
corpus).1 For each question in the testset, we will attempt to exhaustively find and record
all answer instances within the corpus. Human annotators will manually assess the
relevance of documents that contain keywords from both the question and the answer and
record the appropriate judgements. AnswerBank was conceived with four major
purposes in mind:
1.
2.
3.
4.
To serve as a reusable test collection for question answering evaluation.
To serve as training data for developing question answering algorithms.
To serve as a roadmap for future question answering research.
To shed some light on the cognitive process of answering questions.
The primary use of AnswerBank as a reusable test collection has already been discussed.
Beyond evaluations, the existence of human judgments for factoid questions will serve as
a valuable resource for the development of question answering algorithms themselves.
Much success in computational linguistics can be traced to recent advances in statistical
machine learning techniques, and indeed, statistical methods have been applied to the
question answering task (Ittycheriah, 2000; Radev et al., 2002;Echihabi and Marcu,
2003). Naturally, performance is to a large extent dependent on both the quantity and
quality of training data. The availability of complete exhaustive answers sets to a large
number of natural language questions will provide the foundation for advances in
statistical language processing techniques.
Typically, the ability to correctly answer a question involves a combination of natural
language processing technologies (for example, reasoning over lexical-semantic relations,
or anaphora resolution). For each question, we will attempt to document the issues
involved in computing an appropriate answer. We hypothesize that a relatively small
number of language processing challenges are involved in properly analyzing a text
fragment to arrive at a correct answer. If this hypothesis is correct, then it may be
possible to catalog the variety of research issues that make up this question answering
task. The result may serve as a roadmap for future research and will allow the
community to efficiently allocate resources in tackling various enabling technologies.
Finally, we believe that this project will shed some light on the human cognitive
processes involved in answering questions. In the process of building AnswerBank, we
hope to gain a better understanding of what it means to “answer a question”, and the
processes by which information needs are satisfied. As we have discovered in initial
experiments, human judgments of answers and supporting documents exhibit a large
range of variation that reflects legitimate differences in opinion; users bring different
backgrounds and biases to bear in interpreting an answer. An appropriate
characterization of these issues will ensure that AnswerBank reflects true user
information needs. This project will not only be valuable as a community resource, but
1
Available from the Linguistic Data Consortium at http://www.ldc.upenn.edu/
will be a worthwhile exploration into theoretical underpinnings of information-seeking
behavior.
As time and resources allow, we will conduct exploratory studies in natural language
processing technologies for question answering, using AnswerBank as the yardstick for
evaluating performance. Although the core focus of this project is the resource-building,
and not the development of advanced technology per se, our experiments using
AnswerBank will ensure its usefulness to the research community.
Upon completion, AnswerBank will be released to the research community; we hope that
its availability will spur new developments, not only in question answering, but also in
other areas of computational linguistics and information retrieval.
2. The Current State of Question Answering Evaluation
Over the past few years, the Question Answering Tracks at the Text Retrieval
Conferences (TRECs), sponsored by the National Institute of Standards and Technology
(NIST), have brought formal and rigorous evaluation methodologies to bear on the
question answering task, such as, blind testsets, a shared corpora, comparable metrics,
adjudicated human evaluation, and post-hoc stability analyses of performance (Voorhees
and Tice, 1999, 2000; Voorhees, 2001, 2002, 2003). The result is a performance
benchmark that has gained community-wide acceptance. The TREC QA Tracks, in fact,
have become the locus of question answering research, serving not only as an annual
forum for meaningful comparison of natural language processing and information
retrieval techniques, but also as an efficient vehicle for the dissemination of research
results. Successful techniques are frequently adopted by other teams in subsequent years,
leading to rapid advances in the state of the art.
In the TREC instantiation of the question answering task, a system’s response to a natural
language question is a pair consisting of an answer string and a supporting document. All
responses are manually judged by at least one human, who assigns one of three labels:
“correct”, “unsupported”, and “incorrect”. In order for a response unit to be judged as
“correct”, the answer string must provide the relevant information and the supporting
document must provide an appropriate justification of the answer string. Consider the
question “What Spanish explorer discovered the Mississippi River?” A response of
“Hernando de Soto” paired with a document that contains the fragment “the 16th-century
Spanish explorer Hernando de Soto, who discovered the Mississippi River…” would be
judged as correct. However, the same answer string, paired with a document that
contains the sentence “In 1542, Spanish explorer Hernando de Soto died while searching
for gold along the Mississippi River” would be judged as unsupported. While the answer
string is correct, a human cannot conclude by reading the text that de Soto did indeed
discover the Mississippi River. Thus, evaluating the response of a question answering
system involves not only the answer itself, but careful consideration of the document
from which the string was extracted. Because of this reality, a rigorous evaluation of
question answering systems is not possible outside the annual TREC cycle with currently
available resources.
Before discussing deficiencies associated with present evaluation tools, it is necessary to
first understand the nature of existing resources. Each year, NIST compiles correct
answers in the form of answer patterns (regular expressions) and a list of known relevant
documents by pooling the responses of all participants. Since the average performance of
current systems is still somewhat poor (see Voorhees, 2003 for a summary of last year's
results), the number of known relevant documents for each question is small, averaging
1.95 relevant documents per question on the TREC 2002 testset. In the same testset, no
single question had more than four known relevant documents. Even a casual
examination of the corpus reveals the existence of many more relevant documents,
demonstrating that the judgments are far from exhaustive. There are two major reasons
for this: first, since most teams utilize relatively simple keyword-based techniques to
extract answers, there is limited diversity in the types of documents that are retrieved.
Second, since each system is only allowed to return one answer per question, the
maximum number of relevant documents is bounded by the number of participant teams.
For questions that have many more relevant answers in the corpus, there is no hope of
collecting a more complete set of judgments within the TREC evaluation framework.
Moreover, careful inspection of the document list reveals duplicates and errors. To be
fair, NIST merely provides the answer patterns and relevant documents list every year for
convenience only; they were never meant to serve as a complete test collection for future
experiments. For lack of any better resources, however, these partial judgments have
been employed by the research community in many question answering experiments.
To understand the danger of evaluating question answering systems with currently
available resources, one must first understand how they are being used. In the standard
automatic evaluation setup, the answer string is matched against the answer patterns and
the supporting document is matched against the list of known relevant documents, both
provided by NIST through the pooling technique described above. Because this list is far
from exhaustive, new successful retrieval techniques may not be properly rewardeda
perfectly acceptable answer may be judged as incorrect simply because its supporting
document does not appear in the list of known relevant documents. Current question
answering systems, for the most part, are based on matching documents in the corpus that
have keywords in common with the question. As more sophisticated linguistic
processing technologies are developed, it is entirely possible that a system can extract
answers from documents that share few or no keywords with the question. Such systems
are likely to return documents that have never been assessed before, but which are
assumed to be irrelevant by default. The current set of judgments serves as a poor basis
for evaluating new systems; without an exhaustive set of relevant documents, one would
never know if a system's performance is actually improving.
Although the practice of pooling relevant documents is well-established in ad hoc
retrieval, and has proven to produce fair and stable evaluations (Zobel, 1998), we believe
that the same methodology cannot be directly applied to question answering evaluations.
A major difference between the tasks is the pool depth: in ad hoc retrieval, the top one
hundred documents from every system are pooled and judged. Since question answering
systems return only a single answer, the pool depth is effectively one; Zobel’s own
results show that reliability of the pooling strategy is dependent on the pool depth, and
metrics become unstable as pool sizes decrease. Thus, there are serious doubts as to
whether previous TREC results can serve as test collections for future experiments.
Formal, repeatable evaluations with clearly defined performance metrics and stable
results represent a major driver of innovative research. In addition, the ability to rapidly
conduct experiments with minimal turn-around time allows researchers to immediately
assess the effectiveness of different techniques, leading to accelerated progress.
However, this desirable research environment is dependent on the existence of test
collections that can accurately capture true performance. Such a resource currently does
not exist, and it is the goal of our proposed project to address this issue.
3. Building a Reusable QA Test Collection
The creation of a reusable test collection for factoid question answering remains an open
research problem, and the lack of suitable resources becomes more problematic as the
research community moves towards more difficult question types, e.g., those that involve
inferencing. We propose to tackle this problem by confronting it head on. In a pilot
study over the past few months, we have manually built a small reusable question
answering test collection (Bilotti, 2004; Bilotti et al., 2004) consisting of 120 questions
selected from the TREC 2002 testset. This testset is based on the AQUAINT corpus, a
three gigabyte collection of approximately one million news articles from the New York
Times, the Associated Press, and the Xinhua News Service.
Working from known answers to the TREC 2002 questions, we constructed queries with
certain terms selected from each question and its answer, terms which we believed a
relevant document would have to contain. Although it is entirely possible that a relevant
document may contain none of the words from the question and the answer, we assumed
that this happens very rarely. Using these queries, we retrieved tens of thousands of
documents that were manually examined and judged to be either relevant, irrelevant, or
unsupported for a particular question (an example is summarized below in Table 1).
Question: What is the name of the volcano that destroyed the ancient city of Pompeii?
… In A.D. 79, long-dormant Mount Vesuvius erupted, burying the Roman cities
of Pompeii and Herculaneum in volcanic ash… [APW19990823.0165]
Unsupported … Pompeii was pagan in A.D. 79, when Vesuvius erupted…
[NYT20000405.0216]
Irrelevant
… the project of replanting ancient vineyards amid the ruins of Pompeii… Coda
di Volpe, a white grape from Roman times that thrives in the volcanic soils on the
lower slopes of Mt. Vesuvius… [NYT20000704.0049]
Table 1: Examples of relevant, unsupported, and irrelevant documents.
Relevant
To give a concrete example, consider TREC question 1396, “What is the name of the
volcano that destroyed the ancient city of Pompeii?”, whose answer is ``Vesuvius''. Any
relevant document containing the answer to this question must necessarily contain the
keywords “Pompeii” and “Vesuvius”. Therefore, we manually examined all documents
with those two keywords. For this question, we noted fifteen relevant, three unsupported,
and ten irrelevant documents. All other documents not explicitly marked are presumed to
be irrelevant. An example of a clearly relevant document is APW19990823.01652, which
says that “In A.D. 79, long-dormant Mount Vesuvius erupted, burying the Roman cities
of Pompeii and Herculaneum in volcanic ash.” An unsupported document is one that
contains the answer and discusses it in the correct sense and context, but does not
completely and clearly answer the question. An example is NYT20000405.0216, which
states that, “Pompeii was pagan in A.D. 79, when Vesuvius erupted.” The document also
addresses speculations that “the people of Pompeii were justly punished by the volcano
eruption,” but does not explicitly mention or imply that the city was destroyed by
Vesuvius. An irrelevant document containing the terms “Pompeii” and “Vesuvius” is
NYT20000704.0049, which discusses winemaking in Campania, the region of Italy
containing both Pompeii and Vesuvius. The document talks about vineyards near the
ruins of Pompeii, and about a species of grape that grows in the volcanic soil at the foot
of Mt. Vesuvius.
In our proposed AnswerBank, we plan on applying the same basic methodology
employed in the creation of our pilot test collection, except at a much larger scale and
with a supplementary annotation set. In addition, we hope to incorporate many lessons
that have been learned in our exploratory studies. Before describing our proposed
workflow process, we will first discuss the difficulty of this endeavor with respect to
judgment variations.
3.1 The Difficulty in Building a QA Test Collection
The task of building a reusable test collection for factoid question answering is much
more difficult than it appears. The most important challenge we must tackle in this
project is the question “What makes a good answer?” In this section, we will attempt to
lay out some of the intricate issues involved.
Consider TREC question 1398, “What year was Alaska purchased?”, whose answer is
1867. A document that contains the keywords “1867” and “Alaska” is
APW19990329.0045, which says that “In 1867, U.S. Secretary of State William H.
Seward reached agreement with Russia to purchase the territory of Alaska.” Is this a
relevant document? At first glance, probably so, but upon closer examination, one might
have doubts. Bringing to bear common sense knowledge about the world, an assessor
might realize that the international transaction of purchasing territories is a long and
complicated process that might span years. Thus, “reaching an agreement to purchase” in
1867 might not mean that Alaska was “purchased” in 1867. Consider another document,
APW19991017.0082, which states that, “In 1867, the United States took formal
possession of Alaska from Russia.” Would this document be considered relevant? For
one, it does not mention explicitly that a purchase was involved (it could have been ceded
as a result of some other treaty, for example); furthermore, the date of “taking formal
possession” might be different from the purchase date. In our pilot study, assessors
disagreed on the classification of these documents: a purchase may be a complex
commercial transaction that involves multiple steps that may span a long time frame; at
what point can the purchase be considered complete? This and similar examples
2
Incidentally, this document is not present in the NIST-supplied list of relevant documents.
demonstrate that there is no such thing as an “obvious answer”. Humans often have
differences of opinion in interpreting both questions and responses, which translates into
different relevancy judgments
Even when the interpretation of a question is relatively uniform across different
assessors, many disagreements still arise concerning the acceptability of different answer
forms. In a question asking for a date, how exact must the date be? Is the year an event
happened sufficient, or is an exact month and day necessary? From our experiences,
judgments vary not only from assessors to assessor, but also from question to question.
In a question about a relatively recent event such as like “When did World War I start?”
or “When was Hurricane Hugo?”, assessors are more likely to require more exact dates
(containing both the month and the year, for example). For a question like “When did
French revolutionaries storm the Bastille?”, both “July 14”, “1789”, and “July 14, 1789”
were all considered acceptable, by different assessors.
Similar variations in judgments also occurred with entities such as people and place
names. Is the last name of a person sufficient, or should an answer have both a first and
last name? Consider the question “Who won the Nobel Peace Prize in 1991”, whose
answer is “Aung San Suu Kyi”: some assessors accepted “San Suu Kyi”, “Suu Kyi”, or
even just “Kyi”. How exact must place names be? For the question “Where was Harry
Truman born?”, some assessors accepted only “Lamar, Missouri”, while others were
content with “Missouri” (no judge accepted “USA”). For questions that involve locations
outside the United States, however, assessors often felt that the name of the country was
sufficient. Once again, we observed legitimate difference in opinions, both among
different assessors and across different questions.
Finally, there are many cases where an answer is not explicitly stated in a document, but
requires the assessor to bring external knowledge to bear in interpreting the text. For
example, consider the question “What is Pennsylvania’s nickname?”, whose answer is
“keystone state”. The vast majority of supporting documents do not explicitly relate the
state with its nickname, but the connection is clear from the rhetorical structure of the
articles; most assessors did find such documents to be perfectly acceptable. Consider a
more complicated situation concerning the question “Where did Allen Iverson go to
college?” Some articles mentioned that he was a Hoya, which sports aficionados might
automatically associate with Georgetown University. However, this knowledge cannot
be considered “common sense”, and hence relevancy judgments of such documents
would vary from assessor to assessor.
Our preliminary efforts in developing a reusable test collection for question answering
has given us valuable insights into the challenges associated with a large-scale resourcebuilding endeavor. There is no such thing as “an obvious answer” or “universal ground
truth”. It is fruitless to try and create strict rules that govern what constitutes a correct
answer because this notion varies both from person to person and from question to
question (cf. Voorhees, 1999). Legitimate differences in opinion are an inescapable fact
of question answering evaluation, which is a fact that should be addressed, not hidden.
By properly managing these variations (see next section), we can build a high quality test
collection that is both high quality and accurately reflects real-world user needs.
4. Building AnswerBank
Our proposed plan for building AnswerBank can be summarized by the following
workflow diagram:
Select Questions
Determine Answers
Formulate Queries
Assess Documents
Add Additional Annotations
Figure 1: Workflow of the annotation process.
Selecting Questions. We plan to select and manually annotate one thousand selected
questions from the TREC 8, TREC 9, TREC 10, TREC 11, and TREC 12 Question
Answering Tracks. Such an effort will be approximately an order of magnitude larger
than our previous pilot study, allowing researchers to accurately evaluate their systems in
a statistically significant manner. Using existing TREC questions provides some
continuity from presently available resources. These questions are based on search
engine logs, and represent a realistic sampling of real-world natural language questions
submitted by users on the Web (Voorhees, 2001).
Some TREC questions, however, are less appropriate for inclusion in AnswerBank.
Among those include questions that have too many relevant documents within the corpus,
e.g., “What is the capital of the United States?” Since so many documents contain the
answer, it would not be productive to manually find an exhaustive set of judgments. Also
excluded from AnswerBank will be so-called “definition questions” such as “Who was
Galileo?”, which generally cannot be answered by simple named entities (Voorhees,
2003; Hildebrandt et al., 2004). In general, we will aim to annotate questions that are
medium to above average in difficulty, as determined by the number of systems that
correctly answered those questions. Such a choice of test questions will serve to drive
forward state of the art in natural language processing technology.
Determining Answers. Answer strings from the previous TREC evaluations will serve
as the starting point for building our test collection. The purpose of this step is not to
make a final determination of answer correctness, because such judgments can only be
made with respect to a particular document. Rather, these answers will be employed in
the subsequent step to gather a candidate pool of documents for further evaluation.
Formulating Queries. After determining the range of acceptable answers to a question,
actual judgments of answer string and supporting document pairs must be made
manually. This will be accomplished with the assistance of a boolean keyword retrieval
system. The query issued to this system will contain keywords that must be present in a
document in order for it to be considered relevant. As an example, consider question
1436, “What was the name of Stonewall Jackson's horse?”, whose answer is “Little
Sorrel”. The query used to generate a candidate document pool would be “Jackson AND
Little AND Sorrel”. Note that a relevant document might use a synonym for “horse”
such as “steed”, and hence we would not include that keyword in the query. We will take
care to formulate queries as broadly as possible to ensure that all relevant documents will
be considered.
Assessing Documents. Once a candidate set of documents has been gathered, it will be
manually reviewed by multiple assessors and judged to be relevant, unsupported, or
irrelevant. A relevant document not only contains the correct answer, but must also
mention the answer in the correct context of the natural language question. An
unsupported document may mention the correct answer, but its connection to the natural
language question may be unclear from the text. An irrelevant document may contain
keywords from the correct answer, but does not address the natural language question.
This judgment process will require us to directly confront differences in opinion
regarding what constitutes an acceptable answer. Although we recognize that assessors
will inevitably disagree, we hope to gain a deeper understanding of the underlying
cognitive model that drives this answer evaluation process. Such knowledge would
perhaps lead to the development of heuristic knowledge that can be operationalized by
computer systems.
In an attempt to reduce variations in judgment, we will produce a set of general
guidelines similar to those given to TREC assessors. For example, we will assume a
college-educated, North American adult newspaper reader as the idealized target user.
The purpose of these guidelines is to provide a common reference point, not to force
humans to agree on what a good answer is.
To cope with variations in judgment, each question will be multiply annotated by
different assessors. We plan to adjudicate differences in opinion by adopting a simple
majority voting strategy, although we will certainly experiment with more sophisticated
strategies as they become apparent. The purpose of adjudication is not to impose
consistency in judgments, but rather to capture obvious errors and misinterpretation of
the assessment guidelines. Inter-annotator agreement and other relevant statistical
measures will be collected to quantitatively assess the annotation progress.
Adding Additional Annotations. For each relevant document belonging to a question in
the testset, we propose to add additional meta-knowledge noting the natural language
processing technologies that are exercised by the particular instance of an answer. For
example, in order to properly extract the answer to the question “Who killed Abraham
Lincoln?” from the text fragment “John Wilkes Booth altered history with a bullet. He
will forever be known as the man who ended Abraham Lincoln’s life.”, a system must
have knowledge of anaphoric references (e.g., relating “he” to “John Wilkes Booth”) and
paraphrases (i.e., that ending someone’s life is synonymous with killing someone). We
believe that such additional annotations will help researchers categorize the difficulty of a
question (Lange et al., 2004); for example, an answer passage that shares many keywords
in common with the question may probably be considered “easy”.
Although we currently do not have a classification of natural language tasks relevant to
question answering, we believe that such an inventory could be an additional benefit of
building AnswerBank. By examining the corpus and a large collection of questions and
documents, we hope to induce an ontology of challenges that must be overcome in order
to solve the question answering problem. Furthermore, the results of our work may
provide an empirical distribution of the pervasiveness of each research challenge; such
knowledge will enable us to develop a roadmap for question answering research. We
could compile recommendations regarding the most pertinent research questions that
would have the greatest impact on question answering performance.
4.1 Infrastructure Support
The construction of AnswerBank will necessitate the development of specialized
annotation tools to streamline the workflow. We believe that these tools may themselves
contribute to advancing our knowledge in human-computer interfaces, especially in areas
pertaining to information presentation and information visualization.
To support rapid assessment of document relevance, we plan to experiment with novel
interfaces for presenting search results in the context of question answering systems. In
fact, we have previously conducted a user study to determine the effectiveness of
different answer presentation strategies for question answering systems (Lin et al., 2003a;
Lin et al., 2003b). Working under a keyword-in-context (Leung and Apperley, 1994)
paradigm, we explored user preferences in the amount of context accompanying an
answer. The study tested four different interface conditions: exact answer only (no
context), the answer highlighted in the sentence in which it occurred, the answer
highlighted in the paragraph in which it occurred, and the answer highlighted in the
document in which it occurred. We discovered that the majority of users preferred the
answer-in-paragraph interface, because it provided them a natural discourse unit in which
to situate their answer. In addition, a paragraph-sized result allows users to find answers
to related questions.
We plan to build on these previous results in developing the necessary infrastructure to
support the creation of AnswerBank. The development of effective interfaces for
presenting search results has implications beyond the annotation effort described in this
proposal. Lessons learned from interface design can be applied to a wide range of
information systems.
4.2 Information Retrieval Experiments
We believe that AnswerBank will enable a wide variety of question answering research
that has until now been difficult without a reusable test collection. In fact, we have
already performed preliminary studies of what can be accomplished with our pilot test
collection.
Although factoid question answering is distinct from the task of retrieving relevant
documents in response to a user query (so-called ad hoc retrieval), document retrieval
systems nevertheless play a central role in the question answering process. Because
natural language processing techniques are relatively slow, a question answering system
typically relies on traditional document retrieval techniques to first gather a set of
candidate documents, thereby reducing the amount of text that must be analyzed.
In a recent study, we explored the role of document retrieval in the context of question
answering. In particular, we examined different strategies for coping with morphological
variation, which poses a challenge to all types of information retrieval systems. Ideally, a
system should be able to retrieve documents containing closely-related variants of
keywords found in the query, e.g., the query term love should not only match the term
love present in documents, but also the terms loving, loved, and loves.
There are, in principle, two different ways for coping with morphological variation. The
most popular strategy is to apply a stemming algorithm at indexing time and store only
the resulting word stems (e.g., love  lov); this naturally requires user queries to be
similarly analyzed. The effects of this morphological normalization process have been
well studied in the context of document retrieval, but it is still unclear whether or not
stemming is effective (Harman, 1991; Krovetz, 1993; Hull, 1996; Monz, 2003). An
alternative strategy is indexing the original word forms, as they are, and expanding query
terms with their morphological variants at retrieval time (e.g., love  love  loving 
loved  loves). Performing query expansion at retrieval time, however, requires the
ability for structured querying, a capability that may not be present in all document
retrieval systems.
Our study reveals that the performance difference in recall between the stemming and
morphological query expansion approach cannot be detected when evaluating results with
the NIST-supplied set of relevant documents. That is, both stemming and morphological
query expansion were found to perform equally well, slightly above the baseline of no
stemming. However, evaluating the same exact techniques with out pilot test collection
produced a different picture. We found that morphological query expansion gives rise to
higher recall than stemming (which in turn has higher recall than the baseline). We
believe that these results highlight the danger of using the NIST-supplied relevant
documents for research experiments. Since these documents were gathered by pooling
the results of participants, they reflect the type of technologies employed by those
systems. Since most TREC entries rely on keyword-based techniques will little linguistic
sophistication, the resulting pool of documents will not properly assesses the performance
of more advanced query generation techniques such as those involved in our
morphological query expansion experiments.
Not only have we outlined a project to fill a void in question answering resources, but we
have also demonstrated concretely how such a test collection can be used to further
research in document retrieval technology. In parallel to the annotation effort in building
AnswerBank, we will also engage in a number of exploratory studies similar to the one
described above to test our emerging test collection and ensure its usefulness.
4.3 Resource Maintenance
AnswerBank is a static resource that, upon completion, will be publicly released to the
research community. No further maintenance of the project will be required, although
additional funding may be necessary to further expand the scope and coverage of the
collection beyond what is already outlined in this proposal. If the infrastructure tools and
software developed during the annotation effort is found to be useful beyond this project,
they will be released to the community under an open source license.
5. Impact
The availability of language resources and reusable test collections has been a major
driving force in the development of human language technologies and natural language
applications. From a resource point of view, they serve as a reference point of
performance that concrete applications can strive for. From an evaluation point of view,
the existence of reusable test collections allows researchers to conduct experiments with
rapid turn-around, allowing faster exploration of the solution space and leading to
accelerated improvements in performance. The value of resource-building projects is
well-documented in many computational linguistics areas such as statistical parsing,
machine translation, and automatic summarization. In these disciplines, the development
of resources, meaningful evaluation metrics, and standardized test collections is directly
responsible for dramatic performance improvements in short amounts of time. It is clear
that appropriately-designed, community-wide resources can serve as high-impact
enablers that allow entire fields to accomplish what previously was believed to be
impossible. We have similarly high hopes for AnswerBank, which will fill a gap in
available question answering resources and pave the way for rapid advances in the state
of the art.
In the process of building AnswerBank, we will have to tackle the thorny question of
“What makes a piece of text a good answer to a question?” As we have shown, relevance
assessment is a complex process that is both person- and question-specific. We believe
that AnswerBank may help reveal important insights about information-seeking behavior
and its underlying cognitive processes. A better understanding these mental models will
surely lead to more useful information systems that can help users better survive today’s
often overwhelming information environment.
References
Bilotti, Matthew W., “Query Expansion Techniques for Question Answering”, Master's
thesis, Massachusetts Institute of Technology, 2004.
Bilotti, Matthew W., Boris Katz and Jimmy Lin, “What Works Better for Question
Answering: Stemming or Morphological Query Expansion?”, Proceedings of the
Information Retrieval for Question Answering (IR4QA) Workshop at SIGIR 2004, 2004.
Charniak, Eugene, “Immediate Head Parsing for Language Models”, Proceedings of the
39th Annual Meeting of the Association for Computational Linguistics (ACL 2001), 2001.
Collins, Michael, “Three Generative Lexicalized Models for Statistical Parsing”,
Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics
(ACL 1997), 1997.
Echihabi, Abdessamad and Daniel Marcu, “A Noisy-Channel Approach to Question
Answering”, Proceedings of the 41st Annual Meeting of the Association for
Computational Linguistics (ACL 2003), 2003.
Harman, Donna, “How Effective is Suffixing?”, Journal of the American Society for
Information Science, 42(1), 7–15, 1991.
Hildebrandt, Wesley, Boris Katz, and Jimmy Lin, “Answering Definition Questions with
Multiple Knowledge Sources”, Proceedings of the 2004 Human Language Technology
Conference and the North American Chapter of the Association for Computational
Linguistics Annual Meeting (HLT/NAACL 2004), 2004.
Hull, David A., “Stemming Algorithms: A Case Study for Detailed Evaluation”, Journal
of the American Society for Information Science, 47(1), 70–84, 1996.
Ittycheriah Abraham, Martin Franz, Wei-Jing Zhu, Adwait Ratnaparkhi. “IBM's
Statistical Question Answering System”, Proceedings of the Ninth Text REtrieval
Conference (TREC-9), 2000.
Krovetz, Robert, “Viewing Morphology as an Inference Process”, Proceedings of the
16th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR 1993), 1993.
Lange, Rense, Juan Moran, Warren R. Greiff, and Lisa Ferro, “A Probabilistic Rasch
Analysis of Question Answering Evaluations”, Proceedings of the 2004 Human
Language Technology Conference and the North American Chapter of the Association
for Computational Linguistics Annual Meeting (HLT/NAACL 2004), 2004.
Leung, Ying K. and Mark D. Apperley, “A Review and Taxonomy of DistortionOriented Presentation Techniques”, ACM Transactions on Computer-
Human Interaction, 1(2):126–160, 1994.
Lin, Chin-Yew, and Eduard Hovy, “Automatic Evaluation of Summaries Using N-gram
Co-occurrence Statistics”, Proceedings of the 2003 Human Language Technology
Conference and the North American Chapter of the Association for Computational
Linguistics Annual Meeting (HLT/NAACL 2003), 2003.
Lin, Jimmy, Dennis Quan, Vineet Sinha, Karun Bakshi, David Huynh, Boris Katz, and
David R. Karger, “The Role of Context in Question Answering Systems”, Proceedings of
the 2003 Conference on Human Factors in Computing Systems (CHI 2003), 2003a.
Lin, Jimmy, Dennis Quan, Vineet Sinha, Karun Bakshi, David Huynh, Boris Katz, and
David R. Karger, “What Makes a Good Answer? The Role of Context in Question
Answering”, Proceedings of the Ninth IFIP TC13 International Conference on HumanComputer Interaction (INTERACT 2003), 2003b.
Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz, “Building a Large
Annotated Corpus of English: The Penn Treebank”, Computational Linguistics, 19(2),
313–330, 1994.
Monz, Christof, “From Document Retrieval to Question Answering”, Ph.D. Dissertation,
Institute for Logic, Language, and Computation, University of Amsterdam, 2003.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu, “BLEU: a Method for
Automatic Evaluation of Machine Translation”, Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics (ACL 2002), 2002.
Radev, Dragomir, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal,
“Probabilistic Question Answering on the Web”, Proceedings of the Eleventh
International World Wide Web Conference (WWW2002), 2002.
Voorhees, Ellen M. and Dawn M. Tice, “The TREC-8 Question Answering Track
Evaluation”, Proceedings of the Eighth Text REtrieval Conference (TREC-8), 1999.
Voorhees, Ellen M. and Dawn M. Tice, “Overview of the TREC-9 Question Answering
Track”, Proceedings of the Ninth Text REtrieval Conference (TREC-9), 2000.
Voorhees, Ellen M., “Overview of the TREC 2001 Question Answering Track”,
Proceedings of the Tenth Text REtrieval Conference (TREC 2001), 2001.
Voorhees, Ellen M., “Overview of the TREC 2002 Question Answering Track”,
Proceedings of the Eleventh Text REtrieval Conference (TREC 2002), 2002.
Voorhees, Ellen M., “Overview of the TREC 2003 Question Answering Track”,
Proceedings of the Twelvth Text REtrieval Conference (TREC 2003), 2003.
Zobel, Justin, “How Reliable Are the Results of Large-Scale Information Retrieval
Experiments?”, Proceedings of the 21st Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR 1998), 1998
Download