CISE CRI: AnswerBankA Reusable Test Collection for Question Answering University of Maryland and Massachusetts Institute of Technology Question answering is a young, exciting research field that lies at the intersection of computational linguistics and information retrieval. The technology is a promising solution to the textual information overload problem prevalent in today’s informationrich environment. In contrast to the long “hit lists” returned by current information retrieval systems, question answering systems leverage natural language processing technology to provide much more precise and succinct responses to user questions. Current question answering technology focuses on so-called factoid questions such as “Who killed Abraham Lincoln?” that are usually answered by named entities such as people, places, and dates. As with many other research areas, progress is measured by quantitative evaluations against a common set of community-accepted benchmarks. In the field of question answering, the QA Tracks at the Text Retrieval Conferences (TRECs) serve this purpose. Unfortunately, theses evaluations are a yearly event, and their results are not reproducible by individual systems outside of the formal evaluation. In short, no reusable test collection exists for question answering research. To remedy this situation and to spur future developments, we propose to build AnswerBank, a shared community resource consisting of a large set of manually selected questions, their answers, documents supporting those answers, and additional annotations. In the past few months, we have built a small pilot test collection that demonstrates the feasibility of our ideas. Using it, we have evaluated strategies for handling morphological variation in document retrieval for question answering. Although previously existing evaluation resources were unable to detect a performance difference between two competing strategies (stemming and morphological expansion), our experiments with the pilot test collection clearly demonstrated that one strategy is superior to the other. From this, we have demonstrated that the creation of AnswerBank will allow researchers to rapidly, and more importantly, accurately, assess the impact of their algorithms. Faster experimental turn-around time will translate into faster exploration of the solution space, and will lead to accelerated performance gains. We hope that the creation of AnswerBank will not only lead to better systems, but also enable the development of new techniques. For example, many statistical machine learning approaches, which require the existence of large amounts of clean training data, would greatly benefit from our project. Through our recent experiences, we have learned that there is no such thing as an “obvious” answer to a natural language question. Legitimate differences in opinion are an inevitable part of any activity that involves humans engaged in real-world tasks. Instead of viewing this as downside, we believe these variations to be instructive of the underlying cognitive processes involved in answering questions. By properly managing these differences, we can create a high-quality test collection that reflects real-world user needs, and perhaps additionally shed some insight on the process through which certain information requests are satisfied. CISE CRI: AnswerBankA Reusable Test Collection for Question Answering University of Maryland and Massachusetts Institute of Technology 1 Introduction The working of modern information society is not limited by our ability to electronically store information, but rather by our ability to effectively retrieve it in an efficient and timely manner. One pressing issue concerns the increasingly difficult task of accessing the enormous quantities of textual data ubiquitous in our daily lives. The ability to specify information requests using natural language is seen as a potential solution, especially when coupled with natural language processing technology capable of delivering succinct responses instead of long “hit lists” which users must subsequently browse through. This technology, known as question answering, is a young and exciting research field that lies primarily at the intersection of information retrieval and computational linguistics. Question answering research presently focuses on fact-based questions (so-called “factoid” questions) such as “What was the name of FDR’s dog?”. Although simple in appearance, answering these questions may be deceptively difficult; for example, consider the following passage: “Fala was the president’s faithful companion throughout the war years. After his death, the shaggy Scottie lived with Eleanor…” Often, properly computing an appropriate response requires bringing to bear many natural language processing technologies from co-reference resolution to lexical semantics to logical inferencing. As such, question answering not only serves as a practical, high-impact application, but also as an experimental playground for exercising many core language technologies. In addition, the field has ambitions to expand beyond factoid questions to those involving, for example, extended user interactions requiring human-computer dialogues, complex reasoning, and knowledge fusion from multiple knowledge sources. To ensure steady progress in the field, researchers must be able to quantitatively and automatically evaluate question answering systems. Unfortunately, the community currently lacks a reusable test collection and the infrastructure necessary to support repeatable experiments that produce comparable results “at the push of a button”. The situation is quite different in many other areas of computational linguistics such as statistical parsing (Marcus et al., 1994; Collins, 1997; Charniak 2001), machine translation (Papineni, 2002), and automatic summarization (Lin, C.-Y. and Hovy, 2003), where the existence of well-designed evaluation resources support experiments with rapid turn-around time, resulting in faster exploration of the solution space and leading to accelerated improvements in performance. We propose to develop AnswerBank, a shared community resource consisting of a repository of manually selected factoid questions and their answers, as determined by human assessors. This resource will enable researchers in our field to conduct repeatable “push-button” evaluations, potentially leading to rapid advances in the state of the art. 1.1 Project Goals The central deliverable of this proposed project is AnswerBank, a reusable test collection for question answering experiments on a corpus of newspaper articles (the AQUAINT corpus).1 For each question in the testset, we will attempt to exhaustively find and record all answer instances within the corpus. Human annotators will manually assess the relevance of documents that contain keywords from both the question and the answer and record the appropriate judgements. AnswerBank was conceived with four major purposes in mind: 1. 2. 3. 4. To serve as a reusable test collection for question answering evaluation. To serve as training data for developing question answering algorithms. To serve as a roadmap for future question answering research. To shed some light on the cognitive process of answering questions. The primary use of AnswerBank as a reusable test collection has already been discussed. Beyond evaluations, the existence of human judgments for factoid questions will serve as a valuable resource for the development of question answering algorithms themselves. Much success in computational linguistics can be traced to recent advances in statistical machine learning techniques, and indeed, statistical methods have been applied to the question answering task (Ittycheriah, 2000; Radev et al., 2002;Echihabi and Marcu, 2003). Naturally, performance is to a large extent dependent on both the quantity and quality of training data. The availability of complete exhaustive answers sets to a large number of natural language questions will provide the foundation for advances in statistical language processing techniques. Typically, the ability to correctly answer a question involves a combination of natural language processing technologies (for example, reasoning over lexical-semantic relations, or anaphora resolution). For each question, we will attempt to document the issues involved in computing an appropriate answer. We hypothesize that a relatively small number of language processing challenges are involved in properly analyzing a text fragment to arrive at a correct answer. If this hypothesis is correct, then it may be possible to catalog the variety of research issues that make up this question answering task. The result may serve as a roadmap for future research and will allow the community to efficiently allocate resources in tackling various enabling technologies. Finally, we believe that this project will shed some light on the human cognitive processes involved in answering questions. In the process of building AnswerBank, we hope to gain a better understanding of what it means to “answer a question”, and the processes by which information needs are satisfied. As we have discovered in initial experiments, human judgments of answers and supporting documents exhibit a large range of variation that reflects legitimate differences in opinion; users bring different backgrounds and biases to bear in interpreting an answer. An appropriate characterization of these issues will ensure that AnswerBank reflects true user information needs. This project will not only be valuable as a community resource, but 1 Available from the Linguistic Data Consortium at http://www.ldc.upenn.edu/ will be a worthwhile exploration into theoretical underpinnings of information-seeking behavior. As time and resources allow, we will conduct exploratory studies in natural language processing technologies for question answering, using AnswerBank as the yardstick for evaluating performance. Although the core focus of this project is the resource-building, and not the development of advanced technology per se, our experiments using AnswerBank will ensure its usefulness to the research community. Upon completion, AnswerBank will be released to the research community; we hope that its availability will spur new developments, not only in question answering, but also in other areas of computational linguistics and information retrieval. 2. The Current State of Question Answering Evaluation Over the past few years, the Question Answering Tracks at the Text Retrieval Conferences (TRECs), sponsored by the National Institute of Standards and Technology (NIST), have brought formal and rigorous evaluation methodologies to bear on the question answering task, such as, blind testsets, a shared corpora, comparable metrics, adjudicated human evaluation, and post-hoc stability analyses of performance (Voorhees and Tice, 1999, 2000; Voorhees, 2001, 2002, 2003). The result is a performance benchmark that has gained community-wide acceptance. The TREC QA Tracks, in fact, have become the locus of question answering research, serving not only as an annual forum for meaningful comparison of natural language processing and information retrieval techniques, but also as an efficient vehicle for the dissemination of research results. Successful techniques are frequently adopted by other teams in subsequent years, leading to rapid advances in the state of the art. In the TREC instantiation of the question answering task, a system’s response to a natural language question is a pair consisting of an answer string and a supporting document. All responses are manually judged by at least one human, who assigns one of three labels: “correct”, “unsupported”, and “incorrect”. In order for a response unit to be judged as “correct”, the answer string must provide the relevant information and the supporting document must provide an appropriate justification of the answer string. Consider the question “What Spanish explorer discovered the Mississippi River?” A response of “Hernando de Soto” paired with a document that contains the fragment “the 16th-century Spanish explorer Hernando de Soto, who discovered the Mississippi River…” would be judged as correct. However, the same answer string, paired with a document that contains the sentence “In 1542, Spanish explorer Hernando de Soto died while searching for gold along the Mississippi River” would be judged as unsupported. While the answer string is correct, a human cannot conclude by reading the text that de Soto did indeed discover the Mississippi River. Thus, evaluating the response of a question answering system involves not only the answer itself, but careful consideration of the document from which the string was extracted. Because of this reality, a rigorous evaluation of question answering systems is not possible outside the annual TREC cycle with currently available resources. Before discussing deficiencies associated with present evaluation tools, it is necessary to first understand the nature of existing resources. Each year, NIST compiles correct answers in the form of answer patterns (regular expressions) and a list of known relevant documents by pooling the responses of all participants. Since the average performance of current systems is still somewhat poor (see Voorhees, 2003 for a summary of last year's results), the number of known relevant documents for each question is small, averaging 1.95 relevant documents per question on the TREC 2002 testset. In the same testset, no single question had more than four known relevant documents. Even a casual examination of the corpus reveals the existence of many more relevant documents, demonstrating that the judgments are far from exhaustive. There are two major reasons for this: first, since most teams utilize relatively simple keyword-based techniques to extract answers, there is limited diversity in the types of documents that are retrieved. Second, since each system is only allowed to return one answer per question, the maximum number of relevant documents is bounded by the number of participant teams. For questions that have many more relevant answers in the corpus, there is no hope of collecting a more complete set of judgments within the TREC evaluation framework. Moreover, careful inspection of the document list reveals duplicates and errors. To be fair, NIST merely provides the answer patterns and relevant documents list every year for convenience only; they were never meant to serve as a complete test collection for future experiments. For lack of any better resources, however, these partial judgments have been employed by the research community in many question answering experiments. To understand the danger of evaluating question answering systems with currently available resources, one must first understand how they are being used. In the standard automatic evaluation setup, the answer string is matched against the answer patterns and the supporting document is matched against the list of known relevant documents, both provided by NIST through the pooling technique described above. Because this list is far from exhaustive, new successful retrieval techniques may not be properly rewardeda perfectly acceptable answer may be judged as incorrect simply because its supporting document does not appear in the list of known relevant documents. Current question answering systems, for the most part, are based on matching documents in the corpus that have keywords in common with the question. As more sophisticated linguistic processing technologies are developed, it is entirely possible that a system can extract answers from documents that share few or no keywords with the question. Such systems are likely to return documents that have never been assessed before, but which are assumed to be irrelevant by default. The current set of judgments serves as a poor basis for evaluating new systems; without an exhaustive set of relevant documents, one would never know if a system's performance is actually improving. Although the practice of pooling relevant documents is well-established in ad hoc retrieval, and has proven to produce fair and stable evaluations (Zobel, 1998), we believe that the same methodology cannot be directly applied to question answering evaluations. A major difference between the tasks is the pool depth: in ad hoc retrieval, the top one hundred documents from every system are pooled and judged. Since question answering systems return only a single answer, the pool depth is effectively one; Zobel’s own results show that reliability of the pooling strategy is dependent on the pool depth, and metrics become unstable as pool sizes decrease. Thus, there are serious doubts as to whether previous TREC results can serve as test collections for future experiments. Formal, repeatable evaluations with clearly defined performance metrics and stable results represent a major driver of innovative research. In addition, the ability to rapidly conduct experiments with minimal turn-around time allows researchers to immediately assess the effectiveness of different techniques, leading to accelerated progress. However, this desirable research environment is dependent on the existence of test collections that can accurately capture true performance. Such a resource currently does not exist, and it is the goal of our proposed project to address this issue. 3. Building a Reusable QA Test Collection The creation of a reusable test collection for factoid question answering remains an open research problem, and the lack of suitable resources becomes more problematic as the research community moves towards more difficult question types, e.g., those that involve inferencing. We propose to tackle this problem by confronting it head on. In a pilot study over the past few months, we have manually built a small reusable question answering test collection (Bilotti, 2004; Bilotti et al., 2004) consisting of 120 questions selected from the TREC 2002 testset. This testset is based on the AQUAINT corpus, a three gigabyte collection of approximately one million news articles from the New York Times, the Associated Press, and the Xinhua News Service. Working from known answers to the TREC 2002 questions, we constructed queries with certain terms selected from each question and its answer, terms which we believed a relevant document would have to contain. Although it is entirely possible that a relevant document may contain none of the words from the question and the answer, we assumed that this happens very rarely. Using these queries, we retrieved tens of thousands of documents that were manually examined and judged to be either relevant, irrelevant, or unsupported for a particular question (an example is summarized below in Table 1). Question: What is the name of the volcano that destroyed the ancient city of Pompeii? … In A.D. 79, long-dormant Mount Vesuvius erupted, burying the Roman cities of Pompeii and Herculaneum in volcanic ash… [APW19990823.0165] Unsupported … Pompeii was pagan in A.D. 79, when Vesuvius erupted… [NYT20000405.0216] Irrelevant … the project of replanting ancient vineyards amid the ruins of Pompeii… Coda di Volpe, a white grape from Roman times that thrives in the volcanic soils on the lower slopes of Mt. Vesuvius… [NYT20000704.0049] Table 1: Examples of relevant, unsupported, and irrelevant documents. Relevant To give a concrete example, consider TREC question 1396, “What is the name of the volcano that destroyed the ancient city of Pompeii?”, whose answer is ``Vesuvius''. Any relevant document containing the answer to this question must necessarily contain the keywords “Pompeii” and “Vesuvius”. Therefore, we manually examined all documents with those two keywords. For this question, we noted fifteen relevant, three unsupported, and ten irrelevant documents. All other documents not explicitly marked are presumed to be irrelevant. An example of a clearly relevant document is APW19990823.01652, which says that “In A.D. 79, long-dormant Mount Vesuvius erupted, burying the Roman cities of Pompeii and Herculaneum in volcanic ash.” An unsupported document is one that contains the answer and discusses it in the correct sense and context, but does not completely and clearly answer the question. An example is NYT20000405.0216, which states that, “Pompeii was pagan in A.D. 79, when Vesuvius erupted.” The document also addresses speculations that “the people of Pompeii were justly punished by the volcano eruption,” but does not explicitly mention or imply that the city was destroyed by Vesuvius. An irrelevant document containing the terms “Pompeii” and “Vesuvius” is NYT20000704.0049, which discusses winemaking in Campania, the region of Italy containing both Pompeii and Vesuvius. The document talks about vineyards near the ruins of Pompeii, and about a species of grape that grows in the volcanic soil at the foot of Mt. Vesuvius. In our proposed AnswerBank, we plan on applying the same basic methodology employed in the creation of our pilot test collection, except at a much larger scale and with a supplementary annotation set. In addition, we hope to incorporate many lessons that have been learned in our exploratory studies. Before describing our proposed workflow process, we will first discuss the difficulty of this endeavor with respect to judgment variations. 3.1 The Difficulty in Building a QA Test Collection The task of building a reusable test collection for factoid question answering is much more difficult than it appears. The most important challenge we must tackle in this project is the question “What makes a good answer?” In this section, we will attempt to lay out some of the intricate issues involved. Consider TREC question 1398, “What year was Alaska purchased?”, whose answer is 1867. A document that contains the keywords “1867” and “Alaska” is APW19990329.0045, which says that “In 1867, U.S. Secretary of State William H. Seward reached agreement with Russia to purchase the territory of Alaska.” Is this a relevant document? At first glance, probably so, but upon closer examination, one might have doubts. Bringing to bear common sense knowledge about the world, an assessor might realize that the international transaction of purchasing territories is a long and complicated process that might span years. Thus, “reaching an agreement to purchase” in 1867 might not mean that Alaska was “purchased” in 1867. Consider another document, APW19991017.0082, which states that, “In 1867, the United States took formal possession of Alaska from Russia.” Would this document be considered relevant? For one, it does not mention explicitly that a purchase was involved (it could have been ceded as a result of some other treaty, for example); furthermore, the date of “taking formal possession” might be different from the purchase date. In our pilot study, assessors disagreed on the classification of these documents: a purchase may be a complex commercial transaction that involves multiple steps that may span a long time frame; at what point can the purchase be considered complete? This and similar examples 2 Incidentally, this document is not present in the NIST-supplied list of relevant documents. demonstrate that there is no such thing as an “obvious answer”. Humans often have differences of opinion in interpreting both questions and responses, which translates into different relevancy judgments Even when the interpretation of a question is relatively uniform across different assessors, many disagreements still arise concerning the acceptability of different answer forms. In a question asking for a date, how exact must the date be? Is the year an event happened sufficient, or is an exact month and day necessary? From our experiences, judgments vary not only from assessors to assessor, but also from question to question. In a question about a relatively recent event such as like “When did World War I start?” or “When was Hurricane Hugo?”, assessors are more likely to require more exact dates (containing both the month and the year, for example). For a question like “When did French revolutionaries storm the Bastille?”, both “July 14”, “1789”, and “July 14, 1789” were all considered acceptable, by different assessors. Similar variations in judgments also occurred with entities such as people and place names. Is the last name of a person sufficient, or should an answer have both a first and last name? Consider the question “Who won the Nobel Peace Prize in 1991”, whose answer is “Aung San Suu Kyi”: some assessors accepted “San Suu Kyi”, “Suu Kyi”, or even just “Kyi”. How exact must place names be? For the question “Where was Harry Truman born?”, some assessors accepted only “Lamar, Missouri”, while others were content with “Missouri” (no judge accepted “USA”). For questions that involve locations outside the United States, however, assessors often felt that the name of the country was sufficient. Once again, we observed legitimate difference in opinions, both among different assessors and across different questions. Finally, there are many cases where an answer is not explicitly stated in a document, but requires the assessor to bring external knowledge to bear in interpreting the text. For example, consider the question “What is Pennsylvania’s nickname?”, whose answer is “keystone state”. The vast majority of supporting documents do not explicitly relate the state with its nickname, but the connection is clear from the rhetorical structure of the articles; most assessors did find such documents to be perfectly acceptable. Consider a more complicated situation concerning the question “Where did Allen Iverson go to college?” Some articles mentioned that he was a Hoya, which sports aficionados might automatically associate with Georgetown University. However, this knowledge cannot be considered “common sense”, and hence relevancy judgments of such documents would vary from assessor to assessor. Our preliminary efforts in developing a reusable test collection for question answering has given us valuable insights into the challenges associated with a large-scale resourcebuilding endeavor. There is no such thing as “an obvious answer” or “universal ground truth”. It is fruitless to try and create strict rules that govern what constitutes a correct answer because this notion varies both from person to person and from question to question (cf. Voorhees, 1999). Legitimate differences in opinion are an inescapable fact of question answering evaluation, which is a fact that should be addressed, not hidden. By properly managing these variations (see next section), we can build a high quality test collection that is both high quality and accurately reflects real-world user needs. 4. Building AnswerBank Our proposed plan for building AnswerBank can be summarized by the following workflow diagram: Select Questions Determine Answers Formulate Queries Assess Documents Add Additional Annotations Figure 1: Workflow of the annotation process. Selecting Questions. We plan to select and manually annotate one thousand selected questions from the TREC 8, TREC 9, TREC 10, TREC 11, and TREC 12 Question Answering Tracks. Such an effort will be approximately an order of magnitude larger than our previous pilot study, allowing researchers to accurately evaluate their systems in a statistically significant manner. Using existing TREC questions provides some continuity from presently available resources. These questions are based on search engine logs, and represent a realistic sampling of real-world natural language questions submitted by users on the Web (Voorhees, 2001). Some TREC questions, however, are less appropriate for inclusion in AnswerBank. Among those include questions that have too many relevant documents within the corpus, e.g., “What is the capital of the United States?” Since so many documents contain the answer, it would not be productive to manually find an exhaustive set of judgments. Also excluded from AnswerBank will be so-called “definition questions” such as “Who was Galileo?”, which generally cannot be answered by simple named entities (Voorhees, 2003; Hildebrandt et al., 2004). In general, we will aim to annotate questions that are medium to above average in difficulty, as determined by the number of systems that correctly answered those questions. Such a choice of test questions will serve to drive forward state of the art in natural language processing technology. Determining Answers. Answer strings from the previous TREC evaluations will serve as the starting point for building our test collection. The purpose of this step is not to make a final determination of answer correctness, because such judgments can only be made with respect to a particular document. Rather, these answers will be employed in the subsequent step to gather a candidate pool of documents for further evaluation. Formulating Queries. After determining the range of acceptable answers to a question, actual judgments of answer string and supporting document pairs must be made manually. This will be accomplished with the assistance of a boolean keyword retrieval system. The query issued to this system will contain keywords that must be present in a document in order for it to be considered relevant. As an example, consider question 1436, “What was the name of Stonewall Jackson's horse?”, whose answer is “Little Sorrel”. The query used to generate a candidate document pool would be “Jackson AND Little AND Sorrel”. Note that a relevant document might use a synonym for “horse” such as “steed”, and hence we would not include that keyword in the query. We will take care to formulate queries as broadly as possible to ensure that all relevant documents will be considered. Assessing Documents. Once a candidate set of documents has been gathered, it will be manually reviewed by multiple assessors and judged to be relevant, unsupported, or irrelevant. A relevant document not only contains the correct answer, but must also mention the answer in the correct context of the natural language question. An unsupported document may mention the correct answer, but its connection to the natural language question may be unclear from the text. An irrelevant document may contain keywords from the correct answer, but does not address the natural language question. This judgment process will require us to directly confront differences in opinion regarding what constitutes an acceptable answer. Although we recognize that assessors will inevitably disagree, we hope to gain a deeper understanding of the underlying cognitive model that drives this answer evaluation process. Such knowledge would perhaps lead to the development of heuristic knowledge that can be operationalized by computer systems. In an attempt to reduce variations in judgment, we will produce a set of general guidelines similar to those given to TREC assessors. For example, we will assume a college-educated, North American adult newspaper reader as the idealized target user. The purpose of these guidelines is to provide a common reference point, not to force humans to agree on what a good answer is. To cope with variations in judgment, each question will be multiply annotated by different assessors. We plan to adjudicate differences in opinion by adopting a simple majority voting strategy, although we will certainly experiment with more sophisticated strategies as they become apparent. The purpose of adjudication is not to impose consistency in judgments, but rather to capture obvious errors and misinterpretation of the assessment guidelines. Inter-annotator agreement and other relevant statistical measures will be collected to quantitatively assess the annotation progress. Adding Additional Annotations. For each relevant document belonging to a question in the testset, we propose to add additional meta-knowledge noting the natural language processing technologies that are exercised by the particular instance of an answer. For example, in order to properly extract the answer to the question “Who killed Abraham Lincoln?” from the text fragment “John Wilkes Booth altered history with a bullet. He will forever be known as the man who ended Abraham Lincoln’s life.”, a system must have knowledge of anaphoric references (e.g., relating “he” to “John Wilkes Booth”) and paraphrases (i.e., that ending someone’s life is synonymous with killing someone). We believe that such additional annotations will help researchers categorize the difficulty of a question (Lange et al., 2004); for example, an answer passage that shares many keywords in common with the question may probably be considered “easy”. Although we currently do not have a classification of natural language tasks relevant to question answering, we believe that such an inventory could be an additional benefit of building AnswerBank. By examining the corpus and a large collection of questions and documents, we hope to induce an ontology of challenges that must be overcome in order to solve the question answering problem. Furthermore, the results of our work may provide an empirical distribution of the pervasiveness of each research challenge; such knowledge will enable us to develop a roadmap for question answering research. We could compile recommendations regarding the most pertinent research questions that would have the greatest impact on question answering performance. 4.1 Infrastructure Support The construction of AnswerBank will necessitate the development of specialized annotation tools to streamline the workflow. We believe that these tools may themselves contribute to advancing our knowledge in human-computer interfaces, especially in areas pertaining to information presentation and information visualization. To support rapid assessment of document relevance, we plan to experiment with novel interfaces for presenting search results in the context of question answering systems. In fact, we have previously conducted a user study to determine the effectiveness of different answer presentation strategies for question answering systems (Lin et al., 2003a; Lin et al., 2003b). Working under a keyword-in-context (Leung and Apperley, 1994) paradigm, we explored user preferences in the amount of context accompanying an answer. The study tested four different interface conditions: exact answer only (no context), the answer highlighted in the sentence in which it occurred, the answer highlighted in the paragraph in which it occurred, and the answer highlighted in the document in which it occurred. We discovered that the majority of users preferred the answer-in-paragraph interface, because it provided them a natural discourse unit in which to situate their answer. In addition, a paragraph-sized result allows users to find answers to related questions. We plan to build on these previous results in developing the necessary infrastructure to support the creation of AnswerBank. The development of effective interfaces for presenting search results has implications beyond the annotation effort described in this proposal. Lessons learned from interface design can be applied to a wide range of information systems. 4.2 Information Retrieval Experiments We believe that AnswerBank will enable a wide variety of question answering research that has until now been difficult without a reusable test collection. In fact, we have already performed preliminary studies of what can be accomplished with our pilot test collection. Although factoid question answering is distinct from the task of retrieving relevant documents in response to a user query (so-called ad hoc retrieval), document retrieval systems nevertheless play a central role in the question answering process. Because natural language processing techniques are relatively slow, a question answering system typically relies on traditional document retrieval techniques to first gather a set of candidate documents, thereby reducing the amount of text that must be analyzed. In a recent study, we explored the role of document retrieval in the context of question answering. In particular, we examined different strategies for coping with morphological variation, which poses a challenge to all types of information retrieval systems. Ideally, a system should be able to retrieve documents containing closely-related variants of keywords found in the query, e.g., the query term love should not only match the term love present in documents, but also the terms loving, loved, and loves. There are, in principle, two different ways for coping with morphological variation. The most popular strategy is to apply a stemming algorithm at indexing time and store only the resulting word stems (e.g., love lov); this naturally requires user queries to be similarly analyzed. The effects of this morphological normalization process have been well studied in the context of document retrieval, but it is still unclear whether or not stemming is effective (Harman, 1991; Krovetz, 1993; Hull, 1996; Monz, 2003). An alternative strategy is indexing the original word forms, as they are, and expanding query terms with their morphological variants at retrieval time (e.g., love love loving loved loves). Performing query expansion at retrieval time, however, requires the ability for structured querying, a capability that may not be present in all document retrieval systems. Our study reveals that the performance difference in recall between the stemming and morphological query expansion approach cannot be detected when evaluating results with the NIST-supplied set of relevant documents. That is, both stemming and morphological query expansion were found to perform equally well, slightly above the baseline of no stemming. However, evaluating the same exact techniques with out pilot test collection produced a different picture. We found that morphological query expansion gives rise to higher recall than stemming (which in turn has higher recall than the baseline). We believe that these results highlight the danger of using the NIST-supplied relevant documents for research experiments. Since these documents were gathered by pooling the results of participants, they reflect the type of technologies employed by those systems. Since most TREC entries rely on keyword-based techniques will little linguistic sophistication, the resulting pool of documents will not properly assesses the performance of more advanced query generation techniques such as those involved in our morphological query expansion experiments. Not only have we outlined a project to fill a void in question answering resources, but we have also demonstrated concretely how such a test collection can be used to further research in document retrieval technology. In parallel to the annotation effort in building AnswerBank, we will also engage in a number of exploratory studies similar to the one described above to test our emerging test collection and ensure its usefulness. 4.3 Resource Maintenance AnswerBank is a static resource that, upon completion, will be publicly released to the research community. No further maintenance of the project will be required, although additional funding may be necessary to further expand the scope and coverage of the collection beyond what is already outlined in this proposal. If the infrastructure tools and software developed during the annotation effort is found to be useful beyond this project, they will be released to the community under an open source license. 5. Impact The availability of language resources and reusable test collections has been a major driving force in the development of human language technologies and natural language applications. From a resource point of view, they serve as a reference point of performance that concrete applications can strive for. From an evaluation point of view, the existence of reusable test collections allows researchers to conduct experiments with rapid turn-around, allowing faster exploration of the solution space and leading to accelerated improvements in performance. The value of resource-building projects is well-documented in many computational linguistics areas such as statistical parsing, machine translation, and automatic summarization. In these disciplines, the development of resources, meaningful evaluation metrics, and standardized test collections is directly responsible for dramatic performance improvements in short amounts of time. It is clear that appropriately-designed, community-wide resources can serve as high-impact enablers that allow entire fields to accomplish what previously was believed to be impossible. We have similarly high hopes for AnswerBank, which will fill a gap in available question answering resources and pave the way for rapid advances in the state of the art. In the process of building AnswerBank, we will have to tackle the thorny question of “What makes a piece of text a good answer to a question?” As we have shown, relevance assessment is a complex process that is both person- and question-specific. We believe that AnswerBank may help reveal important insights about information-seeking behavior and its underlying cognitive processes. A better understanding these mental models will surely lead to more useful information systems that can help users better survive today’s often overwhelming information environment. References Bilotti, Matthew W., “Query Expansion Techniques for Question Answering”, Master's thesis, Massachusetts Institute of Technology, 2004. Bilotti, Matthew W., Boris Katz and Jimmy Lin, “What Works Better for Question Answering: Stemming or Morphological Query Expansion?”, Proceedings of the Information Retrieval for Question Answering (IR4QA) Workshop at SIGIR 2004, 2004. Charniak, Eugene, “Immediate Head Parsing for Language Models”, Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL 2001), 2001. Collins, Michael, “Three Generative Lexicalized Models for Statistical Parsing”, Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL 1997), 1997. Echihabi, Abdessamad and Daniel Marcu, “A Noisy-Channel Approach to Question Answering”, Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003), 2003. Harman, Donna, “How Effective is Suffixing?”, Journal of the American Society for Information Science, 42(1), 7–15, 1991. Hildebrandt, Wesley, Boris Katz, and Jimmy Lin, “Answering Definition Questions with Multiple Knowledge Sources”, Proceedings of the 2004 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT/NAACL 2004), 2004. Hull, David A., “Stemming Algorithms: A Case Study for Detailed Evaluation”, Journal of the American Society for Information Science, 47(1), 70–84, 1996. Ittycheriah Abraham, Martin Franz, Wei-Jing Zhu, Adwait Ratnaparkhi. “IBM's Statistical Question Answering System”, Proceedings of the Ninth Text REtrieval Conference (TREC-9), 2000. Krovetz, Robert, “Viewing Morphology as an Inference Process”, Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1993), 1993. Lange, Rense, Juan Moran, Warren R. Greiff, and Lisa Ferro, “A Probabilistic Rasch Analysis of Question Answering Evaluations”, Proceedings of the 2004 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT/NAACL 2004), 2004. Leung, Ying K. and Mark D. Apperley, “A Review and Taxonomy of DistortionOriented Presentation Techniques”, ACM Transactions on Computer- Human Interaction, 1(2):126–160, 1994. Lin, Chin-Yew, and Eduard Hovy, “Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics”, Proceedings of the 2003 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT/NAACL 2003), 2003. Lin, Jimmy, Dennis Quan, Vineet Sinha, Karun Bakshi, David Huynh, Boris Katz, and David R. Karger, “The Role of Context in Question Answering Systems”, Proceedings of the 2003 Conference on Human Factors in Computing Systems (CHI 2003), 2003a. Lin, Jimmy, Dennis Quan, Vineet Sinha, Karun Bakshi, David Huynh, Boris Katz, and David R. Karger, “What Makes a Good Answer? The Role of Context in Question Answering”, Proceedings of the Ninth IFIP TC13 International Conference on HumanComputer Interaction (INTERACT 2003), 2003b. Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz, “Building a Large Annotated Corpus of English: The Penn Treebank”, Computational Linguistics, 19(2), 313–330, 1994. Monz, Christof, “From Document Retrieval to Question Answering”, Ph.D. Dissertation, Institute for Logic, Language, and Computation, University of Amsterdam, 2003. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation”, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), 2002. Radev, Dragomir, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal, “Probabilistic Question Answering on the Web”, Proceedings of the Eleventh International World Wide Web Conference (WWW2002), 2002. Voorhees, Ellen M. and Dawn M. Tice, “The TREC-8 Question Answering Track Evaluation”, Proceedings of the Eighth Text REtrieval Conference (TREC-8), 1999. Voorhees, Ellen M. and Dawn M. Tice, “Overview of the TREC-9 Question Answering Track”, Proceedings of the Ninth Text REtrieval Conference (TREC-9), 2000. Voorhees, Ellen M., “Overview of the TREC 2001 Question Answering Track”, Proceedings of the Tenth Text REtrieval Conference (TREC 2001), 2001. Voorhees, Ellen M., “Overview of the TREC 2002 Question Answering Track”, Proceedings of the Eleventh Text REtrieval Conference (TREC 2002), 2002. Voorhees, Ellen M., “Overview of the TREC 2003 Question Answering Track”, Proceedings of the Twelvth Text REtrieval Conference (TREC 2003), 2003. Zobel, Justin, “How Reliable Are the Results of Large-Scale Information Retrieval Experiments?”, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), 1998