Retrieving Answers from Frequently Asked Questions Pages on the Web Valentin Jijkoun Maarten de Rijke jijkoun@science.uva.nl mdr@science.uva.nl Informatics Institute, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands ABSTRACT We address the task of answering natural language questions by using the large number of Frequently Asked Questions (FAQ) pages available on the web. The task involves three steps: (1) fetching FAQ pages from the web; (2) automatic extraction of question/answer (Q/A) pairs from the collected pages; and (3) answering users’ questions by retrieving appropriate Q/A pairs. We discuss our solutions for each of the three tasks, and give detailed evaluation results on a collected corpus of about 3.6Gb of text data (293K pages, 2.8M Q/A pairs), with real users’ questions sampled from a web search engine log. Specifically, we propose simple but effective methods for Q/A extraction and investigate task-specific retrieval models for answering questions. Our best model finds answers for 36% of the test questions in the top 20 results. Our overall conclusion is that FAQ pages on the web provide an excellent resource for addressing real users’ information needs in a highly focused manner. Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Linguistic processing, Indexing methods; H.3.3 [Information Search and Retrieval]: Retrieval models General Terms Algorithms, Measurement, Performance, Experimentation Keywords Question answering, FAQ retrieval, Questions beyond factoids 1. INTRODUCTION Question Answering (QA) is one of several recent attempts to provide highly focused access to textual information. The task has received a great deal of attention in recent years, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’05, October 31–November 5, 2005, Bremen, Germany. Copyright 2005 ACM 1-59593-140-6/05/0010 ...$5.00. especially since the launch of the QA track at TREC in 1999. While significant progress has been made in technology for answering general factoids (e.g., How fast does a cheetah run? or What is the German population? ), there is a real need to go beyond such factoids. Our long term goal is to develop QA technology for the web, catering for ad hoc questions posed by web users. Factoids are frequent in the log files of commercial search engines, but other types are at least as frequent, if not more frequent: procedural questions, requests for explanations or reasons. Such questions we refer to as “questions beyond factoids” [29]. What is the appropriate format for answering questions beyond factoids? This is not an easy matter. As we argue in Section 2, even for factoids it may be unclear what the appropriate answer format is in an ad hoc QA setting, i.e., where a system is to respond to general open domain questions without having access to information about the user asking the question. The problem becomes even more difficult when we try to find answers to questions beyond factoids. In this paper we “solve” the answer format issue by leaving it to humans. Our system responds to questions (factoid or beyond) by retrieving answers from frequently asked questions (FAQ) pages. FAQ pages constitute a rich body of knowledge that combine focused questions (mostly beyond factoids) with ready-made human generated answers. Our task, then, is to return, in response to an incoming question Q0 , a ranked list of pairs (Q, A) extracted from FAQs, where (ideally) the top ranking pair (Q, A) provides an adequate answer Q to the user’s question Q0 , and other result pairs (Q00 , A00 ) provide additional material information pertaining to Q0 . To us, answering questions beyond factoids is a retrieval task, where systems should return a ranked list of (multiple) focused responses in context. This view has clear methodological advantages over the single-shot isolated answer format as required from systems taking part in the TREC and CLEF QA tasks. In order to retrieve answers from FAQ pages on the web, we face three main tasks: (1) fetching FAQ pages from the web; (2) automatically extracting question/answer (Q/A) pairs from the fetched pages; and (3) addressing users’ questions by retrieving appropriate Q/A pairs. In this paper we detail our solutions for each of those three steps, with a focus on the second, and, especially, the third step. We provide evaluation results for all steps.1 To the best of our knowledge, this is the first description 1 All the data used for evaluation purposes in this paper is available at http://ilps.science.uva.nl/Resources. of an open-domain QA system that collects and uses FAQ pages on the web to find answers to user questions. The evaluation of the system on real user questions, randomly sampled from search engine query logs, shows that the system is capable of answering 36% of the questions (in the top 20 results) and provides material information for 56% of the questions. This justifies our claim that FAQ pages on the web constitute an excellent resource for addressing real user’s information needs in highly focused manner. The remainder of the paper is organized as follows. In Section 2 we provide extensive motivation for our proposal to address ad hoc QA as retrieval from FAQs. Sections 3, 4, and 5 are devoted to locating FAQs on the web, extracting Q/A pairs from the found pages, and retrieval against the resulting collection, respectively. In Section 6 we discuss related work, and we conclude in Section 7. 2. BACKGROUND Our long term aim is to develop question answering technology for web users. By inspecting log files of commercial search engines we can find out what type of questions web users ask, even today. In a sample of 10M queries collected from the MetaCrawler engine (http://www.metasearch.com) in the period September–December 2004 we found that at least 2% of the queries were actual questions. While about a quarter of the questions were factoids, another quarter consisted of questions asking for definitions or explanations, and a third of the questions were procedural in nature. At the TREC QA track the importance of questions beyond factoids has been recognized through the introduction of definition questions in 2003 and of so-called “other” questions (that ask for important information about a topic at hand that the user does not know know enough about to task) in 2004 and again in 2005. There are several aspects that make retrieval against Q/A pairs mined from FAQ pages on the web a viable approach to QA. First, factoid questions typically require simple entities or phrases as answers; moving beyond factoids, the TREC QA track required a list of snippets containing important bits of information as the answer to definition and “other” questions; the decisions on answerhood, however, turned out to be problematic [30]. For even more complex questions beyond factoids, we believe that requiring systems to precisely delineate an appropriate answer is beyond current state of the art in language technology. This suggests that returning human generated answers in response to questions beyond factoids is a sensible and realistic solution. Specifically, in our FAQ retrieval scenario, we attempt to locate questions Q that somehow pertain to the user’s input question Q0 and return the full answer A that comes with Q. Note, by the way, that instead of being isolated text snippets, answers on FAQ pages often come with explanations and/or justifications, while additional Q/A pairs on the page where an answer was found provide users with further context. This is an important feature, especially given that users tend to prefer answers in context [21]. There are further issues with QA that can be addressed by implementing QA as retrieval from FAQs. Recall that the “interaction model” adopted by much current research in QA technology is the following: in response to a question, return a single exact answer. In an ad hoc web setting, with no information about the user and her background or intentions (other than what can be picked up through the user’s browsing agent), this single shot format is inappropriate. An example will help explain this. Consider the question Where is the Rijksmuseum located?. Without explicit knowledge about the questioner’s context or background, the system has to select any of the following as the single best answer: A1 A2 A3 A4 A5 In a building designed by the architect Cuijpers. Across Museumplein from the Concertgebouw. You can take Tram 5 to get there. Jan Luijkenstraat in Amsterdam. The Rijksmuseum van Oudheden (RMO) is the national museum of antiquities at Leiden Each of these is a correct answer, and each contains some information that none of the others contains. This suggests that the best we can do is present all of these answers; following standard practice in IR, this is most naturally done in the form of a ranked list. By returning ranked lists of Q/A pairs in response to an incoming question we are able to implement an important lesson from the “other” questions studied at the TREC QA track. Even if a text snippet does not provide a full answer to a question, it might still supply material information, pertinent to the question at hand, that might help the questioner achieve her information seeking goals [10]. In this manner, our ranked list output format helps facilitate serendipity.2 3. FETCHING FAQ PAGES FROM THE WEB Our first step is to collect web pages containing lists of Q/A pairs. Since only a relatively small portion of the Web pages are actually FAQ pages, it would be quite resourceinefficient to collect such pages using general-purpose crawling. Instead, we use a web search engine (Google) to locate pages that are likely to contain FAQs, specifically, that have the string “faq” in their URL or title. While we obviously miss many potentially useful question/answer pages, the number of found pages is still very large: Google reports 8,900,000 hits for the query “intitle:faq” and 16,700,000 hits for “inurl:faq.” Ideally, we would simply use these two queries to retrieve a large collection of FAQ pages. Unfortunately, most web search engines, including Google, provide at most 1000 hits for any given query. To obtain more (ideally, all) FAQ pages, various query expansion techniques can be used. We took a simple, but sufficiently effective approach. We used the ontology of the Open Directory Project (http://dmoz.org/) to extract a list of reasonably general concepts: we simply extracted all concepts having between 5 and 20 children in the ontology. In the experiments below we used 3,847 out of a total of 9,747 extracted concepts to generate expanded retrieval queries (e.g., “inurl:faq Party Supplies”), and for each of the resulting queries the top 900 documents were retrieved using Google’s interface. In total, we used 118,742 calls to the Google API, which gave us a list of 404,870 unique URLs (on average 3.4 unique URLs per call to the Google API, or 105.2 unique URLs per expanded query). For the experiments reported below we downloaded pages using 293,031 of these URLs. 2 In our retrieval experiments in Section 5 we use a three point scale to assess the appropriateness of a Q/A pair in response to a user question: “adequate” (as an answer), “material” (information), and “unsatisfactory.” tions on potential FAQ pages. Once the questions have been identified, we use rules to locate corresponding answers. 1.2e+06 Ideal search engine usage Actual usage Unique URLs collected 1e+06 800000 600000 400000 200000 0 0 20000 40000 60000 80000 Number of Google interface calls 100000 120000 Figure 1: Performance of the FAQ fetching module Since we do not attempt to partition the search space into non-overlapping segments, result lists for later queries increasingly contain URLs that have already been listed in the results for earlier queries: in 1,201,170 results from Google we found 404,870 (33%) unique URLs. Figure 1 shows how the number of unique URLs grows with the number of calls to the Google API (one call returns at most 10 results). Whereas for the experiments described below our simple fetching method is appropriate, it is not scalable for the creation of a much bigger corpus. In that case, more sophisticated query expansion methods [9] can be used to optimize the utilization of the web search engine. We did not have the resources to perform a large scale evaluation of the accuracy of the fetching method. However, in a random sample of 109 pages from the 293,031 pages collected using the method described above, 83 pages (76%) turned out to be true FAQ pages. (See Section 4.2 for more details on this evaluation.) 4. EXTRACTING Q/A PAIRS Once web pages potentially containing lists of Q/A pairs have been collected, the next task is to automatically extract Q/A pairs. While FAQ files abound on the web,3 the main challenge is that the details of the Q/A mark-up vary wildly between FAQ files. In HTML, questions and answers can be simply marked as separate paragraphs, or questions can be made to “stand out” using italics, boldface, headers, fonts of different size and color etc. Various types of indentation (tables, lists, text alignment) and specific line prefixes (e.g., Question:, Subject:, Q:, A: etc.) are used. Most FAQ pages share two important aspects: the annotation on a single page is usually consistent (all questions look “similar”), and pages typically contain several Q/A pairs. 4.1 Extraction Methods We propose two methods for extracting Q/A pairs: one is heuristics-based, and the second bootstraps the heuristic method using memory-based learning. The second method is supervised in that it builds on the output of our handcrafted heuristic method; it is unsupervised in that it uses no further manual annotation. Both methods detect ques3 There is even a markup language, QAML, to represent FAQs: http://xml.ascc.net/en/utf-8/qaml-index.html. Heuristic Extraction . We first use the HTML mark-up of a page to split it into a sequence of text blocks that would be visually separated if the page were viewed in a web browser. We parse the HTML code and consider tags table, br, p, li, div, h1-h6, etc. as always starting or ending a text block. Since many “real world” web pages contain non-well-formed HTML (non-nested tags, missing closing tags, etc.), we use simple heuristics, emulating the behavior of a browser. Each text block is described by a set of features like the first and second words of the block, templates of these words (e.g., “digits followed by a semicolon”), the first and last characters of the block, length, all HTML tags and tag attributes affecting the block, etc. Most features are obvious, and some have been used for detecting questions [24]. The actual number of features for a page is typically between 25 and 45. Once the content of a page is split into a sequence of blocks, we use simple heuristics to select questions: a block should contain at most 200 characters, not be a hyperlink, and contain a question mark or a capitalized question word. This extraction method is called Heuristics below. Machine Learning Extraction . Naturally, the Heuristics method is not perfect, neither for precision nor for recall. We extend it by exploiting another aspect common to many FAQ pages: there are typically several true questions on a page, many of which are consistently annotated. For each FAQ file we assume a consistent mark-up (e.g., all questions are section headers). We make no assumptions on the common annotation details, but do assume that there are mark-up features that are consistent among questions on the page and distinguish questions from other text blocks (e.g., answers). We use TiMBL, a memory-based learner [8], to bootstrap from the questions selected with the Heuristic method. The learner detects which block features are consistent and characteristic for the questions on a given web page. We use the learner as a re-classifier, where it uses as its training base all text blocks from a page, classified into {question, not-question} using our heuristics. The learner calculates feature weights and then re-classifies the same sequence of blocks of the page in the leave-one-out mode (i.e., ignoring the original decision of our heuristic method). We use k-nearest-neighbors classification with inverse distance voting and the number of neighbors equal to the number of questions found with the heuristics. This is the ML method. Identifying Answers . Once the blocks corresponding to questions have been determined, we use a simple but reliable rule to extract corresponding answers: up to three consecutive non-question text blocks immediately following a question in the HTML file constitute the answer. 4.2 Evaluation of the Extraction Methods To assess our extraction methods, we created our own evaluation corpus, which consists of 109 pages taken randomly from the results of the fetching step (Section 3). The descriptive statistics of the corpus can therefore be extrapolated to large collections retrieved from the web in this way. From the 109 pages, annotators manually extracted all Q/A pairs. Annotators were asked to extract as questions bits of texts that are easily interpreted as questions in the context of the page and that have explicit and clearly identifi- able answers. The annotators were instructed not to include question numbers or identifiers and various prefixes (e.g., Q:, A:, Question:, Subject: etc.). Moreover, to make the annotation task easier and faster, they were not requested to annotate complete answers, but only initial portions (one or two first sentences or lines) that were sufficient to locate any given answer on a page. Questions (without prefixes or identifiers) had to be extracted in full. The table below gives some statistics for our evaluation corpus. Q/A extraction evaluation corpus Total number of pages 109 True FAQ pages 83 Total number of Q/A pairs 1418 Avg. number of Q/A pairs per page 13 (76%) We evaluated our Q/A extraction methods against the evaluation corpus using conventional precision and recall measures. The table below gives precision and recall scores for the two methods. Extraction method Heuristic ML Precision 0.94 0.92 Recall 0.93 0.94 The performance of the heuristic Q/A pair extraction method is striking, given its simplicity. With precision and recall around 0.93 the system can be used to create a large clean collection of Q/A pairs. Bootstrapping shows the expected behaviour: the ML method finds other questions, not identified with the simple heuristics, but having similar mark-up features as the questions already found. Although the differences are fairly small, using machine learning to bootstrap the heuristic method allows us to somewhat improve the recall, at the cost of a minor drop in precision. For our retrieval experiments (Section 5), we took the output of the fetching module (Section 3), and extracted Q/A pairs using the Heuristic method as it shows slightly better precision. In total, 2,824,179 Q/A pairs were extracted. 5. RETRIEVING Q/A PAIRS We now describe our approach to retrieving Q/A pairs from the collection built in Section 4. We discuss various retrieval models, and then describe experiments aimed at assessing their effectiveness. 5.1 Retrieval Models Our task is to return a ranked list of Q/A pairs in response to a user’s question. The answer parts of one or more of the Q/A pairs returned should provide an adequate answer to the user’s question; in addition, we want to return Q/A pairs that provide additional important information about the topic at hand. We treat the task as a fielded search task, where the user’s question is treated as a query, and the items to be returned (Q/A pairs) have multiple fields: a question part, an answer part, and the text of the document from which it was extracted. There are several aspects of any FAQ collection that make the retrieval task different from ad hoc retrieval: • Q/A pairs are often “incomplete,” in that they implicitely assume the topic of the FAQ or anaphorically refer to entities mentioned elsewhere on the FAQ page (e.g., “Who produced the program? ”). • Following our user model (discussed in Section 2), we focus on initial precision: the user prefers to find an adequate answer in top 10–20 ranked Q/A pairs. • Question words and phrases (such as “Who”, “How do I ” etc.), typically consisting of stopwords, might be valuable indications of adequacy of an Q/A pair. The retrieval models we investigate exploit these aspects. Indexing . We use the implementation of the vector space model in Lucene [1] as the core of our retrieval system. For each Q/A pair we index the following fields: • question text • question text without stopwords • answer text without stopwords • title of its FAQ page without stopwords • full text of its FAQ page without stopwords We build both stemmed and non-stemmed indices. Question texts are indexed both with and without stopwords, whereas stopwords were removed from all other fields. Given a user question, the retrieval status value of a Q/A pair is calculated as a linear combination of similarity scores between the user question and the fields of the Q/A pair. The models we detail below correspond to different weights of the linear combination. Models . As a baseline (B), we consider a simple model that calculates a linear combination of vector space similarities of the user question and fields of the Q/A pair: question text (with weight 0.5), answer text (weight 0.2) and FAQ page title (weight 0.3) (no stemming, but stopwords are removed). The model B+D adds to the Q/A pair score the vector space similarity between the user question and the text of the entire FAQ page. In essence, we smooth the Q/A pair model with a document model. The model B+D/2 is similar to B+D, but the document similarity is added with weight 0.5 rather than 1, so as to give the background model less weight. The model B+D/2+S is similar to B+D/2, but all similarities are calculated after stemming (we use the English Porter stemmer [26]). The model B+D/2+S+Q adds a score for the similarity between the user question and the question from the Q/A pair, calculated without removing stopwords and without stemming. The idea is to boost exact matches of questions, taking question words (such as “what”, “how”, etc.) into account that are typically treated as stopwords. The model B+D/2+S+Q+PH is similar to B+D/2+S+Q, but the similarities are calculated using 1, 2 and 3 word ngrams (without stopwords) from the user question, rather than just single words. The intuition here is to take possible multi-word expressions (e.g., “sign language” or “super model”) in the user’s question into account. Recent work on topic distillation suggests that this may have a positive impact on early precision [25]. 5.2 Experimental Setting We experimented with a collection of 2,824,179 Q/A pairs, created as described in Sections 3 and 4; the collection adds up to 3.6Gb of raw text of the FAQ pages. One of our aims was to see how well questions asked by real users on the web can be answered using FAQ files. To this end, we obtained samples from query logs of the MetaCrawler search engine (http://www.metacrawler.com) during September– December 2004, and extracted 44,783 queries likely to be questions: we simply selected queries that contained at least one question word (“what”, “how”, “where”, etc.). For our retrieval experiments, we randomly selected 100 user questions from this sample. To facilitate our error analysis of the retrieval system, we manually classified questions into 8 categories; Table 1 gives statistics and examples. The categories are the same as the question types used by the FAQ Finder system [22], except that but we merged its time, location, entity, etc. categories into a single factoid category. (Note that the classification of user questions was not used during the retrieval.) Type #Q Example procedural 38 how to cook a ham factoid 17 what did the caribs eat description 13 who is victoria gott explanation 10 info on why people do good deeds non-question 10 what you did was rude e-cards definition 8 what is cpi trotting direction 2 where can i find physician employment other 2 what present to get for my boyfriend Table 1: Manual classification of 100 user questions. For each retrieval model and each user question we retrieved a ranked list of top-20 Q/A pairs from the collection. In line with our user modeling concerns in Section 2, each combination of user question and retrieved Q/A pair was manually assessed on a three point scale: • adequate (2): the Q/A pair contains an answer to the user question and provides sufficient context to establish the “answerhood”; • material (1): the Q/A pair does not exactly answer the user question, but provides important information pertaining to the user’s information need; and • unsatisfactory (0): the Q/A pair does not address the user’s information need. Assessors were asked to “back-generate” an actual information need given a user question before starting to make assessments. Adequacy assessments had to be based only on the text of the Q/A pair and the title and URL of the FAQ page. Table 2 gives examples of the assessors’ decisions. In total, 5645 assessment decisions were made, 131 of which were adequate and 173 material. Our main evaluation measure is success rate: the number of questions with at least one appropriate response in the top n results (where n equals 10 or 20). More precisely, we calculated S1,2 @n (success rate with adequate or material Q/A pairs in the top n), and S2 @n (adequate Q/A pair in the top n). These evaluation measures closely correspond to the user model we are interested in. 5.3 Results We report on three groups of experiments, all aimed at understanding the effectiveness of our fielded search approach to FAQ retrieval. Table 3 presents the scores. In the first group in the table (rows 2–4) we compare the baseline model against smoothing with the document model at different Model S1,2 @20 B 48 B+D 49 B+D/2 49 B+D/2+S 47 B+D/2+Q 56 B+D/2+S+Q 54 B+D/2+S+Q+PH 50 B+Q 56 B+D+Q 55 S2 @20 28 31 28 30 36 34 34 35 35 S1,2 @10 44 45 45 41 50 45 44 50 51 S2 @10 25 26 27 26 29 28 27 29 29 Table 3: Evaluation of FAQ retrieval approaches on 100 user questions. Three groups of models (explained in Section 5.3), with the highest scores in boldface. weights for the document model. In the second group (rows 5–8) we take one of the smoothed models from the first group, and investigate the impact of further smoothing, normalization and phrasal techniques. Finally, in the third group (rows 9–10) we compare the contribution of smoothing with the question model (B+Q) vs. smoothing with a document model (B+D and B+D+Q). The best performing models provide adequate answers in the top 10 for 29% of the user questions, and material information for 50% of the questions (35% and 56% at top 20, respectively). This is a surprisingly high performance, given the limited size of our FAQ collection, the real-world and open-domain nature of the evaluation questions, and the simplicity of the retrieval models. Clearly, the success rates at 10 and at 20 are highly correlated, especially where the differences between the scores are substantial. In particular, three of the models (smoothing with the question model and, optionally, the document model, i.e., B+Q, B+D+Q, and B+D/2+Q) behave very similarly and outperform others for all four measures. 5.4 Discussion and Error Analysis Interestingly, the best performing models use matches on questions, without removing stopwords. As mentioned above, question stopwords (“how”, “why”, “what”, etc.) are often good indicators of the question type, which has been shown to be useful in FAQ retrieval [22]. Even without explicit identification of types of user questions we are able to exploit this information by simply keeping question stopwords. The evaluation results indicate that stemming does not help in retrieving Q/A pairs. Stemming is generally used to improve recall, but it does not seem to be helpful for the success rate, the main measure we are interested in. Interestingly, the FAQ Finder system [5] used stemming for retrieving Q/A pairs, which did make sense given the small size of its FAQ collection. The size and redundancy of our corpus may play an important role: user questions often have several adequate and material Q/A pairs in different FAQ pages, which increases the chances of finding these Q/A pairs even without stemming. In Figure 2 we show the effect of stemming on the reciprocal rank (the inverse of the rank of the first adequate or material Q/A pair) for the evaluation questions. We compare one of the best runs (B+D/2+Q) and its stemmed version (B+D/2+S+Q). Although stemming does somewhat improve the ranking of Q/A pairs for some questions (right-hand side of Figure 2), for many questions it moves appropriate Q/A pairs down the ranking. Due question how to beat traffic radar type: procedural adequate Q What is Radar Detection? A Modern traffic radar guns can clock vehicles in front of or behind the patrol car. For this reason, it’s important that your detector is able to capture radar behind your vehicle. . . material Q Are radar detectors legal? A They are legal to use in New Zealand have been for over 26 years. . . unsatisf. Q How does the radar work? A NEXRAD (Nex t Generation Rad ar) obtains weather information (precipitation and wind) based upon returned energy. . . question what are internet browsers type: definition adequate Q What browsers are there available? A There are numerous browsers available to surf the WWW. . . adequate Q What is a browser? A To take a look at web pages your will need what is called a browser. . . material Q What Internet browsers are supported? A This site is best viewed with any w3c.org compliant browser like Mozilla , Firefox , Camino (Mac) and Safari (Mac) . However, we have yet to test our pages with Opera , Konqueror , OmniWeb. . . unsatisf. Q Why would someone use a browser other than Netscape or Internet Explorer? A See the reasons listed above under ” Why can’t we just ask people to upgrade their browsers or switch browsers? Table 2: Examples of adequacy assessments. to the restricted size of our question test set, we have not yet been able to determine whether the usefulness of stemming correlates with types of user questions. Surprisingly, phrasal search has a clear negative effect on the success rates, although it proved useful for improving early precision in other settings such as topic distillation in web retrieval [25]. We see two reasons for these findings. One is that the questions are much longer (6.9 words on average) than the queries on which phrasal search was found to be most effective in the topic distillation setting (2.5 words on average at TREC 2003). Another reason is the presence of duplicate or near duplicate FAQ pages and Q/A pairs in our collection; phrasal search seems to help in giving highly similar Q/A pairs very similar retrieval scores, which often forced other potentially useful Q/A pairs to drop out of the top 10 or even top 20 results. Offline (at indexing time) 1 no stemming (B+D/2+Q) stemming (B+D/2+S+Q) Reciprocal Rank at top 20 0.8 and/or online (at query time) clustering and (near) duplicate detection are essential to eliminate this problem. A further source of errors are frequent typos in user questions: “who made the got milk? ” with goat misspelled, “how do yousay youre wlecome in sign language”, or “how glidders take off ” with gliders is misspelled. Since we do not try to correct these mistakes, the system does not find appropriate responses for the questions. For a real web application detection and correction of typos is essential. A different type of problem we faced was the difficulty of assessing adequacy for some of the questions. First, none of the non-questions (10% in Table 1) sampled from the MetaCrawler query log received adequate or material responses from the system. However, even for some wellformed questions, (e.g., “how to move in combat” or “how to teach the sac”) assessors were unable to reconstruct the original information need and, hence, identify any appropriate Q/A pairs in the results of the system. Nevertheless, we decided not to exclude these non-questions and unclear questions from the evaluation, so as not change our sample. 6. 0.6 6.1 0.4 0.2 0 0 10 20 30 User questions 40 50 60 Figure 2: The impact of stemming on reciprocal rank, B+D/2+Q vs. B+D/2+S+Q; only test questions with a non-zero reciprocal rank for at least one of the two runs are plotted. RELATED WORK Question Answering FAQ files are a natural source for knowledge. AutoFAQ [32] was an early system that provided access to a small number of FAQs. The FAQ Finder system [4, 5] retrieves existing answers from a small and homogeneous set of FAQ files (mostly associated with USENET newsgroups, with a standardized Q/A pair markup). FAQ Finder was evaluated using questions drawn from the system’s log files. It uses a weighted sum of the term vector similarity (between user question and Q/A pair), word overlap between user question and indexed question, and a WordNet-based lexical similarity between user question and indexed question. We use fielded search, based on a mixture model to which indexed questions, answers, and full FAQ pages contribute, optionally together with phrasal search and stemming. Later work on FAQ Finder [22, 23] studied the impact of sense tagging and question typing on computing the similarity between user questions and indexed questions. As discussed previously, in the model in which we represent questions with a separate model we implictly do question typing. Operating in a helpdesk environment, eResponder stores Q/A pairs that have previously been asked and answered; these pairs can be used to respond to user questions, or to assist customer service representatives in drafting new responses [6]. See [16] for a similar system and [15] for a variation that uses cluster-based retrieval. QA beyond factoids is largely unexplored, with a few notable exceptions [2, 3, 12]. Much of today’s QA research is centered around the QA tracks at TREC and CLEF, where systems respond to factoids by returning a single exact answer extracted from newspaper corpora. While TREC is slowly moving beyond factoids (by also considering definitions questions in 2003, “other” questions in 2004 and 2005, and “relationship finding” in 2005), it focuses on newspaper corpora, instead of the wealth of information on the web.4 As to systems that perform QA against the web, AnswerBus [34] is an open-domain QA system based on sentence level web IR. Focused on factoids, AnswerBus returns isolated sentences that are determined to contain answers. In its later incarnations, the Start QA system [13] uses an abstraction layer over diverse, semi-structured online content, centered around “object-property-value” queries [14]. Through this layer, Start has access to online sources such as biography.com and the Internet Movie Database to answer factoids. Mulder [18] was a TREC-style QA system used to investigate whether the QA techniques studied in IR can be scaled to the web; its focus was on factoids. AskJeeves (http://www.ask.com) is a commercial service that provides a natural language question interface to the web, relying on human editors to map between question templates and authoritative sites. Recently, several researchers used large numbers of Q/A pairs mined from the web for QA system development. E.g., Ramakrishnan et al. [28] use them to train a strongly datadriven TREC-style factoid QA system. Soricut and Brill [29] describe a statistical QA system that is trained using roughly 1 million Q/A pairs to deliver answers to questions beyond factoids, where the answers are extracted from documents returned by commercial web search engines. 6.2 Retrieval Mixing global and local evidence in retrieval has been the focus of much of the research in semistructured retrieval, dating back to the work of Wilkinson [33]. Unlike in more recent applications (e.g., XML retrieval [11]), the retrieval unit (Q/A pair) is fixed in our task, but smoothing the Q/A pair model with the model of the whole FAQ page does indicate some improvements. 6.3 Extraction Web content mining, i.e., mining, extraction and integration of useful information from web page contents, is a large area. Wrapper induction and automatic data extrac4 Many TREC and CLEF participants use the web; they use it not as their primary answer source but, for instance, to mine extraction patterns from or for smoothing purposes. tion have been actively pursued since the mid-1990s [17]. While there has not been much work on extracting Q/A pairs from FAQ pages on the web, there is work on detecting lists and/or repeated items on web pages. E.g., building on [31], Limanto et al. [20] build an extractor for web discussion forms which tries to find repeated patterns using a suffix tree. Closer to our work is the work by Lai et al. [19], who propose a heuristic method for classifying FAQ pages on the web and a list detection method for extracting Q/A pairs; their rule-based extraction method is not evaluated, while their classifier achieves a precision of 80%. McCallum et al. [24] evaluate a supervised method (maximum entropy Markov model) on a segmentation task that is similar to but different from our Q/A pair extraction task: tagging lines of text files as questions and answers. They report scores of 0.86 (Precision) and 0.68 (Recall) for this task. 6.4 Crawling As stated previously, general-purpose crawling is not a suitable way to obtain FAQ pages. Focused crawling [7] is not appropriate either, since we are not restricting ourselves to any particular domain. The only robust criterion for “wanted” pages for us is that they contain question/answer pairs. Our ontology-based method of expanding queries that are aimed at retrieving FAQ pages from a web search (such as Google) differs from techniques in the literature; e.g. Etzioni et al. [9] used recursive query expansion for a similar task, partitioning the set of a search engine’s results until the full list becomes accessible. The method works by recursively adding query terms +word or -word to the original query for reasonably frequent words. Our experiments with this method for locating FAQ pages, however, showed that the quality of the retrieved pages (i.e., the proportion of true FAQ files) deteriorates quickly with the length of the expanded query. This may be due to subtle interactions between terms in Google’s fielded search (“faq” was requested in the title or URL, while expanded terms should occur in the body of a page). Unlike the technique used in [9], our query expansion method does not try to partition the search space into non-overlapping segments. As a consequence, results for later queries increasingly contain URLs already encountered. 7. CONCLUSIONS We have presented an open-domain Question Answering system that retrieves answers to user’s questions from a database of Frequently Asked Questions automatically collected on the Web. The system performs offline datamining (locating FAQ pages on the web and extracting Q/A pairs) to create the database, and then appropriate Q/A pairs are retrieved from the database as responses to users’ questions. By making the semantic structure of FAQ pages (i.e., relations between questions and answers) explicit, we are able to avoid an ad hoc definition of expected answer format, ultimately leaving it to authors of the FAQ pages to define what a good answer to a question should look like. This is especially important for non-factoid questions, which are the most frequently asked questions on the web. The performance of the system is measured on a 3 point scale, corresponding to the user modeling we detailed in Section 2. Our system shows an impressive performance, even with fairly simple fetching, extraction and retrieval methods. This suggests that FAQ pages on the web are indeed an excellent resource for focused information access providers. Looking forward, there are some obvious techniques, such as WordNet relations [23], and question classes [22] are waiting to be integrated into our system. To make the system usable in real world scenarios, many technical and scientific questions need to be addressed. First, different, more robust and scalable fetching methods have to be designed to obtain substantially larger FAQ corpora. The system has to operate in the dynamic environment of the web, where new data emerges or becomes outdated every hour. More effective Q/A pairs extraction techniques should deal with different types of FAQ pages [19]. Duplicate and near duplicate detection (both offline and online) should be used to eliminate redundancy in presenting results to the user. We are currently experimenting with result clustering and topic detection, both of which we regard as essential, given the often ambiguous nature of user queries. Another direction for future research concerns our evaluation. We are extending our retrieval evaluation. Additionally, to which extent do our evaluation measures reflect the reality of real users? What happens if we restrict the questions and FAQ pages to be in the same subject area? In sum, we believe that “QA as FAQ retrieval” is an important step towards providing highly focused information access to web users. Acknowledgements . We thank our referees for their feedback. This research was supported by the Netherlands Organization for Scientific Research (NWO) under project numbers 017.001.190, 220-80-001, 264-70-050, 354-20-005, 61213-001, 612.000.106, 612.000.207, 612.066.302, 612.069.006. 8. REFERENCES [1] Apache Lucene: A high-performance, full-featured text search engine library. http://lucene.apache.org. [2] E. Agichtein, S. Lawrence, and L. Gravano. Learning to find answers to questions on the web. ACM Trans. Inter. Tech., 4(2):129–162, 2004. [3] A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal. Bridging the lexical chasm: statistical approaches to answerfinding. In Proceedings SIGIR 2000, pages 192–199, 2000. [4] R. Burke, K. Hammond, V. Kulyukin, S. Lytinen, N. Tomuro, and S. Schoenberg. Natural language processing in the FAQFinder system: Results and prospects. In Proceedings 1997 AAAI Spring Symposium on Natural Language Processing for the World Wide Web, pages 17–26, 1997. [5] R. Burke, K. Hammond, V. Kulyukin, S. Lytinen, N. Tomuro, and S. Schoenberg. Question answering from frequently asked question files: Experiences with the FAQFinder system. AI Magazine, 18(2):57–66, 1997. [6] D. Carmel, M. Shtalhaim, and A. Soffer. eResponder: Electronic question responder. In Proceedings CoopIS 2002, pages 150–161, 2000. [7] S. Chakrabarti, M. Van Den Berg, and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31:1623–1640, 1999. [8] W. Daelemans, J. Zavrel, K. Van Der Sloot, and A. Van Den Bosch. TiMBL: Tilburg Memory Based Learner, version 5.0. Tech. Report 03-10, 2003. URL: http://ilk.kub. nl/downloads/pub/papers/ilk0310.ps.gz. [9] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in KnowItAll: (preliminary results). In Proceedings WWW 2004, pages 100–110, 2004. [10] A. Foster and N. Ford. Serendipity and information seeking: an empirical study. J. Documentation, 59(3):321–340, 2003. [11] N. Fuhr, M. Lalmas, S. Malik, and Z. Szlavik, editors. Advances in XML Information Retrieval: Third International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2004), LNCS 3493, Springer, 2005 [12] R. Girju. Automatic detection of causal relations for question answering. In Proceedings ACL 2003 Workshop on Multilingual Summarization and Question Answering, 2003. [13] B. Katz. Annotating the World Wide Web using natural language. In Proceedings RIAO’97, 1997. [14] B. Katz, S. Felshin, D. Yuret, A. Ibrahim, J. Lin, G. Marton, A. McFarland, and B. Temelkuran. Omnibase: Uniform access to heterogeneous data for question answering. In Proceedings NLDB 2002, 2002. [15] H. Kim and J. Seo. High-performance FAQ retrieval using an automatic clustering method of query logs. Information Processing & Management, in press. [16] L. Kossseim, S. Beauregard, and G. Lapalme. Using information extraction and natural language generation to answer e-mail. Data & Knowledge Engineering, 38(1):85–100, 2001. [17] N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1–2):15–68, 2000. [18] C. Kwok, O. Etzioni, and D. Weld. Scaling question answering to the web. In Proceedings WWW 2001, pages 150–161, 2001. [19] Y.-S. Lai, K.-A. Fung, and C.-H. Wu. FAQ mining via list detection. In Proceedings Coling Workshop on Multilingual Summarization and Question Answering, 2002. [20] H. Limanto, N. Giang, V. Trung, N. Huy, J. Zhang, and Q. He. An information extraction engine for web discussion forums. In Proceedings WWW 2005, pages 978–979, 2005. [21] C.-Y. Lin, D. Quan, V. Sinha, K. Bakshi, D. Huynh, B. Katz, and D. Karger. What makes a good answer? The role of context in question answering systems. In Proceedings INTERACT 2003, 2003. [22] S. Lytinen and N. Tomuro. The use of question types to match questions in FAQFinder. In Proceedings AAAI2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases, pages 46–53, 2002. [23] S. Lytinen, N. Tomuro, and T. Repede. The use of WordNet sense tagging in FAQFinder. In Proceedings AAAI-2000 Workshop on AI and Web Search, Austin, TX, 2000. [24] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proceedings ICML 2000, pages 591–598, 2000. [25] G. Mishne and M. de Rijke. Boosting Web Retrieval through Query Operations. In Proceedings ECIR 2005, pages 502– 516, 2005. [26] M. Porter. An algorithm for suffix stripping. Program, 14 (3):130–137, 1980. [27] D. Radev, W. Fan, H. Qi, H. Wu, and A. Grewal. Probabilistic question answering on the web. In Proceedings WWW 2002, pages 408–419, 2002. [28] G. Ramakrishnan, S. Chakrabarti, D. Paranjpe, and P. Bhattacharya. Is question answering an acquired skill? In Proceedings WWW 2004, pages 111–120, 2004. [29] R. Soricut and E. Brill. Automatic question answering: Beyond the factoid. In Proceedings HLT/NAACL, 2004. [30] E. Voorhees. Evaluating answers to definition questions. In Proceedings HLT 2003, 2003. [31] J. Wang and F. Lochovsky. Data extraction and label assignment for web databases. In Proceedings WWW 2003, pages 197–196, 2003. [32] S. Whitehead. Auto-FAQ: An experiment in cyberspace leveraging. Computer Networks and ISDN Systems, 28(1–2): 137–146, 1995. [33] R. Wilkinson. Effective retrieval of structured documents. In Proceedings SIGIR 1994, pages 311–317, 1994. [34] Z. Zheng. AnswerBus question answering system. In Proceedings HLT 2002, 2002.