Paper Identification number 329 1 Implementation of the ArabiQA Question Answering System's Components Yassine Benajiba, Paolo Rosso, Abdelouahid Lyhyaoui Abstract—Most of the components of a Question Answering (QA) system are language -dependent. For this reason, it is really necessary to build such a system component by component, taking into consideration each of the target language peculiarities. In our case, we are interested in building an Arabic QA system (ArabiQA). In previous works, we have presented how we managed to implement an Arabicoriented Passage Retrieval (PR) system and a Named Entities Recognition (NER) system for Arabic texts, two of the necessary components to build ArabiQA. In this work we present the different components that ArabiQA will contain making a special focus on the Answer Extraction (AE) module. This module is required to achieve high precision, because it is responsible for extracting the different possible answers from the relevant passages retrieved by the PR module and it represents the last step before validating the extracted answers and presenting them to the user. The obtained results show that this approach can help significantly to tackle the AE task. Index Terms—Answer Extraction, Arabic Natural Language Processing, Question Answering. I. INTRODUCTION: BACKGROUND AND MOTIVATIONS I NFORMATION Retrieval (IR) systems were thought and designed to return relevant documents related to the user's query. These systems had a great success because they allowed the internet users to find the needed documents in a fast and efficient way. However, IR systems are unable to satisfy a special kind of users who are interested in obtaining a simple answer to a specific question (with both the question and the answer being formulated in natural language). The The research work of the first author was supported partially by MAEC - AECI. We would like to thank the PCI -AECI A/7067/06 and MCyT TIN2006-15265 -C06-04 research projects for partially funding this work. Y. Benajiba is a Ph.D. Student (AECI grant holder) at the Dept. of Informatic Systems and Computation of the Polytechnic University of Valencia, Spain. e-mail: ybenajiba@dsic.upv.es P. Rosso, Ph. D. is a member of the Natural Language Engineering Laboratory and a permanent lecturer in the Dept. of Informatic Systems and Computation of the Polytechnic University of Valencia, Spain. email: prosso@dsic.upv.es A. Lyhyaoui, Ph. D. is a member of the Engineering Systems Laboratory and a permanent lecturer in the Abdelmalek Essaadi University, Tangier, Morocco. e-mail: lyhyaoui@gmail.com research line aiming at satisfying this kind of users is the QA task. The study of the QA task research guidelines [1] reported that there are generally four kinds of questioners where each type represents questions with a certain level of complexity. The easiest questions are those concerning specific facts; for instance: “Who is the king of Morocco?”, “In which city will ICTIS'07 be held?”, etc.; and the hardest ones need a system able to deduce and decide by itself the answer because the question might be something like “Is there any evidence of the existence of weapons of mass destruction in Iraq?”. We have also conducted a general analysis of the CLEF1 (Cross Lingual Evaluation Forum) and the TREC2 (Text Retrieval Conference) results. This general study shows that up to now, only the two first questions levels have been covered by the QA research community. The most successful system in the French monolingual task of the CLEF 2006 used intensive Natural Language Processing (NLP) techniques [2] and obtained an accuracy of 68.95%. Other systems [3][4] obtained an accuracy of 52.63% for the Spanish and the French monolingual task without using any NLP techniques to make their systems as independent of the language as possible. In the TREC 2005 (the TREC 2006 proceedings have not been published yet) the questions were harder to analyze because of their division in groups, each group being related to a target and, therefore, anaphora resolution was also needed. The top accuracy of 53.4% [5] was obtained by means of a sophisticated NER system and a statistical method for the answer selection. In [6] an accuracy of 46.4% was obtained using a dependency relation matching technique in the answer extraction module. The QA task is considered to be difficult both for design and evaluation. The CLEF organizers offered an evaluation platform for many languages but unfortunately the Arabic language is not included among them. The non-existence of a test-bed for Arabic language makes QA even more challenging. However, there have been two attempts to build Arabic QA systems oriented towards: (i) a structured knowledge-based source of information [7]; and (ii) unstructured data [8]. The test-bed of the second QA system was composed of a set of articles of the Raya3 newspaper. In the evaluation process, four Arabic native speakers were asked to give the system 113 questions and judge the correctness of 1 2 3 http://www.clef-campaign.org http://trec.nist.gov http://www.raya.com Paper Identification number 329 its answers manually. The reported results of precision and recall were of 97,3%. These (possibly biased) results seem to be very high if compared with those obtained before for other languages in the CLEF 2006 and TREC 2005 competitions. Unfortunately, the test-bed which was used by the authors is not publicly available in order to compare the QA system with it. In this paper we describe the components of the new ArabiQA system we have been developing. The rest of this paper is structured as follows: In the second section we describe the generic architecture of the system. Special focus is put on named entity recognition in Arabic in section Three. The implementation of the passage retrieval module is presented in section Four. Section Five describes our first attempt for the answer extraction together with the results of the preliminary experiments we carried out. Finally, we draw some conclusions and we discuss the future work to be undertaken, in order to fully implement the ArabiQA system. II. A RABI QA GENERIC A RCHITECTURE Fig. 1. ArabiQA generic architecture The generic architecture illustrated in Figure 1 has been adopted to design a QA system oriented to unstructured data. From a general viewpoint, the system is composed of the following components: (i) Question Analysis module: it determines the type of the given question (in order to inform the AE module about the expected type of answer), the question keywords (used by the passage retrieval module as a query) and the named entities appearing in the question (which are very essential to validate the candidate answers); (ii) Passage Retrieval module: it is the core module of the system. It retrieves the passages which are estimated as relevant to contain the answer (see section Four for more details); (iii) Answer Extraction module: it extracts a list of candidate answers from the relevant passages (see section Five for more 2 details); (iv) Answers Validation module: it estimates for each of the candidate answers the probability of correctness and ranks them from the most to the least probable correct ones. The first, third and fourth modules need a reliable Named Entities Recognition (NER) system. In our case, we have used a NER system that we have designed ourselves [15] (see section Three for more details). III. NAMED ENTITES RECOGNISER A NER system identifies proper names, temporal and numeric expressions in an open-domain text. Such a system is needed in many NLP tasks (IR, QA, Information Extraction, clustering, etc...) because named entities represent for many languages 10% of the available articles [15]. As we mentioned before (see Section 2), a reliable Arabic NER system is needed for most of the components of ArabiQA. Unfortunatly, all of the NER systems known in the literature were built for commercial ends (Siraj4 by Sakhr, ClearTags 5 by ClearForest, NetOwlExtractor6 by NetOwl and InxightSmartDiscoveryEntityExtractor7 by Inxight). This reason motivated us to build our own Arabic NER system. The absence of capital letters in the Arabic language makes the NER task even more challenging. To tackle this problem we made a general study of the approaches which proved to be successful for the NER task. In the languageindependent NER task of the conference CONLL 20028 the best participations used a Maximum Entropy (ME) approach [16][17][18][19]. A comparison between HMM and ME [20] shows that the ME approach gives better results for the NER task. Moreover, [21] in the NAACL/HLT 2004 9 reports a language-independent NER system performing on English, Chinese and Arabic texts using a ME approach reached a Fmeasure of 38,5 for the Arabic language. The corpora used for the training and the test is held by the Language Data Consortium10 (LDC). Our Arabic NER system is also based on the ME approach. This approach tackles the problem better than others because of its features-based model. For training and testing we have manually built our own corpora which are composed of more than 150,000 tokens with an IOB2 annotation scheme [22]. These corpora are freely available on our website11. 4 5 6 7 8 9 10 11 http://siraj.sakhr.com/ http://www.clearforest.com/index.asp http://www.netowl.com/products/extractor.html http://www.inxight.com/products/smartdiscovery/ee/index.php http://www.cnts.ua.ac.be/conll2002/ner/ http://www1.cs.columbia.edu/~pablo/hlt-naacl04/ http://www.ldc.upenn.edu http://www.dsic.upv.es/~ybenajiba Paper Identification number 329 3 TABLE I RESULTS OF OUR ARABIC NER SYSTEM For the proper names recognition we have chosen the exponential model where the probability of a word x of being of a class c can be expressed as: (1) Location Misc Organisation Person Overall Precision Recall F-measure 82.17% 61.54% 45.16% 54.21% 63.21% 78.42% 32.65% 31.04% 41.01% 49.04% 80.25 42.67 36.79 46.69 55.23 Z(x) is for normalization and may be expressed as: (2) Where c is the class, x is a context information and fi(x,c) is the i-th feature. The features are binary functions indicating how the different classes are related to one or many pieces of information about the context in which the word x appeared. The recognition of temporal and numeric expressions is totally based on patterns and a small dictionary containing the names of days and months in Arabic, as well as numbers written in letters. A more detailed description of our Arabic NER system goes beyond the scope of this paper. For further information, please see [14]. IV. PASSAGE RETRIEVAL The Passage Retrieval (PR) module is a core component in a QA system. In [9] the authors report that the quality of a QA system depends mainly on the quality of the PR module it uses. In ArabiQA we have tuned the Java Information Retrieval System12 (JIRS) in order to be able to process Arabic text [10]. The JIRS is a multi-platform and language-independent system. It is based on a density approach because it was proved to be the most successful technique for the PR task [11][12][13] (see Figure 2). The JIRS uses a Distance Density model to compare the ngrams extracted from the question and the passage to determine the relevant passages. The idea of this model is to give more weight to those passages where the question terms appear nearer to each other. In order to achieve this result JIRS performs in two steps. In the first step, it searches the relevant passages and assigns a weight to each one of them. The weight of a passage depends mainly on the relevant question terms appearing in the passage. The weight of a passage can be expressed as: (3) Where n k is the number of passages in which the associated term to the weight w k appears and N is the number of the system passages. The second step uses a model which gives more importance to passages where the question n-grams appear near to each other. This model can be expressed as: (4) Where x is an n-gram of p formed by q terms, w i are the weights defined by (3), h(x) can be defined as: (5) and d(x,x max) is the factor which expresses the distance between the n-gram x and the n-gram with the maximum weight x max , the formula expressing this factor is: Fig. 2. JIRS Architecture (6) 12 http://jirs.dsic.upv.es Figure 3 shows an illustrating example where the first Paper Identification number 329 passage is more relevant for JIRS than the second one because the keywords capital and Morocco appear nearer to each other (D=0 in the first passage whereas D=4 in the second one). Fig. 3. Illustrating example of the JIRS Distance Density model We have tested JIRS using the same questions proportion as in CLEF 2006 and preparing a test-set to make the test fully automatic. The test-set consists of a 200 questions list, a snap shot of the Arabic Wikipedia 13 (11,000 documents) and a list of the correct answers (the elements of the test-set are all available on our website14). After adapting the JIRS to the Arabic language we obtained a coverage (ratio of the number of the correct retrieved passages to the number of the correct passages) of 69% and a redundancy (average of the number of passages returned for a question) of 3.28. However, we have only reached such coverage and redundancy after performing a light stemming on all the elements of the test bed. This preprocessing operation was necessary because the Arabic language is highly inflectional which generally causes a significant data sparseness. In the next future we plan to make experiments with JIRS on a larger corpus where an information tends to be more redundant and with various light-stemmers in order to investigate the best technique to tackle the data sparseness mentioned above. 4 questions it is needed a semantic parsing to extracts the correct answer [26][27][6]. Other approaches suggest using a statistical method [28]. In this section, we describe an AE module oriented to Arabic text for only factoid questions. Our system performs in two main steps: (i) the NER system tags all the named entities (NE) within the relevant passage; (ii) the system makes a preselection of the candidate answers eliminating NE which do not correspond to the expected type of answer; (iii) the system decides the final list of candidate answers by means of a set of patterns. Figure 4 shows an illustrating example. The test of the AE module has been done automatically by a test-set that we have prepared specifically for this task. This test-set consists of: (i) list of questions from different types; (ii) list of question types which contains the type of each of the test questions; (iii) list of relevant passages (we have manually built a file containing a passage which contains the correct answer for each question); (iv) and finally a list of correct answers containing the correct answer for each question. We have used manually selected relevant passages in order to estimate the exact error rate of the AE module. The measure we used to estimate the quality of performance of our AE module is the precision (Number of correct Answers/ Number of Questions). Using the method we described above we have reached a precision of 83.3%. V. A NSWER EXTRACTION The AE task is defined as the search for candidate answers within the relevant passages. The task has to take into consideration the type of answers expected by the user [23], and this means that the AE module should perform differently for each type of question. Using a NER system together with patterns seems to be a successful approach t o extract answers for factoid questions [8][24][25]. However, for difficult 13 14 http://ar.wikipedia.org http://www.dsic.upv.es/~ybenajiba Fig. 4. Illustrating example of the Answer Extraction module's performance steps VI. CONCLUSION AND FUTURE WORK In this paper, we have described the generic architecture chosen for ArabiQA (the Arabic QA system that we have been developing). We particularly focused on the first attempt to implement a simple AE module dedicated especially to factoid questions. In order to implement this module, we developed Paper Identification number 329 an Arabic NER system and a set of patterns for each type of questions (elaborated by hand). To make an automatic test we have manually built a special test-set for this task. We have reached a precision amounting to 83.3%, which shows that the technique we used allows for the efficient extraction of a list of possible answers to a factoid question. In the next future we plan to improve this module in order to extract answers for more difficult questions to analyse than the factoid ones. We are not able to give the accuracy of the complete QA answering system in this paper because some of its components are still in the implementation phase. The final goal is to complete the implementation of the Question Analysis and Answer Validation components of the ArabiQA system. 5 [11] C. Amaral, H. Figueira, A. Martins, P. Mendes, C. Pinto, Priberams Question Answering System for Poteguese. In Working Notes for the CLEF 2005 Workshop. In Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Revised Selected Paper. Vol. 4022 of Lecture Notes in Computer Science, pp. 410-419. Springer, 2006. [12] A. Ittycheriah, M. Franz, W.-J. Zhu, A. Ratnaparkhi, IBM’s Statistical Question Answering System. In Proceedings of the Ninth T ext Retrieval Conference (TREC-2002). pp. 229-234. [13] G. G. Lee, J. Seo, S. Lee, H. Jung, B.-H. Cho, C. Lee, B.-K. Kwak, J. Cha, D. Kim, J. An, H. Kim, K. Kim, SiteQ: Engineering high performance QA system using lexico-semantic pattern matching and shallow NLP. In Proceedings of the Tenth Text Retrieval Conference (TREC-2002). pp. 422-451. [14] Y. Benajiba, P. Rosso, J. M. Benedí. ANERsys: An Arabic Named Entity Recognition System based on Maximum Entropy. International Conference on Computational Linguistics and Intelligent Text Processing, CICLing-2007, Springer -Verlag, LNCS (4394), Mexico D.F., Mexico., pp. 143–153. [15] N. Friburger, D. Maurel, Textual Similarity Based on Proper REFERENCES Names.(MFIR’2002) at the 25 th ACM SIGIR Conference, Tampere, Finland, 2002, pp. 155 –167. [16] O. Bender, F. J. Och, H. Ney, Maximum Entropy Models For Named Entity Recognition. In Proceedings of CoNLL -2003. Edmonton, Canada, 2003. [1] J. Carbonell, D. Harman, E. Hovy, and S. Maiorano, J. Prange, and [17] L. Chieu Hai, T. Ng Hwee, Named Entity Recognition with a K. Sparck-Jones, Vision Statement to Guide Research in Question & Answering (Q&A) and Text Summarization. Rapport technique, NIST, 2000. [2] D. Laurent, P. Séguéla, S. Négre, Cross Lingual Question Answer ing using QRISTAL for CLEF 2006. Working Notes for the CLEF 2006 Workshop. Maximum Entropy Approach. In Proceedings of CoNLL -2003. Edmonton, Canada, 2003. [18] JR. Curran, and S. Clark, Language Independent NER using a [19] [3] M. Pérez-Coutiño, M. Montes -y-Gómez, A. López-López, L. Villaseñor-Pineda, A. Pancardo -Rodríguez, A Shallow Approach for Answer Selection based on Dependency Trees and Term Density. Working Notes for the CLEF 2006 Workshop. [20] [4] L. Gillard, L. Sitbon, E. Blaudez, P. Bellot, M. El-Béze, The LIA at QA@CLEF-2006. Working Notes for the CLEF 2006 Workshop. [5] S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, A. Hickl, P. Wang, Employing Two Question Answering Systems in TREC2005. In the Proceedings of the Fourteenth Text REtrieval Conference, 2005. [6] R. Sun, J. Jiang, Y. Fan Tan, H. Cui, T. Chua, M. Kan, Using Syntactic and Semantic Relation Analysis in Question Answering. In the Proceedings of the Fourteenth Text REtrieval Conference, 2005. [7] F. A. Mohammed, K. Nasser, H. M. Harb, A knowledge based Arabic question answering system (AQAS). ACM SIGART Bulletin, pp. 21 -33, 1993. [8] B. Hammou, H. Abu-salem, S. Lytinen, M. Evens, QARAB: A question answering system to support the Arabic language. In the Proceedings of the work shop on computational approaches to Semitic languages, ACL , pages 55-65, Philadelphia, 2002. [9] F. Llopis, J. L. Vicedo, A. Ferrandez, Passage Selection to Improve Question Answering. In Proceedings of the COLING 2002 Workshop on Multilingual Summarization and Question Answering, 2002. [10] Y. Benajiba, P.Rosso, Adapting the JIRS Passage Retrieval System to the Arabic language. International Conference on Computational Linguistics and Intelligent Text Processing, CICLing-2007, Springer-Verlag, LNCS (4394), Mexico D.F., Mexico. [21] [22] Maximum Entropy Tagger. In Proceedings of CoNLL -2003. Edmonton, Canada, 2003. S. Cucerzan and D. Yarowsky, Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. In Proceedings, 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora, pp.90–99. R. Malouf, Markov Models for Language-Independent Named Entity Recognition. In Proceedings of CoNLL-2003. Edmonton, Canada, 2003. R. Florian, H. Hassan, A. Ittycheriah, H. Jing, N. Kambhatla, X. Luo, N. Nicolov and S. Roukos, A Statistical Model for Multilingual Entity Detection and Tracking. In Proceedings of NAACL/HLT, 2004. A. Ratnaparkhi, Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania (1998). [23] D. Moldovan, M. Pasca, S. Harabagiu and M. Surdeanu, Performance issues and error analysis in an open-domain question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002), 2002. [24] R. J. Cooper and S. M. Ruger, A Simple Question Answering System . In the Proceedings of TREC-9, 2000. [25] S. Abney, M. Collins and S. Amit, Answer Extraction. In the Proceedings of ANLP 2000. [26] D. Mollá and B. Hutchinson, Dependency-based semantic interpretation for answer extraction. In Proceedings Australian NLP Workshop, 2002. [27] S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, A. Hickl, and P. Wang, Employing Two Question Answering Systems in TREC 2005. In Proceedings of the Fourteenth Text REtrieval Conference, 2005. [28] G. S. Mann, A Statistical Method for Short Answer Extraction. In Proceedings of the ACL -2001 Workshop on Open-Domain Question Answering, Toulouse, France, July.W .-K. Chen, Linear Paper Identification number 329 Netwo rks and Systems (Book style). Belmont, 1993, pp. 123–135. 6 CA: Wadsworth,