Administrative Information المعلومات االدارية عنوان Relational Queries for Arabic Text Mining تحر العالقات عبر التنقيب في النصوص العربية الرقمية رقم الهاتف العنوان االلكتروني Telephone e-mail 70139440 fz11@aub.edu.lb العنوان Address Annex 108, FEA (عربي -)وأجنبي Project Title - Principal Investigator الباحث الرئيسي الوظيفية المؤسسة Post Institution Assistant American Professor University of Beirut العنوان االلكتروني e-mail jem04@aub.edu.lb man17@aub.edu.lb From Dec 2010 till April 2012 المشروع االسم Name Fadi Zaraket - Co-Workers الباحثون المشاركون المؤسسة Institution American Univeristy of Beirut American Univeristy of Beirut االسم Name Jad Makhlouta Mohammad Noureddine :- Duration المدة التعاقدية للمشروع Scientific Information العلمية المعلومات ّ Objectives -الهدف Automated analysis of Arabic data sets including texts, publications, records, and digital media became essential with the huge digital Arabic content available nowadays. The current solutions adopt data and text mining techniques that are suitable for Latin languages. The Arabic language has unique features that differentiate it from other languages and that can help text mining and information retrieval techniques. In this project we aim at developing Arabic text mining tools and techniques to handle complex queries on separate but related Arabic documents. Given a set of documents and a relation expressed in a query form, we will use structural analysis techniques to extract graphs representing the relation and the different entities involved in the given documents. For example, given a set of traditions related to Prophet Mohammad in a textual form, and a set of biographies of narrators also in a textual form, the user of our tools can query for a graph relating the traditions to the authenticity of the narrators. In another example, given several text accounts ( )محاضرof security incidents across the country, an officer can request a graph linking the males between 18 and 35 who are involved in the incidents based on their towns of birth. - Achievements أالنجازات المحققة We built the following tools. 1. Sarf: a novel Arabic morphoplogical analyzer that solves several problems such as run-on words: enables automatic tokenization without the assumption of white space delimited words multi-word expressions lexicon consistency issues: reduces the size of the lexicon and removes redundant entries lexicon extension issues: allows the extension of the lexicon with learnt morphemes and without a blow up 2. Arabic temporal entity extractor based on Sarf and FSMs 3. Arabic named entity and relation extraction Application to hadith and biographies with cross document NLP Application to bible genealogy and family relation extraction - Perspectives آفاق البحث The morphological analyzer enables machine translation applications and Arabic text mining techniques. We are looking to use partial diacritization for disambiguation. The research on named entity and relation extraction can be applied to several domains such as security and medical clinical reports. We are looking to apply the cross-document techniques on corpora in those fields. The current application on hadith and literature enables scholars in humanities to conduct simulations and queries on authenticity of narrators of hadith. The work on temporal entity extraction can be extended to temporal entity normalization which enable chronological relational extraction and timeline navigation of documents. Publications & Communications -المنشورات والمساهمات في المؤتمرات J. Makhlouta and F. Zaraket “Detection of Arabic Entity Graphs using Morphology, Finite State Machines, and Graph Transformations” Applied Natural Language Processing (ANLP): Proceedings of the Conference, 2012. J. Makhlouta, H. Harkous and F. Zaraket “Detection of Arabic Entity Graphs using Morphology, Finite State Machines, and Graph Transformations” Computational Linguistics and Intelligent Text Processing (CICLING), 2012. F. Zaraket and J. Makhlouta “Arabic Temporal Entity Extracting using Morphological Analysis”, International Journal of Computational Linguistics and Applications , 2012. (accepted to appear) - Abstract موجز عن نتائج البحث Text mining concerns automatic information retrieval from textual data which lacks structure and is consequently difficult to analyze. The process involves extracting features and attributes from text documents that consist important information such as dates, names, locations, sums, and relations between these entities. A relational query is a query that involves several entities in a relation such as two documents with one earlier than the other while both occurring at the same location. In the last decade not only new documents were produced directly in digital form, but also lots of old documents have been ported to digital form. This allows automatic techniques to speed up information retrieval from textual documents. We built tools to extarct morphological, part of speech, gloss, and positional features from Arabic text documents and used them with manually built finite state transducers to extract named entities and relations amongst them. We applied the tools to biblical text and family relations, hadith text and narrator relations, biography text and ownership relations, and news text and temporal entities. We also used cross-document natural language processing and novel graph correspondence algorithms to increase accuracy and achieved above 90% accuracy on most of our tasks.