Relational Queries for Arabic Text Mining

advertisement
Administrative Information
‫المعلومات االدارية‬
‫عنوان‬
Relational Queries for Arabic Text Mining
‫تحر العالقات عبر التنقيب في النصوص العربية الرقمية‬
‫رقم الهاتف‬
‫العنوان االلكتروني‬
Telephone
e-mail
70139440 fz11@aub.edu.lb
‫العنوان‬
Address
Annex 108,
FEA
‫(عربي‬
-)‫وأجنبي‬
Project
Title
- Principal Investigator ‫الباحث الرئيسي‬
‫الوظيفية‬
‫المؤسسة‬
Post
Institution
Assistant
American
Professor University of
Beirut
‫العنوان االلكتروني‬
e-mail
jem04@aub.edu.lb
man17@aub.edu.lb
From Dec 2010 till April 2012
‫المشروع‬
‫االسم‬
Name
Fadi Zaraket
- Co-Workers ‫الباحثون المشاركون‬
‫المؤسسة‬
Institution
American
Univeristy of
Beirut
American
Univeristy of
Beirut
‫االسم‬
Name
Jad Makhlouta
Mohammad
Noureddine
:- Duration ‫المدة التعاقدية للمشروع‬
Scientific Information
‫العلمية‬
‫المعلومات‬
ّ
Objectives -‫الهدف‬
Automated analysis of Arabic data sets including texts, publications, records, and digital media
became essential with the huge digital Arabic content available nowadays. The current solutions
adopt data and text mining techniques that are suitable for Latin languages. The Arabic language
has unique features that differentiate it from other languages and that can help text mining and
information retrieval techniques.
In this project we aim at developing Arabic text mining tools and techniques to handle complex
queries on separate but related Arabic documents. Given a set of documents and a relation
expressed in a query form, we will use structural analysis techniques to extract graphs
representing the relation and the different entities involved in the given documents. For
example, given a set of traditions related to Prophet Mohammad in a textual form, and a set of
biographies of narrators also in a textual form, the user of our tools can query for a graph
relating the traditions to the authenticity of the narrators. In another example, given several text
accounts (‫ )محاضر‬of security incidents across the country, an officer can request a graph linking
the males between 18 and 35 who are involved in the incidents based on their towns of birth.
- Achievements ‫أالنجازات المحققة‬
We built the following tools.
1. Sarf: a novel Arabic morphoplogical analyzer that solves several problems such as
 run-on words: enables automatic tokenization without the assumption of
white space delimited words
 multi-word expressions
 lexicon consistency issues: reduces the size of the lexicon and removes
redundant entries
 lexicon extension issues: allows the extension of the lexicon with learnt
morphemes and without a blow up
2. Arabic temporal entity extractor based on Sarf and FSMs
3. Arabic named entity and relation extraction
 Application to hadith and biographies with cross document NLP
 Application to bible genealogy and family relation extraction
- Perspectives ‫آفاق البحث‬
The morphological analyzer enables machine translation applications and Arabic text mining
techniques. We are looking to use partial diacritization for disambiguation.
The research on named entity and relation extraction can be applied to several domains such as
security and medical clinical reports. We are looking to apply the cross-document techniques on
corpora in those fields.
The current application on hadith and literature enables scholars in humanities to conduct
simulations and queries on authenticity of narrators of hadith.
The work on temporal entity extraction can be extended to temporal entity normalization which
enable chronological relational extraction and timeline navigation of documents.
Publications & Communications -‫المنشورات والمساهمات في المؤتمرات‬
J. Makhlouta and F. Zaraket “Detection of Arabic Entity Graphs using Morphology, Finite
State Machines, and Graph Transformations” Applied Natural Language
Processing (ANLP): Proceedings of the Conference, 2012.
J. Makhlouta, H. Harkous and F. Zaraket “Detection of Arabic Entity Graphs
using Morphology, Finite State Machines, and Graph Transformations” Computational
Linguistics and Intelligent Text Processing (CICLING), 2012.
F. Zaraket and J. Makhlouta “Arabic Temporal Entity Extracting using Morphological
Analysis”, International Journal of Computational Linguistics and Applications , 2012.
(accepted to appear)
- Abstract ‫موجز عن نتائج البحث‬
Text mining concerns automatic information retrieval from textual data which lacks structure
and is consequently difficult to analyze. The process involves extracting features and attributes
from text documents that consist important information such as dates, names, locations, sums,
and relations between these entities. A relational query is a query that involves several entities
in a relation such as two documents with one earlier than the other while both occurring at the
same location. In the last decade not only new documents were produced directly in digital
form, but also lots of old documents have been ported to digital form. This allows automatic
techniques to speed up information retrieval from textual documents. We built tools to extarct
morphological, part of speech, gloss, and positional features from Arabic text documents and
used them with manually built finite state transducers to extract named entities and relations
amongst them. We applied the tools to biblical text and family relations, hadith text and narrator
relations, biography text and ownership relations, and news text and temporal entities. We also
used cross-document natural language processing and novel graph correspondence algorithms
to increase accuracy and achieved above 90% accuracy on most of our tasks.
Download