Exploring and exploiting a historical corpus for Arabic
Bassam Hammo,
Sane Yagi,
Omaima Ismail,
Mohammad AbuShariah
This paper presents a historical Arabic corpus named HAC. At this early embryonic stage of the
project, we report about the design, the architecture and some of the experiments which we have
conducted on HAC. The corpus, and accordingly the search results, will be represented using a
primary XML exchange format. This will serve as an intermediate exchange tool within the
project and will allow the user to process the results offline using some external tools. HAC is
made up of Classical Arabic texts that cover 1600 years of language use; the Quranic text,
Modern Standard Arabic texts, as well as a variety of monolingual Arabic dictionaries. The
development of this historical corpus assists linguists and Arabic language learners to effectively
explore, understand, and discover interesting knowledge hidden in millions of instances of
language use. We used techniques from the field of natural language processing to process the
data and a graph-based representation for the corpus. We provided researchers with an export
facility to render further linguistic analysis possible.