UW CLMA 11/19/2008 A short intro with a couple of demonstrations ARABIC MORPHOLOGY AND POSTAGGING 1 UW CLMA 11/19/2008 OUTLINE Arabic morphology: overview of the problem Prior Art with demonstration of Buckwalter’s AraMorph Sketch of enhancements to AraMorph Demonstration Future directions 2 UW CLMA 11/19/2008 ARABIC MORPHOLOGY: OVERVIEW OF THE PROBLEM Short vowels are not represented The contrast between diphthongs and long vowels is not represented Most closed class morphemes are written as affixes to the content word categories: Nouns, Adjectives, Verbs and prepositions 3 UW CLMA 11/19/2008 ARABIC MORPHOLOGY: OVERVIEW (CONT.) Some examples (glossing over a lot of detail): البيت شاهد الرجل الفيلم فرجع إلى $Ahd alrjl Alfylm frjE {lA Albyt $aaHada r-rajul-u l-fiylm-a fa-rajaEa {ilaA l-bayt-i Saw-3sg.m. the-man-nom the-film-acc and-soreturned-3sg.m to the-house-gen The man watched the film and then went home. This example is not so bad 4 UW CLMA 11/19/2008 REGULAR EXPRESSIONS FOR ORTHOGRAPHIC WORDS (conj)?(enclitic_preposition)? noun_stem (plural)(possesive_pronoun) (conj)?(definiteness marker)? noun_stem (plural)? (conj)? full_word_preposition (genitive_pronoun)? (conj)? complementizer (object_pron)? (conj)? (modal)? (ImpVerbSubjAgr) verb_stem (plural_subject_marker)? (object_pronoun)? (conj)? (modal)? verb_stem (perfVerbSubjAgr)? (object_pronoun)? 5 UW CLMA 11/19/2008 INHERENT AMBIGUITY Some strings with multiple analyses فقد == fqd : either the verb fqd = he lost f qd = and so (verbal modal) OR fqd smEth = ; فقد سمعتهCan be analyzed as a) f qd smE t h (and so I had heard him) b) fqd smEp h (he lost his reputation) 6 UW CLMA 11/19/2008 OTHER ISSUES BEYOND THE SCOPE OF THIS TALK Arabic spans 14 centuries and 22 countries Is the liturgical language of over 1 billion Muslims The Standard Language has never been a spoken variety. The vernaculars have never been standardized. The LDC corpus is the only annotated corpus that is readily available. The last time I looked the treebank part was less than a million tokens 7 UW CLMA 11/19/2008 PRIOR ART Buckwalter’s Aramorph from LDC (a port from work done @ Xerox) Ported to Java on top of Lucene (!) by Pierrick Brihaye circa 2003 http://cvs.savannah.gnu.org/viewvc/aramorph Tagset and segmentation description http://www.ldc.upenn.edu/Catalog/docs/LDC200 3T06/POS-info.txt Buckwalter’s Transliteration scheme http://www.qamus.org/transliteration.htm. 8 UW CLMA 11/19/2008 AND NOW A DEMONSTRATION OF ARAMORPH The point here is that most word strings have more than one legal analysis. The other point is that the number of types is quite high, unless you do something to reveal the content word behind all the function morpheme affixes. Kitaab (book) Al-kitaab (the book) These two queries in Arabic return different sets of results on google 9 UW CLMA 11/19/2008 A FEW WORDS WRT ARAMORPH AraMorph will generate all the legal analyses for which it has an entry in its lexicon Pierrick Brihaye ported AraMorph to Java AraMorph is the first stage in a lot of Arabic text processing done by researchers in the US. 10 UW CLMA 11/19/2008 ENHANCEMENTS TO ARAMORPH I build this POS tagger in stages on top of Pierrick Brihaye’s port of of AraMorph The first thing I did was to port in a bigram model of segmented text from the LDC This was used to choose the most likely segmentation sequence out of all of the analyses returned by Buckwalter’s analyzer 11 UW CLMA 11/19/2008 ARCHITECTURE (AS IT EVOLVED) With a 5-word sliding window generate all sequences of segmentations for that 5word window based on all the analyses returned by AraMorph. This scheme produced acceptable results Sometime later a trigram model of the tags was added and given 50% weight with the segmentation scores to decide which tags to keep with the segments 12 UW CLMA 11/19/2008 THIS BEARS SOME SIMILARITY TO OTHER WORK DONE IN 2005 Habash, Nizar and Owen Rambow. Arabic Tokenization, Morphological Analysis, and Part-ofSpeech Tagging in One Fell Swoop. In Proceedings of the Conference of American Association for Computational Linguistics (ACL’05). His team used Ripper (Cohen, 1996) to learn a rulebased classifier (Rip). They also used AraMorph as their starting point to produce all legal morphological sequences. http://www.mt-archive.info/ACL-2005-Habash1.pdf 13 UW CLMA 11/19/2008 HOW WELL DOES THE POS TAGGER PERFORM? Good question, still TBD I meant to pull out some of the training data and test it against a piece of the LDC corpus. I ran out of time Hand analysis puts it at better than 90%. At some point I turned on the option to not toss the vowels provided by AraMorph. This is observably less accurate 14 UW CLMA 11/19/2008 FIRST: A WORD FROM MY SPONSOR I’m allowed to talk about this system I was told that I could expose its functionality on a website I am not allowed to distribute it or use it for commercial purposes There is an earlier tagger that does not inorporate Lucene or AraMorph. It is based on Brill’s TB learning @ http://innerbrat.org/segmentTagDownload 15 UW CLMA 11/19/2008 THE DEMOS Tag to Buckwalter transliteration output Tag to enamex style tags Tag to Utf8 arabic Re-attaching the segments Reduced tagset Reloading the dictionary every time is annoying Tag with a server and thin client 16 UW CLMA 11/19/2008 FUTURE DIRECTIONS Any further work will require me to rebuild everything from scratch Uncouple it from Lucene Port it to c++ or c# Bring in a statistical language model or two for recovering the short vowels. Use some state-of-the-art machine learning toolkits to improve performance Start annotating some of my corpora 17 UW CLMA 11/19/2008 FUTURE DIRECTIONS See if I can embed it in some practical applications such as language teaching document production preprocessing for machine translation systems preprocessing ASR Text to speech Bootstrap annotation tools for other Afro-Asiatic languages Tigrinya, Somali, Hausa, Hebrew, Arabic vernaculars, Amharic, Amazigh, Coptic, Egyptian Hyroglyphs, Babylonian, Punic Help with ODIN?? 18 UW CLMA 11/19/2008 THE END 19