Finding Entries in an On-line Arabic Dictionary 27 May 2010 27th Annual HCIL Symposium Sarah C. Wayland, C. Anton Rytting, David Zajic, Timothy Buckwalter, Jason White, Corey Miller, Jeffrey Carnes, Nathanael Lynn, Paul Rodrigues, Michael Maxwell, Evelyn Browne Arabic is not English • Different sounds (e.g., voiceless uvular /q/, retroflex /l/, voiced velar fricative /gh/, glottal stop / ‘ /) • Different letters ()مباريات • Different morphology (templatic vs. affixative) • Written form doesn’t reflect spoken dialect • Keyboard has different layout/letters 2 LANGUAGE RESEARCH IN SERVICE TO THE NATION Many informal texts diverge from Modern Standard Arabic Texts differ from classroom Arabic in orthography, morphology, and lexical content. LANGUAGE RESEARCH IN SERVICE TO THE NATION Many informal texts diverge from Modern Standard Arabic Texts differ from classroom Arabic in orthography, morphology, and lexical content. Orthographic differences are based on dialect pronunciations, typographical errors, and ... “style.” LANGUAGE RESEARCH IN SERVICE TO THE NATION Orthographic Differences Some dialects use non-standard characters Dialect MSA (Modern Standard Arabic) Iraqi (with Persian character) Iraqi (with MSA character) SATTS (no vowels) KLB #CLB J-LB JLB LANGUAGE RESEARCH IN SERVICE TO THE NATION Native (no vowels) كلب چلب جلب Many informal texts diverge from Modern Standard Arabic Texts differ from classroom Arabic in orthography, morphology, and lexical content. Orthographic differences are based on dialect pronunciations, typographical errors, and ... “style.” LANGUAGE RESEARCH IN SERVICE TO THE NATION Many informal texts diverge from Modern Standard Arabic Texts differ from classroom Arabic in orthography, morphology, and lexical content. Orthographic differences are based on dialect pronunciations, typographical errors, and ... “style.” LANGUAGE RESEARCH IN SERVICE TO THE NATION Many informal texts diverge from Modern Standard Arabic Texts differ from classroom Arabic in orthography, morphology, and lexical content. Orthographic differences are based on dialect pronunciations, typographical errors, and ... “style.” LANGUAGE RESEARCH IN SERVICE TO THE NATION Phonetic Differences Consonants sometimes vary across dialects ق گ غ أ Educated Urban (MSA) Iraq Sudan Cairo قلب گلب غلب ألب LANGUAGE RESEARCH IN SERVICE TO THE NATION qlb /qalb/ glb /gaLub/ qhlb /ghaLib/ ’lb /’alb/ Morphologically Complex *qalub “heart” Al-qalb “the-heart” قلوب القلوب قلبي قلوبنا *quluwb “hearts” Al-quluwb “the-hearts” qalb-iy “my-heart” quluwb-naA “our-hearts” قلبك قلبك قليب qalb-ak “your-heart (to a man)” qalb-ik “your-heart (to a woman)” qulayb “little heart” قلب القلب * (the only forms listed in the dictionary) LANGUAGE RESEARCH IN SERVICE TO THE NATION The Arabic keyboard makes difficult-to-detect typos likely LANGUAGE RESEARCH IN SERVICE TO THE NATION The Arabic keyboard makes difficult-to-detect typos likely Adjacent letters are often visually similar LANGUAGE RESEARCH IN SERVICE TO THE NATION The Arabic keyboard makes difficult-to-detect typos likely Adjacent letters are often visually similar LANGUAGE RESEARCH IN SERVICE TO THE NATION The Arabic keyboard makes difficult-to-detect typos likely Adjacent letters are often visually similar LANGUAGE RESEARCH IN SERVICE TO THE NATION The Arabic keyboard makes difficult-to-detect typos likely Adjacent letters also often sound similar (with contrasts not found in English) LANGUAGE RESEARCH IN SERVICE TO THE NATION The Arabic keyboard makes difficult-to-detect typos likely Adjacent letters also often sound similar (with contrasts subject to placeassimilation) LANGUAGE RESEARCH IN SERVICE TO THE NATION The Arabic keyboard makes difficult-to-detect typos likely Adjacent letters also often sound similar (particularly so in some dialect pronunciations) LANGUAGE RESEARCH IN SERVICE TO THE NATION Putting DYM…? together • A query is checked by composing a singlestring finite state automaton (FSA) with: H ح keyboard – weighted keyboard, visual, and sound-based FSTs – a dictionary FSA (with weights for dialect variants) • The n-best paths yielding unique strings are calculated • The corresponding strings are displayed to the user HARB, ?ARB, OARB, .... LANGUAGE RESEARCH IN SERVICE TO THE NATION A ا R ر visual B ب sound-based 19 LANGUAGE RESEARCH IN SERVICE TO THE NATION 20 LANGUAGE RESEARCH IN SERVICE TO THE NATION 21 LANGUAGE RESEARCH IN SERVICE TO THE NATION 22 LANGUAGE RESEARCH IN SERVICE TO THE NATION 23 LANGUAGE RESEARCH IN SERVICE TO THE NATION 24 LANGUAGE RESEARCH IN SERVICE TO THE NATION Show verbs 25 Show non-verbs LANGUAGE RESEARCH IN SERVICE TO THE NATION Download Results 26 LANGUAGE RESEARCH IN SERVICE TO THE NATION 27 LANGUAGE RESEARCH IN SERVICE TO THE NATION Arabic is not English! • One user interface for all languages will not work • We must customize the user interface to take into account the unique structure of each language 28 LANGUAGE RESEARCH IN SERVICE TO THE NATION Sarah C. Wayland swayland@casl.umd.edu 301-226-8938