Finding Entries in an On-line Arabic Dictionary

advertisement
Finding Entries in an
On-line Arabic Dictionary
27 May 2010
27th Annual HCIL Symposium
Sarah C. Wayland, C. Anton Rytting, David
Zajic, Timothy Buckwalter, Jason White, Corey
Miller, Jeffrey Carnes, Nathanael Lynn, Paul
Rodrigues, Michael Maxwell, Evelyn Browne
Arabic is not English
• Different sounds (e.g., voiceless uvular /q/,
retroflex /l/, voiced velar fricative /gh/, glottal
stop / ‘ /)
• Different letters (‫)مباريات‬
• Different morphology (templatic vs. affixative)
• Written form doesn’t reflect spoken dialect
• Keyboard has different layout/letters
2
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Many informal texts diverge from
Modern Standard Arabic
Texts differ
from classroom
Arabic in
orthography,
morphology,
and lexical
content.
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Many informal texts diverge from
Modern Standard Arabic
Texts differ from
classroom
Arabic in
orthography,
morphology, and
lexical content.
Orthographic
differences are
based on dialect
pronunciations,
typographical
errors, and ...
“style.”
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Orthographic Differences
Some dialects use non-standard characters
Dialect
MSA (Modern Standard
Arabic)
Iraqi (with Persian
character)
Iraqi (with MSA
character)
SATTS
(no vowels)
KLB
#CLB
J-LB
JLB
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Native
(no vowels)
‫كلب‬
‫چلب‬
‫جلب‬
Many informal texts diverge from
Modern Standard Arabic
Texts differ from
classroom
Arabic in
orthography,
morphology, and
lexical content.
Orthographic
differences are
based on dialect
pronunciations,
typographical
errors, and ...
“style.”
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Many informal texts diverge from
Modern Standard Arabic
Texts differ from
classroom
Arabic in
orthography,
morphology, and
lexical content.
Orthographic
differences are
based on dialect
pronunciations,
typographical
errors, and ...
“style.”
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Many informal texts diverge from
Modern Standard Arabic
Texts differ from
classroom
Arabic in
orthography,
morphology, and
lexical content.
Orthographic
differences are
based on dialect
pronunciations,
typographical
errors, and ...
“style.”
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Phonetic Differences
Consonants sometimes vary across dialects
‫ق‬
‫گ‬
‫غ‬
‫أ‬
Educated Urban (MSA)
Iraq
Sudan
Cairo
‫قلب‬
‫گلب‬
‫غلب‬
‫ألب‬
LANGUAGE RESEARCH IN SERVICE TO THE NATION
qlb
/qalb/
glb
/gaLub/
qhlb
/ghaLib/
’lb
/’alb/
Morphologically Complex
*qalub
“heart”
Al-qalb
“the-heart”
‫قلوب‬
‫القلوب‬
‫قلبي‬
‫قلوبنا‬
*quluwb
“hearts”
Al-quluwb
“the-hearts”
qalb-iy
“my-heart”
quluwb-naA
“our-hearts”
‫قلبك‬
‫قلبك‬
‫قليب‬
qalb-ak
“your-heart (to a man)”
qalb-ik
“your-heart (to a woman)”
qulayb
“little heart”
‫قلب‬
‫القلب‬
* (the only forms listed in the dictionary)
LANGUAGE RESEARCH IN SERVICE TO THE NATION
The Arabic keyboard makes
difficult-to-detect typos likely
LANGUAGE RESEARCH IN SERVICE TO THE NATION
The Arabic keyboard makes
difficult-to-detect typos likely
Adjacent letters are
often visually similar
LANGUAGE RESEARCH IN SERVICE TO THE NATION
The Arabic keyboard makes
difficult-to-detect typos likely
Adjacent letters are
often visually similar
LANGUAGE RESEARCH IN SERVICE TO THE NATION
The Arabic keyboard makes
difficult-to-detect typos likely
Adjacent letters are
often visually similar
LANGUAGE RESEARCH IN SERVICE TO THE NATION
The Arabic keyboard makes
difficult-to-detect typos likely
Adjacent letters also
often sound similar
(with contrasts not
found in English)
LANGUAGE RESEARCH IN SERVICE TO THE NATION
The Arabic keyboard makes
difficult-to-detect typos likely
Adjacent letters also
often sound similar
(with contrasts
subject to placeassimilation)
LANGUAGE RESEARCH IN SERVICE TO THE NATION
The Arabic keyboard makes
difficult-to-detect typos likely
Adjacent letters also
often sound similar
(particularly so in
some dialect
pronunciations)
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Putting DYM…? together
• A query is checked by
composing a singlestring finite state
automaton (FSA) with:
H
‫ح‬
keyboard
– weighted keyboard, visual,
and sound-based FSTs
– a dictionary FSA (with
weights for dialect variants)
• The n-best paths
yielding unique strings
are calculated
• The corresponding
strings are displayed to
the user HARB, ?ARB, OARB, ....
LANGUAGE RESEARCH IN SERVICE TO THE NATION
A
‫ا‬
R
‫ر‬
visual
B
‫ب‬
sound-based
19
LANGUAGE RESEARCH IN SERVICE TO THE NATION
20
LANGUAGE RESEARCH IN SERVICE TO THE NATION
21
LANGUAGE RESEARCH IN SERVICE TO THE NATION
22
LANGUAGE RESEARCH IN SERVICE TO THE NATION
23
LANGUAGE RESEARCH IN SERVICE TO THE NATION
24
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Show verbs
25
Show non-verbs
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Download Results
26
LANGUAGE RESEARCH IN SERVICE TO THE NATION
27
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Arabic is not English!
• One user interface for all languages will not
work
• We must customize the user interface to take
into account the unique structure of each
language
28
LANGUAGE RESEARCH IN SERVICE TO THE NATION
Sarah C. Wayland
swayland@casl.umd.edu
301-226-8938
Download