doc - School of Computing

advertisement
Arabic Language Computing applied to the Quran
- YouTube and PowerPoint presentation by Kais Dukes, University of Leeds
TRANSCRIPT
=========================
[0:00] Hello. This is a talk on Arabic language computing applied to the Quran.
A PhD research project by Kais Dukes at the Institute for Artificial
Intelligence and Biological Systems, at the School of the Computing,
University of Leeds.
[0:15] If you Google Kais Dukes you will discover his website
which shows he's a financial software engineer in the financial industry
in the city of London. He's also doing a part-time PhD. Unfortunately he is
very busy at the moment and unable to present in person.
So I, his supervisor Eric Atwell, am presenting this for him.
[0:30] The challenge is to try and find an interdisciplinary approach to
understanding the Quran using ideas from Quranic studies, traditional Arabic
linguistics and computing research, and hopefully feeding back to all three
areas.
[0:45] The Quran is the last in a series of five major religious texts.
Believers hold that God gave the message to the angel Gabriel to
pass it on to Muhammad to learn by heart and pass on to all mankind.
[1:00] It's written in the language of 1300 years ago.
All believers are supposed to try and understand the original text
rather than translations or interpretations. It has guided philosophy,
science and other aspects of knowledge...
[1:15] ...particularly Arabic linguistics, which was developed to try and help
understand the Quran, and it's been the guiding light in theories of syntax,
semantics and discourse analysis, used today on modern English too.
[1:30] As far as computers are concerned, there are many websites where you can
access the Quran, but you can only search for verse-by-verse. You can search for
individual verses which contain words; so basically Google-style searching.
[1:45] It would be nice in theory to be able to ask questions in plain English,
like "How long should I breastfeed my child for?" and have an AI system which
computes the meaning, and finds the verse which has relevant meaning to answer
the question.
[2:00] Machine learning works by taking data, and then learning patterns and
classifications in the data. If we augment the data with linguistic and
semantic concepts, then the AI system can learn conceptual patterns
[2:15] So, we need to augment the Quran text with linguistic annotations.
However, this is challenging, as the Quran is written in a complex script
with very difficult word structure, grammar and semantics.
[2:30] But Computational Linguistics research methods offer a solution.
We can get hold of the text of Traditional Arabic grammar textbooks,
extract the meanings of the of the grammatical descriptions, use this for
machine learning, and then put results online for volunteers to correct.
[2:45] The first task is to get hold of an authentic version of the text. If you
just use modern Windows encoding or Unicode, this doesn't display the original
text correctly.
[3:00] Luckily, there was a project - called the Tanzil project – which started
around the time Kais started this research effort, which came up with a Unicode
XML encoding which allows the text to be displayed authentically in its original
form.
[3:15] So, Kais had to start by developing a Java API or large set of code which
allowed you to read this XML and display the original text authentically on a
web page.
[3:30] This then allows us to do morphological analysis as a next stage, and
there are tools for morphological analysis for modern standard Arabic and there
has been some progress on analysing the Quran at the University of Haifa, there
are also formal lexical representations developed at Columbia University.
[3:45] The trouble with the Haifa corpus is that they didn't really complete it,
so each word has many possible analyses and they were not verified by experts
who know what was correct, and it's a non-standard annotation scheme.
[4:00] So Kais' answer was to develop the Quranic Arabic Corpus website, do a
lot of analysis, and put it all online, for people to see and use, and correct
if necessary, including word structure, word-for-word translations, grammatical
and semantic representations.
[4:15] So here we have the base - the verified Uthmani script for a word. You
have to read the Arabic from right-to-left. Of course, if you don’t speak
Arabic, this doesn’t mean much to you. But if you do speak Arabic, you can see
this is the correct original format.
[4:30] For non-Arabic speakers, there is also a phonetic transcription – not
using true international phonetic alphabet, but something like the standard
roman alphabet, so English speakers, if you learnt English as a second language,
you can probably work this out.
[4:45] Also the assumption is that an awful lot of learners of the Quran do
speak English. So we’ve added an interlinear word-by-word exact translation of
what the Arabic morphemes mean.
[5:00] And there is a referencing system which allows you to locate any
particular chapter, verse, word, and even segment so you can find others which
have the same ones - a complex referencing system.
[5:15] Now on top of that, each Arabic word is quite complex, so a typical word
may have a root, for example a verb, and then a conjunction at the start of it,
and then a subject and object pronoun after it. So, you have to segment the word
into individual parts.
[5:30] And there is quite a lot of detailed information as to what the grammatical
categories of the individual parts are. So this is – reading from right-to-left
– a conjunction, followed by a main verb, followed by subject pronoun, followed
by an object pronoun.
[5:45] And for use by Arabic grammarians, as there are an awful lot of Arabic
grammarians in the Arab world who prefer to speak Arabic, there is also an
automatically generated Arabic translation of the grammatical description.
[6:00] Somewhat more complicatedly, there is a parse structure tree, or diagram,
showing the grammatical structure for each sentence, based on the traditional
Arabic grammar of I’rāb rather than modern linguistics.
[6:15] There is also a quite complex ontology, which is a set of all the
entities or ‘things’. Every noun or pronoun refers to some ‘thing’ and this is
linked to from the text, and you can find all the instances of that from the
ontology.
[6:30] On top of this there is quite a complex framework for collaboration. A
message board, so that anybody finding anything wrong can point it out and a
large set of downloadable resources including the software and the data.
[6:45] This is used by researchers and members of the public worldwide.
This map shows where the users are. Many in America and Britain, but
also around the whole world. And these are not just lay people trying to
read the Quran, but many researchers worldwide.
[7:00] So as far as AI and computational linguistics – what’s new. Well, it’s
the first treebank of parsed trees for Classical Arabic, and it’s the only one
that’s freely available. And it’s also a formalism for traditional Arabic
grammar, used in machine learning parsers.
[7:15] This is a novel part-of-speech tagging system. So for each word there is
quite a detailed grammatical category, gleaned from the traditional Arabic
grammar textbooks – but formalized in a computational sense.
[7:30] Kais has also developed a parser, which takes examples of these trees,
and can using machine learning to work out the patterns for parsing and then
apply the parser to new sentences, such as other Classical Arabic sentences.
[7:45] So how does he meet the criteria for postgraduate researcher of the year?
Able to communicate research to the lay and non-specialist audience, and impact
on the rest of the world, and engagement to the public.
[8:00] Well, there is a feedback page, on the website, that includes lots of
feedback from members of the public, but also some academic researchers
non-specialists, such as professor Michael Arthur, Vice Chancellor of Leeds
University.
[8:15] And in terms of impact, over a million users have used it in the past
year alone, and obviously it’s just starting, so there will be many more. And
there are lots of interesting users such as a chaplain in the correctional
center of the state of Missouri, as an interesting example.
[8:30] Scientific impact in terms of the subject area - he has already
published many papers including a significant journal entry, and quite a few
citations even though he is only half way through his PhD, and has lots of
positive feedback from other researchers.
[8:45] And there has been news articles, for example in the Muslim Post, and the
website itself has gots lots of public users and feedback that
is definitely public engagement on a worldwide scale, shall we say.
[9:00] So, to conclude, well this isn’t the conclusion because he’s only half
way through his PhD project, given it’s a part-time PhD project. So, I hope you
are going to give him the award of Postgraduate researcher of the year, if not
he can come back and try again next year.
Download