The presentation template - School of Computing

advertisement
School of Computing
FACULTY OF ENGINEERING
Automatic
Part-of-Speech Tagging of Arabic Text
Majdi Sawalha
sawalha@comp.leeds.ac.uk
Supervisor
Dr. Eric Atwell
eric@comp.leeds.ac.uk
School of Computing
FACULTY OF ENGINEERING
Outline:
• Introduction
• Research focus and questions
• A word about Arabic Language
• Arabic Language Corpora
• Gold standard for evaluation
• Arabic Morphological Analysers and Stemmers
• Prior-Knowledge broad-lexical resource
• Hybrid Part-of-Speech tagger of Arabic language
2
Introduction
School of Computing
FACULTY OF ENGINEERING
• What is Part of Speech Tagging?
• What is a tag?
• What is the tagsets?
Our Aim
How to widen the scope of Arabic Part-of-Speech
tagging, to develop a system which can process
Arabic text in wide range of formats, domains, and
genres of both vowelized and non-vowelized text ?
3
Research focus and questions
School of Computing
FACULTY OF ENGINEERING
How to widen the scope of Arabic Part-of-Speech tagging, to
develop a system which can process Arabic text in wide range of
formats, domains, and genres of both vowelized and nonvowelized text ?
Research sub-questions:
• Can richer lexical resources derived from dictionaries and
grammar text books improve the coverage of morphological analysis
for wider range of Arabic text formats, domains and genres?
• How do we evaluate existing Part-of-Speech taggers and new
Part-of-Speech tagger on a wider range of text formats, domains,
genres, and vowelized and non-vowelized text?
• How do I make the best reuse of existing tagger components
and methods?
4
Introduction
School of Computing
FACULTY OF ENGINEERING
Tagging Applications
• A good tagger can serve as a preprocessor.
• Large tagged text corpora are used as data for linguistic
studies.
• Information technology applications;
• Text indexing and retrieval.
• Speech processing.
5
A word about Arabic Language
School of Computing
FACULTY OF ENGINEERING
Arabic language linguists classify words in Arabic into three
main categories.
• Verbs: that word which denotes an action and has tense.
• Nouns: name of a person, place, or object and does not
have any tense.
• Particles: that word of which cannot be understood without
joining a noun or a verb or both.
6
A word about Arabic Language
School of Computing
FACULTY OF ENGINEERING
Verb classifications
Verb
Verb
‫الفعل‬
‫الفعل‬
Complete Verb
Complete
Verb
‫فعل تام‬
‫فعل تام‬
Transitive Verb
Transitive
Verb
‫فعل متعد‬
‫فعل متعد‬
Active Verb
Active
Verb
‫فعل معلوم‬
‫فعل معلوم‬
Incomplete
Verb
Incomplete
‫فعل ناقص‬
Verb
‫فعل ناقص‬
Intransitive
Verb
Intransitive
‫فعل الزم‬
Verb
‫فعل الزم‬
Passive Verb
Passive
Verb
‫فعل مجهول‬
‫فعل مجهول‬
Verb
‫الفعل‬
Perfect / Past Verb
‫الفعل الماضي‬
Progress Verb
‫الفعل المضارع‬
Imperative Verb
‫فعل أمر‬
7
A word about Arabic Language
School of Computing
FACULTY OF ENGINEERING
Nouns
• Arabic language linguists distinguish between 21 types of nouns
• Verbal noun
• Original noun
• Pronoun
• Personal noun
• Demonstrative noun
• Joining nouns
• Interrogative noun
• Conditional noun
• Generalization nouns
• Adverb
• Present participle
• Past participle
• Adjective
• Increased present participle.
• Comparing and contrasting
entities, the comparative and
the superlative
• Adverb of place
• Adverb of time
• Noun of instrument
• Proper noun
• Noun of genus
• Ordinal number nouns
• Verb noun
• The five nouns
8
A word about Arabic Language
School of Computing
FACULTY OF ENGINEERING
Particles
Particles
Meaning Particles
Inactive Particles
Building Particles
Active Particles
Effects
Verb
Noun
Both
Jussive
Subjunctive
Partial
subjunctive
Genitive
Case
Vocative
Exception
Conjunction
9
Arabic Language Tagset
School of Computing
FACULTY OF ENGINEERING
Evaluating existing Arabic tagsets.
• Every researcher has developed a tagset. Either detailed or
minimal tagset.
• A comparison of different tagsets will show
• The number of tags used,
• The purpose of using the tagset.
• The source of information when designing the tagset.
• The errors in classifying tags into their categories.
• Designing a more reliable and multi-level tagset that
varies from minimal tagset to more detailed one.
10
A word about Arabic Language
School of Computing
FACULTY OF ENGINEERING
Arabic Language challenges
• Writing constraints lead to ambiguities.
• Tokenization.
• Agglutination.
• Complex Morphology.
• Vowel Marks.
• Grammatical ambiguity
 2.8 in vowelized text and 5.6 in non-vowelized text
11
Tokenization
School of Computing
FACULTY OF ENGINEERING
• What is a token?
• Main tokens are delimited by a white space or a punctuation mark
• ( ، ‫ ! ؛ ؟‬. etc) .
• Arabic Morphology allows words to be prefixed or suffixed with clitics.
• Clitics can be concatenated one after the other.
• Arabic clitics are not as easily recognizable.
• A single word can comprise up to four independent morphemes.
• Tokenizer is responsible for:
• Defining word boundaries.
• Demarcating clitics, multiword expressions, abbreviations and numbers.
• Affixes carry morpho-syntactic features
- Tense
- Person - Gender - Number)
• Clitics serve syntactic functions
- Negation -Definition – Conjunction - Preposition
12
Tokenization
Tokenization
School of Computing
FACULTY OF ENGINEERING
• Most Arabic words consist of stem/root and a combination of prefixes
and suffixes.
12- Prefix(es)
34- Prefix(es)
56- Prefix(es)
78- Prefix(es)
‫كتب‬
‫يكتب‬
‫كتبه‬
‫يكتبه‬
‫كتاب‬
‫الكتاب‬
‫كتابهم‬
‫وكتابهم‬
Root
+ Root
Root + Suffix(es)
+ Root + Suffix(es)
Stem
+ Stem
Stem + Suffix(es)
+ Stem + Suffix(es)
ktb
yktb
ktbh
yktbh
ktAb
AlktAb
ktAbhm
wktAbhm
Wrote
Write
Wrote it
Writing it
Book
The book
Their book
And their book
‫ولـيـــكـتـبــونـهـا‬
[ wlyktbwnhA ] (And they write it)
‫( و * ل * ي * كتب * ون * ها‬w*l*y*ktb*wn*hA)
‫و‬
Conjunction
‫ل‬
preposition
‫ي‬
‫كتب‬
‫ون‬
‫ها‬
Progressive
letter
Root
Relative
Pronoun
Relative Pronoun
(Plural/Subject)
(Object) 13
Vowels & Diacritical marks
School of Computing
FACULTY OF ENGINEERING
• Arabic has 2 types of vowels
1- Long vowels: Alif
‫ا‬
, waw
‫و‬
, yaa
‫( ي‬part of Arabic letters)
2- Short vowels: there small vowel marks which are not part of Arabic letters.
These marks are placed above and below the Arabic letters.
Arabic has other 5 diacritical marks
• Nunation is the doubling of the short vowels used at the end of indefinite
nouns
• Sukun (absence of a vowel) consonant is not followed by a vowel.
• Gemination (Shadda) duplication of the consonant
14
Vowelization & Part-of-Speech Tagging
School of Computing
FACULTY OF ENGINEERING
Importance of using diacritics in Arabic language
• Adding semantic information to the words
• Determining the correct tag to the word in the sentence
• Indicating grammatical functions to the word
(Mood, Aspect, Voice endings for verbs, Case endings for nouns).
• Indicating the correct pronunciation of word, correct
syntactical analysis and removing the semantic confusion
of Arabic readers.
15
Vowelization & Part-of-Speech Tagging
School of Computing
FACULTY OF ENGINEERING
• Diacritical marks affect the Part-of-Speech tag of the word
and its meaning
16
Corpora or (Corpuses)
School of Computing
FACULTY OF ENGINEERING
Corpus
A collection of samples of texts that are selected and ordered according
to explicit linguistic criteria in order to be used as a sample of the
language.
Applications of Corpora
• Prepare and format text to be used by search tools.
• Useful for linguist, teacher and learner. (advanced level)
• The study of syntactic structure.
• Corpus in lexicography used for developing good dictionaries.
• Used to train Machine Learning software for grammar analysis, word
clustering, machine translation, …
17
Arabic Language Corpora
School of Computing
FACULTY OF ENGINEERING
Corpus of Contemporary Arabic (CCA) [University of Leeds Corpus] (2004)
• Engineered by Latifa Al-Sulaiti & Eric Atwell; Written and some spoken;
Around 1M words; TAFL; Websites and online magazines
• FREE to download: http://www.comp.leeds.ac.uk/arabic
Buckwalter Arabic Corpus 1986-2003
• Written; 2.5 to 3 billion words, Lexicography;Public resources on the Web
An-Nahar Corpus (2001)
• Written;140M words; General research;
An-Nahar newspaper (Lebanon)
Al-Hayat Corpus (2002)
• Written;18.6M words; Language Engineering and Information Retrieval; AlHayat newspaper (Lebanon)
Arabic Gigaword (2002)
• Written; Around 400M words; Natual language processing, information retrieval,
language modelling; Agence France Presse, Al-Hayat news agency, An-Nahar
news agency, Xinhua news agency
18
Gold Standard Evaluation Corpus
School of Computing
FACULTY OF ENGINEERING
Building Gold Standard Evaluation Corpus
- Different text domains, formats and genres of both vowelised
and non-vowelised text.
- The Qur’an.
- Newspaper text.
- Magazines.
- School books.
- Children’s books.
- Blogs (text in blogs can be in Arabic script or in roman letters
transcription)
- Gold Standard will be checked by Arabic language scholars.
19
Gold Standard Evaluation Corpus
Sample of Qur’an Gold Standard (vowelized)
Alif. Lam. Mim. Do men imagine that they will be left (at ease)
because they say, We believe, and will not be tested with
affliction? Lo! We tested those who were before them. Thus
Allah knoweth those who are sincere, and knoweth those who
feign. Or do those who do ill-deeds imagine that they can
outstrip Us? Evil (for them) is that which they decide. Whoso
looketh forward to the meeting with Allah (let him know that)
Allah's reckoning is surely nigh, and He is the Hearer, the
Knower. And whosoever striveth, striveth only for himself, for
lo! Allah is altogether Independent of (His) creatures. And as for
those who believe and do good works, We shall remit from them
their evil deeds and shall repay them the best that they did. We
have enjoined on man kindness to parents; but if they strive to
make thee join with Me that of which thou hast no knowledge,
then obey them not. Unto Me is your return and I shall tell you
what ye used to do. And as for those who believe and do good
works, We verily shall make them enter in among the righteous.
School of Computing
FACULTY OF ENGINEERING
Sample of Newspaper Gold Standard (non-vowelized)
Globalization will stay a hot topic of discussion for a long
time. In this article, we consider in depth some of the
questions raised by new writers who consider globalization
as a new lifestyle for the modern man. Taking the lead from
America, many writers describe the multi-ethnic and
multicultural American life style as the ideal in the new global
village where telecommunication, transportation, information
systems and the media shorten the distances between
disparate groups. Advocates of this point of view look
forward to a new modern man, the Cosmopolitan man.
20
Arabic Morphological Analysers
and Stemmers
School of Computing
FACULTY OF ENGINEERING
Evaluating stemming and morphological analyzers.
• A comparison of three stemming algorithms has been done.
• Shereen Khoja Stemmer, Tim Buckwalter morphological
analyzer and tri-literal root extraction algorithm.
• Four different fair evaluation measurements were applied.
• A combining by voting is used to combine results of
different algorithms.
• The paper shows that more work in this field is required
as the stemming algorithms failed to achieve accuracy rates
more that 75% (sawalha & Atwell, 2008).
21
Prior-Knowledge broad-lexical
resource of Arabic Language
School of Computing
FACULTY OF ENGINEERING
• 15 Arabic language dictionaries* are used
I've seen it all..;)
•The lexicon contains:
• roots and single words.
• Multi-word expressions.
• Idioms.
• Collocations requiring special part of speech assignment.
• Words with special part of speech tags.
• Meanings.
* Freely available from www.almeshkat.com in MS-Word format
22
Prior-Knowledge broad-lexical
resource of Arabic Language
School of Computing
FACULTY OF ENGINEERING
Lisan Al-Arab “ ‫ ” لسان العرب‬Arab tongue
Taj Al-Arous min jawaher Al-Qamus “ ‫تاج العروس من‬
‫ ” جواهر القاموس‬Bride crown from the dictionaries jewels
23
Existing Arabic language Part-of-Speech
taggers and reuse
School of Computing
FACULTY OF ENGINEERING
• Evaluating existing Part-of-Speech tagger components.
• Gold Standard
• Fair measurements
• Multi-level tagset
• Analyzing & re-implementing algorithms of Part-of-Speech
taggers.
• Best tagger components need to be re-implemented,
using Python.
• Python will simplify the integration of the Part-of-Speech
tagger to the NLTK (Natural Language Toolkit).
24
Hybrid Part-of-Speech tagger
School of Computing
FACULTY OF ENGINEERING
• Novel algorithm leading to hybrid Part-of-Speech tagger
for Arabic text which combines best components of
existing taggers with novel resources and components.
• Integrating best tagger components together
• Integrating Prior-knowledge lexical resource
• Integrating Morphological analyser
• Using unsupervised learning algorithms to solve the
problem of unknown words.
25
School of Computing
FACULTY OF ENGINEERING
26
Download