A Template-Based, n-gram Part-Of

advertisement
TPOST: A Template-Based, n-gram Part-Of-Speech Tagger for Tagalog
Charibeth K. Cheng
Vlamir S. Rabo
College of Computer Studies
De La Salle University - Manila
Keywords: part-of-speech tagger, linguistic resources, natural language processing
TPOST is a template-based n-gram Part-Of-Speech (POS) tagger for Tagalog. It is designed to utilize
few lexical resources. The key to the algorithm is in the use of word features, which consists of (1)
predefined words, (2) affixes, and (3) other word characteristics and symbols such as capitalization and
hyphens. The predefined word list is composed of only 225 basic words that are commonly used in
constructing sentences. These words are in their base form and are closed class words, whose meanings
do not change as the language evolves. The affixes include pre-determined prefixes, infixes and suffixes.
These word features are used in stemming, tagging and disambiguation. TPOST was trained using 1,983
words with 450 distinct features, from the first three chapters of the Book of Philippians. The tagset
includes 59 tags that are classified under 10 major POS tags. Using another text under the same domain
with 539 words with 221 distinct word features, TPOST achieved less than 8% and 11% errors for
general and specific POS tag errors, respectively. It was also tested on a different corpus on the domain
of children’s literature consisting of 1,093 words with 397 distinct word features. The test resulted to an
error below 17% and 23% for general and specific errors, respectively. Different variations on the
algorithm were performed to reduce the errors, and to make TPOST algorithm a good foundation for
further research in the field of POS Tagging.
1
Introduction
One of the reasons why most of the successful
researches in POS tagging were trained and
tested in the English language is the availability
of lexical resources that are useful in determining
the correct POS of a word. Such resources include lexical dictionaries, tagged corpus,
stemmers, morphological analyzers and a compilation of tags. Several languages do not have
these resources, which makes it difficult to test
the accuracy of existing algorithms. In the case
of Tagalog, there is no comprehensive lexicon;
neither does a tagged corpus exist. These resources are necessary to train and/or test if existing algorithms apply to the Tagalog language.
There is an algorithm called Baum-Welch, which
tags text without lexical dictionaries or a tagged
corpus, but a study showed that the need for even
a small amount of tagged corpus would result to
great improvements in its accuracy (Merialdo,
1994).
In languages such as English, features such as
suffix and capitalization are useful information in
determining the POS of the word. Tagalog also
has the same features, as well as prefixes, infixes,
reduplication, and other features that could be
used as markers to determine the POS tag of the
word. However, there are differences in the language features of English and Tagalog. For instance, the Subject-Verb agreement in English is
useful in predicting the POS tag of the word by
looking at the word-window to the left of the target word. But in Tagalog, the subject and predicate could be interchanged as shown in Figure
1.1, making it difficult to predict the POS tag of
the word.
English
Tagalog
She is a beauti- Siya ay isang magandang baful girl.
bae.
Isa siyang magandang babae.
Isa siyang babaeng maganda.
Siya ay isang babaeng maganda.
Figure 1.1 No Subject Verb Agreement in Tagalog
Another issue is the tagset. The Penn TreeBank
Tagset is not completely applicable to all languages, including Tagalog. Figure 1.2 shows that
the word “Si” in Tagalog is used as a determiner
for a proper noun, specifically, the name of a person. The English language however, does not
have a determiner to signal proper nouns. Furthermore, English uses “-’s” to denote possessive property; while Tagalog uses the word “ni”,
then interchanges the owner and the subject.
These signal words or determiners must be included the Tagalog tagset.
English
John is the friend of
Jack’s sister.
Tagalog
Si John ay ang kaibigan
ng babaeng kapatid ni
Jack.
Figure 1.2 Use of Determiners in Tagalog
This research explores a way of adopting existing
approaches in POS tagging to Tagalog utilizing a
small set of lexical resources.
2
2.1 Word features
The following word features are extracted from a
word, and are later used in determining the POS
tag of the word:
Capitalization
Affixes – including prefixes, infixes, suffixes, duplicates, and their combinations.
The list of affixes is based on (Bonus, 2003)
o Prefix – Natulog (Slept), “Na-” and
“Tulog”;
o Infix – Kumain (Ate), “-Um-” and
“Kain”;
o Suffix – Tulugan (A place to sleep),
“-An” and “Tulog”;
o Duplication – Natutulog (sleeping)
“Tu” (repeated) and “Tulog” (base
form);
Combination – Nagtutulug-tulugan
(Prefix-“Nag”, Repeated-“Tu”, “Tulog” (base form), and Suffix-“An”).
2.2 The lexicon
TPOST uses 225 words that are in simplest form
and are more robust in use. These pre-defined
words are also those words that are often used to
link sentences or thoughts together. Entries in
the lexicon are in the base or root forms. The
lexicon also includes stopwords that are used to
determine POS tags of the succeeding or previous words. The following are examples of stopwords:
•
•
•
Algorithm Parameters
This research focuses on using the features of
Tagalog words, a small list of words, the position
of a word in the sentence and a list of known affixes as the basis for determining the tag of a
word.
•
•
o
•
Determiners – used as a signal word to
point to the word being referenced. Example: Mga, Si, Sina, Ang;
Conjunctions – used to combine sentences or words of the same class or
group. Example: At, O, Saka;
Linking Verbs – usually follows a noun
and is followed by adjective or verb. Example: Ay;
And other words such as pronouns,
prepositions and numbers.
2.3 Position of the word
Aside from the language features of the target
word, its position in the sentence is also important in determining its POS tag. In this research,
an n-gram tagger was implemented. The tagger
checks the surrounding tags and features of the
words in a two-word-window. This means that
the two words before and after the target word
were used as added information to determine its
tag.
2.4 The stemmer
Unlike most POS taggers, TPOST does not utilize a morphological analyzer. Instead, it determines the affixes attached to a word using simple
string matching, thereby possibly resulting to
over-stemming or under-stemming. For instance,
the rootword Tagalog may be processed as
having the prefix Taga- . The following are the
affix substrings searched in a word:
Prefix:
di
in
manga
mang
na
ni
taga
tag
i
man
pag
tig
kanda may
pala
um
kay
ka
ma
nag
pam
pang
sing
sin
ki
nam
pan
mas
mag
mala
nangag
nanga
pa
sa
gang
ga
mam
nang
sang
nan
sim
mangag
Infix:
• -um• -inSuffix:
-g
-uhan
-uhin
-han
-hin
-ran
3
-rin
-uan
-uin
-an
-in
-n
The TPOST System
The TPOST system is divided into two parts,
namely, training and tagging. Both of the processes start with feature extraction that identifies
features of the word, which would be stored to a
rules table during the training process. The extracted features stored in the rules table would
then be used during tagging phase. The feature
extractor consists of simple string manipulations
and stopword look-ups. Note that the stopword
list contains only 225 words in their base form,
hence over-stemming and under-stemming may
occur and they affect the training and tagging
processes. The main module in the training
phase is the generated rules table, which consists
of three tables used as templates during the tagging phase. The three tables are: the Word Feature table, the Tag table, and the Sentence
Template table.
During training, the following steps are performed:
1. The TPOST Processor accepts two files,
namely, (1) the document to be tagged,
and (2) the corresponding tags of the
words in the document.
2. The first input file goes through the Feature Extractor module, wherein each
word is checked if (1) it is a predefined
word, (2) if it has affixes, and (3) if it
starts with an uppercase letter. It then
gets the POS tag of the word from the
second input file, and stores the POS tag
together with the extracted word features
into a structure as shown below:
Word1 -> Word
-> TagCode
-> Features
Ex. mabait -> mabait
-> JJD
-> ~ma
3. The Rule-Checker and Generator then
populates the list of Generated Rules. It
loads one sentence at a time and stores
the structure of the sentence into the list
if it has not been encountered before,
otherwise, it increments the heuristic
value of that sentence structure.
There are 3 kinds of rules that may be
generated and checked:
1) Word_Feature Rule
<Feature_ID>, <Word_Feature>
kumakanta W30,@um$1
2.) Tag-Name Rule
<Tag_ID>,<Tag_Name>,(<Feature_
ID><Cnt>)+,(<Sentence_ID>)+
kumakanta T15,VBTR,W30,c1,S1
3.) Sentence-Template Rule
<Sentence_ID>,(Tag_ID)+,<Cnt>
Kumakanta
si
John.
S5,
TST15T20T14T10TE, 1
4. Output contains the database of templaterules generated by the system. It is used
by the tagging process in determining the
word being tagged plus the information
of the words surrounding it. The word
structure of the rule is shown below:
a. The target word’s feature list and tag
b. The previous words’ feature lists and
tags
c. The succeeding words’ feature lists
and tags
d. The heuristic value of this structure
The key to the training process is its concept of
storing only the word features, instead of storing
a huge database of rootwords. Word features
help group words with similar word structures so
that it would be more robust than novel words.
Also, instead of storing words with their possible
tags, word features are categorized under the 59
tags, hence, increase in table elements is not very
large. Lastly, instead of storing the original training sentences, only sentence templates are being
stored. These sentence templates contain sequences of POS tags, hence generalizing sentences having the same POS sentence structure.
During the actual tagging process, a Rule-based
Driven Tagger (RDT) is used to check if the feature of the word being tagged exists in the Word
Feature table. This is done by retrieving all the
tags associated to that word feature. If there is
more than one tag paired to it, then that word is
considered as ambiguous, while those with no
tags are called unknowns. The main responsibility of the RDT is to resolve ambiguous words
and unknowns. For ambiguous words, the RDT
retrieves all sentence templates having the same
contextual predicates and tags of the word being
tagged. It has to compute which among the candidate tags would be selected. For unknowns, the
RDT searches all the tags having the same contextual predicates, at the same time comparing
the word features to see which tag is closest to
the unknown word.
4
The Training Set
The training set is taken from the Bible, from the
Book of Philippians (BIBINT, 1998). The reason
for this choice is because the Bible was translated
using a consistent structure, thus patterns could
be learned by the system. Philippians has four
chapters; the first three chapters were used as
training set, while the last chapter was used as the
testing set. This amount of testing and training is
patterned after the ratio 1:3 that was used by
(Brill, 1995) and (Ma, 2000). The training set
contains 107 sentences with 1,983 words, of
which 533 are distinct, and 450 distinct word features. The training set was tagged using 59 POS
tags. These tags are classified into 10 general
POS tags, namely, verb, noun, adjective, adverb,
preposition, determiner, conjunction, pronoun,
number, and punctuation marks.
5
The Testing Set
The testing set was taken from the last chapter of
Book of Philippians with 605 words and punctuation marks of consisting of 34 sentences composed of 539 words, of which 223 are distinct,
and has 221 distinct word features.
6
Summary of Results
The system was first tested using the last chapter
of the Book of the Philippians. Parameter values
and errors were analyzed. It was found out that
25% of the errors were due to incorrect stemming. TPOST used an existing stemmer (Bonus,
2003), since the stemmer is not part of the research. Due to high number of stemming errors,
a more sophisticated stemmer is necessary. Running different tests on window sizes, the best
word window seen was two-previous and twonext. Another test done was to add predefined
words, which further decreased the percentage
errors. The best result was 3.802% for the general tag errors, and 6.446% for the specific tag
errors. TPOST was able to determine the specific tags of 68 out of 76 ambiguous words
(89.47% correctness), and 30 out of 50 unknown
words (60% correctness). Furthermore, of the 71
of 76 ambiguous words were given correct general tags (93.42% correctness); while 38 of the 50
unknowns were likewise given correct general
POS tags (76% correctness).
TPOST was also tested on literary works for
children. Using only the original list of predefined words and correct word features, the tagger
achieved 16.5% general and 22.5% specific POS
tagging errors. The text was then divided into
two parts where one was used as training set
which consists of 68 sentences, and the other was
used as testing set consisting of 34 sentences.
The training set was added to the Philippians
training set and resulted to 13.75% general and
18.5% specific POS tagging errors. Several
variations on the algorithm were made but the
best one achieved only 13.1% general and 17.7%
specific POS tagging error. Errors were attributed to lack of training and sentence templates.
Further tests were done on different domains,
including a book from the Old Testament and
news articles, to verify the results when using a
different domain. The results turned out to be
similar.
7
Strengths of the Algorithm
This algorithm uses a small set of lexical resources to be able to tag text written in the Tagalog language. When determining the general
POS tags, the initial algorithm is able to correctly
tag 89% of ambiguous words and 60% of unknown words. TPOST only knows 225 rootwords and uses a very simple stemmer. In spite
of these limitations in the resources, TPOST is
still able to tag a significant amount of words
correctly.
Another feature of this algorithm is that it would
always be able to give the POS tag of a word
even if the word or its sentence structure were
never encountered during training. Its concept of
collecting the features of the words instead of the
word itself makes it independent on the words
fed during training. Furthermore, the generated
rules template used in the tagging process contain
minimal data because only the word features are
being collected. The concept of feature extraction is more robust because these features are the
basic foundations of words, hence, they seldom
change, as compared to words themselves. In
addition, after the training process, the generated
rules table contains word features and their tags,
which can be used as a simple tagger using simple string matching.
8
Limitations of the Algorithm
The implemented algorithm also has several limitations, including:
1. language ambiguity,
2. stemming, and
3. scoring.
In language ambiguity, some words cannot be
tagged accurately because of the ambiguous nature of Tagalog itself. Some words have the
same contextual predicates or surrounding words,
but have different tags. For instance, given the
two sentences below:
Sentence A: Mabait ka sa akin.
Sentence B: Magalit ka sa akin.
The words mabait and magalit have two
different tags. The first one is an ADJECTIVE,
while the other one is a VERB, despite having
the same surrounding words “ka sa akin”.
The algorithm cannot resolve this. The feature of
both words is the same, which is the prefix -ma.
The only resolution is to add both words into the
list of predefined words in order to distinguish
one word from the other.
When it comes to stemming, there are a lot of
words that cannot be stemmed using simple
string matching. There is really a need for a
comprehensive lexicon and a morphological analyzer to get the correct features. For instance, if
the words tinapay and tinapon are not in
the list of predefined words, TPOST recognizes
both of the words as having the infix –in-. Actually, tinapay is a rootword, while tapon is
the root for tinapon with infix –in-.
Another limitation is the scoring. Several errors
found during testing are due to the lack of parameters. This occurs when we solve ambiguities
first before unknowns and vice versa. In our initial algorithm, ambiguities are solved first so that
the unknowns could have more resources to help
find a better tag for it. If all the words surrounding it are unknowns and ambiguous, TPOST
would rely on the features of the word itself, and
would not check the context anymore. Sometimes, the ambiguous words are not resolved correctly, and so produce cascading errors. Solving
the unknowns first would not lead to a better solution, since it assumes that each of the candidate
tags for the ambiguous word is correct. With too
many candidate tags, the percentage of the sys-
tem selecting a wrong tag would be greater. The
only solution is to reduce the number of unknowns by increasing the predefined word list, or
by adding more sentences into the training set.
References
9
Bonus, E. 2003. A Stemming Algorithm for
Tagalog Words. MS Thesis, De La Salle University-Manila.
Conclusion
TPOST was able to show that even with limited
resource, automatic tagging for the Filipino text
is possible. It has shown that using word features
instead of the words themselves is a possible area
to look into. As (Brill, 1998) has stated “words
are constantly being invented as well as falling in
and out of favor, no matter how complete our
words list is, we will still encounter words not on
the list, as well as novel usages.” Currently,
comprehensive lexicons for English consist of
millions of words, and yet it could never be complete as new words are being added. We need to
find a method where generalization is captured.
The introduction of manually tagged corpora
such as the Brown Corpus and the Penn TreeBank (Brill, 1998), gave rise to many different
techniques in tagging. Similarly research focusing in Tagalog has been few, due to the lack of
resources. We need to create more resources so
that we could do further researches. With this
initial tagger, it could help generate more tagged
documents to be used on other researches on machine translation and other areas of computational linguistics.
Acknowledgement
Utmost respect and recognition are given to the
following professionals, for carefully and manually tagging the training and testing corpus.
Dr. Josephine Mangahis
Faculty, Departamento ng mga Wika sa Pilipinas, De la Salle University
Ms. Adalia Cruz
Department Coordinator, Filipino Department,
Grace Christian High School
BIBINT 1998. A Tagalog Translation of the
New Testament Bible International Version.
[online]. http://bible.gospelcom.net.
Brill, E. & Wu, J. 1998 Classifier Combination
for Improved Lexical Disambiguation. Proceedings of COLING-ACL'98, pp. 191-195.
Brill, E. 2000. Part of Speech Tagging. Handbook of Natural Language Processing, pp.
403-414. Microsoft Research, Redmond,
Washington.
Castillo, A. 2001. Hiyas ng Lahi (Wika at Panitikan). Vibal Publishing House Inc. Manila,
Philippines.
Ma, Q., Morata M., Uchimoto K., & Hitoshi I.
2000 Hybrid neuro and Rule-based part of
speech taggers. Proceedings of the 18th International Conference on Computational Linguistics (COLING’2000), pp. 509–515.
Marcus M., Santorini B. and Marcinkiewicz
1993. Building a large annotated corpus of
English: The Penn Treebank. Computational
Linguistics, Volume 19, No. 2, pp.313-330.
Merialdo B. 1994. Tagging English text with a
probabilistic model. Computational Linguist
Volume 20, No. 15.
Samuelsson C. & Voutilainen A. 1997. Comparing a Linguistic and Stochastic Tagger. Proceedings of 35th Annual Meeting of the
Association for Computational Linguistics and
8th Conference of the European Chapter of the
Association for Computational Linguistics,
ACL, Madrid.
Download