An Integrated Approach To Tagalog Part Of Speech Tagging

advertisement
Presented at the 8th Science and Technology Congress
De La Salle University-Manila, 2401 Taft Avenue, 1004 Manila, Philippines
March 8, 2006
AN INTEGRATED APPROACH TO TAGALOG PART-OF-SPEECH TAGGING
Rachel Edita O. ROXAS
Bryan Anthony S. HONG
Chris Ian LIM
Peter TAN
Software Technology Department
College of Computer Studies
De La Salle University-Manila
2401 Taft Avenue, Manila Philippines 1004
Tel: (63) (2) (536-0276/7)
Fax: (63) (2) (536-0278)
roxasr@dlsu.edu.ph, bashx5@yahoo.com, godlovesian8@yahoo.com, txscool2000@163.com
Abstract: A part-of-speech tagger that integrates the rule-based and the example-based
approaches, IPOST, is presented. Documents for automatic tagging are initially stemmed using a
Tagalog stemmer (Bonus, 2003) through an affix rule repository. Tagging is done through word
features which are automatically extracted from a manually-tagged Tagalog corpus (Rabo,
2004), and an electronic dictionary of Tagalog words with their parts of speech. Words with
multiple tags are disambiguated through a parser which was automatically outputted by a parser
generator (See, et. al., 2004) based on a specified Tagalog grammar (Ang, et. al., 2002). The
system was tested on five Tagalog corpora with 5892 words, and test results show a 55% to 73%
tagging accuracy for each of the corpus. About half of the errors can be attributed to ambiguous
and unknown words. This shows that solving ambiguities should be improved and populating the
dictionary should be done to improve the performance of the system. Other ambiguity resolution
algorithms could also be considered. For the rule-based approach, it is further recommended for
an improvement in the accuracy of the system that an evaluation of the other language resources
such as the Tagalog corpora and the Tagalog grammar specification should be done. For the
example-based approach, an evaluation of the efficiency of the word features which are
automatically extracted should also be done.
Key Words: Part-of-Speech Tagging
1. INTRODUCTION
Part of speech (POS) tagging refers to the identification of the parts of speech to the words in a
given sentence. It is used as a component for parsing, for recognition in message extraction
systems, for text-based information retrieval, for speech recognition, and for generating
intonation in speech production systems.
1
Presented at the 8th Science and Technology Congress
De La Salle University-Manila, 2401 Taft Avenue, 1004 Manila, Philippines
March 8, 2006
There are two major approaches to POS tagging, which are the supervised and unsupervised
approaches. In the supervised approach, a training phase uses pre-tagged corpus for the
acquisition of initial knowledge. On the other hand, the unsupervised requires initial knowledge
from linguists to perform POS tagging. Figure 1 shows the major classification of these systems.
Figure 1: Major Classification of POS Taggers
In supervised learning, two main approaches have been considered, rule-based and
probabilistic/stochastic. In a rule-based tagger, a set of tags are assigned to words based on a
lexicon and morphological analysis (Abney, 1996). Rules (pattern-action) are used to eliminate
tags for the word. For example, “the current word is not a verb if the preceding word is a
determiner”. In the probabilistic approach, the context of the sentence is considered such that the
relation of one tag to another is determined by using computed probability values of the possible
tag sequences.
These approaches have been applied to tagging of documents in different languages and have
been proven to produce acceptable results. For Tagalog, the Tagalog POS Tagger or TPOST
(Rabo, 2004) is a POS tagger that has been developed by training the system using a manually
tagged corpus of Tagalog documents. It uses a template-based n-gram approach to POS tagging,
and is designed for languages with few and with no comprehensive lexical resources. The key to
the algorithm is to utilize carefully chosen basic words and fundamental features used for word
constructions, in tagging itself and in disambiguating and solving unknown words surrounding it.
About 50% of the tagging errors are caused by ambiguous words, that is, words with many
possible tags. Different window sizes did not improve the errors due to these ambiguous words.
About half of these errors are resolved through re-training of the system using data from the
same domain. Unfortunately, a more comprehensive manually-tagged Tagalog corpus in any
domain is not available at this time, thus, we cannot perform training and testing on TPOST on
more data from the same domain. Thus, a method is proposed here to aid in ambiguity
resolution by using existing language resources and tools to improve the accuracy of the tagger.
2. METHODOLOGY
IPOST has three main phases: pre-processing, the initial tagger and the ambiguity resolution.
The pre-processing module performs stemming on a text document using a Tagalog stemmer
(Bonus, 2003) through an affix rule repository. Initial tagging is done through word features
which are automatically extracted from a manually-tagged Tagalog corpus (Rabo, 2004), and an
2
Presented at the 8th Science and Technology Congress
De La Salle University-Manila, 2401 Taft Avenue, 1004 Manila, Philippines
March 8, 2006
electronic dictionary of Tagalog words with their parts of speech (Lat, et. al., 2005). If two or
more specific tags which are assigned to a word are under the same general tag, these tags are
removed, and the general tag is assigned to the word. Other words with multiple tags are
disambiguated through a parser which was automatically outputted by a parser generator (See, et.
al., 2004) based on a specified Tagalog grammar of Twirl (Ang, et. al., 2005) and a more
comprehensive grammar by (Ang, et. al, 2002).
The system was tested on five Tagalog corpora with 5892 words. Tagging scores are computed
and expressed in percentage form as follows: (correct specific tag weight + correct general tag
weight + ambiguous word weight)/number of words in the document. Each correct specific tag
is assigned a weight 1, while each correct general tag is given a weight 0.5. In the scoring of
ambiguous words, if the correct manual tag for an ambiguous word is one of the tags generated
by the system, then the score given is 1/number of tags generated by the system for the word.
But if the correct manual tag for an ambiguous word is a specific tag of a general tag which is
generated by the system, the score given is 0.5/number of tags generated by the system for the
word. For example, if the system tag is NNPA/PRSP/VBW and the manual tag is PRSP, then
1/3 is the ambiguous word weight. But if the system tag is NNPA/PR/VBW and the manual tag
is PRSP, 0.5/3 is the ambiguous word weight.
3. RESULTS AND DISCUSSION
The system was tested on five Tagalog corpora with 5892 words which were also used by (Rabo,
2004) in the testing of TPOST. Test results (see Table 1) show either the same accuracy (for
corpora 2, 3 and 4), or an improvement (for corpora 1 and 5) in the tagging scores when the
Tagalog grammar by (Ang, et. al., 2005) was replaced by a more comprehensive grammar of
(Ang, et. al, 2002).
Table 1. IPOST Results on Test Corpora Using Twirl (Ang, et al, 2005) and
(Ang, et al, 2002).
Corpus
Number
of
words
Grammar
of Parser
Correctly
Tagged
words
1. Children’s
Story Book
2. Business Text
1269
3. Essays
261
4. Entertainment
205
5. Philippians 1-3
2171
Twirl
Ang
Twirl
Ang
Twirl
Ang
Twirl
Ang
Twirl
Ang
587
608
53
53
101
101
96
96
1239
1244
133
General
Tag
Instead of
Specific
100
114
12
12
16
16
16
16
44
45
Ambiguous
words
Unknown
words
Mistagged
words
Score
384
339
43
43
92
92
61
61
888
878
139
117
21
21
43
43
25
25
0
0
59
91
4
4
9
9
7
7
0
4
60.42%
61.50%
54.82%
54.82%
54.89%
54.89%
60.87%
60.87%
73.25%
73.31%
The improvements in tagging scores are due to the increase in correctly tagged words (using
specific tags), reduction of ambiguous and unknown words. Overall scores showed slight
3
Presented at the 8th Science and Technology Congress
De La Salle University-Manila, 2401 Taft Avenue, 1004 Manila, Philippines
March 8, 2006
improvements only (for corpora 1 and 5) since attempts to tag more ambiguous and unknown
words resulted to increase in mistagging and specification of general tags rather than specific
tags. It is also noteworthy that the larger the corpus (corpora 1 and 5), the better the results were,
while the smaller-sized corpora (corpora 2, 3 and 4) showed no improvements.
4. CONCLUSIONS
This study has shown that the use of a more comprehensive grammar for ambiguity resolution
resulted in an improvement to the over-all performance of the tagger. Thus, a study on the
development of the Tagalog grammar is recommended to further improve the accuracy of the
tagger.
Despite the attempt to address ambiguous words and unknown words, about half of the errors
can still be attributed to these words. This shows that ambiguity resolution has to be addressed
to greatly improve the accuracy of this tagging approach, or by considering other ambiguity
resolution algorithms.
REFERENCES
Abney, S. (1996). Part-of-Speech Tagging and Partial Parsing. [online]. Available:
http://www.vinartus.net/spa/95a.pdf. (February 2006).
Ang, M., Cagalingan, S., Chan, P., & Tan, R. (2002). FiSSAn: Filipino Sentence Syntax and
Semantic Analyzer. Philippines: De La Salle University.
Ang, R. J., Bautista, N. G., Cai, Y. R., & Tanlo, B. G. (2005). Translation With Rule-Learning.
Philippines: Undergraduate Thesis, De La Salle University Manila.
Bonus, E. (2003). A Stemming Algorithm for Tagalog Words. Philippines: De La Salle
University.
Lat, J., Ng, S., Sze, K., & Yu, G. (2005). AEFLEX. Philippines: De La Salle University.
Rabo, V. (2004). TPOST: A Template-based, n-gram Part-of-Speech Tagger for Tagalog.
Philippines: De La Salle University.
See, S., & Teo, M. (2004). Parser Generator. Philippines: De La Salle University.
4
Download