Presented at the 8th Science and Technology Congress De La Salle University-Manila, 2401 Taft Avenue, 1004 Manila, Philippines March 8, 2006 AN INTEGRATED APPROACH TO TAGALOG PART-OF-SPEECH TAGGING Rachel Edita O. ROXAS Bryan Anthony S. HONG Chris Ian LIM Peter TAN Software Technology Department College of Computer Studies De La Salle University-Manila 2401 Taft Avenue, Manila Philippines 1004 Tel: (63) (2) (536-0276/7) Fax: (63) (2) (536-0278) roxasr@dlsu.edu.ph, bashx5@yahoo.com, godlovesian8@yahoo.com, txscool2000@163.com Abstract: A part-of-speech tagger that integrates the rule-based and the example-based approaches, IPOST, is presented. Documents for automatic tagging are initially stemmed using a Tagalog stemmer (Bonus, 2003) through an affix rule repository. Tagging is done through word features which are automatically extracted from a manually-tagged Tagalog corpus (Rabo, 2004), and an electronic dictionary of Tagalog words with their parts of speech. Words with multiple tags are disambiguated through a parser which was automatically outputted by a parser generator (See, et. al., 2004) based on a specified Tagalog grammar (Ang, et. al., 2002). The system was tested on five Tagalog corpora with 5892 words, and test results show a 55% to 73% tagging accuracy for each of the corpus. About half of the errors can be attributed to ambiguous and unknown words. This shows that solving ambiguities should be improved and populating the dictionary should be done to improve the performance of the system. Other ambiguity resolution algorithms could also be considered. For the rule-based approach, it is further recommended for an improvement in the accuracy of the system that an evaluation of the other language resources such as the Tagalog corpora and the Tagalog grammar specification should be done. For the example-based approach, an evaluation of the efficiency of the word features which are automatically extracted should also be done. Key Words: Part-of-Speech Tagging 1. INTRODUCTION Part of speech (POS) tagging refers to the identification of the parts of speech to the words in a given sentence. It is used as a component for parsing, for recognition in message extraction systems, for text-based information retrieval, for speech recognition, and for generating intonation in speech production systems. 1 Presented at the 8th Science and Technology Congress De La Salle University-Manila, 2401 Taft Avenue, 1004 Manila, Philippines March 8, 2006 There are two major approaches to POS tagging, which are the supervised and unsupervised approaches. In the supervised approach, a training phase uses pre-tagged corpus for the acquisition of initial knowledge. On the other hand, the unsupervised requires initial knowledge from linguists to perform POS tagging. Figure 1 shows the major classification of these systems. Figure 1: Major Classification of POS Taggers In supervised learning, two main approaches have been considered, rule-based and probabilistic/stochastic. In a rule-based tagger, a set of tags are assigned to words based on a lexicon and morphological analysis (Abney, 1996). Rules (pattern-action) are used to eliminate tags for the word. For example, “the current word is not a verb if the preceding word is a determiner”. In the probabilistic approach, the context of the sentence is considered such that the relation of one tag to another is determined by using computed probability values of the possible tag sequences. These approaches have been applied to tagging of documents in different languages and have been proven to produce acceptable results. For Tagalog, the Tagalog POS Tagger or TPOST (Rabo, 2004) is a POS tagger that has been developed by training the system using a manually tagged corpus of Tagalog documents. It uses a template-based n-gram approach to POS tagging, and is designed for languages with few and with no comprehensive lexical resources. The key to the algorithm is to utilize carefully chosen basic words and fundamental features used for word constructions, in tagging itself and in disambiguating and solving unknown words surrounding it. About 50% of the tagging errors are caused by ambiguous words, that is, words with many possible tags. Different window sizes did not improve the errors due to these ambiguous words. About half of these errors are resolved through re-training of the system using data from the same domain. Unfortunately, a more comprehensive manually-tagged Tagalog corpus in any domain is not available at this time, thus, we cannot perform training and testing on TPOST on more data from the same domain. Thus, a method is proposed here to aid in ambiguity resolution by using existing language resources and tools to improve the accuracy of the tagger. 2. METHODOLOGY IPOST has three main phases: pre-processing, the initial tagger and the ambiguity resolution. The pre-processing module performs stemming on a text document using a Tagalog stemmer (Bonus, 2003) through an affix rule repository. Initial tagging is done through word features which are automatically extracted from a manually-tagged Tagalog corpus (Rabo, 2004), and an 2 Presented at the 8th Science and Technology Congress De La Salle University-Manila, 2401 Taft Avenue, 1004 Manila, Philippines March 8, 2006 electronic dictionary of Tagalog words with their parts of speech (Lat, et. al., 2005). If two or more specific tags which are assigned to a word are under the same general tag, these tags are removed, and the general tag is assigned to the word. Other words with multiple tags are disambiguated through a parser which was automatically outputted by a parser generator (See, et. al., 2004) based on a specified Tagalog grammar of Twirl (Ang, et. al., 2005) and a more comprehensive grammar by (Ang, et. al, 2002). The system was tested on five Tagalog corpora with 5892 words. Tagging scores are computed and expressed in percentage form as follows: (correct specific tag weight + correct general tag weight + ambiguous word weight)/number of words in the document. Each correct specific tag is assigned a weight 1, while each correct general tag is given a weight 0.5. In the scoring of ambiguous words, if the correct manual tag for an ambiguous word is one of the tags generated by the system, then the score given is 1/number of tags generated by the system for the word. But if the correct manual tag for an ambiguous word is a specific tag of a general tag which is generated by the system, the score given is 0.5/number of tags generated by the system for the word. For example, if the system tag is NNPA/PRSP/VBW and the manual tag is PRSP, then 1/3 is the ambiguous word weight. But if the system tag is NNPA/PR/VBW and the manual tag is PRSP, 0.5/3 is the ambiguous word weight. 3. RESULTS AND DISCUSSION The system was tested on five Tagalog corpora with 5892 words which were also used by (Rabo, 2004) in the testing of TPOST. Test results (see Table 1) show either the same accuracy (for corpora 2, 3 and 4), or an improvement (for corpora 1 and 5) in the tagging scores when the Tagalog grammar by (Ang, et. al., 2005) was replaced by a more comprehensive grammar of (Ang, et. al, 2002). Table 1. IPOST Results on Test Corpora Using Twirl (Ang, et al, 2005) and (Ang, et al, 2002). Corpus Number of words Grammar of Parser Correctly Tagged words 1. Children’s Story Book 2. Business Text 1269 3. Essays 261 4. Entertainment 205 5. Philippians 1-3 2171 Twirl Ang Twirl Ang Twirl Ang Twirl Ang Twirl Ang 587 608 53 53 101 101 96 96 1239 1244 133 General Tag Instead of Specific 100 114 12 12 16 16 16 16 44 45 Ambiguous words Unknown words Mistagged words Score 384 339 43 43 92 92 61 61 888 878 139 117 21 21 43 43 25 25 0 0 59 91 4 4 9 9 7 7 0 4 60.42% 61.50% 54.82% 54.82% 54.89% 54.89% 60.87% 60.87% 73.25% 73.31% The improvements in tagging scores are due to the increase in correctly tagged words (using specific tags), reduction of ambiguous and unknown words. Overall scores showed slight 3 Presented at the 8th Science and Technology Congress De La Salle University-Manila, 2401 Taft Avenue, 1004 Manila, Philippines March 8, 2006 improvements only (for corpora 1 and 5) since attempts to tag more ambiguous and unknown words resulted to increase in mistagging and specification of general tags rather than specific tags. It is also noteworthy that the larger the corpus (corpora 1 and 5), the better the results were, while the smaller-sized corpora (corpora 2, 3 and 4) showed no improvements. 4. CONCLUSIONS This study has shown that the use of a more comprehensive grammar for ambiguity resolution resulted in an improvement to the over-all performance of the tagger. Thus, a study on the development of the Tagalog grammar is recommended to further improve the accuracy of the tagger. Despite the attempt to address ambiguous words and unknown words, about half of the errors can still be attributed to these words. This shows that ambiguity resolution has to be addressed to greatly improve the accuracy of this tagging approach, or by considering other ambiguity resolution algorithms. REFERENCES Abney, S. (1996). Part-of-Speech Tagging and Partial Parsing. [online]. Available: http://www.vinartus.net/spa/95a.pdf. (February 2006). Ang, M., Cagalingan, S., Chan, P., & Tan, R. (2002). FiSSAn: Filipino Sentence Syntax and Semantic Analyzer. Philippines: De La Salle University. Ang, R. J., Bautista, N. G., Cai, Y. R., & Tanlo, B. G. (2005). Translation With Rule-Learning. Philippines: Undergraduate Thesis, De La Salle University Manila. Bonus, E. (2003). A Stemming Algorithm for Tagalog Words. Philippines: De La Salle University. Lat, J., Ng, S., Sze, K., & Yu, G. (2005). AEFLEX. Philippines: De La Salle University. Rabo, V. (2004). TPOST: A Template-based, n-gram Part-of-Speech Tagger for Tagalog. Philippines: De La Salle University. See, S., & Teo, M. (2004). Parser Generator. Philippines: De La Salle University. 4