TPOST: A Template-Based, n-gram Part-Of-Speech Tagger for Tagalog Charibeth K. Cheng Vlamir S. Rabo College of Computer Studies De La Salle University - Manila Keywords: part-of-speech tagger, linguistic resources, natural language processing TPOST is a template-based n-gram Part-Of-Speech (POS) tagger for Tagalog. It is designed to utilize few lexical resources. The key to the algorithm is in the use of word features, which consists of (1) predefined words, (2) affixes, and (3) other word characteristics and symbols such as capitalization and hyphens. The predefined word list is composed of only 225 basic words that are commonly used in constructing sentences. These words are in their base form and are closed class words, whose meanings do not change as the language evolves. The affixes include pre-determined prefixes, infixes and suffixes. These word features are used in stemming, tagging and disambiguation. TPOST was trained using 1,983 words with 450 distinct features, from the first three chapters of the Book of Philippians. The tagset includes 59 tags that are classified under 10 major POS tags. Using another text under the same domain with 539 words with 221 distinct word features, TPOST achieved less than 8% and 11% errors for general and specific POS tag errors, respectively. It was also tested on a different corpus on the domain of children’s literature consisting of 1,093 words with 397 distinct word features. The test resulted to an error below 17% and 23% for general and specific errors, respectively. Different variations on the algorithm were performed to reduce the errors, and to make TPOST algorithm a good foundation for further research in the field of POS Tagging. 1 Introduction One of the reasons why most of the successful researches in POS tagging were trained and tested in the English language is the availability of lexical resources that are useful in determining the correct POS of a word. Such resources include lexical dictionaries, tagged corpus, stemmers, morphological analyzers and a compilation of tags. Several languages do not have these resources, which makes it difficult to test the accuracy of existing algorithms. In the case of Tagalog, there is no comprehensive lexicon; neither does a tagged corpus exist. These resources are necessary to train and/or test if existing algorithms apply to the Tagalog language. There is an algorithm called Baum-Welch, which tags text without lexical dictionaries or a tagged corpus, but a study showed that the need for even a small amount of tagged corpus would result to great improvements in its accuracy (Merialdo, 1994). In languages such as English, features such as suffix and capitalization are useful information in determining the POS of the word. Tagalog also has the same features, as well as prefixes, infixes, reduplication, and other features that could be used as markers to determine the POS tag of the word. However, there are differences in the language features of English and Tagalog. For instance, the Subject-Verb agreement in English is useful in predicting the POS tag of the word by looking at the word-window to the left of the target word. But in Tagalog, the subject and predicate could be interchanged as shown in Figure 1.1, making it difficult to predict the POS tag of the word. English Tagalog She is a beauti- Siya ay isang magandang baful girl. bae. Isa siyang magandang babae. Isa siyang babaeng maganda. Siya ay isang babaeng maganda. Figure 1.1 No Subject Verb Agreement in Tagalog Another issue is the tagset. The Penn TreeBank Tagset is not completely applicable to all languages, including Tagalog. Figure 1.2 shows that the word “Si” in Tagalog is used as a determiner for a proper noun, specifically, the name of a person. The English language however, does not have a determiner to signal proper nouns. Furthermore, English uses “-’s” to denote possessive property; while Tagalog uses the word “ni”, then interchanges the owner and the subject. These signal words or determiners must be included the Tagalog tagset. English John is the friend of Jack’s sister. Tagalog Si John ay ang kaibigan ng babaeng kapatid ni Jack. Figure 1.2 Use of Determiners in Tagalog This research explores a way of adopting existing approaches in POS tagging to Tagalog utilizing a small set of lexical resources. 2 2.1 Word features The following word features are extracted from a word, and are later used in determining the POS tag of the word: Capitalization Affixes – including prefixes, infixes, suffixes, duplicates, and their combinations. The list of affixes is based on (Bonus, 2003) o Prefix – Natulog (Slept), “Na-” and “Tulog”; o Infix – Kumain (Ate), “-Um-” and “Kain”; o Suffix – Tulugan (A place to sleep), “-An” and “Tulog”; o Duplication – Natutulog (sleeping) “Tu” (repeated) and “Tulog” (base form); Combination – Nagtutulug-tulugan (Prefix-“Nag”, Repeated-“Tu”, “Tulog” (base form), and Suffix-“An”). 2.2 The lexicon TPOST uses 225 words that are in simplest form and are more robust in use. These pre-defined words are also those words that are often used to link sentences or thoughts together. Entries in the lexicon are in the base or root forms. The lexicon also includes stopwords that are used to determine POS tags of the succeeding or previous words. The following are examples of stopwords: • • • Algorithm Parameters This research focuses on using the features of Tagalog words, a small list of words, the position of a word in the sentence and a list of known affixes as the basis for determining the tag of a word. • • o • Determiners – used as a signal word to point to the word being referenced. Example: Mga, Si, Sina, Ang; Conjunctions – used to combine sentences or words of the same class or group. Example: At, O, Saka; Linking Verbs – usually follows a noun and is followed by adjective or verb. Example: Ay; And other words such as pronouns, prepositions and numbers. 2.3 Position of the word Aside from the language features of the target word, its position in the sentence is also important in determining its POS tag. In this research, an n-gram tagger was implemented. The tagger checks the surrounding tags and features of the words in a two-word-window. This means that the two words before and after the target word were used as added information to determine its tag. 2.4 The stemmer Unlike most POS taggers, TPOST does not utilize a morphological analyzer. Instead, it determines the affixes attached to a word using simple string matching, thereby possibly resulting to over-stemming or under-stemming. For instance, the rootword Tagalog may be processed as having the prefix Taga- . The following are the affix substrings searched in a word: Prefix: di in manga mang na ni taga tag i man pag tig kanda may pala um kay ka ma nag pam pang sing sin ki nam pan mas mag mala nangag nanga pa sa gang ga mam nang sang nan sim mangag Infix: • -um• -inSuffix: -g -uhan -uhin -han -hin -ran 3 -rin -uan -uin -an -in -n The TPOST System The TPOST system is divided into two parts, namely, training and tagging. Both of the processes start with feature extraction that identifies features of the word, which would be stored to a rules table during the training process. The extracted features stored in the rules table would then be used during tagging phase. The feature extractor consists of simple string manipulations and stopword look-ups. Note that the stopword list contains only 225 words in their base form, hence over-stemming and under-stemming may occur and they affect the training and tagging processes. The main module in the training phase is the generated rules table, which consists of three tables used as templates during the tagging phase. The three tables are: the Word Feature table, the Tag table, and the Sentence Template table. During training, the following steps are performed: 1. The TPOST Processor accepts two files, namely, (1) the document to be tagged, and (2) the corresponding tags of the words in the document. 2. The first input file goes through the Feature Extractor module, wherein each word is checked if (1) it is a predefined word, (2) if it has affixes, and (3) if it starts with an uppercase letter. It then gets the POS tag of the word from the second input file, and stores the POS tag together with the extracted word features into a structure as shown below: Word1 -> Word -> TagCode -> Features Ex. mabait -> mabait -> JJD -> ~ma 3. The Rule-Checker and Generator then populates the list of Generated Rules. It loads one sentence at a time and stores the structure of the sentence into the list if it has not been encountered before, otherwise, it increments the heuristic value of that sentence structure. There are 3 kinds of rules that may be generated and checked: 1) Word_Feature Rule <Feature_ID>, <Word_Feature> kumakanta W30,@um$1 2.) Tag-Name Rule <Tag_ID>,<Tag_Name>,(<Feature_ ID><Cnt>)+,(<Sentence_ID>)+ kumakanta T15,VBTR,W30,c1,S1 3.) Sentence-Template Rule <Sentence_ID>,(Tag_ID)+,<Cnt> Kumakanta si John. S5, TST15T20T14T10TE, 1 4. Output contains the database of templaterules generated by the system. It is used by the tagging process in determining the word being tagged plus the information of the words surrounding it. The word structure of the rule is shown below: a. The target word’s feature list and tag b. The previous words’ feature lists and tags c. The succeeding words’ feature lists and tags d. The heuristic value of this structure The key to the training process is its concept of storing only the word features, instead of storing a huge database of rootwords. Word features help group words with similar word structures so that it would be more robust than novel words. Also, instead of storing words with their possible tags, word features are categorized under the 59 tags, hence, increase in table elements is not very large. Lastly, instead of storing the original training sentences, only sentence templates are being stored. These sentence templates contain sequences of POS tags, hence generalizing sentences having the same POS sentence structure. During the actual tagging process, a Rule-based Driven Tagger (RDT) is used to check if the feature of the word being tagged exists in the Word Feature table. This is done by retrieving all the tags associated to that word feature. If there is more than one tag paired to it, then that word is considered as ambiguous, while those with no tags are called unknowns. The main responsibility of the RDT is to resolve ambiguous words and unknowns. For ambiguous words, the RDT retrieves all sentence templates having the same contextual predicates and tags of the word being tagged. It has to compute which among the candidate tags would be selected. For unknowns, the RDT searches all the tags having the same contextual predicates, at the same time comparing the word features to see which tag is closest to the unknown word. 4 The Training Set The training set is taken from the Bible, from the Book of Philippians (BIBINT, 1998). The reason for this choice is because the Bible was translated using a consistent structure, thus patterns could be learned by the system. Philippians has four chapters; the first three chapters were used as training set, while the last chapter was used as the testing set. This amount of testing and training is patterned after the ratio 1:3 that was used by (Brill, 1995) and (Ma, 2000). The training set contains 107 sentences with 1,983 words, of which 533 are distinct, and 450 distinct word features. The training set was tagged using 59 POS tags. These tags are classified into 10 general POS tags, namely, verb, noun, adjective, adverb, preposition, determiner, conjunction, pronoun, number, and punctuation marks. 5 The Testing Set The testing set was taken from the last chapter of Book of Philippians with 605 words and punctuation marks of consisting of 34 sentences composed of 539 words, of which 223 are distinct, and has 221 distinct word features. 6 Summary of Results The system was first tested using the last chapter of the Book of the Philippians. Parameter values and errors were analyzed. It was found out that 25% of the errors were due to incorrect stemming. TPOST used an existing stemmer (Bonus, 2003), since the stemmer is not part of the research. Due to high number of stemming errors, a more sophisticated stemmer is necessary. Running different tests on window sizes, the best word window seen was two-previous and twonext. Another test done was to add predefined words, which further decreased the percentage errors. The best result was 3.802% for the general tag errors, and 6.446% for the specific tag errors. TPOST was able to determine the specific tags of 68 out of 76 ambiguous words (89.47% correctness), and 30 out of 50 unknown words (60% correctness). Furthermore, of the 71 of 76 ambiguous words were given correct general tags (93.42% correctness); while 38 of the 50 unknowns were likewise given correct general POS tags (76% correctness). TPOST was also tested on literary works for children. Using only the original list of predefined words and correct word features, the tagger achieved 16.5% general and 22.5% specific POS tagging errors. The text was then divided into two parts where one was used as training set which consists of 68 sentences, and the other was used as testing set consisting of 34 sentences. The training set was added to the Philippians training set and resulted to 13.75% general and 18.5% specific POS tagging errors. Several variations on the algorithm were made but the best one achieved only 13.1% general and 17.7% specific POS tagging error. Errors were attributed to lack of training and sentence templates. Further tests were done on different domains, including a book from the Old Testament and news articles, to verify the results when using a different domain. The results turned out to be similar. 7 Strengths of the Algorithm This algorithm uses a small set of lexical resources to be able to tag text written in the Tagalog language. When determining the general POS tags, the initial algorithm is able to correctly tag 89% of ambiguous words and 60% of unknown words. TPOST only knows 225 rootwords and uses a very simple stemmer. In spite of these limitations in the resources, TPOST is still able to tag a significant amount of words correctly. Another feature of this algorithm is that it would always be able to give the POS tag of a word even if the word or its sentence structure were never encountered during training. Its concept of collecting the features of the words instead of the word itself makes it independent on the words fed during training. Furthermore, the generated rules template used in the tagging process contain minimal data because only the word features are being collected. The concept of feature extraction is more robust because these features are the basic foundations of words, hence, they seldom change, as compared to words themselves. In addition, after the training process, the generated rules table contains word features and their tags, which can be used as a simple tagger using simple string matching. 8 Limitations of the Algorithm The implemented algorithm also has several limitations, including: 1. language ambiguity, 2. stemming, and 3. scoring. In language ambiguity, some words cannot be tagged accurately because of the ambiguous nature of Tagalog itself. Some words have the same contextual predicates or surrounding words, but have different tags. For instance, given the two sentences below: Sentence A: Mabait ka sa akin. Sentence B: Magalit ka sa akin. The words mabait and magalit have two different tags. The first one is an ADJECTIVE, while the other one is a VERB, despite having the same surrounding words “ka sa akin”. The algorithm cannot resolve this. The feature of both words is the same, which is the prefix -ma. The only resolution is to add both words into the list of predefined words in order to distinguish one word from the other. When it comes to stemming, there are a lot of words that cannot be stemmed using simple string matching. There is really a need for a comprehensive lexicon and a morphological analyzer to get the correct features. For instance, if the words tinapay and tinapon are not in the list of predefined words, TPOST recognizes both of the words as having the infix –in-. Actually, tinapay is a rootword, while tapon is the root for tinapon with infix –in-. Another limitation is the scoring. Several errors found during testing are due to the lack of parameters. This occurs when we solve ambiguities first before unknowns and vice versa. In our initial algorithm, ambiguities are solved first so that the unknowns could have more resources to help find a better tag for it. If all the words surrounding it are unknowns and ambiguous, TPOST would rely on the features of the word itself, and would not check the context anymore. Sometimes, the ambiguous words are not resolved correctly, and so produce cascading errors. Solving the unknowns first would not lead to a better solution, since it assumes that each of the candidate tags for the ambiguous word is correct. With too many candidate tags, the percentage of the sys- tem selecting a wrong tag would be greater. The only solution is to reduce the number of unknowns by increasing the predefined word list, or by adding more sentences into the training set. References 9 Bonus, E. 2003. A Stemming Algorithm for Tagalog Words. MS Thesis, De La Salle University-Manila. Conclusion TPOST was able to show that even with limited resource, automatic tagging for the Filipino text is possible. It has shown that using word features instead of the words themselves is a possible area to look into. As (Brill, 1998) has stated “words are constantly being invented as well as falling in and out of favor, no matter how complete our words list is, we will still encounter words not on the list, as well as novel usages.” Currently, comprehensive lexicons for English consist of millions of words, and yet it could never be complete as new words are being added. We need to find a method where generalization is captured. The introduction of manually tagged corpora such as the Brown Corpus and the Penn TreeBank (Brill, 1998), gave rise to many different techniques in tagging. Similarly research focusing in Tagalog has been few, due to the lack of resources. We need to create more resources so that we could do further researches. With this initial tagger, it could help generate more tagged documents to be used on other researches on machine translation and other areas of computational linguistics. Acknowledgement Utmost respect and recognition are given to the following professionals, for carefully and manually tagging the training and testing corpus. Dr. Josephine Mangahis Faculty, Departamento ng mga Wika sa Pilipinas, De la Salle University Ms. Adalia Cruz Department Coordinator, Filipino Department, Grace Christian High School BIBINT 1998. A Tagalog Translation of the New Testament Bible International Version. [online]. http://bible.gospelcom.net. Brill, E. & Wu, J. 1998 Classifier Combination for Improved Lexical Disambiguation. Proceedings of COLING-ACL'98, pp. 191-195. Brill, E. 2000. Part of Speech Tagging. Handbook of Natural Language Processing, pp. 403-414. Microsoft Research, Redmond, Washington. Castillo, A. 2001. Hiyas ng Lahi (Wika at Panitikan). Vibal Publishing House Inc. Manila, Philippines. Ma, Q., Morata M., Uchimoto K., & Hitoshi I. 2000 Hybrid neuro and Rule-based part of speech taggers. Proceedings of the 18th International Conference on Computational Linguistics (COLING’2000), pp. 509–515. Marcus M., Santorini B. and Marcinkiewicz 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, Volume 19, No. 2, pp.313-330. Merialdo B. 1994. Tagging English text with a probabilistic model. Computational Linguist Volume 20, No. 15. Samuelsson C. & Voutilainen A. 1997. Comparing a Linguistic and Stochastic Tagger. Proceedings of 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, ACL, Madrid.