UNIVERSITI MALAYA ORIGINAL LITERARY WORK DECLARATION Name of Candidate: NOOR HAFHIZAH BINTI ABD RAHIM (I.C/Passport No: 830127-08-5412) Registration/Matric No: WGA070052 Name of Degree: Master of Computer Science Title of Project Paper/Research Report/Dissertation/Thesis (“this Work”): A Statistical Parser To Reduce Structural Ambiguity In Malay Grammar Rules Field of Study: Artificial Intelligence (Natural Language Processing) I do solemnly and sincerely declare that: (1) (2) (3) (4) (5) (6) I am the sole author/writer of this Work; This Work is original; Any use of any work in which copyright exists was done by way of fair dealing and for permitted purposes and any excerpt or extract from, or reference to or reproduction of any copyright work has been disclosed expressly and sufficiently and the title of the Work and its authorship have been acknowledged in this Work; I do not have any actual knowledge nor do I ought reasonably to know that the making of this work constitutes an infringement of any copyright work; I hereby assign all and every rights in the copyright to this Work to the University of Malaya (“UM”), who henceforth shall be owner of the copyright in this Work and that any reproduction or use in any form or by any means whatsoever is prohibited without the written consent of UM having been first had and obtained; I am fully aware that if in the course of making this Work I have infringed any copyright whether intentionally or otherwise, I may be subject to legal action or any other action as may be determined by UM. Candidate’s Signature Date 16 February 2011 Subscribed and solemnly declared before, Witness’s Signature Name: Designation: Date ABSTRACT The goal of the research is to develop a statistical parser that can help in reducing a structural ambiguity in a Malay language. Parsing is an important phase in understanding natural language. However, to parse a sentence is a difficult task due to the various ambiguity problems in natural language. Parsing technique is the most important components that need to be considered in developing any parser. The technique used in this research is top-down parsing and the grammar chosen is a context-free grammar (CFG) for Malay language. The CFG contains rule in forming a Malay basic sentence. The proposed Malay Statistical Parser uses probability values, which were computed for one hundred and fourty seven (147) grammar rules as the guideline in parsing the best parse tree. Since there is no probability for Malay CFG rules, one thousand (1000) of training data are collected from primary text books and various Malay grammar books. The probability values were calculated and it is known as Probability Context-free Grammar (PCFG). The parser is then evaluated using one hundred (100) test data, where the data was approved by two Malay linguists that were known as Munsyi Dewan. After that, the Malay statistical parser computes the highest probability value for each of the parsed sentences. The result shows the parser achieved 100% recall, 93.25% precision and 96.75% f-score, where the parser is able to reduce ambiguity for Malay basic sentence. ii ABSTRAK Tujuan penyelidikan ini ialah membangunkan sebuah pengurai berstatistik yang dapat membantu mengurangkan ketaksaan berstruktur dalam Bahasa Melayu. Penguraian merupakan satu fasa penting dalam memahami bahasa tabii. Walau bagaimanapun, untuk mengurai sesuatu ayat, ia merupakan satu tugas yang sukar memandangkan terdapat banyak masalah dalam ketaksaan bahasa tabii. Teknik penguraian merupakan komponen yang paling penting yang perlu dipertimbangkan dalam membangunkan sebarang pengurai. Teknik yang digunakan dalam penyelidikan ini ialah teknik penguraian atas-bawah dan tatabahasa yang dipilih ialah nahu bebas-konteks untuk Bahasa Melayu. Nahu bebaskonteks tersebut mengandungi petua-petua bagi membentuk ayat mudah Bahasa Melayu. Pengurai Berstatistik Bahasa Melayu menggunakan nilai-nilai kebarangkalian yang dikira untuk seratus empat puluh tujuh (147) petua-petua nahu yang digunakan sebagai panduan dalam memperoleh rajah pepohon yang terbaik. Memandangkan belum ada nilai kebarangkalian bagi petua nahu bebas-konteks untuk Bahasa Melayu, seribu (1000) data latihan diperoleh daripada buku-buku teks sekolah rendah dan tatabahasa Bahasa Melayu. Nilai-nilai kebarangkalian yang dikira itu dikenali sebagai Nahu Bebas-konteks Berkebarangkalian. Pengurai itu dinilai menggunakan seratus (100) data ujian yang dipersetujui oleh dua orang pakar dalam Bahasa Melayu yang dikenali sebagai Munsyi Dewan. Seterusnya, Pengurai Berstatistik Bahasa Melayu tersebut dapat mengira nilai kebarangkalian yang tertinggi bagi setiap ayat yang diurai. Hasil keputusan menunjukkan pengurai itu mencapai 100% recall, 93.25% precision dan 96.75% f-score, yang menunjukkan pengurai tersebut berjaya mengurangkan ketaksaan berstruktur bagi ayat mudah Bahasa Melayu. iii ACKNOWLEDGEMENT To Allah, who gives me strength and good health to finish my MSc Degree. To parents and siblings, who gives me full support. To Dr Rohana Mahmud, my supervisor, who encourages and guides me in this research. To my friends Thanks a lot. iv TABLE OF CONTENTS Abstract........................................................................................................ ii - iii Acknowledgement........................................................................................ iv Table of Contents.......................................................................................... v - viii List of Figures................................................................................................ ix- x List of Tables................................................................................................. xi 1 Introduction 1 1.1 Introduction............................................................................................... 1 1.2 Problem Statement ................................................................................... 5 1.3 Research Objectives ............................................................................... 6 1.4 Research Approach………....................................................................... 6 1.5 Expected Results………………………………………………………... 7 1.6 Research Scopes and Limitations……………………………………….. 7 1.7 Dissertation Organization.......................................................................... 8 2 Literature Review 10 2.1 Natural Language Processing Phases............................................................ 10 2.2 Syntax Analysis…………………………..................................................... 11 2.2.1 Category of Syntax Analysis…………....................................... 13 2.2.2 Syntactic Analysis Technique………………………………….. 14 2.3 Ambiguity………………………………………………………………... 16 2.3.1 Part-of-Speech (POS) Ambiguity................................................ 16 v 2.3.2 Semantic Ambiguity.................................................................... 17 2.3.3 Syntactic or Structural Ambiguity.............................................. 17 2.3.4 Verbal Ambiguity...................................................................... 17 2.4 Statistical Parsing……………………………………………………….. 18 2.4.1 Context-free Grammar in English Language…………………. 19 2.4.2 Probabilistic Context-free Grammar (PCFG)………………… 21 2.5 Charniak’s Parser...................................................................................... 23 2.6 Collin’s Parser………..………………………………………………… 27 2.7 Statistical Parser for Malay Language..................................................... 28 2.8 Ahmad’s Malay Parser………………………………………………….. 28 2.9 Juzaiddin’s Malay Parser………………………………………………… 31 2.10 Summary of the Chapter……………………………………………….. 34 3 Probabilistic Malay Grammar 36 3.1 Introduction of Grammar……………………………………………… 36 3.2 Types of Malay Language Grammar...................................................... 36 3.2.1 Sentence Grammar………....................................................... 37 3.2.1.1 CFG in Malay Language…………………………... 37 3.2.2 Partial Discourse Grammar...................................................... 40 3.2.3 ‘Pola’ (Pattern) Grammar.......................................................... 40 3.3 Rules for Basic Malay Sentence.............................................................. 42 3.3.1 Rules for FN(Noun Phrase, NP)……………… 42 3.3.2 Rules for FK (Verb Phrase, VP)……………... 44 vi 3.3. 3 Rules for FA (Adjective Phrase, AP)……… 45 3.2. 4 Rules for FS (Prepositional Phrase, PP)........ 45 3.4 Probabilistic Context-free Grammar (PCFG) for Malay Language…………………………………………………… 46 3.4.1 Training Data……………………………………… 46 3.4.2 Analysis of Training Data………………………… 48 3.5 Summary of the Chapter…………………………………… 61 4 Development of Malay Statistical Parser 62 4.1 Requirement Specification………………………………………….. 62 4.1.1 Functional Requirement…………………………………... 62 4.1.2 Non-functional Requirement……………………………… 63 4.2 System Design……………………………………………………… 64 4.2.1 System Architecture............................................................. 64 4.2.1.1 Input Component.................................................... 65 4.2.1.2 Part-of-Speech (POS) Tagger................................ 68 4.2.1.3 Malay Lexicon (MaLEX)...................................... 70 4.2.1.4 Parsing Engine....................................................... 72 4.2.1.5 Output Component……..………………………... 75 4.2.2 User Interface Design……………………………………… 75 4.3 Summary of the Chapter……………………………………………. 78 vii 5 Experiments and Results 79 5.1 Test Datasets......................................................................................... 79 5.2 Results………………………………………………………………… 90 5.3 Summary of the Chapter…...................................................................... 95 6 Conclusion 96 6.1 Fulfillment of Research Objectives............................................................ 96 6.2 Malay Statistical Parser............................................................................... 97 6.3 Limitations………………………………………………………………… 98 6.4 Future Enhancement……………………………………………………… 98 6.5Summary of the Chapter….......................................................................... 99 References.......................................................................................................... 100 Appendix A…………......................................................................................... 103 Appendix B…………......................................................................................... 120 Appendix C…………......................................................................................... 122 Appendix D…………......................................................................................... 191 Appendix E…………......................................................................................... 193 viii LIST OF FIGURES Figure 1.1: First tree of “He saw the boy with a telescope” (Meyer at al., 2002) 2 Figure 1.2: Second tree of “He saw the boy with a telescope” (Meyer at al., 2002) 3 Figure 1.3: First tree of “Kami adang air itu” 4 Figure 1.4: Second tree of “Kami adang air itu” 4 Figure 2.1: Structure of Syntax Analysis (Parsing). 12 Figure 2.2 Some Examples of CFG for English (Jurafsky et. al. 2000) 20 Figure 2.3: Tree for Sentence “John loves Mary” 20 Figure 2.4: The Grammar with PCFG Form 25 Figure 2.5: The First Possible Parse Tree for Sentence “salespeople sold the dog 25 biscuits” Figure 2.6: The Second Possible Tree for Sentence “salespeople sold the dog 26 biscuit” Figure 2.7: The third possible parse tree for sentence “salespeople sold the dog 26 biscuits” Figure 2.8: System Architecture for Ahmad‟s Malay Parser 30 Figure 2.9: The System Architecture Based on „Pola‟ (Pattern) Sentence 33 Figure 3.1: Context-Free Grammar for Malay Language, after Nik Safiah Karim 38 (1995) Figure 3.2: Basic Rule for FN 43 Figure 3.3: Basic Rule for FN in detail 44 Figure 3.4: Basic Rules for FK 44 Figure 3.5: Basic rules for FA 45 Figure 3.6: Basic rules for FS 45 ix Figure 3.7: Analysis Pattern Sentence of Training Data 49 Figure 4.1: System Architecture of a Statistical Parser for Malay Language 64 Figure 4.2: A parse tree of sentence “Dia gemar melancong” 69 (Mohd Juzaiddin Ab Aziz, et.al, 2006) Figure 4.3: A parse tree of sentence “Dia gemar menulis aaturcara” 69 (Mohd Juzaiddin Ab Aziz, et.al, 2006) Figure 4.4: The Pseudocode for POS Tagger 70 Figure 4.5: The Process in Parsing Engine 73 Figure 4.6: Another Process in Parsing Engine 74 Figure 4.7: Example of a Parse Trees And Their Probability Values 75 Figure 4.8: Main Interface for Malay Statistical Parser 76 Figure 4.9: Output for the Parser 76 Figure 4.10: Output Component for sentence “bapa saya pemandu teksi” 77 Figure 4.11: One of the Probability Values for the Parsed Sentence 77 Figure 4.12: The main interface with the error message “Tiada dalam database” 78 Figure 4.13: Error Message for Unsuccessful Parsed Sentence 78 Figure 5.1: Recall, Precision and F-score for Statistical Parser for Malay 94 Language x LIST OF TABLES Table 2.1: Examples of terminal symbols, non-terminals and rewrite rules 19 Table 2.2: Comparison between Charniak’s parser, Collins’s parser, 34 Ahmad’s Malay parser and Juzaiddin’s Malay parser Table 3.1: Description of Elements Used in Malay Grammar Rules, after 39 Ahmad I. Z. Abidin et al. (2007) Table 3.2: Analysis of the sentences according to their pattern 48 Table 3.3: Rules for A (sentence), S (subject), and P (predicate) 50 Table 3.4: Rules for the FN (noun phrase) 50 Table 3.5: Rules for the FK (verb phrase) 55 Table 3.6: Rules for the FA (adverb phrase) 59 Table 3.7: Rules for the FS (prepositional phrase) 61 Table 4.1: A Few Examples of Words, Lexical Classes and their Synonyms 71 from MaLEX Table 4.2: A Few Examples of Rules and Probability Values from MaLEX 72 Table 5.1: The number of test sentences 92 Table 5.2: Result of the experiments 93 xi CHAPTER 1: INTRODUCTION Chapter 1 Introduction 1.1 Introduction The purpose of Natural Language Processing (NLP) is to ensure that machines understand the human language. Mohanty and Balabantaray (2003) define parsing as a process of assigning structural description to the sequence of words in the natural language. The parsing process produces parse trees which are useful in grammar’s checking applications. The applications are similar to those used in the word processing system. In those applications, parsing plays an important role throughout the inspection process. For example, a sentence like “likes she reading” cannot be parsed because it is ungrammatical. Parse trees are also very important in semantic analysis and to understand the deep meaning of a sentence. Questions answering and information extraction are examples of analytical applications. For example, information needed to answer the question “What books were written by male Malay authors after 2000?” can be found in the subject of the sentence, which are “what books” and the by-adjunct “male Malay authors”. The question can be solved by knowing the books’ list instead of the authors’ list (Manning & Schutze, 1999). Parsing a sentence, a difficult task, is an initial step to understand natural language although ambiguity is a serious problem that linguists face in natural language parsing. An ambiguity problem occurs when more than one parse tree are constructed. For example, the sentence 1 “He saw the boy with a telescope” can give two interpretations for readers. First reading: “He used the telescope to see the boy”. Second reading: “He saw the boy who had a telescope”. Both interpretations can also be represented by using tree structures. First tree is shown in Figure 1.1. Figure 1.1: First tree of “He saw the boy with a telescope” (Meyer at al., 2002) The second tree which has dissimilar structure is shown in Figure 1.2. 2 Figure 1.2: Second tree of “He saw the boy with a telescope” (Meyer at al., 2002) Both figures have dissimilar structures because they used different grammar rules. Thus, the sentence “He saw the boy with a telescope” is an ambiguous sentence. In Malay language, there are also ambiguous sentences. As mentioned by Mohd Juzaiddin et. al. (2006), a Malay sentence is an ambiguous sentence when it has more than one parse tree constructed. It can be happened when the sentence has word ambiguity too. The definition of word ambiguity is a word that holds more than one part-of-speech (POS). For example, word “adang” (deter) has two POS; namely KN (noun) and KKTr (transitive verb). A sentence “kami adang air itu” (we deter the water) can have two different parse trees. First tree is shown in Figure 1.3. 3 Figure 1.3: First tree of “Kami adang air itu” The second tree which has dissimilar structure is shown in Figure 1.4. Figure 1.4: Second tree of “Kami adang air itu” Other than structural ambiguity, there are also other three (3) types of ambiguity that usually occur in natural language such as English and Malay language, which are the partof-speech ambiguity, semantic ambiguity and verbal ambiguity (Jurasky et. al 2000). To minimize the structural ambiguity; a statistical parser is one of the solutions to encounter the problem. A statistical parser is a parser that assigned probability value to each 4 one of the grammar rules. Many researchers have succeeded in developing this type of parsers. The development has been applied to many languages such as English, Chinese, French and German. Charniak’s and Collins’s parsers are type of statistical parser that are developed for English language. Charniak’s parser had been developed by a group of researchers from University of Pennsylvania (Charniak, 1997) while Collins’s parser had been developed by Michael Collins for his PhD study (Collins, 2003). In this research, the focus is to implement a statistical parser for the Malay language as other researchers have already developed Malay parser without using the statistical elements. Ahmad’s (Ahmad I. Z. Abidin et al., 2007), and Juzaiddin’s (Mohd Juzaiddin Ab Aziz et al., 2006) are examples of Malay parsers. 1.2 Problem statements Parsing a sentence helps in knowing the way in which a sentence is structured by using the linguistic knowledge. A sentence is successfully parsed when a parse tree can represent the sentence. However, if the sentence is an ambiguous sentence, this may result in more than one parse tree. To handle this situation, a statistical parser that minimizes ambiguity should be developed. So far, however, there is no current development of statistical parsers in Malay language; the goal of this project is to build a statistical parser that can minimize structural ambiguity for Malay language sentences. 5 1.3 Research Objectives There are two main objectives to this project: (1) To provide the probability values for 147 of the Malay grammar rules. (2) To develop a prototype of a statistical parser for the Malay language. The statistical parser will then calculate the parsed sentence and choose a parse tree with the highest probability. 1.4 Research Approach There are two main things need to be considered when developing statistical parser for Malay language. First is the calculation of the probabilistic grammar. Probability values are assigned as a statistical element to each of the parsed sentence. Because there are no such grammatical rules in the Malay language that has probability values, the values will be calculated based on one thousand (1000) Malay sentences which follow language rules that are introduced by Prof Emeritus Datuk Dr Nik Safiah Karim (Nik Safiah Karim, 1995). The training data are derived from various sources containing Malay basic sentences. When probability values are computed, then they will be assigned to 147 Malay grammar rules. Following that, the statistical parser calculates the parsed sentence and chooses a parse tree with the highest probability. Second thing that is considered when developing a statistical parser is the Malay Lexicon (MaLEX), which contains Malay words and its lexical classes. There are ninety words which have two lexical classes. The words are very important because they lead to ambiguous sentences which will produce two parse trees and probability values. The 6 MaLEX is provided by undergraduate students from Universiti Kebangsaan Malaysia (UKM). In this research project, an enhancement has been made where a table which contains grammar rules and their probability values is added. 1.5 Expected Results There are two expected results in this study: (1) probability values for 147 Malay’s grammar rules only A list of Malay grammar rules that assigned a probability value to each of the rules known as Probability Context-Free Grammar (PCFG). So far, these rules only derived within one thousand training data. (2) a prototype of a Malay statistical parser A prototype which can minimize the structural ambiguity of Malay basic sentences. 1.6 Research Scopes and Limitations The scope of this study is limited to the development of a statistical parser for the Malay language. Malay language sentences are categorized to either a basic sentence or a complex sentence. Thus the limitation of this study is based only on basic sentences as it follows the context-free grammar rules that were developed by Malay linguist (Nik Safiah Karim, 1995). 7 1.7 Dissertation Organization The dissertation of this is organized as follows: Chapter One This chapter provides an overview of the study. An introduction of the study is provided first, followed by the problem’s statements, the objectives, synopsis of the study’s methodology, the scope and limitation, and finally the dissertation’s structure. Chapter Two Chapter Two provides a background study of Natural Language Processing (NLP) phases, followed by a syntax analysis. Also, it discusses types of ambiguity and refers to previous related works that is reported by other researcher. Chapter Three The third chapter describes in detail about Malay grammar. It explains the different types of the Malay language grammar, namely the sentence grammar, the partial discourse grammar, and pattern (pola) grammar. Next, the chapter provides an explanation on CFG in the Malay language. It also discusses the training data and how the probability values are counted on each of the CFG rules and known as PCFG. 8 Chapter Four Chapter Four describes the system architecture of a statistical parser for the Malay language. The system comprises five components including the parsing engine, part-of speech (POS) tagger, Malay lexicon, input and output components. Chapter Five This chapter discusses in detail the experiments carried out of each the basic Malay sentence’s pattern. The chapter also demonstrated the results and analyse the findings. Chapter Six The last chapter assesses the fulfillment of the objectives in this study, the strength of the Malay Statistical Parser, the limitations and also future enhancement. 9 CHAPTER 2: LITERATURE REVIEW Chapter 2 Literature Review This chapter describes briefly the phases in Natural Language Processing. Then, it explains about the syntactic analysis, which is also known as parsing, followed by further explanation about the category and technique in syntactic analysis. Then, it goes to the explanation of ambiguity, the statistical parser in English and also the grammar. This chapter also gives some examples about available parsers in English such as Charniak‟s and Collin‟s parsers. It also describes two parsers in Malay language, namely Ahmad‟s and Juzaiddin‟s Malay Parser. 2.1 Natural Language Processing (NLP) Phases Natural language processing is a task where computer systems can understand natural language. Humans use natural language in daily conversations and examples of said natural language are English, French, Turkish, Chinese and Malay. There are six different phases of knowledge to understand a natural language: phonology, morphology, syntax, semantic, pragmatic and discourse. Phonology is a field regarding how words are related to the sound. For example, the word „book‟ is pronounced as „buk‟. The next phase is morphology. Morphology is an area which is related to the structure of the words. The words are constructed from basic meaning units called morphemes. For example, the word „recently‟ comes from the root word „recent‟ 10 (adjective) and coupled with the suffix –ly. When both are joined together, the word „recently‟ becomes an adverb. The third phase is syntactic. This phase concerns how words can be put together to form a correct sentence. It deals with the structure of the sentences. For example, the structure of the sentence „she reads a book‟ is (S (NP (N she)) (VP (V reads) (VP (ART a) (N book)))). The fourth phase is semantic knowledge. The semantic knowledge concerns about the meaning of the words and how these meanings are combined in sentences to form sentence meanings. For instance, the sentence “Colourless green ideas sleep furiously” (Chomsky, 1957) would be rejected as semantically anomalous. Next is the pragmatic phase. It deals with the use of a sentence in different situations. For instance, the sentence „you have a green light‟ can give two interpretations. The first interpretation is „you are holding a green light bulb‟ and the second interpretation is „you have a green light to drive your car‟. The last phase is the discourse phase. Discourse phase is a phase which concerns how the previous sentence affects the next sentence. For example, “Ahmad wanted it”. “It” is referred to previous sentence. In this research project, the major emphasis is in the syntax phase. As mentioned, ambiguity becomes a main problem in natural language parsing. We want to disambiguate the structure of an ambiguous sentence therefore syntax is the field that relates to it. 2.2 Syntax Analysis Syntax analysis or also known as syntactic parsing is a task of recognising a sentence and assigning a syntactic structure to the parsed sentence (Jurafsky et al., 2000). The word syntax comes from the Greek sŷntaxis, meaning “setting out together or arrangement”, and refers to the way words are arranged together (Jurafsky et. al, 2000). Syntax provides rules 11 to put together words in a sentence in the correct order. The objective of the syntax is to produce a tree or some equivalent representation from the output. Syntactic structure in a sentence means each word in the sentence associates with a lexical value (word category). Another definition of the analysis is a process of analysis of text, made of a sequence of tokens (i.e. words) to determine its grammatical structure with respect to a given formal grammar. Figure 2.1 shows the structure of the syntax analysis (parsing). A test sentence P A R S E R Parse trees Grammar rules Figure 2.1: Structure of Syntax Analysis (Parsing). 12 2.2.1 Category of Syntax Analysis Syntax analysis can be categorized into two parts, namely recognising a phrase and generating a series of strings. The first part is to recognise a phrase. In recognising phrases in a sentence, most syntactic representations of language are based on the notion of context-free grammars (CFG). CFG consists of a set of rules or production. It is clearly stated that CFG is used to group and order symbols together. CFG also defines lexicon of words and symbols (Allen, 1995). An example of CFG from the Malay language that represents the Noun Phrase (NP) or Frasa Nama (FN) which was suggested by Nik Safiah Karim (1975) can be seen below: FN → (Bil) (Penj Bil) (Gelaran) Kata Nama Int <Kata Nama Int> (Penentu) (Pent) Instance of FN: Dua buah buku Bil (Penj Bil)(Kata Nama) There are two classes of symbols that are used in CFG. The classes are called terminal symbol and non-terminal symbol. Symbols that cannot be further decomposed into grammar are called terminal symbols like Verb (V) and Noun (N). Symbols that express clusters of terminal symbols are called non-terminals such as NP, VP (verb phrase) and S (sentence) (Jurasfky et. al, 2000). 13 The second part is to generate a series of strings. This sequence of rule expansions is called a derivation of words. There are two important processes in derivation. The first is sentence generation which uses derivations to construct legal sentences. A simple generator could be implemented by randomly choosing rewrite rules, starting from the S symbol, until it reaches a sequence of words. The second process based on derivations is parsing which identifies the structures of sentences given a grammar (Allen, 1995). 2.2.2 Syntactic Analysis Technique Syntax analysis technique or commonly known as syntactic parsing technique is a method of analyzing a sentence to determine its structure according to the grammar (Allen, 1995). There are two basic approaches in parsing techniques, namely top-down parsing and bottom-up parsing (Grune & Jacobs , 1990). Top-down parsing starts with an S symbol (S is equivalent to A in Malay language context where A stands for Ayat) and then searches through different ways to rewrite the symbols until the input sentence is generated. For example, we have the sentence: Ali belajar. (Ali studies) A → Subjek Prediket (Sentence → Subject Predicate) → (Frasa Nama) (Frasa Kerja) (Noun Phrase) (Verb Phrase) → (Kata Nama) (Frasa Kerja) (Noun) (Verb Phrase) 14 → Ali (Frasa Kerja) Ali (Verb Phrase) → Ali (Kata Kerja) Ali (Verb) → Ali belajar Ali studies. Bottom-up parsing starts with the words in the sentence and used to rewrite rules backward to reduce the sequence of symbols until it consists of only A. Example: → Ali belajar Ali studies. → (Kata Nama) belajar. (Noun) studies. → (Frasa Nama) belajar (Noun Phrase) studies. → (Frasa Nama) (Kata Kerja) (Noun Phrase) (Verb) → (Frasa Nama) (Frasa Kerja) (Noun Phrase) (Verb Phrase) → Subjek Prediket (Subject Predicate) → A (Sentence) 15 2.3 Ambiguity Ambiguity is perhaps the most serious problem faced by parsers whereby the ambiguity problem could not be solved by the top-down and bottom-up approaches. The parsers that use those techniques may produce more than one parse tree when they analyse an ambiguous sentence. On the other hand, statistical parsing is a better approach to tackle the ambiguity problem (Charniak, 2000), (Jurafsky et. al., 2000). There are four different types of ambiguity, namely part-of-speech (POS) ambiguity, semantic ambiguity, syntactic or structural ambiguity and verbal ambiguity (Jurafsky et. al 2000). 2.3.1 Part-of-Speech (POS) Ambiguity POS tagging is an activity that selects appropriate syntactic categories for words in a sentence. For example, Sentence: Ali belajar (Ali studies) POS: Ali → kata nama (noun), belajar → kata kerja (verb) However, ambiguity also can be a problem in POS tagging. For instance, the word mereka has two different categories, the first is a noun (mereka meaning they) and second is a verb (mereka meaning design). 16 2.3.2 Semantic Ambiguity Ambiguity is a serious problem during semantic interpretation. A word can be identified as semantically ambiguous if it maps to more than one sense. For example: Palestinian head seeks arms. The word “head” can be interpreted as a noun meaning either chief or the anatomical head of body. While the word “arms” can be interpreted as a plural noun meaning either weapons or body parts (Hoenisch, 2004). 2.3.3 Syntactic or Structural Ambiguity Syntactic or structural ambiguity can happen when a sentence can be interpreted in more than one way. This ambiguity arises from the relationship between the words and clauses of a sentence and also the structure of a sentence. Example: The children ate the cake with the spoon. The first interpretation of the sentence can be that the children ate the cake by using the spoon, while it can also be interpreted as the children ate the cake and the spoon together (Manning & Schutze 1999). 2.3.4 Verbal ambiguity Verbal ambiguity is a deeper kind of ambiguity. It happens during the speech. 17 Example: “John loves his mother and so does Bill.” It can be used to say either that John loves John's mother and Bill loves Bill's mother, or that John loves John's mother and Bill loves John's mother (Bach, 1998). In developing a statistical parser for the Malay language, only structural ambiguity will be considered. The reason is structural ambiguity arises when we parse a sentence using syntactic parser (Jurafsky et. al 2000). This ambiguity occurs when the grammar assigns more than one possible parse tree to a sentence. The statistical parser gives a solution where it uses probability as the approach. 2.4 Statistical Parsing The aim of the statistical parser is to offer a solution to the problem that arises from syntactic parsing in which the parser cannot handle an ambiguous sentence. Statistical parser uses probabilistic approach to the grammar that is used in parsing. Statistical parsers have been successfully developed in variety of languages, like English, Germany, French and Chinese. In English, there are a few statistical parsers that have been developed like Charniak‟s Parser and Collin‟s Parser. To build the statistical parsers, there are two main things that must consider, first is the type of grammar and second is the method of parsing. Basically, the most basic type of grammar that used in English Statistical Parser is context-free grammar (CFG). 18 2.4.1 Context-free Grammar in English Language The most common type of English grammar is CFG. CFG consists of four (4) constituents, namely a set of terminal symbols, a set of non-terminals symbols, a specific non-terminal symbol and a set of rewrite rules. A set of terminal symbols is the symbols that appear in the final strings, while a set of nonterminals is defined as the symbols that are expanded into other symbols. There is a specific non-terminal symbol which is designated as the starting symbol. Basically, the symbol of S that gives a meaning of sentence is one of the examples of a starting symbol. A set of rewrite rules contain a single non-terminal on the left hand side and one or more terminal or non-terminal symbols on the right (Charniak, 1993). Table 2.1: Examples of terminal symbols, non-terminals and rewrite rules Non-terminal Symbols Terminal Symbols Rewrite Rules Sentence (S) Verb (V) S → NP VP Verb Phrase (VP) Noun N) NP → Det N N Noun Phrase (NP) Determiner (Det) NP → N Prepositional Phrase (PP) Adjective (Adj) NP → Det N Figure 2.2 shows some examples of CFG in English that are taken from Jurafsky et. al. 2000: 19 Sentence → Auxiliary NounPhrase VerbPhrase Sentence → NounPhrase VerbPhrase NounPhrase → ProperName NounPhrase → NounPhrase + PrepositionalPhrase VerbPhrase → Verb NounPhrase VerbPhrase NounPhrase → Nominal NounPhrase → Pronoun Nominal → Noun Figure 4: CFG for English (Jurafsky et. al 2000). Figure 2.2: Some Examples of CFG for English (Jurafsky et. al. 2000) In developing a statistical parser, the second thing that needs to be considered is the methodology. For statistical work, we need a corpus for hand-parsed text. Fortunately, in English language, there is a corpus that has been developed, which is called as Penn tree bank (Marcus et. al., 1993). Penn tree bank project annotates occurring text for linguistic structure. It shows the syntactic and semantic information. The tree banks contain a huge amount of linguistic trees where it also annotates text with part-of-speech tags. One of the examples in Penn tree bank is stated below: Figure 2.3: Tree for Sentence “John loves Mary” 20 The statistical parsers work by assigning probabilities to possible parses of sentences, locating the most probable parse tree and presenting the tree as the answer. For that purposes, we need the probabilistic grammar. The simple mechanism for this is based on probabilistic context-free grammar (PCFG). 2.4.2 Probabilistic Context-free Grammar (PCFG) PCFG is an idea to combine probability with each grammatical rule. Lakeland & Knott (2004) describes the statistical parser as an idea to combine probability with each of the grammatical rules. PCFG is very useful in disambiguation (Jurafsky & Martin 2000). PCFG is also known as Stochastic Context-free Grammar (SCFG) that was first proposed by Booth (1969). A PCFG is defined by five parameters (N, ∑, P, S, D): (1) A set of non-terminal symbols N (2) A set of terminal symbols ∑ (3) A set of production P, each of the form A → β, where A is a nonterminal and β is a string of symbols from the infinite set of strings (∑ U N) * (4) A designated start symbol S (5) A function that assigns probabilities to each rule D We can assign probability values for those CFG rules. Allen (1995) defined the PCFG as counting the number of times each rule that is used in a corpus containing parsed sentences. The use of the process is to estimate the probability of each rule being used (Allen 1995). 21 To calculate the PCFG, Allen (1995) produced a formula. For example, let‟s consider a category C where the grammar contains m rules, R1, ........, Rm, with left-hand side C. The formula to compute the probability of using rule Rj to derive C is PROB(Rj | C) = Count (# times Rj used) / ∑ i = 1, m (# times Ri used) For instance which is taken from Allen (1995), if we calculate the PCFG for a sentence “A flower wilted”, the rules given with the probability are as below: S → NP VP (1.000) VP → V (0.386) VP → V NP (0.393) NP → N N (0.090) NP → N (0.140) NP → ART N (0.550) There are three possible ways to generate “A flower wilted”. By giving the probability of individual rules, we can calculate the probability of an entire parse by taking the product of the probability for each of the rules. (i) (S (NP ( ART a) (N flower)) (VP (V wilted))) Probability for the list: = 1 x 0.550 x 0.386 = 0.212 (ii) (S (NP (N a) (N flower)) (VP (V wilted))) Probability for the list: 22 = 1 x 0.09 x 0.386 = 0.035 (iii) (S (NP (N a) (N flower)) (VP (V flower) (NP (N wilted)))) Probability for the list: = 1 x 0.09 x 0.393 x 0.14 = 4.95 x 10-3 As 0.212 is the highest probability value among them, the first list is the most possible representation for the parsed sentence. Charniak‟s parser (Charniak, 1997) and Collin‟s parser (Collins, 2003) are a few examples that used probabilistic grammar as a technique in developing statistical parser. 2.5 Charniak’s Parser Charniak‟s parser had been developed by a group of researchers from University of Pennsylvania (Marcus et. al 1993). The technique of parsing that was applied in this parser is top-down parsing. There are two basic approaches in parsing techniques, namely top-down parsing and bottom-up parsing (Top-down parsing starts with an S symbol (S stands for sentence) and then searches through different ways to rewrite the symbols until the input sentence is generated (Grune & Jacobs , 1990). 23 For example, we have the sentence: She eats. S → (Noun Phrase) ( Verb Phrase) → (Noun) (Verb Phrase) → She (Verb Phrase) → She (Verb) → She eats. Bottom-up parsing starts with the words in the sentence and used to rewrite rules backward to reduce the sequence of symbols until it consists of only S. Example: → She eats → (Noun) eats → (Noun Phrase) eats → (Noun Phrase) (Verb) → (Noun Phrase) (Verb Phrase) →S In this parser, the corpus that is used based on Penn Treebank. Charniak‟s parser is a type of statistical parsing. It is based on probability to parse by the parser. To assign a probability in each constituent, Charniak used a method called the Markov grammar. Markov grammar is a method to assign probabilities to any possible expansion using statistics gathered from training corpus (Charniak, 1997), (Collins, 2003), (Magerman, 1995). 24 In the parser, Charniak suggested three steps to parse a new sentence. First, we need an actual grammar with PCFG form. Second, we parse the sentence by using a parser that applies the PCFG to a sentence and find all the possible parses for the sentence. And lastly, we need to find the highest probability among the parse trees. For example, a test sentence: “Salespeople sold the dog biscuits.” Step 1: We need an actual grammar with PCFG form. See Figure 2.4 below. s vp vp np np np np → np vp → verb np → verb np np → det noun → noun → det noun noun → np np (1.00) (0.8) (0.2) (0.5) (0.3) (0.15) (0.05) Figure 2.4: The Grammar with PCFG Form Step 2: We parse the test sentence to the parser and find all the possible parses. There are three possible parses as shown in the three figures below: Figure 2.5, Figure 2.6 and Figure 2.7. s vp np noun salespeople np verb det noun noun sold the dog biscuit Figure 2.5: The First Possible Parse Tree for Sentence “salespeople sold the dog biscuits” 25 s vp np np noun np verb det noun noun salespeople sold the dog biscuit Figure 2.6: The Second Possible Tree for Sentence “salespeople sold the dog biscuit” s vp np np np np noun verb det noun noun salespeople sold the dog biscuits Figure 2.7: The third possible parse tree for sentence “salespeople sold the dog biscuits” Step 3: We need to find the highest probability among the parse trees. Before we identify which parse tree has the highest value, we first need to calculate the probability for each of the parse tree. For Figure 2.5, the probability is 1.0 x 0.3 x 0.8 x 0.15 = 0.036 Figure 2.6, the probability is 1.0 x 0.3 x 0.2 x 0.5 x 0.3 = 0.009 Figure 2.7, the probability is 1 x 0.3 x 0.8 x 0.05 x 0.3 x 0.5 = 0.0018 26 From the result above, we can see that the parse tree in Figure 2.5 has the highest probability value. In conclusion, the parse tree for the sentence “salesman sold the dog biscuits” is found in Figure 2.5. This parser achieved 83% for measurement recall and precision. 2.6 Collin’s Parser Collins‟s parser is also another example of statistical parser. This parser was developed by Michael Collins (Collins, 2003). Collins applied a statistical approach in building his parser. A key idea in the statistical approach is to associate probability with each grammar rules. This grammar is called as Probabilistic Context-free Grammar (PCFG). A PCFG is a simple modification of a context-free grammar in which each rule in the grammar has an associated probability; The equation means that the conditional probability of ‟s is being expanded using the rule expanding listed in grammar. The probability of a tree-sentence pair rules Side), , as opposed to one of other possibilities for derived by n applications of context-free (LHS is known as Left-Hand side, RHS is known as Right-Hand under PCFG is However, Collins discovered that the probability of the test sentence is almost to zero. He suggested two solutions to overcome the problem. The suggestions are each rule should be broken into smaller steps and the number of non-terminals should be increased. 27 This parser is also considered as head-driven statistical model because the parse tree is represented as the sequence of decisions corresponding to a head-centered and top-down derivation of the tree. 2.7 Statistical Parser for Malay Language In developing a statistical parser for Malay language, there are a few things that need to be considered. First is there is no lexicon like WordNet. WordNet is a huge lexicon resource of English language that is used in most natural language applications (Fellbaum, 1999). The second is there is no probabilistic grammar. Thus, we need to compute each probability value for each context-free grammar rule for Malay language. Previous studies have shown that there is no statistical parser has been developed yet for Malay language. However, to date there are two Malay syntactic parsers that had already been built. First is the Ahmad‟s Malay Parser. 2.8 Ahmad’s Malay Parser The parser was developed by a group of researchers from Universiti Teknologi Petronas that was led by Ahmad Izuddin Zainal Abidin (Ahmad I.Z. Abidin et. al, 2007). This parser is a type of syntactic parsing using a top-down parsing approach. The target of the parser is to complete the existing word processing system by checking the grammar of a test sentence. Another function of this parser is that it is able to illustrate a parse tree if the sentence is grammatically correct. 28 The research domain of the parser is Malay language and focuses on basic Malay sentences. This parser also focuses on semantic part, which is a basic level of semantic parsing on top of syntactical parsing. In the semantic parsing, Malay words are divided into two categories: humans and animals. Some examples of words for humans are ‘mengandung’ (pregnant), ‘memasak’ (cooking) and ‘berfikir’ (thinking) while examples for animals are ‘meragut’(grazing), ‘mengawan’ (mating) and ‘bunting’(pregnant for animals) . The advantage of this parser is that it can handle the semantic ambiguity. Case in point, the sentence ‘bapa meragut rumput‟ (a father is grazing grasses) failed in parsing as the word ‘meragut’ is categorised under animals and not for humans. This parser was evaluated by experts in Malay language, specifically school teachers. Figure 2.8 is the architecture of the parsing for Ahmad‟s Malay Parser. 29 User interface Input Checking Engine Text Parser Import data and check for Grammar Accurateness Technical Structures of Parser Grammar rules Lexicon Output Figure 2.8: System Architecture for Ahmad‟s Malay Parser Figure 2.8 illustrates the system architecture for Ahmad‟s Malay Parser. When the user inputs a sentence, the checking engine will parse the test sentence using the text parser component. In the text parser component, there are two important parts which form the technical structure of the parser. These parts are grammar rules and Malay Lexicon. The grammar rules are derived from Nik Safiah Karim (1995) while the Malay Lexicon contains three thousand words (3000) and is arranged according to word categories. The 30 words were collected from Kamus Dwibahasa Oxford Fajar 2 nd Edition (Hawkins, 2001). This parser achieved 81.3% in recall measurement. 2.9 Juzaiddin’s Malay Parser This parser had been developed by Mohd Juzaiddin Ab Aziz for his PhD study (Mohd Juzaiddin Ab Aziz et. al, 2006). The parser introduces a pola-grammar technique that does not require lexical process of retrieving the part-of-speech for each word. The techniques used the automata and the finite states. This parser also analysed a basic Malay sentence, which is either combination NP+NP, NP+VP, NP+PP or NP+AP. The basic sentence will be grouped into five categories, namely adjunct, subject, post-subject, conjunction and predicate. An adjunct is a type of adverbial illustrating the circumstances of the action. For example, dua (two), pada (to), di (at), orang (people), beberapa (a few), lampau (past), silam (past), kerana (because), agar (so that) and sekiranya (if). A subject tells whom or what the sentence is all about. For example dia (he/she), mereka (they) and pengaturcara (programmer). A post-subject which is „yang‟ is normally used in Malay language as the language is a terse language. A conjunction is the word used as a discourse marker (kata penghubung). For example tetapi (but), kerana (because) and dan (and). A predicate tells something about the subject, such as „makan nasi‟ (eat rice), „menterjemah aturcara‟ (translate a code) and „bermain permainan komputer‟ (play computer games). There are three steps to identify a pattern (pola) for each of the sentence. Step 1: Identify the basic sentence Step 2: Identify the category of the sentence 31 Step 3: Produce the sentence pattern. For example: Pengkompil menukar bahasa paras tinggi kepada bahasa mesin. (Compiler changes the high level language to the machine language) Step 1: Identify the basic sentence It is a basic sentence; refer to rule (NP+VP) Step 2: Identify the category of the sentence Subject (Pengkompil) Predicate [menukar kepada bahasa paras tinggi bahasa mesin] Predicate: Verb (menukar) Object (bahasa paras tinggi) Conjunction (kepada) Adverb (bahasa mesin) Adverb: Object (bahasa mesin) Step 3: Produce the sentence pattern (pola). Menukar(bahasa paras tinggi, bahasa mesin) The architecture of the system is shown in Figure 2.9. 32 The Sentences Adjunct Subject Singular/Plural PostSubject Yang/Ini Conjunction Object Verb Predicate Adverb Identify: 1. Passive/ Active Sentence. 2. Basic or Compound Sentence. 3. The order of the subjects and the objects. 4. Negative Sentence. Verb [[(subject, object)], Verb (subject, object)] Figure 2.9: The System Architecture Based on „Pola‟ (Pattern) Sentence This parser is experimented using 19 abstracts thesis consisting 3604 words and 173 sentences to test the algorithm. As the parser introduced the pola-grammar technique, it does not involve ambiguity problems like other syntactic parsers. This parser is also been compared to Ahmad‟s Malay Parser. Based on the comparison, Ahmad‟s Malay parser does not solve ambiguity problem because it does not have the probabilistic model for tree structures. 33 The Juzaiddin‟s Malay Parser achieves in the range of 73% to 93% of f-score. F-score is a measurement of parser where it takes the average of precision and recall (Jurafsky et. al, 2000). Based on the reviews, we can make a comparison among the parsers that is represented using Table 2.2. Table 2.2: Comparison between Charniak‟s parser, Collins‟s parser, Ahmad‟s Malay parser and Juzaiddin‟s Malay parser Approach Type of grammar Representation of ouput Charniak’s parser Collins’s parser Ahmad’s Malay parser Juzaiddin’s Malay Parser Top down parser and statistical parser Markov grammar Statistical parser Top down parser Not applicable Probabilistic Context-free Grammar (PCFG) Parse tree Context-free grammar Pola-grammar Technique Parse tree Verb [[(subject, object)], Verb (subject, object)] Parse tree 2.10 Summary of the Chapter In this chapter, we have defined the syntactic analysis. Then, we have defined and described the details of ambiguity and also the statistical parsers that available for English. This chapter also discusses the available parsers for Malay language. 34 Based on the review, as suggested by Charniak (1997), the simplest method to develop statistical parser is using Probabilistic Context-free Grammar (PCFG). To our knowledge, no attempt has been made to provide the probability values for Malay Context-free Grammar. Mohd Juzaiddin Abd Aziz et. al (2006) introduces the pola-grammar technique which does not involve ambiguity problems. The parser is only suitable for annotating the thematic roles in semantic analysis. To overcome the ambiguity problem in syntactic analysis, the probability values should be provided for Malay context-free grammar rules. In conclusion, this chapter helps to form a strong base for understanding the concept of parsing and the ambiguity problem. 35 CHAPTER 3: PROBABILISTIC MALAY GRAMMAR Chapter 3 Probabilistic Malay Grammar Chapter 3 describes the introduction to grammar, types of grammar in Malay language, the details of Context-free Grammar for Malay language and the Probabilistic Malay Grammar. In the Probabilistic Malay Grammar, it explains the details of how training data is gathered and the steps to calculate the probability. 3.1 Introduction of Grammar Grammar is an inner regularity and the simple knowledge representation of language (Chomsky 1966, 1971, 1975, 1980). It occurs from language and plays the most important role in the implementation of the fundamental aims of linguistics analysis. Grammar plays two roles: (i) To separate the grammatical sentence from the ungrammatical sequences (ii) To study the structure of grammatical sentences. The grammar of a language is also a device that generates all grammatical sentences of the language and none of the ungrammatical ones (Yuan 1997). 3.2 Types of Malay Language Grammar There are three types of grammar in Malay language. Sentence grammar is the first, second is the partial discourse grammar and the third is the „pola‟ (pattern) sentence grammar. 36 3.2.1 Sentence Grammar This type of grammar uses personal, idiolectal (the total amount of a language that any one person knows and uses), artificial sounding and independent sentences as a guide in making syntactic Malay sentences. ‘Ayat’(sentence) grammar has two models, namely the transformational-generative grammar (Nik Safiah Karim 1975) and the relational grammar (Yeoh 1979). The transformational-generative grammar is a grammar that consist a series of phrase-structure rewrite rules. For example, a series rules that generates the underlying phrase structure of a sentence; and a series of rules that act upon the phrase structure to form more complex sentence. The relational grammar is a theory of descriptive grammar which stated the syntactic operations such as the relationship between subject and object. These two models are inherited of context-free grammar (CFG). 3.2.1.1 CFG in Malay Language CFG for Malay language was formed by Nik Safiah Karim (1995). It became the basis in developing probabilistic for Malay language grammar. The CFG to form a basic sentence for the Malay language is pictured in Figure 3.1. 37 A → S + P S → FN P → FN P → FK P → FA P → FS FN → (Bil) + (PenjBil) + (Gel) + KN + [KN] + (Pen) +(Pent) FK → (KBantu) + KKtr + Obj + (Ket) FK → (KBantu) + KKtr + AKomp + (Ket) FK → (KBantu) + KKttr + Pel + (Ket) FK → (KBantu) + KKttr + AKomp + (Ket) FA → (KBantu) + (KPeng) + Adj + [Adj] + (Ket) + (AKomp) FS → (KBantu) + KSN + (KNA) + FN + (AKomp) FS → (KBantu) + KSN + (KNA) + FN + (Ket) Figure 3.1: Context-Free Grammar for Malay Language, after Nik Safiah Karim (1995) 38 Table 3.1: Description of Elements Used in Malay Grammar Rules, after Ahmad I. Z. Abidin et al. (2007) Element Description in Malay language (English) A Ayat (Sentence) () S () P Subjek (Subject) () Prediket (Predicate) Adj () Adjektif (Adjective) AKomp () Ayat Komplemen (Complementary Sentence) Bil () Bilangan (Numeric) FA () Frasa Adjektif (Adjective Phrase) FK () Frasa Kerja (Verb Phrase) FN () Frasa Nama (Noun Phrase) FS () Frasa Sendi (Prepositional Phrase) Gel () Gelaran (Title) KBantu () Kata Bantu (Auxiliary) Ket () Keterangan (Explanation) KKtr () Kata Kerja Transitif (Transitive Verb) KKttr () Kata Kerja Tak Transitif (Intransitive Verb) KNA () Kata Nama Arah (Direction) KN () Kata Nama (Noun) KPeng () Kata Penguat (Intensifier) Obj () Objek (Object) Pel () Pelengkap (Complement) Pen () Penerang (Description) PenjBil () Penjodoh Bilangan (Numerical Coefficient or Classifier) Pent () Penentu (Determiner) KSN () Kata Sendi Nama (Conjunction) 39 3.2.2 Partial Discourse Grammar A discourse grammar is the grammar of the sentences as are used in discourse. Discourse concerns how the immediately preceding sentences affect the interpretation of the next sentences. A partial discourse is the grammar that picks out the sentences from discourse in order to make linguistic statements about them. This type of grammar is different to “ayat” grammar because it uses “language-first” approach in the writing of syntax while “ayat” grammar uses “theory-first” approach. According to Azhar Simin (1988), “language-first” approach represents a chance for Malay reader to read the latest ideas in his own language about the genius of his language, while “theory-first” approach is more likely to need the sentence in order to make the chosen theory appear workable. Example of partial discourse grammar: Aminah membaca buku. Dia juga mendengar radio. (Aminah is reading a book. She is also listening to the radio.) Dia (She) is referring to Aminah 3.2.3 ‘Pola’ (Pattern) Grammar “Pola” grammar is the pattern of grammar in the sentences. This type of grammar is used by Azhar Simin (1988). Each “pola” is linked to class-name that forms or helps to make a basic sentence. Each “pola” is a formula to make a basic sentence. 40 Example: ”Pola”: Pelaku + perbuatan Pattern: Actor + verb Sentence: Saya makan. (I eat). Asmah Hj Omar (1968) represents the most “theoretical” work on “pola” grammar. It provides a methodology for “pola” grammar writing. Below is the “pola” grammar for Malay language: (i) Pelaku + Perbuatan (Actor + Verb) (ii) Pelaku + Perbuatan + Pelengkap (Actor + Verb + Complement) (iii) Perbuatan + Pelengkap (Verb + Complement) (iv) Diterangkan + Menerangkan (Signified + Signify) (v) Digolong + Penggolong (Classified + Classifier) (vi) Pelengkap + Perbuatan + Pelaku (Complement + Verb + Actor) (vii) Pelengkap + Perbuatan (Complement + Verb) In forming a basic sentence in Malay language, the type of grammar that is suitable to use is sentence grammar which provides rules. The rules are derived from CFG that was mentioned in Section 3.2.1.1. 41 3.3 Rules for Basic Malay Sentence Basically, to create rules for a sentence in Malay language, we should pursue CFG for Malay language by Nik Safiah Karim (1995) as shown in Figure 7. A basic sentence in Malay language can be derived from these four (4) basic patterns of rules: A → FN + FN (S ) A → FN + FK (S ) A → FN + FA (S ) A → FN + FS (S ) where A = Ayat (sentence), FN = Frasa Nama (Noun Phrase), FK = Frasa Kerja (Verb Phrase), FA = Frasa Adjektif (Adjective Phrase), FS = Frasa Sendi (Prepositional Phrase) Each of FN, FK, FA and FS can be described detail in the following sections. 3.3.1 Rules for FN (Noun Phrase, NP) There is a basic rule for FN. The rule is in Figure 3.2. 42 Bil + PenjBil Bil + KN + {Pen} + {Pent} Gel Figure 3.2: Basic Rule for FN The elements that are used in the FN basic rules can be described below: Bil = Bilangan (Numerical) PenjBil = Penjodoh Bilangan (Numerical Coefficient of Classifier) Gel = Gel (Title) KN = Kata Nama (Noun) Pen = Penerang (Description) Pent = Penentu (Determiner) For example: dua orang pelajar itu (the two students), mereka (they), and murid-murid (pupils) All those elements can be considered as POS except for Pen or Penerang (Description). Penerang can be categorised into two groups, namely „Penerang 1‟ and „Penerang 2‟. „Penerang 1‟ contains KN or Kata Nama (Noun) while „Penerang 2‟ contains KK or Kata Kerja (Verb), KSN or Kata Sendi Nama (Preposition) and KA or Kata Adjektif (Adjectives). 43 The basic rule of FN in Figure 3.2 can be detailed in Figure 3.3. Bil + PenjBil Bil KN + KN + Gel KKtr+KN/KKttr + {Pent} KA KSN + KN Figure 3.3: Basic Rule for FN in detail tiga orang pelajar sekolah itu (the three school students), Examples: Datuk Ahmad di dewan parlimen (Datuk Ahmad at Parliament House) From the above, we can develop many rules for FN as the elements inside the parentheses are optional. 3.3.2 Rules for FK (Verb Phrase, VP) There are two basic rules for FK. Kata Kerja (verb) consists of two dissimilar types, namely Kata Kerja Transitif, KKtr (transitive verb) and Kata Kerja Tak Transitif, KKttr (intransitive verb). Basic Rules: (i) for KKtr: FK → (KBantu) + KKtr + KN + (Ket) (ii) for KKttr: FK → KKttr + (Pel) Figure 3.4: Basic Rules for FK 44 For example: KKtr: melamar (propose), menggunakan (use), menerima (receive) KKttr: tidur (sleep), makan (eat), mandi (bathe) 3.3.3 Rules for FA (Adjective Phrase, AP) Frasa Adjektif is a phrase (groups of words) that relate to a noun. There is a basic rule for FA. Basic rule: FA → (KBantu) + (KPeng) + KA + (KA) + (Ket) + (Akomp) Figure 3.5: Basic rules for FA For example: amat berani (very brave), cantik (beautiful), renik (very small) 3.3.4 Rules for FS Frasa sendi (prepositional phrase) is a phrase that is used to show the relationship of a noun or a pronoun to some other word. There is a basic rule for FS. Basic rule: FS → (KBantu) + KSN + (KArah) + FN + (Ket) Figure 3.6: Basic rules for FS For example: di dalam pinggan (in the plate), di tepi tangga (beside the ladder), dalam perahu (in the boat), di Kuala Lumpur (at Kuala Lumpur) 45 As mentioned, context-free grammar is the most basic grammar in Malay language. Many syntactic parsing uses CFG as their grammar. Yet, syntactic parsing has a main difficulty where it cannot handle syntactic ambiguity. To resolve the problem, we place a statistical element on each of the Malay grammar rules. The statistical element is the probability itself. We compute the probability to the CFG for Malay language. We label it as Probabilistic Context-free Grammar (PCFG). 3.4 Probabilistic Context-Free Grammar (PCFG) for Malay Language In English language, there are a few corpora that contain rules with the probability, for example Penn Wall Street Journal Corpus (Marcus et. al, 1993) and Brown Corpus (Kucera & Francis 1979). However, in Malay language, there is no such corpus that contains rules and probability. We need to calculate the probability for the Malay grammar rules. In order to calculate the probability for each rule in CFG Malay language, there are two steps that we should follow. First we need a training data that contains a collection of Malay basic sentences. In this research project, we collect one thousand (1000) sentences from various sources. Then, we need to compute the probability based on the training data. 3.4.1 Training Data In getting one thousand (1000) basic sentences, various sources are used. The sources are listed below: 46 (1) Malay text books for primary schools (Zainal Arifin Yusof, Kamarudin Jeon & Mohd Nasar Sukor, 2008), (Zainal Arifin Yusof, Kamarudin Jeon & Mohd Nasar Sukor, 2005). (2) Malay Grammar books (Abdullah Hassan, 1993), (Abdullah Hassan & Ainon Mohd., 1994a), (Abdullah Hassan & Ainon Mohd., 1994b), (Abu Naim Kassan, 2001), (Nik Safiah Karim et. al, 2009) Training data analysis process is evaluated by the Malay language expert, who is known as Munsyi Dewan. Munsyi Dewan is a group of human language experts in Malay language, who are certified by Bahasa dan Pustaka (DBP). This group has main responsibilities for giving consultation and lectures about Malay language to the public and private sectors (Dewan Bahasa dan Pustaka Malaysia, 2010). There are two steps of analysis the training data. First, the sentences are tagged according to Malay lexicon. Second is the process of deriving the Malay rules. The process needs to match the result of the tagged sentenced and the Malay language rules. For example: Sentence: Umur saya tujuh tahun (My age is seven years old) Step 1: Tagging Process Umur saya tujuh tahun KN KN Bil KN 47 Step 2: Deriving Rules Process A→S+P S → FN P → FN FN → KN + KN FN → KN + KN The processes are repeated to all the basic sentences in training data. The results are shown in Appendix A. The training data are manually tag by Munsyi Dewan. 3.4.2 Analysis Pattern of Training Data Based on the sentences, they can be categorized into four patterns; FN + FN, FN + FK, FN + FA, and FN + FS patterns. The fractions of the sentences are displayed in Table 3.2. Table 3.2: Analysis of the sentences according to their pattern PATTERN NUMBER OF SENTENCES FN + FN 80 FN + FK 739 FN + FA 141 FN + FS 40 TOTAL 1000 Table 3.2 can be represented using a pie chart. 48 FN+FA 14.1% 141 sentences FN+FS 4.0% 40 sentence FN+FN 8.0% 80 sentences FN+FN FN+FK FN+FK 73.9% 739 sentences FN+FA FN+FS Figure 3.7: Analysis Pattern Sentence of Training Data The pie chart shows the pattern sentences of data collection. There are clear differences among the patterns. The FN+FK has the highest percentage where the value is 73.9%. It shows that many sentences contain verb. Next pattern is pattern FN+FA where it has higher percentage compared to FN+FN where the different value is 6.1%. The least percentage is 4% where it belongs to FN+FS‟s pattern. The analysis of the basic sentences is shown in several tables. Each of the tables displays the amount of left-hand side rules (LHS), right-hand-side rules (RHS) and the probability values. Table 3.3 displays the result of A (sentence), S (subject), and P (predicate) rules. Table 3.4 presents the FN (noun phrase) rules while Table 3.5 shows the FK (verb phrase) rules. Table 3.6 and 3.7 show the FA (adjective phrase) and FS (prepositional phrase) rules respectively. The total of the rules for each of the segments are presented detail in Appendix B. 49 Table 3.3: Rules for A (sentence), S (subject), and P (predicate) Num RULE LHS RHS Rule RHS/ LHS Rule Probability Count Count 1. A→S+P 1000 1000 1000/1000 1.0000 2. S → FN 1000 1000 1000/1000 1.0000 3. P → FN 1000 80 80/1000 0.0800 4. P → FK 1000 739 739/1000 0.7390 5. P → FA 1000 141 141/1000 0.1410 6. P → FS 1000 40 40/1000 0.0400 Table 3.4: Rules for the FN (noun phrase) Num 1. RULE FN FN → KN LHS RHS RHS/ Rule Rule LHS Count Count 1304 517 517/ Probability 0.3965 1034 2. FN → KN + Pent 1304 162 162/ 0.1242 1304 3. FN → KN + Pent + KA + KN 1304 1 1/1304 0.0008 4. FN → KN + Pent + KN + Bil + 1304 1 1/1304 0.0008 1304 1 1/1304 0.0008 KN 5. FN → KN + Pent + KN + KN 50 Num 6. RULE FN FN → KN + Pent + KSN + KN + LHS RHS RHS/ Probability Rule Rule LHS Count Count 1304 2 2/1304 0.0015 KN 7. FN → KN + KKTr + KN + KN 1304 1 1/1304 0.0008 8. FN → KN + KKTTr 1304 3 3/1304 0.0023 9. FN → KN + KSN + KN + KN + 1304 2 2/1304 0.0015 3/1304 0.0023 263/ 0.2017 KN 10. FN → KN + KSN + KN + Pent 1304 3 11. FN → KN + KN 1304 263 1304 12. FN → KN + KN + KSN + KN + 1304 5 5/ 1304 0.0038 1304 1 1/1304 0.0008 1304 1 1/1304 0.0008 KN 13. FN → KN + KN + KSN + KN + KN + Pent 14. FN → KN + KN + Bil + PenjBil + KN + KN + KN 15. FN → KN + KN + Pent 1304 97 97/1304 0.0744 16. FN → KN + KN + KBantu + KA 1304 3 3/1304 0.0023 17. FN → KN + KN + KA + KN + 1304 1 1/1304 0.0008 KN 18. FN → KN + KN + KN 1304 63 63/1304 0.0483 19. FN → KN + KN + KN + Pent 1304 12 12/1304 0.0092 51 Num 20. RULE FN FN → KN + KN + KN + Bil + LHS RHS RHS/ Probability Rule Rule LHS Count Count 1304 1 1/1304 0.0008 1304 1 1/1304 0.0008 1304 4 4/1304 0.0031 KN + Pent 21. FN → KN + KN + KN + Pent + KN 22. FN → KN + KN + KN + KSN + KN 23. FN → KN + KN + KN + KN 1304 15 15/1304 0.0115 24. FN → KN + KN + KN + KN + 1304 3 3/1304 0.0023 KN 25. FN → KPeng + KA + KN + KN 1304 4 4/1304 0.0031 26. FN → Bil + KN + KN 1304 3 3/1304 0.0023 27. FN → KN + KN + KSN + KN 1304 2 2/1304 0.0015 28. FN → KN + KBantu + KN + 1304 1 1/1304 0.0008 KBantu 29. FN → KN + KKTTr + KA 1304 1 1/1304 0.0008 30. FN → KN + KSN + KN 1304 19 19/1304 0.0146 31. FN → KN + PenjBil + KN + Pent 1304 1 1/1304 0.0008 32. FN → Bil + KN 1304 22 22/1304 0.0169 33. FN → PenjBil + KN + KN 1304 1 1/1304 0.0008 34. FN → Bil + PenjBil + KN 1304 23 23/1304 0.0176 35. FN → KN + KBantu + KA + 1304 1 1/1304 0.0008 52 Num RULE FN LHS RHS RHS/ Rule Rule LHS Count Count Probability Pent 36. FN → Gel + KN 1304 2 2/1304 0.0015 37. FN → KBantu + KSN + KN + 1304 1 1/1304 0.0008 KN 38. FN → Bil + KN + KN + Pent 1304 1 1/1304 0.0008 39. FN → KN + KPeng + KA 1304 1 1/1304 0.0008 40. FN → KBantu + KKTr + KN + 1304 1 1/1304 0.0008 1304 1 1/1304 0.0008 KSN + KN + Pent 41. FN → KN + KN + KBantu + KKTTr 42. FN → KN + KSN + KN + KN 1304 5 5/1304 0.0038 43. FN → KN + KSN + KKTTr 1304 1 1/1304 0.0008 44. FN → Bil + PenjBil + KSN + KN 1304 1 1/1304 0.0008 45. FN → Bil + PenjBil + KN + KN 1304 6 6/1304 0.0046 46. FN → KN + KN + KBantu 1304 1 1/1304 0.0008 47. FN → KN + KSN + KA 1304 2 2/1304 0.0015 48. FN → Bil + KN + KSN + KN 1304 1 1/1304 0.0008 49. FN → KN + KSN + KN + 1304 1 1/1304 0.0008 1304 1 1/1304 0.0008 KBantu 50. FN → Bil + PenjBil 53 Num 51. RULE FN FN → KN + KN + KSN + KN + LHS RHS RHS/ Probability Rule Rule LHS Count Count 1304 1 1/1304 0.0008 Pent 52. FN → KN + KSN + KN + KA 1304 1 1/1304 0.0008 53. FN → KN + KSN + KN + KSN 1304 2 2/1304 0.0015 1304 1 1/1304 0.0008 1304 1 1/1304 0.0008 + KN 54. FN → KN + KN + KSN + KN + KN + KN + KN 55. FN → KN + KN + Pent + KSN + KN 56. FN → KN + KKTTr + KKTTr 1304 1 1/1304 0.0008 57. FN → Bil + PenjBil + KN + KA 1304 1 1/1304 0.0008 58. FN → KN + Pent + KSN + KN 1304 2 2/1304 0.0015 59. FN → KN + Bil 1304 3 3/1304 0.0023 60. FN → KN + PenjBil + KN 1304 1 1/1304 0.0008 61. FN → KN + Pent + KN 1304 2 2/1304 0.0015 62. FN → KN + Pent + KN + KN 1304 2 2/1304 0.0015 63. FN → KN + KBantu + KBantu + 1304 1 1/1304 0.0008 KKTTr + Pent 64. FN → KN + KN + KSN + KN 1304 1 1/1304 0.0008 65. FN → KN + KSN + KN + Bil + 1304 2 2/1304 0.0015 KN 54 Num RULE FN LHS RHS RHS/ Rule Rule LHS Count Count Probability 66. FN → Bil + KN + Kbantu + KA 1304 1 1/1304 0.0008 67. FN → KN + Pent + KA 1304 1 1/1304 0.0008 68. FN → KN + KBantu + KPeng + 1304 1 1/1304 0.0008 KA 69. FN → KN + KN + Pent + KN 1304 1 1/1304 0.0008 70. FN → KN + Bil + KN 1304 1 1/1304 0.0008 71. FN → KN + KN + Bil + KN 1304 1 1/1304 0.0008 72. FN → KN + Bil + PenjBil + KN 1304 6 6/1304 0.0046 73. FN → KN + KN + Bil + PenjBil 1304 1 1/1304 0.0008 1304 1 1/1304 0.0008 1304 1 1/1304 0.0038 + KN 74. FN → KN + Pent + KSN + KN + KBantu + KSN + KN 75. FN → KN + KBantu + KN + Pent Table 3.5: Rules for the FK (verb phrase) Num Rules FK LHS RHS LHS/ Rule Rule RHS Count Count Probability 1. FK → KKTr + FN 704 339 339/704 0.4815 2. FK → KBantu + KKTr + FN 704 61 61/704 0.0866 55 Num Rules FK LHS RHS LHS/ Rule Rule RHS Count Count Probability 3. FK → KKTTr 704 67 67/704 0.0952 4. FK → KKTTr + KA 704 26 26/704 0.0369 5. FK → KKTTr + KSN + KN 704 50 50/704 0.0710 6. FK → KKTTr + KSN + KN + KN 704 36 36/704 0.0511 7. FK → KKTTr + KSN + KN + KN + 704 5 5/704 0.0071 KN 8. FK → KBantu + KKTTr 704 55 55/704 0.0781 9. FK → KBantu + KKTTr + KSN + KN 704 12 12/704 0.0170 10. FK → KBantu + KKTTr + KSN + KN 704 3 3/704 0.0043 704 1 1/704 0.0014 + KN 11. FK → KNafi + KKTTr + KSN + KN + KN 12. FK → KKTTr + KSN +Bil + KN + KN 704 1 1/704 0.0014 13. FK → KKTTr + KN + KN 704 2 2/704 0.0028 14. FK → KBantu + KKTTr + KA 704 2 2/704 0.0028 15. FK → KKTTr + KN + Bil + KN 704 1 1/704 0.0014 16. FK → KNafi + KBantu + KKTTr + 704 1 1/704 0.0014 704 1 1/704 0.0014 704 1 1/704 0.0014 KN + KN 17. FK → KBantu + KKTTr + KBantu + KN 18. FK → KKTTr + KSN + Pent 56 Num Rules FK LHS RHS LHS/ Rule Rule RHS Count Count Probability 19. FK → KKTTr + KN + KN + KN 704 2 2/704 0.0028 20. FK → KKTTr + KBantu + Bil + 704 1 1/704 0.0014 704 1 1/704 0.0014 704 1 1/704 0.0014 PenjBil 21. FK → KKTTr + KSN + KN + KBantu + Bil + KN 22. FK → KKTTr + KSN + KN + KN + KSN + KN + KN 23. FK → KKTTr + KSN + KN + Pent 704 1 1/704 0.0014 24. FK → KKTTr + KSN + KN +KN+ 704 2 2/704 0.0028 Pent 25. FK → KKTTr + KPeng + KA 704 2 2/704 0.0028 26. FK → KKTTr + KSN + KBantu + 704 1 1/704 0.0014 704 2 2/704 0.0028 704 1 1/704 0.0014 704 1 1/704 0.0014 KKTr + KN 27. FK → KKTTr + KSN + KN + KSN + KN + KN 28. FK → KKTTr + KSN + KN + KSN + KKTr + KN 29. FK → KKTTr +KN + KSN + KN + 57 Num Rules FK LHS RHS LHS/ Rule Rule RHS Count Count Probability KN + KKTTr 30. FK → KKTTr + KSN + KN + KN + 704 2 2/704 0.0028 704 1 1/704 0.0014 704 1 1/704 0.0014 Pent 31. FK → Kbantu + KKTTr + KN + KSN + Bil + KN + KN 32. FK → KKTTr + KBantu + KKTTr + KN 33. FK → KNafi + KKTTr + KBantu 704 1 1/704 0.0014 34. FK → KKTTr + KA + KSN + KN + 704 1 1/704 0.0014 KN 35. FK → KKTTr + KN 704 2 2/704 0.0028 36. FK → KKTTr + Bil 704 1 1/704 0.0014 37. FK → KKTTr + KKTr + KN + KN 704 1 1/704 0.0014 38. FK → KKTTr + KN + KN + KN + KN 704 1 1/704 0.0014 39. FK → KBantu + KKTTr + KSN + KN 704 1 1/704 0.0014 + KN 40. FK → KBantu + KKTr + KKTTr 704 1 1/704 0.0014 41. FK → KBantu + KKTTr + KA 704 4 4/704 0.0057 42. FK → KPeng + KKTTr 704 2 2/704 0.0028 43. FK → KBantu + KKTTr + KPeng 704 1 1/704 0.0014 58 Num Rules FK LHS RHS LHS/ Rule Rule RHS Count Count Probability 44. FK → KBantu + KKTTr + KBantu 704 2 2/704 0.0028 45. FK → KKTTr + KSN + KN KBantu + 704 1 1/704 0.0014 704 1 1/704 0.0014 704 1 1/704 0.0014 KA 46. FK → KKTTr + KSN + KN + Bil + KN 47. FK → KKTTr + KKTTr Table 3.6: Rules for the FA (adverb phrase) Num 1. Rules FA FA → KA + KSN + KN + KN + LHS RHS RHS/ Probability Rule Rule LHS Count Count 141 1 1/141 0.0071 KBantu + KA 2. FA → KA + Kpeng 141 15 15/141 0.1064 3. FA → KA + KN 141 3 3/141 0.0213 4. FA → KA + KN + KN + KN + KN 141 1 1/141 0.0071 5. FA → Kpeng + KA 141 39 39/141 0.2766 6. FA → Kpeng + KA + KKTTr 141 3 3/141 0.0213 7. FA → KBantu + KN + KA 141 1 1/141 0.0071 8. FA → KBantu + KPeng + KA 141 1 1/141 0.0071 59 Num Rules FA LHS RHS RHS/ Rule Rule LHS Count Count Probability 9. FA → KBantu + KA + KPeng 141 1 1/141 0.0071 10. FA → KPeng + KA + KBantu + 141 1 1/141 0.0071 KN + KN + Pent 11. FA → KBantu + KA 141 11 11/141 0.0780 12. FA → KBantu + KA + KSN + KN 141 2 2/141 0.0142 141 1 1/141 0.0071 + Pent 13. FA → KBantu + KA + KPeng + KSN + KN + KN + KN + KN 14. FA → KA + KSN + KN 141 1 1/141 0.0071 15. FA → KPeng + KA + KKTTr 141 2 2/141 0.0142 16. FA → KPeng + KA + KSN + KN + 141 1 1/141 0.0071 KN 17. FA → KA 141 44 44/141 0.3121 18. FA → KA + KSN + KN + KN 141 1 1/141 0.0071 19. FA → KA + KA 141 8 8/141 0.0567 20. FA → KA + KA + KN 141 1 1/141 0.0071 21. FA → KA + KKTr + KN 141 1 1/141 0.0071 22. FA → KA + KN + KKTTr 141 1 1/141 0.0071 23. FA → KBantu + KA + KBantu + 141 1 1/141 0.0071 KKTTr + KSN + KN 60 Table 3.7: Rules for the FS (prepositional phrase) Num Rules FA LHS Rule RHS RHS/ Count Rule LHS Probability Count 1. FS → KSN + FN 40 39 39/40 0.975 2. FS → KBantu + KSN + FN 40 1 1/40 0.025 3.5 Summary of the Chapter In this chapter, a review of Malay grammar was explained in detail. The grammar rules to form a basic sentence in Malay language are also described. This chapter also showed how to calculate probability value from training data. The thousand sentences of training data are gathered from various sources. This probability value is counted and put inside the database together with the rules. In the following chapter, it shows how the probability values are used in the parsing engine. 61 CHAPTER 4: DEVELOPMENT OF MALAY STATISTICAL PARSER Chapter 4 Development of Malay Statistical Parser The system development method used for the development of the Statistical Malay Parser is a prototyping model. This method is used to test or illustrate an idea and build a system in explorative way (Hawryszkiewycz, 1998). The development process using this method covers four main stages. The stages are requirement specification, system design, implementation and evaluation. 4.1 Requirement Specification The requirement for Malay Statistical Parser is to develop a parser that can parse an ambiguous sentence and represent a parse tree with the highest probability value. 4.1.1 Functional Requirement There is one functional requirement for this prototype. The requirement is to parse a sentence. A sentence that should be input by a user must be a type of basic and declarative sentence. The prototype could not accept any text file for input component. However, the length of the sentence should not exceed more than ten (10) words. The reason is the training data has also the same length of words in a sentence. Once the user inputs a sentence, the user needs to click the button „Parse‟, then if the system successful parsing the sentence, it then automatically calculates the probability. If 62 the sentence has an ambiguity, the parser shows a few parse trees with their probability values. Finally it selects the highest value in order to show the best parse tree that represents the ambiguous sentence. 4.1.2 Non-functional Requirement A non-functional requirement or constraint describes a restriction on the system that limits our choices for constructing a solution to a problem (Pfleeger, 1998). The non-functional requirement for Malay Statistical Parser is stated below (i) Usability This parser is useful for those who want to check the best representation of parse tree for an ambiguous sentence. It also helpful for those who want to check grammar for a basic sentence, which limits to ten words only. (ii) Platform constraint This prototype shall run under Windows platform only. (iii) Response times On average, this result shall be answered within 3 seconds. (iv) Reliability The input to the parser shall be a sequence of texts. The parser could not accept numbers unless they are spelled. 63 4.2 System Design The system is described in terms of architecture and the user interface design. 4.2.1 System Architecture The system architecture of a Malay Statistical Parser has five components, namely the Parsing Engine, the Part-Of-Speech (POS) Tagger, the Malay Lexicon (MaLEX), the Input Component and the Output Component. The system architecture is shown in the following Figure 4.1. Input POS tagger component User MaLEX Parsing Engine CFG with probability (PCFG) Output component (Parse tree) Figure 4.1: System Architecture of a Statistical Parser for Malay Language The statistical parser is aimed to reduce structural ambiguity in Malay sentence. The parser produces a few parse trees for an ambiguous sentence and computes the probability of each tree to determine the most probable tree with the highest probability value. 64 The process of parsing a sentence starts when a user input a sentence. The sentence will be tagged by the POS tagger. The tagger chooses a single tag for each of the words in the sentence. Tag is part-of-speech (POS) for a word. Malay lexicon (MaLEX) provides words and tags to the parser. After tagging, the next step is parsing activity which involves a parsing engine. The engine parses the test sentence and checks the grammar rules that applied to the test sentence. If the test sentence is an ambiguous sentence, it will calculate the probability for each of the possible parse trees. However, the output for the parsing process is a parse tree with the highest probability value. Each of the components in the system will be described in detail below: 4.2.1.1 Input Component The input component will accept a Malay sentence. In Malay language, there are four types of sentences based on their purposes, namely (i) Declarative sentence (Ayat penyata) A declarative sentence is a sentence which the predicate explains something about the subject. For example: Ini rumah saya. (This is my house.) 65 (ii) An interrogative sentence (Ayat Tanya) An interrogative sentence is used to pose a question or to seek information. For example: Berapa umur anda? (How old are you) (iii) Imperative sentence (Ayat perintah) An imperative sentence gives an order or makes a request. For example: Menangislah sepuas-puas hati. (Cry your heart out.) (iv) Exclamatory sentence (Ayat Seru) An exclamatory sentence is a sentence which used intonation to describe a response to an emotion such as joy, surprise, disbelief, fear, anger, sorrow and pain. For example: Amboi, cantiknya kereta ni! (What a beautiful car!) In this system, the user needs to enter a declarative sentence as it follows context-free grammar (CFG) rules that were derived for Nik Safiah Karim (1995). 66 The user also needs to enter a basic sentence rather a complex sentence. A basic sentence is a sentence which becomes the base or source for forming other sentences. It has only a subject and predicate. A basic sentence is also known as a kernel sentence or simple sentence (Nik Safiah Karim et al. 2009). For example, Rumah itu terbakar Subjek Prediket (The house is burnt). Subject Predicate A complex sentence is a sentence that comes from a combination of two or more subjects or predicates. For example: Ahmad seorang pelajar yang pintar tetapi dia berasal dari keluarga yang susah. (Ahmad is a clever boy but he is from a poor family) The sentence consists of two basic sentences. (i) Ahmad seorang pelajar yang pintar (Ahmad is a clever. boy) (ii) Dia berasal dari keluarga yang susah. (He is from a poor family) 67 Both of the sentences (sentence (i) and sentence (ii)) are joined using conjunctions tetapi (but). However, only simple sentence is allowed to be entered in the system. 4.2.1.2 Part-of-Speech (POS) Tagger POS tagging is a process of assigning POS or other syntactic class marker to each of word in corpus. POS tagger involves two main components. The components are Tokenization and Ambiguity Look Up. Tokenization is a process of breaking a stream of text up into meaningful elements. The elements are called token. For example, the sentence “Rumah itu terbakar” (That house burnt) will be divided into three tokens which are rumah (house), itu (that), and terbakar (burnt). In Ambiguity Look Up, it involves the use of Malay Lexicon. From the lexicon, there are ninety (90) words which have more than one part-of-speech. These words also can lead to the ambiguous sentences. For example, word „gemar‟ (favour) has two part-of-speeches; KKTr (transitive verb) and KBantu (auxiliary). 68 A S P FN FK KN KKTr Objek Dia gemar melancong Figure 4.2: A parse tree of sentence “Dia gemar melancong” (Mohd Juzaiddin Ab Aziz, et.al, 2006) A S P FN FK KN KBantu Dia gemar KKTr menulis Objek aturcara Figure 4.3: A parse tree of sentence “Dia gemar menulis aturcara” (Mohd Juzaiddin Ab Aziz, et.al, 2006) In this tagging process, it can be simplified by referring to the tagging algorithm below. In the algorithm, the input to the POS tagger is a string of words (a sentence). The output is a single tag for each word. The pseudocode of the algorithm is shown below: 69 Start Read a string of words while(the string does not reach the end) { if (the string is in the lexicon) return the POS else unrecognized POS { End Figure 4.4: The Pseudocode for POS Tagger For example, let say a sentence “Rumah itu terbakar” is entered as the input, so the output from the POS tagger is Before tagging: After tagging: Rumah itu terbakar (The house is burnt) KN Pent KKttr (Noun) (Determiner) (Intransitive Verb) 4.2.1.3 Malay Lexicon (MaLEX) MaLEX is a computerized lexical database in Malay language which designed as a tool for this research project in natural language parsing. In MaLEX, there are two tables. The first table is for words and its lexical class, the second table is for rules and its probability value. The first table, Lexicon, provides lexical class and meaning for the words. MaLEX contains 39,190 words that are based from Kamus Oxford Fajar 2nd Edition. (Hawkins, 70 1997) The lexical classes which are provided in the lexicon have variety of classes. The classes are KN (noun), KA (adjective), KSN (preposition), KKTr (transitive verb), KKttr (intransitive verb), Bil (count) and Gel (title). The arrangement of words is in alphabetical order. In the MaLEX also, it has variety types of words such as root words and also derivative words like lari (run) and berlari (running). The current version of this lexicon is MaLEX 1.0 which is in Access format (Microsoft Corporation, 2002). It was built by a group of undergraduate students of Fakulti Teknologi dan Sains Maklumat, Universiti Kebangsaan Malaysia. Some examples of words from MaLEX is shown in the Table 4.1: Table 4.1: A Few Examples of Words, Lexical Classes and their Synonyms from MaLEX Words Lexical Class Synonyms Itu Pent Kata petunjuk kepada satu benda Rumah KN Binaan untuk tempat tinggal terbakar KKttr Sedang atau sudah menyala kerana terbakar However, in this research project, there is an enhancement has been made to the MaLEX. The second table is added to the lexicon, which contain Malay grammar rules and probability values. The Malay grammar rules are derived from the training data. The rules 71 are associated with the probability value. Some examples of rules and their probability values are shown in Table 4.2. Table 4.2: A Few Examples of Rules and Probability Values from MaLEX Rules Probability Values A→S+P 1.0000 S → FN 1.0000 P → FN 0.0800 FN → KN + Pent 0.1242 FN → KN + KN 0.2017 FN → Bil + KN 0.0169 4.2.1.4 Parsing Engine The most important part in Malay Statistical Parser is parsing engine. Parsing engine is an engine that parses a tagged sentence and assigned a probability to each of the rules that involved in the parsed sentence. The engine will find the similarity between the tagged sentence and the rules that are embedded inside the engine. The process is illustrated in the Figure 4.5. 72 Process Example Bapa saya pemandu teksi Input Sentence POS tagger MaLEX POS tagger Tokenization Word Bapa saya pemandu teksi ……… ………….. POS KN KN KN KN …….. ... POS label PCFG 1.0000 1.0000 0.0800 FN → KN FN → KN + KN FN → KN + KN + KN 0.3965 0.2017 ....... ... 0.0483 KN Parsing Engine ....... Rules A→S+P S→FN P→FN Bapa saya pemandu KN KN teksi KN Parsing Engine ... Grammar rules A → S + P (1.0000) S → FN (1.0000) P → FN (0.0800) FN → KN (0.3965) FN → KN + KN (0.2017) FN → KN + KN+ KN (0.0483) PCFG Parse Tree A Parse Tree (1) S P FN FN Probability value Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.1532 x 10-3 A S P FN FN KN KN KN KN Bapa saya pemandu teksi Parse Tree (2) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 A S P FN FN KN KN KN KN Bapa saya pemandu teksi Figure 4.5: The Process in Parsing Engine 73 Based on Figure 4.5, the parser will choose the second parse tree with the highest value probability for the sentence “Bapa saya pemandu teksi”. Another example of parsing another sentence is shown in Figure 4.6. The sentence is “Pemandu itu sangat letih.” Process Example Input Sentence MaLEX Word Itu Letih pemandu sangat ....... ....... POS tagger POS tagger POS Pent KA KN KPeng ... ... Tokenization POS label Parsing Engine Rules A→S+P S→FN P→FA PCFG 1.0000 1.0000 0.1410 FN → KN +Pent FA → KPeng + KA ....... 0.1242 Pemandu itu sangat letih Pemandu KN itu sangat letih Pent KPeng KA Parsing Engine A → S + P (1.0000) S → FN (1.0000) P → FA (0.1410) FN → KN + Pent (0.1242) FA → KPeng + KA (0.2017) Grammar rules PCFG 0.2017 ... Parse Tree + PCFG value A S P FN FA Parse Tree PCFG value = 1.0000x 1.0000 x0.1410x 0.1242x0.201 = 3.520 x 10-3 A KN S P FN FA Pent KPeng KA Figure 4.6: Another Process in Parsing Engine 74 As the sentence “Pemandu itu sangat letih” is an unambiguous sentence, only one parse tree is presented with a probability value. 4.2.1.5 Output Component The output for this statistical parser is a parse tree. The output also provides the probability value for the parsed sentence. The output component shows a few parse trees and probability values for the ambiguous sentence. For example, the output in the previous example is shown in Figure 4.7. Parse Tree (1): Parse Tree (2): A A S P S P FN FN FN FN KN KN KN KN Bapa saya pemandu teksi Probability value = 1.1532 x 10-3 KN KN KN KN Bapa saya pemandu teksi Probability value = 3.255 x 10-3 Figure 4.7: Example of a Parse Trees And Their Probability Values 4.2.2 User Interface Design There is a main interface for Malay Statistical Parser. The Figure 4.8 shows the main interface for the Statistical Malay Parser. 75 Input component Process of tokenization Figure 4.8: Main Interface for Malay Statistical Parser In referring to Figure 4.8, the left hand side is the input component. The user enters “Bapa saya pemandu teksi” as the input to the parser. The right hand side is the output of the process tokenization. However, the process only happened after the user clicked button „Proses‟. As the sentence is successfully parsed, the message “Ayat ini sah” (the sentence is syntactically correct) is shown as in the Figure 4.9. Figure 4.9: Output for the Parser Then, the parser shows the parse tree which has the highest probability value. For example, the output for the sentence is shown in the Figure 4.10. 76 Figure 4.10: Output Component for sentence “bapa saya pemandu teksi” The probability is calculated and the result is shown in the message box. The message box is shown in the Figure 4.11. Figure 4.11: One of the Probability Values for the Parsed Sentence However, if the system is unable to parse the sentence, the error message will be presented. There are two reasons when the sentence is failed to parse. The first reason is some words in the sentence cannot be found in the Malay Lexicon (MaLEX). It was labeled at the right hand side of the main interface window as demonstrated in Figure 4.12. 77 Figure 4.12: The main interface with the error message “Tiada dalam database” The second reason is the parsed sentence does not match with the rules that are embedded in the parser. So, the error message will be shown in the Figure 4.13. Figure 4.13: Error Message for Unsuccessful Parsed Sentence 4.3 Summary of the Chapter This chapter discusses the design and development process of the Malay Statistical Parser prototype. It started with the requirement of the parser which contains functional and nonfunctional requirements. Then, it followed by system architecture and user interface design. The evaluation of the parser will be discussed in Chapter 5. 78 CHAPTER 5: EXPERIMENTS AND RESULTS Chapter 5 Experiments and Results This chapter will explain the test datasets that were used in the experiments which contain four (4) different types pattern of Malay basic sentence. It will then proceed to the results of the prototype using the three measurements, namely precision, recall and f-score. 5.1 Test Dataset Test Dataset is a set of data that is used to test the parser. In this Malay Statistical Parser, the test data that is used includes four patterns of basic sentences, which contains syntactic ambiguity. All the sentences were acquired from two experts in Malay language’s grammar, who are Tuan Hj Sufian b Hj Afandi and Tuan Hj Nawi b Ismail. Both are munsyi dewan who is appointed by Dewan Bahasa dan Pustaka (Dewan Bahasa dan Pustaka Malaysia, 2010). All of the sentences are correct according to Malay language’s grammar. The total of the test sentences is one hundred (100) sentences. Each of them was tested with the prototype. The length of each test sentences is not more than ten words. Below is a list of the sentences that are used in the experiments. Sentences for FN + FN (noun Phrase + Noun Phrase) T2: Baju kakak batik Terengganu. (My sister’s attire is Terengganu batik) T5: Pegawai itu pengurus syarikat (The officer is a company manager) 79 T15: Coklat koko makanan kegemaran adik (Chocolates is my sister’s favourite) T23: Kakak saya guru sekolah (My sister is a teacher) T24: Sepupu saya jurutera binaan (My nephew is a civil engineer) T30: Adik saya murid sekolah rendah. (My brother is a pupil) T33: Makcik saya kerani sekolah (My auntie is a school clerk) T34: Kawan abang tentera laut (My brother’s friend is a navy) T39: Gadis itu model sambilan. (The girl is a part-time model) T48: Pulau peranginan milik kerajaan negeri (The island resort is owned by state government) T49: Kakak jurusolek butik pengantin (My sister is a bridal boutique beautician) T50: Beliau atlet negara Malaysia. (He is a Malaysian athlete) T52: rumah saya rumah kayu (My house is a wooden house). T55: Pelajar itu pelajar cemerlang (The student is an excellent student). T65: kereta idaman saya kereta mewah (My dream car is a luxury car) T73: sekolah kami sekolah harapan (our schools are school expectations) T74: guru saya guru besar (Our teacher is a headmaster) T80: pakcik saya rakan kongsi ayah (My uncle is my father’s share partner) T83: datuk saya pesara polis (My grandfather was a policeman) T84: rakan saya pegawai bomba (My friend is a fireman officer) T89: lelaki itu pengacara televisyen (The man is a television host) T96: beliau pemimpin besar negara (he is a national leader) T97: dia pelajar kolej swasta (She is a private college student) T98: rakan kami pengusaha kedai perabot (Our friend is a furniture businessman). 80 T99: ibu pengusaha butik pengantin (My mother is a bridal boutique businesswoman) T100: beliau pemain bola sepak (He is a footballer) Sentence for FN + FK (Noun Phrase + verb Phrase) T1: Kami adang air itu. (We deter the water) T4: Saya bagi fakir itu wang. (I give money to the poor) T7: Beg adik berwarna coklat (The colour of my younger brother’s bag is brown) T10: Saya eskot pengetua ke pentas (I escort the principal to the stage) T11: Kami garuk tanah. (We scratch the soil) T12: Saya gali sendiri telaga itu. (I dig the well by myself) T20: Baju kakak berwarna putih (The colour of my sister’s cloth is white) T21: Kapal milik keluarga saya labuh di pelabuhan klang (My family’s ship wagons at Port Klang) T22: Aku lambung duit syiling (I lob the coins) T25: saya kopek buah kelapa itu. (I peel the coconut fruit) T29: Buah gajus rasa kelat. (The cashew fruit taste bitter) T32: Aku cas bateri itu. (I charge the battery) T37: Ibu saya bagi kucing itu makanan (My mother gives the food to the cat) T38: guru kami bagi markah sangat rendah (Our teacher gives us too low marks) T31: saya pagar reban itu (I enclose the hens’ shed) T46: Kami bentuk adunan biskut itu (We shape the dough of biscuit) T47: Pelajar malas itu benak dalam semua subjek (The lazy student slows in learning all the subjects) 81 T51: Saya kepit suratkhabar itu. (I clamp the newspaper) T54: orang kaya itu bagi sedekah (the rich man gives donation). T57: Gigi adik berwarna putih (My brother’s teeth is white) T60: Saya daftar subjek baru semester hadapan (I register the new subject for coming semester) T61: Kami garuk sungai yang cetek (We dig the shallow river) T62: Saya gali lubang sampai dalam (I dig the hole deeply) T70: kasut ayah berwarna coklat (My father’s shoes color is brown) T72: kami lambung pengantin lelaki itu (We lobbed the bridegroom.) T75: aku kopek buah limau itu. (I peel the orange fruit) T79: buah strawberi rasa masam (The strawberry fruits taste sour) T81: kami pagar kandang itu. (We gate the cage) T82: kami cas generator ini (We charge this hgenerator) T87: abang saya bagi budak itu duit (My brother gives the boy money) T88: pengadil bagi mata sangat tinggi (the referee gives the high points) Sentence for FN + FA T3: Seterika antik sangat mahal. (The antique iron is very expensive) T6: Atap rumah saya bocor. (My house’s roof is leaking) T8: Pipi mukanya bengkak (Her face is swollen) T9: Kata-kata wanita itu bisa. (Her words are really hurt) T13: Nasi godak kenduri sangat sedap. (The special feast rice is very delicious) T14: Puteri ketujuh paling gombang (The seventh princess is very pretty) T16: Suara penyanyi wanita agak garuk (The singer’s voice is quite husky) 82 T17: Sakit hati saya makin buku (My heart feels so hurt) T18: Buah beri itu manis. (The berry fruit is sweet) T19: Ucapan guru kaunseling itu cacat. (The counseling teacher’s speech is flawed) T26: Jawapan murid itu konkrit (The student’s answer is concrete) T27: bunyi derai kaca amat ngilu (The sound of broken glass is very unpleasant) T28: pelajar cemerlang sungguh dinamik (The excellent student is a very dynamic person) T35: kawan kakak sangat cantik (My sister’s friend is very beautiful) T36: kulit bayi sangat halus (The baby skin is so gentle) T40: kereta kepunyaan ayah baru (My father’s car is new) T41: Pembetulan tesis saya minor (My thesis’s corrections are minor) T44: baju adik baru (my sister’s cloth is new) T45: pembedahan ibu minor (my mother’s operation is minor) T53: baju ibu sangat murah (My mom’s attire is so cheap). T56: belon kepunyaan adik bocor (My younger sister balloons’ leak) T58: Perut adik buntal (My brother’s stomach distended) T59: sengat tebuan itu bisa (the sting hornet is hurt) T63: sayur hijau sangat segar (the green vegetables is very fresh). T64: kereta kepunyaan rakan amat besar (my friend’s car is very big). T66: badan pesakit diabetis makin kurus (The diabetes patients become slimmer) T67: kereta buatan tempatan makin mahal (The local cars become more expensive) T68: badan pelakon itu langsing (The actress body is slim) T69: bangunan milik kerajaan itu runtuh (the government’s building is collapsed.) T71: kain sekolah pelajar itu labuh (the student’s skirt is trailing) T76: binaan bangunan itu konkrit (The structure of the building is concrete) 83 T77: jiran rumah kami amat baik (our neighbor is so kind) T78: adik saya sangat nakal (my brother is very naughty) T85: bunga mawar sangat wangi (the rose is so sweet-smelling). T86: anak kakak sangat comel (my sister’s daughter is so cute) T90: kereta kepunyaan ayah baru (My father has a new car) T91: Pembetulan tesis saya minor (My thesis correction is minor) T94: perkakasan sekolah adik baru (My brother’s stationary is new) T95: murid sekolah itu rajin (the school pupil is hardworking) Sentence for FN + FS T42: Bapa saya ke pejabat (My father goes to the office) T43: Asal benang daripada kapas (Origin thread is from the cotton) T92: rumah kami di bandar (My house is in the city) T93: penduduk kampung ke sawah padi (villagers go to the paddy field) There are three steps that each of the sentences should follow: (1) Find the possible parses (2) Assigning probabilities to each rules that are involved in the parsed sentence (3) Determine the most probable one (the parse tree which has the highest probability value) Experiments were done on each sentence but only a few have been selected for the below examples. Others can be referred in Appendix C. Each of the examples follows the three steps. 84 Example 1: T2: Baju kakak batik Terengganu. (My sister’s attire is Terengganu batik) STEP (1): Find the possible parses Parse tree 1: Parse tree 2: STEP (2): Assigning probabilities to each rules that are involved in the parsed sentence Parse tree 1: Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 STEP (3): Determine the most probable one (the parse tree which has the highest probability value) The parser chooses the first parse tree (Parse tree 1) as the structure for sentence “baju kakak batik Terengganu” as the parse tree has the higher value of probability compares to the parse tree (3.255 x 10-3 > 1.532 x 10-3) 85 Example 2: T1: Kami adang air itu. (We deter the water) STEP (1): Find the possible parses Parse tree 1: Parse tree 2: STEP (2): Assigning probabilities to each rules that are involved in the parsed sentence Parse tree 1: Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 STEP (3): Determine the most probable one (the parse tree which has the highest probability value) The parser chooses the first parse tree (Parse tree 1) as the structure for sentence “kami adang air itu” as the parse tree has the higher value of probability compares to the parse tree (1.752 x 10-2 > 2.004 x 10-3) 86 Example 3: T13: Nasi godak kenduri sangat sedap. (The special feast rice is very delicious) STEP (1): Find the possible parses Parse tree 1: Parse tree 2: STEP (2): Assigning probabilities to each rules that are involved in the parsed sentence Parse tree 1: Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KPeng + KA (0.2766) A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN(0.2017) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.1410 X 0.0483 X 0.2766 = 1.884 X 10-3 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0008 = 1.291 X 10-5 STEP (3): Determine the most probable one (the parse tree which has the highest probability value) The parser chooses the first parse tree (Parse tree 1) as the structure for sentence “kami adang air itu”) as the parse tree has the higher value of probability compares to the parse tree (1.884 x 10-3 > 1.291 x 10-5) 87 Example 4: T92: Rumah kami di bandar (My house is in the city) STEP (1): Find the possible parses Parse tree 1: Parse tree 2: STEP (2): Assigning probabilities to each rules that are involved in the parsed sentence Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN + KN (0.2017) FS → KSN+ FN (0.975) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0400 x 0.2017 x 0.975 x 0.3965 = 3.119 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KSN + KN (0.0146) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0146 = 4.631 x 10-4 STEP (3): Determine the most probable one (the parse tree which has the highest probability value) The parser chooses the first parse tree (Parse tree 1) as the structure for sentence “kami adang air itu”) as the parse tree has the higher value of probability compares to the parse tree (3.119 x 10-3 > 4.631 x 10-4) 88 From those examples, we can conclude that the Malay Statistical Parser is quite successful to minimize the structural ambiguity in Malay grammar rules. However, in the test data, there are seven (7) sentences where the parse tree wrongly selects the best possible parse tree. The sentences are listed below: (1) T49: Kakak jurusolek butik pengantin (2) T50: Beliau atlet negara Malaysia (3) T94: perkakasan sekolah adik baru (4) T96: beliau pemimpin besar negara (5) T97: dia pelajar kolej swasta (6) T98: rakan kami pengusaha kedai perabot (7) T99: ibu pengusaha butik pengantin One of the sentences is shown in the example 5. Example 5: T49: Kakak jurusolek butik pengantin (My sister is a bridal boutique beautician) STEP (1): Find the possible parses Parse tree 1: Parse tree 2: 89 STEP (2): Assigning probabilities to each rules that are involved in the parsed sentence Parse tree 1: Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 STEP (3): Determine the most probable one (the parse tree which has the highest probability value) The parser chooses the first parse tree (Parse tree 1) as the structure for sentence “kakak jurusolek butik pengantin”) as the parse tree has the higher value of probability compares to the parse tree (3.255 x 10-3> 1.532 x 10-3) However, the parse tree that is chosen by the parser does not represent the best possible parse tree. According to the munsyi dewan, the best possible parse tree should be the Parse tree 2. Based on the result, the subject of the sentence contains “kakak jurusolek”, However, the predicate should be “kakak” only. 5.2 Results By evaluating the performance and accuracy of the parsers, there are three different metrics that were used recall, precision, and f-score (Carroll et. al, 1998). 90 Recall is identified as the ratio of the number of grammatical relations (GRs) returned by the parsers that match GRs in the corresponding annotated sentence, divided by the total number of GRs in the annotated sentences. GRs can be defined as a process when the parser produces an output that remove away the details of the actual sentence but retains the structure important for semantics (Carroll & Charniak, 1992). Recall is computed by dividing the correct parsed sentences with the intended correct parsed sentence. Recall = correct parsed sentence Intended correct parsed sentence The second metric, precision is identified as ratio of the number of GRs returned by parser that match, divided by the total number of GRs returned by the parser for that sentence. We also used precision to determine the accuracy of the parser. In this parser, the precision is calculated by dividing the correct parsed sentence with the all parsed sentence. Precision = correct parsed sentence all parsed sentence The last metric is f-score. We can calculate this metric by adding the both value of precision and recall. Then the result will be divided by two. f-score = (precision + recall) / 2 The overall number of the test sentences is presented in the Table 5.1: 91 Table 5.1: The number of test sentences Pattern Number of Test Dataset FN+FN 26 FN+FK 31 FN+FA 39 FN+FS 4 TOTAL TEST 100 SENTENCES There are one hundred (100) test sentences that are used to evaluate the prototype. The sentences are obtained from the Munsyi Dewan who is expert in Malay languages (dewan Bahsa & Pustaka (2010). The highest number of the test sentences is 39, which belongs to FN + FA pattern. The results of the parser evaluation are categorized according the types of pattern sentences. The steps of the calculation are shown in the following example. For pattern FN +FN: Recall = correct parsed sentence Intended correct parsed sentence = 19/19 = 100% Precision = correct parsed sentence All parsed sentence = 19/26 = 73% 92 F-score = (precision + recall) / 2 = (100% + 73%) / 2 = 87% For pattern FN+FK: Recall = 31 /31 = 100% Precision = 31/31 = 100% F-score = (100%+100%)/2 = 100% For pattern FN+FA: Recall = 39/39 = 100% Precision = 39/39 = 100% F-score = (100%+100%)/2= 100% For pattern FN+FS: Recall = 4/4 = 100% Precision = 4/4 = 100% F-score = (100%+100%)/2 = 100% 93 All the results are simplified by using Table 5.2. Table 5.2: Result of the experiments Pattern Recall Precision F-score FN + FN 100% 73% 87% FN + FK 100% 100% 100% FN + FA 100% 100% 100% FN + FS 100% 100% 100% Average 100% 93.25% 96.75% The Table 5.2 could also be represented using bar chart in the Figure 5.1. Recall, Precision and F-score for Statistical Parser for Malay Language Percentage (%) 100 Recall 90 Precision 80 F-score 70 60 50 40 30 20 10 0 FN+FN FN+FK FN+FA (Basic Sentence Pattern) FN+FS Figure 5.1: Recall, Precision and F-score for Statistical Parser for Malay Language 94 From Figure 5.1, the bar chart represents the results of experiment for statistical parser for Malay language. The results are evaluated using three different evaluation metrics namely recall, precision and f-score and we can see that each basic sentence pattern is almost 100% except for precision of FN+FN pattern where it only achieved 73%. Precision has the highest percentage compared to other measurements for each of the pattern except for FN+FN rules. It means that most tested sentences are correct in parsing. All the test sentences are tested by Munsyi Dewan. The average of the measurement shows that the parser has good performance. 5.3 Summary of the Chapter In this chapter, the experiments and results have been discussed. Next is the discussion of its efficiency, limitations and future enhancements which will be described in the final chapter. 95 CHAPTER 6: CONCLUSION Chapter 6 Conclusion This chapter sums up this dissertation and provides some conclusions from the results of this research. It also highlights the strengths of Malay Statistical Parser, the prototype developed for this study. Some limitations that exist in this prototype are also listed and can be improved through suggestions given in the Future Enhancements section. 6.1 Fulfillment of Research Objectives The objectives of this research were defined in Section 1.3 of Chapter One. Here, we will review the objectives and see if they were fulfilled as expected. (1) To calculate the probability of Malay’s grammar rules This objective was achieved through the Probabilistic Malay Grammar (Chapter Three). In that chapter, probability values are computed for Malay grammar rules. One thousand basic sentences are collected from various sources such as primary school textbooks and Malay grammar books. Then, the probability values are calculated based on how many the rules occur in the training data. These processes are demonstrated detail in Appendix A and Appendix B. There are one hundred and fourty seven (147) rules are derived from the training data. 96 (2) To develop a prototype of a statistical parser for the Malay language. This objective is achieved in Chapter Four that describes the process of designing and implementing Malay Statistical Parser, the prototype for this dissertation. 6.2 Malay Statistical Parser The Malay Statistical Parser is an initial idea for the development of statistical parser in Malay language. Since there is no probability for Malay grammar rules was computed, a thousand sentences are gathered to calculate the values. There are one hundred and fourty seven (147) rules in Malay grammar that are associated with probability values. In Malay language also, there is no corpus that has been developed. In this research project, the MaLEX, which is Malay Lexicon, is used. Initially in the MaLEX, there is only one table inside it. The table is Lexical Table, which has two entities, words and lexical classes. An enhancement has been made to the MaLEX where a new table has been added. The table is Rule Table, which has Malay grammar rules and their probability values. The parser could minimize the structural ambiguity in Malay language. This reduction of ambiguity can be seen in ALL patterns of sentences, which are shown, in Appendix C. All the test sentences are evaluated by Munsyi Dewan. The evaluation of the parser is measured based on three measurements; namely precision, recall and f-score. The results are good where the parser has achieved 100% recall, 93.25% precision and 96.75% f-score. 97 6.3 Limitations As the Malay Statistical Parser is still at the prototype level, there are some features or aspects which are still not completely developed, thus become the limitations of the parser. (1) Only a limited number of words are used as the test dataset The total number of the training data that are listed in the Appendix A is one thousand (1000) sentences. The length of each sentence is not more than ten (10) words. Thus, we should also test the parser using a sentence which has less than ten words in length. (2) The type of test data set should only basic and declarative sentence. The grammar that is used in Malay Statistical Parser is Context-Free Grammar (CFG). The CFG is rule that form a basic sentence only. The sentence should contain one subject and one predicate only. The CFG is also for declarative sentence. 6.4 Future Enhancement Further enhancement is necessary as the current parser only involved one hundred and fourty seven (147) rules with associate probability values. It is recommended that further research be undertaken in the following areas: (1) After analysed the result of the experiments, there is a need for development of more detail online corpus of Malay words. As there is no such standard online Malay corpus available (Ahmad Izuddin et. al 2007), we suggest an online tagging large data set must be developed, so that it able to provide a better results to the parser. 98 (2) The POS tagger should tag the word accurately according to its usage. Probabilistic POS tagging should be implemented to solve the problem. For example, the word ‘semak’ should have the two probability values because it has two lexical classes, either KN or KKTr. (3) The size of grammar rules should be expanded. In this study, only one hundred and fourty seven (147) rules have the probability values. The rules may expand more if the size of training data is extended. 6.5 Summary of the Chapter This chapter marks the end of this dissertation report. It has summarized the essential elements that form the basis of this research. From this chapter, it is seen that the objectives of this dissertation have been achieved. The research has also provided a corpus of a Malay sentence with the probabilistic scores namely the PCFG of Malay Sentences. Like other available good parsers developed for English language such as Stanford Parser (Klein & Manning, 2003), our proposed PCFG for Malay language can also be applied in many applications. For instance, the parser can be used as an engine for Malay Grammar Checker. 99 REFERENCES REFERENCES Abdullah Hassan, 1993, Tatabahasa Pedagogi Bahasa Melayu, Utusan Publications & Distributors Sdn Bhd. Abdullah Hassan & Ainon Mohd, 1994a, Bahasa Melayu untuk Maktab Perguruan, Penerbit Fajar Bakti Sdn Bhd Kuala Lumpur. Abdullah Hassan & Ainon Mohd, 1994b, Tatabahasa Dinamika Berdasarkan Tatabahasa Dewan, Penerbit Fajar Bakti Sdn Bhd Kuala Lumpur. Abu Naim Kassan, 2001, Wawasan PMR Tatabahasa, Pustaka Delta Pelajaran Sdn Bhd, Petaling Jaya. Ahmad I. Z. Abidin, Yong, S. P., Rozana Kasbon & Hazreen Azman, 2007, Utilizing TopDown Parsing Technique In The Development of a Malay Language Sentence Parser, Proceeding of the 2 nd International Conference of Informatics, Universiti Malaya, Kuala Lumpur. Allen, J, 1995, Natural Language Understanding, The Benjamin/Cumming Publishing Company, Inc, Redwood City, California. Azhar Simin, 1988, Discourse-Syntax of “YANG” in Malay (Bahasa Malaysia), Dewan Bahasa dan Pustaka, Kuala Lumpur. Bach, K., 1998, Ambiguity, Routledge Encyclopedia of Philosophy, Routledge, London Booth, T. L., 1969, Probabilistic Representation of Formal Languages, Tenth Annual IEEE Symposium on Switching and Automata Theory. Carroll, J., Briscoe, T., & Sanfilippo, A., 1998, Parser Evaluation: A Survey and a New Proposal, Proceedings of the First International Conference on Language resources and Evaluation, pp. 447-454. Carroll, G., & Charniak, E., 1992, Two Experiments on Learning Probabilistic Dependency Grammar from Corpora, Workshop Notes, Statistically-Based NLP Techniques, pp. 1-13. Charniak, E., 1993, Statistical Language Learning, MIT Press, Cambridge Massachusetts, London, UK. Charniak, E., 1997, Statistical Parsing with a Context-free Grammar and Word Statistics, Proceedings of the National Conference on Artificial Intelligence, John Wiley & Son Ltd, USA. Charniak, E., 2000, A Maximum-Entropy-Inspired Parser, ACM International Conference Proceeding Series, The MIT Press. Chomsky, H., 1957, Syntactic Structure, The Hague, The Netherlands: Moutan. 100 Chomsky, N., 1966, Syntactic Structures, The Sixth Printing, Mounton & Co. Chomsky, N., 1971, Problems of Knowledge Freedom – The Russell Lectures, Pantheon Books, A division of Random House, New York. Chomsky, N., 1975, Reflections of Language, Pantheon, New York. Chomsky, N., 1980, Rules and Representations, Columbia University Press, New York. Collins, M. J., 2003, Head-Driven Statistical Models For Natural Language Parsing, Vol. 29, No. 4, pp. 589-637, MIT Press Journal Dewan Bahasa dan Pustaka Malaysia, 2010, Risalah Munsyi Dewan, online, retrieved 23 December 2010 from http://appw05.dbp.gov.my/dokumen/risalah_munsyi_dewan.pdf Fellbaum, C., 1999, WordNet: An Electronic Lexical Database, MIT Press. Francis, W. N., & Kucera, H., 1979, Brown Corpus Manual, http://www.hit.uib.no/icame/brown/bcm.html Grune, D., & Jacobs, C., 1990, Parsing Techniques A Practical Guide, Ellis Horwood Limited, West Sussex, England. Hawkins, J. M., 2001, Kamus Dwibahasa Oxford Fajar, Oxford Fajar SDN. BHD., Selangor. Hawryszkiewycz, I. T., 1998, Introduction to System Analysis and Design, Fourth Edition, Australia: Prentice Hall Australia Pty. Ltf. Hoenisch, S., 2004, Indentifying and Resolving Ambiguity, online, retrieved 21 January 2009, from http://www.criticism.com/linguistics/types-of-ambiguity.php Jurafsky, D., Daniel, & Martin, J. H., 2000, Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, Prentice-Hall, Upper Saddle River, New Jersey. Klein, D., & Manning, C., 2003, Accurate Unlexicalized Parsing, Proceedings of the Association for Computational Linguistics (ACL). Lakeland, C., & Knott, A., 2004, Implementing Lexicalised Parser, University of Otago, New Zealand. Magerman, D. M., 1995, Statistical Decision-tree Models for Parsing, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 276-283. Manning, C. D., & Schutze, H., 1999, Foundations of Statistical Natural Language Processing, The MIT Press, Cambridge, Massachusetts. 101 Marcus, M., Kim, G., & Marcinkiewicz, A.M., 1993, Building a Large Annotated Corpus of English: the Penn Treebank, Computional Linguistics, Volume 19, Issue 2, The MIT Press, pp. 313-330. Meyer, P. G., et al., 2002, Synchronic English Linguistics An Introduction, Gunter Narr Verlag Tubingen, Germany. Microsoft Corporation, 2002, Copyright, online, retrieved 1 December http://www.microsoft .com/about/legal/en/us/Copyright/Default.aspx 2008, Mohanty, S., & Balabantaray R. C., 2003, Intelligent Parsing In Natural Language Processing, 8th International Workshop of Parsing Technologies. Mohd Juzaiddin Ab Aziz, et. al., 2006, Pola Grammar Technique For Grammatical Relation Extraction In Malay Language, Malaysian Journal of Computer Science, Vol 19, No. 1, pp. 59-72, University of Malaya. Nik Safiah Karim, 1975, The Major Syntactic Structures of Bahasa Malaysia and their Implication of the standardization of the Language, PhD Dissertation, Ohio University. Nik Safiah Karim, 1995, Malay Grammar for Academics and Professionals, Dewan Bahasa dan Pustaka, Kuala Lumpur. Nik Safiah Karim, Farid M. Onn, Hashim Haji Musa & Abdul Hamid Mahmood, 2009, Tatabahasa Dewan Edisi Ketiga, Dewan Bahasa dan Pustaka, Kuala Lumpur. Pfleeger, S. L., 1998, Software Engineering Theory and Practice, International Edition, USA: Prenticw Hall. Yeoh, C.K., 1979, Interaction of Rules in Bahasa Malaysia, PhD Dissertation, University of Illinois at Urbana Champaign. Yuan, Y., 1997, Statistics Based Approaches Towards Chinese Language Processing, PhD Dissertation, National University of Singapore. Zainal Arifin Yusof, Kamarudin Jeon & Mohd Nasar Sukor, 2008, Buku Teks Bahasa Melayu Sekolah Kebangsaan Tahun 1 Kurikulum Bersepadu Sekolah Rendah (KBSR), Dewan Bahasa dan Pustaka Kuala Lumpur. Zainal Arifin Yusof, Kamarudin Jeon & Mohd Nasar Sukor, 2005, Buku Latihan dan Aktiviti Bahasa Melayu Sekolah Kebangsaan Tahun 1 KBSR, Dewan Bahasa dan Pustaka Kuala Lumpur. 102 APPENDIX A EXAMPLES OF TRAINING DATA APPENDIX A: TRAINING DATA Some Examples of Training Data that are tagged manually by Munsyi Dewan Index / Sentence Number Sentence Tagging Process Rules Text are Match (1) Air banjir itu kami adang Kami adang air banjir itu. KN KKTr KN KN Pent A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KN + Pent (2) Barang permainan murah buatan negeri China tentu ada aibnya. Barang permainan murah buatan negeri China tentu ada aibnya. KN KN KA KN KN KN KBantu KKTr KN A→S+P S → FN P → FK FN → KN + KN + KA + KN + KN FK → KBantu + KKTr + FN FN → KN (3) Ali berasa sungguh aib atas pelakuannya . Ali berasa sungguh aib atas pelakuannya . KN KKTr KPeng KA KSN KN A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KPeng + KA + KSN + KN (4) Bau minyak wangi alkoholik tahan lama. Bau minyak wangi alkoholik tahan lama. KN KN KN KN KKTTr KA A→S+P S → FN P → FK FN → KN + KN + KN + KN FK → KKTTr + KA 103 (5) Pemain-pemain bola bawah 18 tahun itu masih amatur. Pemain-pemain bola bawah 18 tahun itu masih amatur. KN KN KN Bil KN Pent KBantu KA (6) Amatur sandiwara menjadi kegemarannya pada masa cuti semester. Amatur sandiwara menjadi kegemarannya pada masa cuti semester. KN KN KKTr KN KSN KN KN KN (7) Kami adang air yang mengalir itu. Kami adang air yang mengalir itu. KN KKTr KN KBantu KKTr Pent (8) Kebanyakan hotel di tepi pantai membina limbung berduri. Kebanyakan hotel di tepi pantai membina limbung berduri. KN KN KSN KN KN KKTr KN KKTTr (9) Sikap Hasim yang limbung menjelikkan orang. Sikap Hasim yang limbung menjelikkan orang. KN KN KBantu KA KKTr KN (10) Orang membenci sikap Hasim yang limbung. Orang membenci sikap Hasim yang limbung. KN KKTr KN KN KBantu KA A→S+P S → FN P → FK FN → KN + FK → KKTTr + KA A→S+P S → FN P → FK FN → KN + KN FK → KKTr + FN FN → KN + KSN + KN + KN + KN A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KBantu + KKTr + Pent A→S+P S → FN P → FK FN → KN + KN + KSN + KN + KN FK → KKTr + FN FN → KN + KKTTr A→S+P S → FN P → FK FN → KN + KN + KSN + KN + KN FK → KKTr + FN FN → KN + KKTTr A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KN + KBantu + KA 104 A→S+P S → FN P → FK FN → KN + KN + KN + Pent FK → KKTr + FN FN → KN + KN + KN A→S+P S → FN P → FA FN → KN + KN FA → KA + KSN + KN + KN + KBantu + KA (11) Lonjong bentuk songkok itu mendapat permintaan orang ramai. Lonjong bentuk songkok itu mendapat permintaan orang ramai. KN KN KN Pent KKTr KN KN KN (12) Kakak saya lonjong daripada kakak yang lain. Kakak saya lonjong daripada kakak yang lain. KN KN KA KSN KN KN KBantu KA (13) Hantu bersifat mistik mengikut kepercayaan masyarakat. Hantu bersifat mistik mengikut kepercayaan masyarakat. KN KKTr KN KKTr KN KN A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KKTr + KN + KN (14) Pak Li mengamalkan ilmu mistik. Pak Li mengamalkan ilmu mistik. KN KN KKTr KN KN A→S+P S → FN P → FK FN → KN + KN FK → KKTr + FN (15) Ubat batuk itu amat mustajab. Ubat batuk itu amat mustajab. KN KN Pent KPeng KA A→S+P S → FN P → FA FN → KN + KN + Pent FA → KPeng + KA 105 (16) Mustajabnya doa seorang ibu bapa adalah mengikut amalan mereka. Mustajabnya doa seorang ibu bapa adalah mengikut amalan mereka. KN KN Bil PenjBil KN KN KN KKTr KN KN A→S+P S → FN P → FK FN → KN + KN + Bil + PenjBil + KN + KN + KN FK → KKTr + FN FN → KN + KN (17) Nenek suka makan pinang Nenek suka makan pinang KN KBantu KKTr KN A→S+P S → FN P → FK FN → KN FK → KBantu + KKTr + FN (18) Saya pinang gadis kampung itu. Saya pinang gadis kampung itu. KN KKTr KN KN Pent A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KN + Pent (19) Kereta itu kereta prebet. Kereta itu kereta prebet. KN Pent KN KN A→S+P S → FN P → FN FN → KN + Pent FN → KN + KN (20) Raga itu besar sangat. Raga itu besar sangat. KN Pent KA KPeng A→S+P S → FN P → FA FN → KN + Pent FA → KA + KPeng 106 (21) Pakaian seragamnya agak rambu pada hari ini. Pakaian seragamnya agak rambu pada hari ini. KN KN KBantu KA KSN KN Pent A→S+P S → FN P → FA FN → KN + KN FA → KBantu + KA + KSN + KN + Pent (22) Tali khemah itu diikat pada rambu. Tali khemah itu diikat pada rambu. KN KN Pent KKTTr KSN KN A→S+P S → FN P → FK FN → KN + KN + Pent FK → KKTTr + KSN + KN (23) Para menteri membincangkan rang undang-undang jalan raya. Para menteri membincangkan rang undangundang jalan raya. KN KN KKTr KN KN KN KN (24) Padi itu ditanam rapat. Padi itu ditanam rapat. KN Pent KKTTr KA (25) Rapat umum itu sangat aman. Rapat umum itu sangat aman. KN KN Pent KPeng KA A→S+P S → FN P → FK FN → KN + KN + Pent FK → KKTTr + KSN + KN A→S+P S → FN P → FK FN → KN + Pent FK → KKTTr + KA A→S+P S → FN P → FA FN → KN + KN + Pent FA → KPeng + KA (26) Cuti raya tahun ini dua hari sahaja. Cuti raya tahun ini dua hari sahaja. KN KN KN Pent Bil KN KN A→S+P S → FN P → FN FN → KN + KN + KN + Pent FN → Bil + KN + KN 107 (27) Ali sedang releks di kerusi malasnya. Ali sedang releks di kerusi malasnya. KN KBantu KKTTr KSN KN KN A→S+P S → FN P → FK FN → KN FK → KBantu + KKTTr + KSN + KN + KN (28) Dia menjawab soalan guru dengan releks sahaja. Dia menjawab soalan guru dengan releks sahaja. KN KKTr KN KN KSN KN KN A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KN + KSN + KN + KN (29) Resmi padi sangat bagus dicontohi. Resmi padi sangat bagus dicontohi. KN KN KPeng KA KKTTr A→S+P S → FN P → FA FN → KN + KN FA → KPeng + KA + KKTTr (30) Ali begitu ribut sekali di tempat kenduri kahwin adiknya. Ali begitu ribut sekali di tempat kenduri kahwin adiknya. KN KBantu KA KPeng KSN KN KN KN KN` A→S+P S → FN P → FA FN → KN FA → KBantu+KA +KPeng+KSN+KN+KN+KN+KN` (31) Hamidah saing baik saya di sekolah. Hamidah saing baik saya di sekolah. KN KN KN KN KSN KN A→S+P S → FN P → FN FN → KN FN → KN + KN + KN + KSN + KN 108 A→S+P S → FN P → FK FN → KN FK → KNafi + KKTTr + KSN + KN + KN A→S+P S → FN P → FN FN → KN FN → KN + KN + KN + Pent + KN (32) Kecantikannya tiada saing di kelas kami. Kecantikannya tiada saing di kelas kami. KN KNafi KKTTr KSN KN KN (33) Saya saksi pertandingan bola sepak itu semalam. Saya saksi pertandingan bola sepak itu semalam. KN KN KN KN KN Pent KN (34) Saksi kejadian itu telah meninggal dunia. Saksi kejadian itu telah meninggal dunia. KN KN Pent KBantu KKTr KN A→S+P S → FN P → FK FN → KN + KN + Pent FK → KBantu + KKTr + FN FN → KN (35) Kubu pertahanan musuh sapih akibat serangan tentera bersekutu. Kubu pertahanan musuh sapih akibat serangan tentera bersekutu. KN KN KN KA KN KN KN KN (36) Kami telah sapih anak lelaki kami minggu lepas Kami telah sapih anak lelaki kami minggu lepas. KN KBantu KKTtr KN KN KN KN KN A→S+P S → FN P → FA FN → KN + KN + KN FA → KA + KN + KN + KN + KN A→S+P S → FN P → FK FN → KN FK → KBantu + KKTr + FN FN → KN + KN + KN + KN + KN 109 (37) Saya sebak rambut saya berkali-kali. Saya sebak rambut saya berkali-kali. KN KKTr KN KN KN A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KN + KN (38) Dia sedang membasuh. Dia sedang membasuh. KN KBantu KKTTr A→S+P S → FN P → FK FN → KN FK → KBantu + KKTTr (39) Ali berkeluarga sedang. Ali berkeluarga sedang. KN KKTr KN A→S+P S → FN P → FK FN → KN FK → KKTr + FN (40) Seksi penerbitan Syarikat Tunas sangat aktif. Seksi penerbitan Syarikat Tunas sangat aktif. KN KN KN KN KPeng KA (41) Pakaian itu sungguh seksi Pakaian itu sungguh seksi. KN Pent KPeng KA (42) Pintu itu mempunyai selak aluminium. Pintu itu mempunyai selak aluminium. KN Pent KKTr KN KN A→S+P S → FN P → FA FN → KN + KN + KN + KN FA → KPeng + KA A→S+P S → FN P → FA FN → KN + Pent FA → KPeng + KA A→S+P S → FN P → FK FN → KN + Pent FK → KKTr + FN FN → KN + KN 110 (43) Siling rumah itu kami seling dengan kayu jati. Siling rumah itu kami seling dengan kayu jati. KN KN Pent KN KKTTr KSN KN KN A→S+P S → FN P → FK FN → KN + KN + Pent FK → KKTTr + KSN + KN + KN (44) Seling besar itu telah pecah. Seling besar itu telah pecah. KN KN Pent KBantu KA (45) Kami selusur papan gelongsor itu. Kami selusur papan gelongsor itu. KN KKTr KN KN Pent A→S+P S → FN P → FA FN → KN + KN + Pent FA → KBantu + KA A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KN + Pent (46) Selusur jambatan kayu itu telah patah. Selusur jambatan kayu itu telah patah. KN KN KN Pent KBantu KKTTr A→S+P S → FN P → FK FN → KN + KN + KN + Pent FK → KBantu + KKTTr (47) Kawasan rumah Ali semak. Kawasan rumah Ali semak. KN KN KN KN A→S+P S → FN P → FN FN → KN FN → KN + KN + KN (48) Saya semak semula kertas soalan itu. Guru itu terpaksa semak semula kertas ujian Hassan. KN Pent KBantu KKTTr A→S+P S → FN P → FK FN → KN + Pent FK → KBantu + KKTTr 111 (49) Pintu rumah saya sempal. Pintu rumah saya sempal. KN KN KN KN A→S+P S → FN P → FN FN → KN FN → KN + KN + KN (50) Kami sempal lubang itu dengan kain basahan. Kami sempal lubang itu dengan kain basahan. KN KKTr KN Pent KSN KN KN A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + Pent + KSN + KN + KN (51) Sengkek dari Bangladesh itu baru tiba ke Malaysia. Sengkek dari Bangladesh itu baru tiba ke Malaysia. KN KSN KN Pent KBantu KKTTr KSN KN A→S+P S → FN P → FK FN → KN + KSN + KN + Pent FK → KBantu + KKTTr + KSN + KN (52) Dia sudah jatuh sengkek. Dia sudah jatuh sengkek. KN KBantu KKTTr KA A→S+P S → FN P → FK FN → KN + KSN + KN + Pent FK → KBantu + KKTTr + KSN + KN (53) Ali menjala ikan sepat di sungai itu. Ali menjala ikan sepat di sungai itu. KN KKTr KN KN KSN KN Pent A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KN + KSN + KN + Pent 112 (54) Serang hendap tentera itu berjaya. Serang hendap tentera itu berjaya. KN KN KN Pent KKTTr A→S+P S → FN P → FK FN → KN + KN + KN + Pent FK → KKTTr (55) Budak itu berani saya serang pada bilabila masa sahaja. Budak itu berani saya serang pada bila-bila masa sahaja. KN Pent KA KN KKTTr KSN KN KN KN A→S+P S → FN P → FK FN → KN + Pent + KA + KN FK → KKTTr + KSN + KN + KN + KN (56) Bulu surai di tengkuk kuda itu berwarna perang. Bulu surai di tengkuk kuda itu berwarna perang. KN KN KSN KN KN Pent KKTr KN A→S+P S → FN P → FK FN → KN + KN +KSN + KN + KN + Pent FK → KKTr + FN FN → KN (57) Pihak kami surai majlis itu jam 4.30 petang. Pihak kami surai majlis itu jam 4.30 petang. KN KN KKTr KN Pent KN Bil KN A→S+P S → FN P → FK FN → KN + KN FK → KKTr + FN FN → KN + Pent + KN + Bil + KN (58) Orang ramai bersorak seperti bunyi tagar. Orang ramai bersorak seperti bunyi tagar. KN KN KKTTr KSN KN KN A→S+P S → FN P → FK FN → KN + KN FK → KKTTr + KSN + KN + KN 113 (59) Besi murah itu mudah tagar . Besi murah itu mudah tagar . KN KN Pent KA KN A→S+P S → FN P → FA FN → KN + KN + Pent FA → KA + KN (60) Dia telah makan nasi. Dia telah makan nasi. KN KBantu KKTr KN A→S+P S → FN P → FK FN → KN FK → KBantu + KKTr + FN (61) Aku telah Halim akan memperoleh 8A dalam peperiksaannya. Aku telah Halim akan memperoleh 8A dalam peperiksaannya. KN KBantu KN KBantu KKTr KN KSN KN A→S+P S → FN P → FK FN → KN + KBantu + KN + KBantu FK → KKTr + FN FN → KN + KSN + KN (62) Terawang songket itu amat kemas. Terawang songket itu amat kemas. KN KN Pent KPeng KA A→S+P S → FN P → FA FN → KN + KN + Pent FA → KPeng + KA (63) Dia terawang jauh mengenangkan masa mudanya. Dia terawang jauh mengenangkan masa mudanya. KN KKTr KA KKTr KN KN A→S+P S → FN P → FK FN → KN + KKTr + KA FK → KKTr + FN FN → KN + KN 114 (64) Dua ulas durian itu busuk. Dua ulas durian itu busuk. KN PenjBil KN Pent KA A→S+P S → FN P → FA FN → KN + PenjBil + KN + Pent FA → KA (65) Kain ulas itu diperbuat daripada kapas Kain ulas itu diperbuat daripada kapas KN KN Pent KKTTr KSN KN A→S+P S → FN P → FK FN → KN + KN + Pent FK → KKTTr + KSN + KN (66) Adik Fatimah bermain di padang. Adik Fatimah bermain di padang. KN KN KKTTr KSN KN A→S+P S → FN P → FK FN → KN + KN FK → KKTTr + KSN + KN (67) Pemandu itu membelok ke kanan. Pemandu itu membelok ke kanan. KN Pent KKTTr KSN KN KN A→S+P S → FN P → FK FN → KN + Pent FK → KKTTr + KSN + KN + KN (68) Peristiwa itu disaksikan oleh dua ratus orang. Peristiwa itu disaksikan oleh dua ratus orang. KN Pent KKTTr KSN Bil KN KN A→S+P S → FN P → FK FN → KN + Pent FK → KKTTr + KSN + Bil + KN + KN (69) Padi sedang menguning. Padi sedang menguning. KN KBantu KKTTr A→S+P S → FN P → FK FN → KN FK → KBantu + KKTTr 115 (70) Pemuda itu tersenyum. Pemuda itu tersenyum. KN Pent KKTTR A→S+P S → FN P → FK FN → KN + Pent FK → KKTTr (71) Padi sedang menguning di sawah. Padi sedang menguning di sawah. KN KBantu KKTTr KSN KN A→S+P S → FN P → FK FN → KN FK → KBantu + KKTTr + KSN + KN (72) Pemuda itu tersenyum seorang diri. Pemuda itu tersenyum seorang diri. KN Pent KKTTr KN KN A→S+P S → FN P → FK FN → KN + Pent FK → KKTTr + KN + KN (73) Mereka asyik berbual di kedai kopi. Mereka asyik berbual di kedai kopi. KN KBantu KKTTr KSN KN KN A→S+P S → FN P → FK FN → KN FK → KBantu + KKTTr + KSN + KN + KN (74) Dia menjadi guru. Dia menjadi guru. KN KKTr KN A→S+P S → FN P → FK FN → KN FK → KKTr + FN (75) Lukanya beransur baik. Lukanya beransur baik. KN KKTTr KA A→S+P S → FN P → FK FN → KN FK → KKTTr + KA 116 (220) Saya berumur lapan tahun. Saya berumur lapan tahun. KN KKTr Bil KN A→S+P S → FN P → FK FN → KN + KN FK → KKTr + FN FN → Bil + KN (221) Tiga orang budak lemas. Tiga orang budak lemas. Bil PenjBil KN KKTTr A→S+P S → FN P → FK FK → KKTTr FN → Bil +PenjBil+ KN (222) Saya hendak tahu tentang bumi. Saya hendak tahu tentang bumi. KN KBantu KKTTr KBantu KN A→S+P S → FN P → FK FN → KN FK → KBantu + KKTTr + KBantu + KN A→S+P S → FN P → FK FN → KN FK → KKTTr +KSN + KN (394) Ali pergi ke London. Ali pergi ke London. KN KKTTr KSN KN (397) Dua ekor belalang di atas daun. Dua ekor belalang di atas daun. Bil PenjBil KN KSN KN KN A→S+P S → FN P → FS FN → Bil + PenjBil+KN FS → KSN + FN FN → KN + KN 117 (411) Pelajar itu belajar dengan bersungguhsungguh. Pelajar itu belajar dengan bersungguh-sungguh. KN Pent KKTTr KSN KKTTr A→S+P S → FN P → FK FN → KN + KN + Pent FK → KKTTr + KSN + KKTTr (412) Mereka itu berjalan pada waktu pagi. Mereka itu berjalan pada waktu pagi. KN Pent KKTTr KSN KN KN A→S+P S → FN P → FK FN → KN + Pent FK → KKTTr + KSN + KN + KN (696) Peniaga itu memperdaya pelanggannya dengan janji-janji manis. Peniaga itu memperdaya pelanggannya dengan janji-janji manis. KN Pent KKTr KN KSN KN KA A→S+P S → FN P → FK FN → KN + Pent FK → KKTr + FN FN → KN + KSN + KN + KA (697) Dia memperkacang harta orang tuanya sehingga habis. Dia memperkacang harta orang tuanya sehingga habis. KN KKTr KN KN KN KSN KN A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KN + KN + KSN + KN (701) Saya menduduki kerusi yang kosong itu. Saya menduduki kerusi yang kosong itu. KN KKTr KN KBantu KN Pent A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + KBantu + KN + Pent 118 (702) Saya mendudukkan anak itu pada kerusi yang di hadapan. Saya mendudukkan anak itu pada kerusi yang di hadapan. KN KKTr KN Pent KSN KN KBantu KSN KN A→S+P S → FN P → FK FN → KN FK → KKTr + FN FN → KN + Pent + KSN + KN + KBantu + KSN + KN (904) Kumpulan pelajar Malaysia itu bertolak esok pagi ke Jepun. Kumpulan pelajar Malaysia itu bertolak esok pagi ke Jepun. KN KN KN Pent KKTr KN KN KSN KN A→S+P S → FN P → FK FN → KN + KN + KN + Pent FK → KKTr + FN FN → KN + KN + KSN + KN (933) Sahabat penanya menulis surat yang sangat panjang. Sahabat penanya menulis surat yang sangat panjang. KN KN KKTr KN KBantu KPeng KA A→S+P S → FN P → FK FN → KN + KN FK → KKTr + FN FN → KN + KBantu + KPeng + KA (999) Rasa buah itu enak sekali. Rasa buah itu enak sekali. KN KN Pent KA KPeng A→S+P S → FN P → FA FN → KN + KN + Pent FA → KA + KPeng (1000) Nasibnya sungguh malang. Nasibnya sungguh malang. KN KPeng KA A→S+P S → FN P → FA FN → KN FA → KPeng + KA 119 APPENDIX B RULES AND RESPECTIVE INDEX SENTENCES Num RULE 1. A→S+P LHS Rule Count 1000 RHS Rule Count 1000 RHS/ LHS Probability The index sentence from training data according to respective rule Comment 1000/1000 1.0000 (1) – (1000) 1000/1000 1.0000 (1) – (1000) All sentences have this rule. All sentences have this rule. There are 80 sentences have FN as predicate. 2. S → FN 1000 1000 3. P → FN 1000 80 80/1000 0.0800 (21), (30), (38), (40), (56), (58), (106), (107), (119), (128), (149), (165), (167), (169), (170), (171), (172), (186), (194), (201), (202), (203), (204), (205), (206), (207), (209), (210), (211), (212), (213), (214), (215), (216), (217), (243), (297), (300), (303), (304), (307), (308), (316), (317), (401), (441), (446), (468), (470), (471), (478), (479), (480), (501), (502), (503), (505), (506), (507), (508), (509), (510), (512), (519), (546), (571), (572), (573), (576), (578), (788), (789), (794), (803), (896), (962), (971), (972), (981), (982) 4. P → FS 1000 40 40/1000 0.0400 (129), (163), (173), (195), (289), (299), (327), (389), (395), (397), (400), (434), (445), (498), (499), (500), (531), (532), (533), (534), (564), (807), (906), (909), (954), (955), (956), (957), (958), (959), (960), (961), (964), (990), (991), (992), (993), (994), (995), (996) There are 40 sentences have FS as predicate. 5. FN → KN + Pent 1304 162 162/ 1304 (20), (21), (22), (49), (50), (57), (79), (80), (84), (89), (94), (103), (107), (109), (110), (111), (113), (139), (156), (174), (175), (183), (184), (188), (192), (195), (235), (292), (295), (296), (297), (304), (308), (318), (376), (410), (412), (430), (431), (427), (433), (434), (441), (444), (462), (475), (480), (485), (493), (494), (497), (499), (501), (502), (520), (521), (523), (524), (534), (536), (545), (546), (548), (549), (550), There are 162 rules involve for rule FN → KN + Pent 0.1242 120 APPENDIX B: Some Rules and Their Index Sentences from Training Data According to Their Respective Rules Num RULE LHS Rule Count RHS Rule Count RHS/ LHS Probability The index sentence from training data according to respective rule Comment (553), (554), (559), (560), (568), (571), (572), (573), (574), (575), (576), (577), (579), (581), (602), (604), (610), (611), (615), (624), (626), (627), (628), (642), (644), (645), (648), (650), (651), (656), (663), (677), (680), (686), (690), (691), (692), (694), (695), (696), (705), (712), (719), (727), (725), (732), (733), (739), (735), (741), (751), (758), (761), (763), (779), (782), (799), (824), (834), (850), (854), (855), (858), (859), (860), (861), (862), (863), (864), (867), (874), (879), (883), (888), (891), (894), (896), (910), (911), (912), (916), (917), (929), (930), (938), (945), (954), (955), (956), (957), (959), (963), (964), (977), (993), (994), (996) 6. FA → Kpeng + KA 141 39 39/141 0.2766 (16), (29), (48), (49), (72), (103), (137), (145), (161), (175), (176), (183), (281), (298), (310), (314), (376), (378), (409), (433), (524), (542), (544), (545), (548), (549), (581), (770), (775), (780), (784), (785), (923), (936), (941), (946), (980), (998), (1000) There are 39 out of 141 rules are related to rule FA → Kpeng + KA 7. FK → KKTTr + KN + KN + KN FS → KBantu + KSN + FN 704 2 2/704 0.0028 (226), (227) There are 2 sentences involve rule FK → KKTTr + KN + KN + KN (129) There is only one rule of FS → KBantu + KSN + FN in the training data index (129) 8. 40 1 1/40 0.025 121 APPENDIX C RESULTS OF TEST DATA APPENDIX C RESULTS OF THE TEST DATA Index Parse Trees Sentence T1 Sentence: Kami adang air itu. Probability values Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 The best parse tree that represents “Kami adang air itu” is Parse Tree 1. 122 Index Parse Trees Sentence T2 Sentence: Baju kakak batik Terengganu. Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “Baju kakak batik Terengganu” is Parse Tree 1. T3 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: Seterika antik sangat mahal. Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.2017 x 0.2766 = 7.886x10-3 123 Index Parse Trees Sentence Parse tree 2: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538x10-5 The best parse tree that represents “Seterika antik sangat mahal” is Parse Tree 1. T4 Sentence: Saya bagi fakir itu wang. Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent + KN (0.0015) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3965 x 0.4815 x 0.0015 = 2.116x10-4 A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN (0.3965) FK → KSN+ FN (0.975) FN → KN + Pent + KN (0.0015) Parse tree 2 The best parse tree that represents “saya bagi fakir itu wang” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x 0.0400 X 0.3965 x 0.975 x 0.0015 = 2.320x10-5 124 Index Parse Trees Sentence T5 Sentence: Pegawai itu pengurus syarikat Parse tree 1 Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + Pent (0.1242) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.1242 x 0.2017 = 2.004 x 10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + Pent + KN (0.0015) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0015x 0.3965 = 4.758x 10-5 The best parse tree that represents “Pegawai itu pengurus syarikat” is Parse Tree 1. T6 Sentence: Atap rumah saya bocor. Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0483 x 0.3121 = 2.125 x 10-3 125 Index Parse Trees Sentence Probability values Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 The best parse tree that represents “atap rumah saya bocor” is Parse Tree 1. T7 Sentence: Beg adik berwarna coklat Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FN → KKTTr + KA (0.0483) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.2017 x 0.0483 = 7.199 x 10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FN → KKTTr + KN (0.0028) The best parse tree that represents “beg adik berwarna coklat” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x 0.7390 x 0.2017 x 0.0028 = 4.174 x 10-4 126 Index Parse Trees Sentence T8 Sentence: Pipi mukanya bengkak Probability values Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 X 0. 2017X 0.3121 = 8.876 X 10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FA → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 X 0. 2017 X 0.3965 = 6.398 X 10-3 The best parse tree that represents “pipi mukanya bengkak” is Parse Tree 1. T9 Sentence: Kata-kata wanita itu bisa. Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN+ Pent (0.0744) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 X 0.0744 X 0.3121 = 3.274 X 10-3 127 Index Parse Trees Sentence Parse tree 2: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN+ Pent (0.0744) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 X 0.0744 X 0.3965 = 2.360 X 10-3 The best parse tree that represents “kata-kata wanita itu bisa” is Parse Tree 1. T10 Sentence: Saya eskot pengetua ke pentas Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + KSN + KN (0.0146) Probability value = 1.0000 x 1.0000 x 0.7390x 0.3965 x 0.4815 x 0.0146 = 2.206x10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.0483) FN → KN + KSN + KN (0.0146) The best parse tree that represents “saya eskot pengetua ke pentas” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x0.08000 x 0.0483x 0.0146 = 5.641 X 10-5 128 Index Parse Trees Sentence T11 Sentence: Kami garuk tanah. Probability values Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.7390x 0.3965 x 0.4815 x 0.3965 = 5.5946x10-2 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN (0.3965) FA → KA + KN (0.3121) The best parse tree that represents “kami garuk tanah” is Parse Tree 1. T12 Probability value = 1.0000 x 1.0000 x 0.1410 X 0.0483 X 0.3121 = 2.125 X 10-3 Sentence: saya gali sendiri telaga itu. Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + KN + Pent (0.0744) Probability value = 1.0000 x 1.0000 x 0.7390x 0.3965 x 0.4815 x 0.0146 = 2.206x10-3 129 Index Parse Trees Sentence Probability values Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN + Pent (0.0092) Probability value = 1.0000 x 1.0000 x 0.0800 X 0.3965 X 0.0092 = 2.918 X 10-4 The best parse tree that represents “saya gali sendiri telaga itu” is Parse Tree 1. T13 Sentence: Nasi godak kenduri sangat sedap Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 X 0.0483 X 0.2766 = 1.884 X 10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN(0.2017) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.0800 X 0.2017 X 0.0008 = 1.291 X 10-5 The best parse tree that represents “nasi godak kenduri sangat sedap” is Parse Tree 1. 130 Index Parse Trees Sentence T14 Probability values Sentence: Puteri ketujuh paling gombang Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.2017 x 0.2766 = 7.886x10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538x10-5 The best parse tree that represents “puteri ketujuh paling gombang” is Parse Tree 1. T15 Sentence: Coklat koko makanan kegemaran adik Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN + KN (0.0483) Probability value 131 Index Parse Trees Sentence Probability values = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0483 = 7.794 x 10-4 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN + KN (0.0115) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0115 = 3.645x 10-4 The best parse tree that represents “coklat koko makanan kegemaran adik” is Parse Tree 1. T16 Sentence: Suara penyanyi wanita agak garuk Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KBantu + KA (0.0780) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0483x0.0780 = 5.279x10-4 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) 132 Index Parse Trees Sentence Probability values FN → KN (0.3965) FA → KN + KN + KBantu + KA (0.0023) Probability value = 1.0000 x 1.0000 x 0.0.0800 x 0.3965x0.0023 = 7.296x10-5 The best parse tree that represents “suara penyanyi wanita agak garuk” is Parse Tree 1. T17 Sentence: Sakit hati saya makin buku Parse tree 1: Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KBantu + KA (0.0780) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0483x0.0780 = 5.279x10-4 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN (0.3965) FN → KN + KN + KBantu + KA (0.0023) The best parse tree that represents “sakit hati saya makin buku” is Parse Tree 1. T18 Probability value = 1.0000 x 1.0000 x 0.0.0800 x 0.3965x0.0023 = 7.296x10-5 Sentence: Buah beri itu manis. Parse tree 1: 133 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + Pent (0.0744) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0744x0.3121 = 3.274x10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + Pent + KA (0.0015) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.3965x0.0023 = 1.286x10-4 The best parse tree that represents “buah beri itu manis” is Parse Tree 1. T19 Sentence: Ucapan guru kaunseling itu cacat. Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN +KN + Pent (0.0092) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0092x0.3121 = 4.049x10-4 Parse tree 2 134 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN(0.0800) FN → KN + KN (0.2017) FN → KN + Pent + KA (0.0008) The best parse tree that represents “ucapan guru kaunseling itu cacat” is Parse Tree 1. T20 Sentence: Baju kakak berwarna putih Parse tree 1: Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017x0.0008 = 1.291x10-5 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FN → KKTTr + KA (0.0483) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.2017 x 0.0483 = 7.199 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FN → KKTTr + KN (0.0028) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.2017 x 0.0028 = 4.174 x 10-4 The best parse tree that represents “baju kakak berwarna putih” is Parse Tree 1. T21 Sentence: Kapal milik keluarga saya labuh di pelabuhan klang A → S + P (1.0000) 135 Index Parse Trees Sentence Parse tree 1 Probability values S → FN (1.0000) P→ FK (0.7390) FN → KN + KN + KN +KN (0.0115) FN → KKTTr + KSN + KN + KN (0.0511) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.0115 x 0.0511 = 4.343 x 10-4 Parse tree 2: The best parse tree that represents “kapal milik keluarga saya labuh di pelabuhan klang” is Parse Tree 1. T22 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN + KN (0.0115) FN → KA + KSN + KN + KN (0.0071) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0115 x 0.0071 = 1.151 x 10-5 Sentence: Aku lambung duit syiling Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.7390x 0.3965 x 0.4815 x 0.2017 = 2.846x10-2 136 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 2: The best parse tree that represents “aku lambung duit syiling” is Parse Tree 1. T23 Sentence: kakak saya guru sekolah Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) 137 Index Parse Trees Sentence Probability values FN → KN (0.3965) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 The best parse tree that represents “kakak saya guru sekolah” is Parse Tree 1. T24 Sentence: sepupu saya jurutera binaan Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “sepupu saya jurutera binaan” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 138 Index Parse Trees Sentence T25 Probability values Sentence: saya kopek buah kelapa itu. Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + KN + Pent (0.0744) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3965 x 0.4815 x 0.0744 = 1.050 x 10-2 Parse tree 2: The best parse tree that represents “saya kopek buah kelapa itu” is Parse Tree 1. T26 Sentence: Jawapan murid itu konkrit Parse tree1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.2017) FN → KN + KN + Pent (0.0744) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0744 = 1.2001 x 10-3 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + Pent (0.0744) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0744x0.3121 = 3.274x10-3 Parse tree 2: A → S + P (1.0000) 139 Index Parse Trees Sentence Probability values S → FN (1.0000) P→ FN (0.0800) FN → KN + KN + Pent (0.0744) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0744x0.3965 = 2.360x10-3 The best parse tree that represents “jawapam murid itu itu” is Parse Tree 1. T27 Sentence: bunyi derai kaca amat ngilu Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN(0.0483) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0483 x 0.2766 = 1.884x10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KPeng + KA (0.0008) The best parse tree that represents “bunyi derai kaca amat ngilu” is Parse Tree 1. T28 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0008 = 1.291x10-5 Sentence: pelajar cemerlang sungguh dinamik Parse tree 1: 140 Index Parse Trees Sentence Probability values A→ S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.2017 x 0.2766 = 7.886x10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KPeng + KA (0.0008) The best parse tree that represents “pelajar cemerlang sungguh dinamik” is Parse Tree 1. T29 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538x10-5 Sentence: Buah gajus rasa kelat. Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FK → KKTr + FN (0.4815) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.7390x 0.2017x 0.4815 x 0.3965 = 2.846x10-2 Parse tree 2: A → S + P (1.0000) 141 Index Parse Trees Sentence Probability values S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 3: The best parse tree that represents “buah gajus rasa kelat” is Parse Tree 1. T30 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: Adik saya murid sekolah rendah. Parse tree1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0483 = 7.794 x 10-4 Parse tree 2: 142 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN + KN (0.0115) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0115 = 3.645x 10-4 The best parse tree that represents “adik saya murid sekolah rendah” is Parse Tree 1. T31 Sentence: saya pagar reban itu Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 The best parse tree that represents “saya pagar reban itu” is Parse Tree 1. T32 Sentence: aku cas bateri itu. Parse tree 1: 143 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 The best parse tree that represents “aku cas bateri itu” is Parse Tree 1. T33 Sentence: makcik saya kerani sekolah Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Parse tree 2: 144 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 The best parse tree that represents “makcik saya kerani sekolah” is Parse Tree 1. T34 Sentence: kawan abang tentera laut Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) The best parse tree that represents “kawan abang tentera laut” is Parse Tree 1. T35 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 Sentence: kawan kakak sangat cantik 145 Index Parse Trees Sentence Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.2017 x 0.2766 = 7.886x10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KPeng + KA (0.0008) The best parse tree that represents “kawan kakak sangat cantik” is Parse Tree 1. T36 Sentence: kulit bayi sangat halus Parse tree 1: Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538x10-5 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.2017 x 0.2766 = 7.886x10-3 Parse tree 2: A → S + P (1.0000) 146 Index Parse Trees Sentence Probability values S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538x10-5 The best parse tree that represents “kulit bayi sangat halus” is Parse Tree 1. T37 Sentence: ibu saya bagi kucing itu makanan Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FK → KKTr + FN (0.4815) FN → KN + Pent + KN (0.0015) Probability value = 1.0000 x 1.0000 x 0.7390 X 0.2017 x 0.4815 x 0.0015 = 1.077x10-4 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN+KN (0.2017) FK → KSN+ FN (0.975) FN → KN + Pent + KN (0.0015) T38 The best parse tree that represents “ibu saya bagi kucing itu makanan” is Parse Tree 1. Sentence: guru kami bagi markah sangat rendah Probability value = 1.0000 x 1.0000 x 0.0400 X 0.2017 x 0.975 x 0.0015 = 1.180x10-5 147 Index Parse Trees Sentence Probability values Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FK → KKTr + FN (0.4815) FN → KN + KPeng + KA (0.0008) Parse tree 2: Probability value = 1.0000 x 1.0000 x 0.7390 X 0.2017 x 0.4815 x 0.0008 = 5.742x10-5 A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN+KN (0.2017) FK → KSN+ FN (0.975) FN → KN + KPeng + KA (0.0008) The best parse tree that represents “guru kami bagi markah sangat rendah” is Parse Tree 1. T39 Sentence: gadis itu model sambilan. Parse tree 1: Probability value = 1.0000 x 1.0000 x 0.0400 X 0.2017 x 0.975 x 0.0008 = 6.293x10-6 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + Pent (0.1242) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.1242 x 0.2017 = 2.004 x 10-3 Parse tree 2: A → S + P (1.0000) 148 Index Parse Trees Sentence Probability values S → FN (1.0000) P→ FN (0.0800) FN → KN + Pent + KN (0.0015) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0015x 0.3965 = 4.758x 10-5 The best parse tree that represents “gadis itu model sambilan” is Parse Tree 1. T40 Sentence: kereta kepunyaan ayah baru Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0483 x 0.3121 = 2.125 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “kereta kepunyaan ayah baru” is Parse Tree T41 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: Pembetulan tesis saya minor 149 Index Parse Trees Sentence Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0483 x 0.3121 = 2.125 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “pembetulan tesis saya minor” is Parse Tree 1. T42 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: bapa saya ke pejabat Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN + KN (0.2017) FS → KSN+ FN (0.975) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0400 x 0.2017 x 0.975 x 0.3965 = 3.119 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) 150 Index Parse Trees Sentence Probability values P→ FN (0.0800) FN → KN (0.3965) FN → KN + KSN + KN (0.0146) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0146 = 4.631 x 10-4 The best parse tree that represents “bapa saya ke pejabat” is Parse Tree 1. T43 Sentence: asal benang daripada kapas Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN + KN (0.2017) FS → KSN+ FN (0.975) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0400 x 0.2017 x 0.975 x 0.3965 = 3.119 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KSN + KN (0.0146) The best parse tree that represents “asal benang daripada kapas” is Parse Tree 1. T44 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0146 = 4.631 x 10-4 Sentence: baju adik baru 151 Index Parse Trees Sentence Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 X 0. 2017X 0.3121 = 8.876 X 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FA → KN (0.3965) The best parse tree that represents “baju adik baru” is Parse Tree 1. T45 Probability value = 1.0000 x 1.0000 x 0.0800 X 0. 2017 X 0.3965 = 6.398 X 10-3 Sentence: pembedahan ibu minor Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 X 0. 2017X 0.3121 = 8.876 X 10-3 Parse tree 2: 152 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FA → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 X 0. 2017 X 0.3965 = 6.398 X 10-3 The best parse tree that represents “pembedahan ibu minor” is Parse Tree 1. T46 Sentence: Kami bentuk adunan biskut itu Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + KN + Pent (0.0744) Probability value = 1.0000 x 1.0000 x 0.7390x 0.3965 x 0.4815 x 0.0744 = 1.050x10-2 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN + Pent (0.0092) The best parse tree that represents “kami bentuk adunan biskut itu” is Parse Tree 1. T47 Probability value = 1.0000 x 1.0000 x 0.0800 X 0.3965 X 0.0092 = 2.918 X 10-4 Sentence: pelajar malas itu benak dalam semua 153 Index Parse Trees Sentence subjek Parse tree 1: Parse tree 2: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + Pent (0.0744) FA → KA + KSN + KN + KN (0.0071) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0744 x 0.0071 = 7.448 X 10-5 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN + Pent (0.0744) FN → KN + KSN + KN + KN (0.0038) The best parse tree that represents “pelajar malas itu benak dalam semua subjek” is Parse Tree 1 T48 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0744 x 0.0038 = 2.262 X 10-5 Sentence: pulau peranginan milik kerajaan negeri Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0483 = 7.794 x 10-4 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) 154 Index Parse Trees Sentence Probability values FN → KN (0.3965) FN → KN + KN + KN + KN (0.0115) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0115 = 3.645x 10-4 T49 The best parse tree that represents “pulau peranginan milik kerajaan negeri” is Parse Tree 1 Sentence: kakak jurusolek butik pengantin Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “kakak jurusolek butik pengantin” is Parse Tree 1. T50 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: beliau atlet negara Malaysia 155 Index Parse Trees Sentence Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Parse tree 2: Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) The best parse tree that represents “beliau atlet negara malaysia” is Parse Tree 1. T51 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 Sentence: saya kepit suratkhabar itu. Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Parse tree 2 156 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 T52 The best parse tree that represents “saya kepit suratkhabar itu.” is Parse Tree 1. Sentence: rumah saya rumah kayu Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “rumah saya rumah kayu” is Parse Tree 1. T53 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: baju ibu sangat murah 157 Index Parse Trees Sentence Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.2017 x 0.2766 = 7.886x10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538x10-5 The best parse tree that represents “baju ibu sangat murah” is Parse Tree 1. T54 Sentence: orang kaya itu bagi sedekah Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN + Pent (0.0744) FK → KKTr + FN (0.4815) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.0744 x 0.4815 x 0.3965 = 1.050x10-2 158 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN + KN + Pent (0.0744) FK → KSN+ FN (0.975) FN → KN (0.3965) Parse tree 2 The best parse tree that represents “orang kaya itu bagi sedekah” is Parse Tree 1. T55 Probability value = 1.0000 x 1.0000 x 0.0400 X 0.0744 x 0.975 x 0.3965 = 1.150x10-3 Sentence: Pelajar itu pelajar cemerlang Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + Pent (0.1242) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.1242 x 0.2017 = 2.004 x 10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + Pent + KN (0.0015) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0015x 0.3965 = 4.758x 10-5 The best parse tree that represents “Pelajar itu pelajar cemerlang” is Parse Tree 1. 159 Index Parse Trees Sentence T56 Sentence: belon kepunyaan adik bocor. Probability values Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0483 x 0.3121 = 2.125 x 10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “belon kepunyaan adik bocor” is Parse Tree 1. T57 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: Gigi adik berwarna putih Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FN → KKTTr + KA (0.0483) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.2017 x 0.0483 = 7.199 x 10-3 160 Index Parse Trees Sentence Parse tree 2 Probability values A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FN → KKTTr + KN (0.0028) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.2017 x 0.0028 = 4.174 x 10-4 The best parse tree that represents “beg adik berwarna coklat” is Parse Tree 1. T58 Sentence: Perut adik buntal Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 X 0. 2017X 0.3121 = 8.876 X 10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FA → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 X 0. 2017 X 0.3965 = 6.398 X 10-3 The best parse tree that represents “perut adik bintal” is Parse Tree 1. 161 Index Parse Trees Sentence T59 Sentence: sengat tebuan itu bisa Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN+ Pent (0.0744) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 X 0.0744 X 0.3121 = 3.274 X 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN+ Pent (0.0744) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 X 0.0744 X 0.3965 = 2.360 X 10-3 The best parse tree that represents “sengat tebuan itu bisa” is Parse Tree 1. T60 Sentence: Saya daftar subjek baru semester hadapan Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + KN + KN + KN (0.0115) Probability value = 1.0000 x 1.0000 x 0.7390x 0.3965 x 0.4815 x 0.0115 = 2.206x10-3 162 Index Parse Trees Sentence Parse tree 2 Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.0483) FN → KN + KN + KN + KN + KN (0.0023) The best parse tree that represents “Saya daftar subjek baru semester ini” is Parse Tree 1. T61 Probability value = 1.0000 x 1.0000 x0.08000 x 0.0483x 0.0023 = 5.641 X 10-5 Sentence: Kami garuk sungai yang cetek Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + KBantu + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.7390x 0.3965 x 0.4815 x 0.0008 = 1.129x10-4 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KN + KBantu + KA (0.0008) The best parse tree that represents “kami garuk sungai yang cetek” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x 0.1410 X 0.2017 x 0.0008 = 2.275 X 10-5 163 Index Parse Trees Sentence T62 Sentence: saya gali lubang sampai dalam Parse tree 1 Probability values A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + KKTr + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.7390x 0.3965 x 0.4815 x 0.0008 = 1.129x10-4 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN+ KN (0.2017) FN → KN + KKTr + KA (0.0008) The best parse tree that represents “saya gali lubang sampai dalam” is Parse Tree 1. T63 Probability value = 1.0000 x 1.0000 x 0.0800 X 0.2017 X 0.0008 = 1.291 X 10-5 Sentence: sayur hijau sangat segar Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.2017 x 0.2766 = 7.886x10-3 164 Index Parse Trees Sentence Parse tree 2 Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538x10-5 The best parse tree that represents “sayur hijau sangat segar” is Parse Tree 1. T64 Sentence: kereta kepunyaan rakan amat besar Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 X 0.0483 X 0.2766 = 1.884 X 10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN(0.2017) FN → KN + KPeng + KA (0.0008) The best parse tree that represents “kereta kepunyaan rakan amat besar” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x 0.0800 X 0.2017 X 0.0008 = 1.291 X 10-5 165 Index Parse Trees Sentence T65 Sentence: kereta idaman saya kereta mewah Probability values Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0483 = 7.794 x 10-4 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN + KN (0.0115) T66 The best parse tree that represents “kereta idaman saya kereta mewah” is Parse Tree 1. Sentence: badan pesakit diabetis makin kurus Parse tree 1: Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0115 = 3.645x 10-4 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KBantu + KA (0.0780) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0483x0.0780 = 5.279x10-4 166 Index Parse Trees Sentence Parse tree 2: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FA → KN + KN + KBantu + KA (0.0023) Probability value = 1.0000 x 1.0000 x 0.0.0800 x 0.3965x0.0023 = 7.296x10-5 The best parse tree that represents “badan pesakit diabetis makin kurus” is Parse Tree 1. T67 Sentence: kereta buatan tempatan makin mahal Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KBantu + KA (0.0780) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0483x0.0780 = 5.279x10-4 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN (0.3965) FN → KN + KN + KBantu + KA (0.0023) The best parse tree that represents “kereta buatan tenpatan makin mahal” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x 0.0.0800 x 0.3965x0.0023 = 7.296x10-5 167 Index Parse Trees Sentence T68 Sentence: badan pelakon itu langsing Probability values Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + Pent (0.0744) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0744x0.3121 = 3.274x10-3 Parse tree 2 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + Pent + KA (0.0015) The best parse tree that represents “badan pelakon itu langsing” is Parse Tree 1. T69 Probability value = 1.0000 x 1.0000 x 0.1410 x 0.3965x0.0023 = 1.286x10-4 Sentence: bangunan milik kerajaan itu runtuh Parse tree 1 A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN +KN + Pent (0.0092) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0092x0.3121 = 4.049x10-4 168 Index Parse Trees Sentence Parse tree 2 Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN(0.0800) FN → KN + KN (0.2017) FN → KN + Pent + KA (0.0008) The best parse tree that represents “bangunan milik kerajaan itu runtuh ” is Parse Tree 1. T70 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017x0.0008 = 1.291x10-5 Sentence: kasut ayah berwarna coklat Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FN → KKTTr + KA (0.0483) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.2017 x 0.0483 = 7.199 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FN → KKTTr + KN (0.0028) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.2017 x 0.0028 = 4.174 x 10-4 The best parse tree that represents “kasut ayah berwarna coklat” is Parse Tree 1. 169 Index Parse Trees Sentence T71 Sentence: kain sekolah pelajar itu labuh Parse tree 1 Probability values A → S + P (1.0000) S → FN (1.0000) P→ FA (0.141) FN → KN + KN +KN + Pent (0.0092) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.141 x 0.0092 x0.3121 = 4.049x 10-4 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN + KN + KN (0.0115) FN → KA + KSN + KN + KN (0.0071) The best parse tree that represents “kain pelajar sekolah itu labuh” is Parse Tree 1. T72 Probability value = 1.0000 x 1.0000 x 0.7390 x 0.0115 x 0.0071 = 8.133 x 10-5 Sentence: kami lambung pengantin lelaki itu Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + KN + Pent (0.0744) Probability value = 1.0000 x 1.0000 x 0.7390x 0.3965 x 0.4815 x 0.0744 = 1.050x10-1 Parse tree 2: 170 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN + Pent (0.0092) The best parse tree that represents “kami lambung pengantin lelaki itu” is Parse Tree 1 . T73 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0092 = 2.918 x 10-4 Sentence: sekolah kami sekolah harapan Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “sekolah kami sekolah harapan” is Parse Tree 1. T74 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: guru saya guru besar Parse tree 1: A → S + P (1.0000) S → FN (1.0000) 171 Index Parse Trees Sentence Probability values P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 The best parse tree that represents “guru saya guru besar” is Parse Tree 1. T75 Sentence: aku kopek buah limau itu. Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + KN + Pent (0.0744) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3965 x 0.4815 x 0.0744 = 1.050 x 10-2 Parse tree 2: A → S + P (1.0000) 172 Index Parse Trees Sentence Probability values S → FN (1.0000) P→ FN (0.0800) FN → KN (0.2017) FN → KN + KN + Pent (0.0744) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0744 = 1.2001 x 10-3 The best parse tree that represents “aku kopek buah limau itu” is Parse Tree 1. T76 Sentence: binaan bangunan itu konkrit Parse tree1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + Pent (0.0744) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0744x0.3121 = 3.274x10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN + Pent (0.0744) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0744x0.3965 = 2.360x10-3 The best parse tree that represents “binaan bangunan itu konkrit” is Parse Tree 1. T77 Sentence: jiran rumah kami amat baik 173 Index Parse Trees Sentence Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN(0.0483) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0483 x 0.2766 = 1.884x10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0008 = 1.291x10-5 The best parse tree that represents “jiran rumah kami amat baik” is Parse Tree 1. T78 Sentence: adik saya sangat nakal Parse tree 1: A→ S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.2017 x 0.2766 = 7.886x10-3 Parse tree 2: A → S + P (1.0000) 174 Index Parse Trees Sentence Probability values S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538x10-5 The best parse tree that represents “adik saya sangat nakal” is Parse Tree 1. T79 Sentence: buah strawberi rasa masam Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FK → KKTr + FN (0.4815) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.7390x 0.2017x 0.4815 x 0.3965 = 2.846x10-2 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 3: A → S + P (1.0000) 175 Index Parse Trees Sentence Probability values S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 The best parse tree that represents “buah strawberi rasa masam” is Parse Tree 1. T80 Sentence: pakcik saya rakan kongsi ayah Parse tree1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN + KN (0.0483) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0483 = 7.794 x 10-4 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN + KN (0.0115) The best parse tree that represents “pakcik saya rakan kongsi ayah” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0115 = 3.645x 10-4 176 Index Parse Trees Sentence T81 Sentence: kami pagar kandang itu. Parse tree 1 Probability values A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 The best parse tree that represents “kami pagar kandang itu” is Parse Tree 1. T82 Sentence: kami cas generator ini Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 177 Index Parse Trees Sentence Parse tree 2: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 The best parse tree that represents “kami cas generator ini” is Parse Tree 1. T83 Sentence: datuk saya pesara polis Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) The best parse tree that represents “makcik saya kerani sekolah” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 178 Index Parse Trees Sentence T84 Sentence: rakan saya pegawai bomba Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) The best parse tree that represents “rakan saya pegawai bomba” is Parse Tree 1. T85 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 Sentence: bunga mawar sangat wangi Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.2017 x 0.2766 = 7.886x10-3 179 Index Parse Trees Sentence Parse tree 2: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538x10-5 The best parse tree that represents “bunga mawar sangat wangi” is Parse Tree 1. T86 Sentence: anak kakak sangat comel Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN (0.2017) FA → KPeng + KA (0.2766) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.2017 x 0.2766 = 7.886x10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KPeng + KA (0.0008) The best parse tree that represents “anak kakak sangat comel” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538x10-5 180 Index Parse Trees Sentence T87 Sentence: abang saya bagi budak itu duit Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FK → KKTr + FN (0.4815) FN → KN + Pent + KN (0.0015) Probability value = 1.0000 x 1.0000 x 0.7390 X 0.2017 x 0.4815 x 0.0015 = 1.077x10-4 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN+KN (0.2017) FK → KSN+ FN (0.975) FN → KN + Pent + KN (0.0015) The best parse tree that represents “abang saya bagi budak itu duit” is Parse Tree 1. T88 Probability value = 1.0000 x 1.0000 x 0.0400 X 0.2017 x 0.975 x 0.0015 = 1.180x10-5 Sentence: pengadil bagi mata sangat tinggi Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN + KN (0.2017) FK → KKTr + FN (0.4815) FN → KN + KPeng + KA (0.0008) Probability value = 1.0000 x 1.0000 x 0.7390 X 0.2017 x 0.4815 x 0.0008 181 Index Parse Trees Sentence Parse tree 2: Probability values = 5.742x10-5 A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN+KN (0.2017) FK → KSN+ FN (0.975) FN → KN + KPeng + KA (0.0008) The best parse tree that represents “pengadil bagi mata sangat tinggi” is Parse Tree 1. T89 Sentence: lelaki itu pengacara televisyen Parse tree 1: Probability value = 1.0000 x 1.0000 x 0.0400 X 0.2017 x 0.975 x 0.0008 = 6.293x10-6 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + Pent (0.1242) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.1242 x 0.2017 = 2.004 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + Pent + KN (0.0015) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0015x 0.3965 = 4.758x 10-5 The best parse tree that represents “lelaki itu pengacara televisyen” is Parse Tree 1. 182 Index Parse Trees Sentence T90 Sentence: kereta kepunyaan ayah baru Probability values Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0483 x 0.3121 = 2.125 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “rumah milik saya baru” is Parse Tree 1 T91 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: Pembetulan tesis saya minor Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + Pent (0.0744) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0744 x 0.3121 = 3.274 x 10-3 183 Index Parse Trees Sentence Parse tree 2: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + Pent + KA (0.0008) The best parse tree that represents “pembetulan tesis saya minor” is Parse Tree 1. T92 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0008 = 2.538 x 10-5 Sentence: rumah kami di bandar Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN + KN (0.2017) FS → KSN+ FN (0.975) FN → KN (0.3965) Probability value = 1.0000 x 1.0000 x 0.0400 x 0.2017 x 0.975 x 0.3965 = 3.119 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KSN + KN (0.0146) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0146 = 4.631 x 10-4 The best parse tree that represents “bapa saya ke pejabat” is Parse Tree 1. 184 Index Parse Trees Sentence T93 Sentence: penduduk kampung ke sawah padi Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FS (0.0400) FN → KN + KN (0.2017) FS → KSN+ FN (0.975) FN → KN + FN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0400 x 0.2017 x 0.975 x 0.2017 = 1.587 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KSN + KN + KN (0.0038) The best parse tree that represents “penduduk kampung ke sawah padi” is Parse Tree 1. T94 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0038 = 1.205 x 10-4 Sentence: perkakasan sekolah adik baru Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.1410 X 0. 0483X 0.3121 = 2.125 X 10-3 185 Index Parse Trees Sentence Parse tree 2: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 The best parse tree that represents “perkakasan sekolah adik baru” is Parse Tree 2. T95 Sentence: murid sekolah itu rajin Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + Pent (0.0744) FA → KA (0.3121) Parse tree 2: Probability value = 1.0000 x 1.0000 x 0.1410 x 0.0744x0.3121 = 3.274x10-3 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + Pent + KA (0.0015) The best parse tree that represents “murid sekolah itu rajin” is Parse Tree 1. Probability value = 1.0000 x 1.0000 x 0.1410 x 0.3965x0.0023 = 1.286x10-4 186 Index Parse Trees Sentence T96 Sentence: beliau pemimpin besar negara Probability values Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0483 x 0.3121 = 2.125 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “kami bentuk adunan biskut itu” is Parse Tree 1. T97 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: dia pelajar kolej swasta Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FA (0.1410) FN → KN + KN + KN (0.0483) FA → KA (0.3121) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.0483 x 0.3121 = 2.125 x 10-3 187 Index Parse Trees Sentence Parse tree 2: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “dia pelajar kolej swasta” is Parse Tree 1. T98 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: rakan kami pengusaha kedai perabot Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN + KN (0.0483) Parse tree 2: Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.0483 = 7.794 x 10-4 A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN + KN (0.0115) The best parse tree that represents “pulau peranginan milik kerajaan negeri” is Parse Tree 1 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0115 = 3.645x 10-4 188 Index Parse Trees Sentence T99 Sentence: ibu pengusaha butik pengantin Parse tree 1: Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + KN (0.2017) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.2017 = 3.255 x 10-3 Parse tree 2: A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN (0.3965) FN → KN + KN + KN (0.0483) The best parse tree that represents “kakak jurusolek butik pengantin” is Parse Tree 1. T100 Probability value = 1.0000 x 1.0000 x 0.0800 x 0.3965 x 0.0483 = 1.532 x 10-3 Sentence: beliau atlet negara Malaysia Parse tree 1: A → S + P (1.0000) S → FN (1.0000) P→ FK (0.7390) FN → KN (0.3965) FK → KKTr + FN (0.4815) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.7390 x 0.3964 x 0.4815 x 0.1242 = 1.752 x 10-2 Parse tree 2: 189 Index Parse Trees Sentence Probability values A → S + P (1.0000) S → FN (1.0000) P→ FN (0.0800) FN → KN + KN (0.2017) FN → KN + Pent (0.1242) Probability value = 1.0000 x 1.0000 x 0.0800 x 0.2017 x 0.1242 = 2.004 x 10-3 The best parse tree that represents “beliau atlet negara malaysia” is Parse Tree 1. 190 APPENDIX D USER MANUAL APPENDIX D: USER MANUAL The main interface for the prototype of Statistical Parser for Malay Language is shown below. The interface is divided into two divisions. The left-hand side is a place where the user put the sentence. Another side is a place to show the tagged sentence. The main interface with example test sentence “bapa pemandu teksi” The tagged sentence is shown after the “Proses” button is clicked. The message below is shown if the sentence is grammatically correct. 191 Otherwise, the message is illustrated below. If the sentence is grammatically correct, the highest probability is shown as well as the possible parse trees. 192 APPENDIX E LETTERS OF APPROVAL APPENDIX E: LETTERS OF APPROVAL 193 194