MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING INFERENCE RULE MAISARAH BTE YAMAN A thesis submitted in partial fulfillment of the requirement for the award of the Degree of Master of Science (Software Engineering) Faculty of Computer Science and Information Technology Universiti Tun Hussein Onn Malaysia DECEMBER, 2013 iv ABSTRACT Natural Language Processing (NLP) is a technique where a machine can understand human better and thereby reduce the distance between human being. NLP was implemented in MyParser, acting as a tool of parsing sentence. This research aimed to develop a sentence parser application for teachers and learners of the Malay language who faced difficulties in comprehending the grammar and phrase structure. There are three major processing steps that have been drawn which are Input Component, Parsing Engine and Output Component. Parsing engine phase involved pre-processing phase; the use of 'Tokenizer' and 'Part Of Speech' (POS). The input component is a simple sentence provided by an expert of Malay Language (Munsyi Dewan). It is categorized based on the inference rule implemented in MyParser. This study is significant in designing inference rules for Malay Language sentences, focusing on the relationship between computing language. The output was labelled according to each phrase following the Malay Context Free Grammar (CFG) rule. Thus, parsing technique is an essential component to be considered in parsing application development. Besides that, MyParser application was developed to process Malay simple sentence by categorizing them into different grammar and phrase structure. The target groups for this application are students and teachers of Malay Language subject in primary school. MyParser was evaluated using 100 training data which were agreed by a qualified expert on Malay Language of Malaysia (Munsyi Dewan). More than thousand words were stored in the database. This application was found to be able to visualize correct sentence with its labelled graphical tags. MyParser was tested by different level of teachers from primary and secondary school and Munsyi Dewan. The results proved that MyParser achieved more than 90% accuracy in constructing sentences based on its grammatical rule. v ABSTRAK Pemprosesan bahasa semulajadi (NLP) adalah teknik yang mana sesuatu mesin boleh memahami kehendak manusia. NLP adalah komponen yang penting dalam proses penguraian bahasa yang dibangunkan iaitu MyParser. Penyelidikan ini bertujuan untuk membangunkan aplikasi penghurai untuk memproses ayat tunggal Bahasa Melayu bagi kegunaan guru dan pelajar kelas Bahasa Melayu yang mempunyai masalah pemahaman nahu konteks Bahasa. Terdapat tiga bahagian utama yang terdapat dalam MyParser iaitu ‘Komponen Input’, ‘Enjin Penghurai’, dan ‘Komponen Output’. Fasa Enjin Penghurai melibatkan proses 'Penandaan' dan ‘Golongan Kata’. Komponen input adalah ayat ringkas yang diberikan oleh pakar bahasa (Munsyi Dewan) yang dikategorikan berdasarkan peraturan inferen yang dilaksanakan di MyParser. Kajian ini menyumbang kepada pembentukan peraturanperaturan untuk ayat dalam Bahasa Melayu dengan memberi tumpuan kepada hubungan antara bahasa pengaturcaraan. Oleh itu, teknik penghuraian ini merupakan komponen penting dalam pembangunan aplikasi ini. Selain itu, aplikasi MyParser dibangunkan untuk memproses ayat ringkas Bahasa Melayu dengan mengkategorikan ayat-ayat tersebut dalam struktur tatabahasa dan frasa yang berbeza. Kumpulan sasaran aplikasi ini adalah guru-guru dan pelajar dari sekolah rendah yang mengambil matapelajaran bahasa Melayu. Tatabahasa yang dipilih untuk aplikasi ini ialah nahu bebas konteks (CFG) dan terhad kepada ayat tunggal Bahasa Melayu yang mengandungi frasa-frasa ayat mudah untuk Bahasa Melayu. 'MyParser' ini dinilai menggunakan 100 data ujian yang dipersetujui oleh pakar Bahasa Melayu (Munsyi Dewan) dan terdapat lebih daripada 1000 patah perkataan Bahasa Melayu yang disimpan dalam rekod pangkalan data. Aplikasi ini didapati dapat menggambarkan ayat yang betul dengan tag grafik yang dilabel.Hasil keputusan daripada kajian yang dilakukan terhadap guru Bahasa Melayu sekolah rendah dan menengah serta Munsyi Dewan menunjukkan 'MyParser' mencapai lebih daripada 90% ketepatan dalam pembentukan ayat mengikut nahu bahasanya. vi CONTENTS TITLE i DECLARATION ii DEDICATION iii ABSTRACT iv CONTENTS vi LIST OF TABLE viii LIST OF FIGURE ix CHAPTER 1 INTRODUCTION 1.1 An Overview 1 1.2 Problem Statement 3 1.3 Research Objectives 5 1.4 Aim of Study 5 1.5 Research Scope 5 1.6 Report Outline 6 vi CHAPTER 2 LITERATURE REVIEW 2.1 Introduction 7 2.2 Sentence Grammar Overview 7 2.3 The Standard Malay Language 10 2.4 The Types of Malay Language Grammar 12 2.4.1 Sentence Grammar 12 2.4.2 CFG in Malay Language 12 2.4.3 Partial discourse grammar 15 2.4.4 ‘Pola’ (Pattern) Grammar 15 2.5 Rules for Basic Malay Sentence 17 2.6 Malay Text Categorization 18 2.7 Pre-processing 19 2.7.1 Parsing technique 21 2.7.2 Error Analysis of Parsing Technique 21 Related Work 23 2.8.1 Ahmad’s Malay Parser 23 2.8 2.8.2 2.9 Juzaiddin’s Malay Parser 25 2.8.3 Comparison Among Research Works 27 Summary 28 CHAPTER 3 RESEARCH METHODOLOGY 3.1 Introduction 29 3.2 The Proposed Framework for MyParser 30 3.2.1 Input Component 32 3.2.2 MyParser (Pre-processing process) 35 3.2.3 Output Sentence 36 Summary 36 3.4 CHAPTER 4 ANALYSIS: MYPARSER 4.1 Introduction 36 4.2 Specific Requirements 39 4.2.1 Functional Requirements 39 4.2.2 Non-functional Requirements 39 4.2.3 UML Specification 40 vii 4.3 4.2.3.1 Use case diagram 40 4.2.3.2 Class Diagram 49 4.2.3.3 Sequence Diagram 53 4.2.3.4 Activity Diagram 55 4.2.3.5 Test Plan 56 Summary 58 CHAPTER 5 IMPLEMENTATION OF MYPARSER 5.1 Introduction 59 5.2 The implementation of MyParser 60 5.3 Summary 67 CHAPTER 6 RESULTS AND DISCUSSION 6.1 Introduction 68 6.2 The Experimental Result of MyParser 68 6.2.1 Error Analysis of MyParser 71 6.3 Comparison with other Malay NLP 76 6.3 Summary 76 CHAPTER 7 CONCLUSION 7.1 Introduction 77 7.2 Achievement of Research Objectives 77 7.3 Recommendations for Future Work 78 7.4 Conclusion 80 REFERENCES 81 APPENDIX A 84 APPENDIX B 102 viii LIST OF TABLE 4.1 The Description of Insert Sentence 42 4.2 The Description of Edit Sentence 43 4.3 The Description View Malay CFG Rule 44 4.4 The Description Update Sentence 45 4.5 The Description Add Rule 46 4.6 The Description Update Rule 46 4.7 The Description Update New Word 47 4.8 The Description Record Data 48 4.9 The Description Class User 50 4.10 The Description Class Student 50 4.11 The Description Class Teacher 51 4.12 The Description Class Programmer 51 4.13 The Description Class Programmer 52 4.14 The Test Plan of MyParser 57 6.1 Results Of Sentence Parse 67 6.2 The Analysis Error of MyParser 69 6.3 The Analysis Of Malay NLP 69 x LIST OF FIGURE 2.1 First tree of “He saw the boy with a telescope” (Meyer et al., 2002) 8 2.2 Second tree of “He saw the boy with a telescope” (Meyer et al., 2002) 9 2.3 First tree of “Kami adang air itu 9 2.4 Second tree of “Kami adang air itu” 10 2.5 Context-Free Grammar for Malay Language (Karim,1995) 14 2.6 Pre-processing for conceptual clustering (Zakree et al., 2008) 20 2.7 The Format For Errors Analysis by Hendrickson (1979) 22 2.8 The Architecture of The Parsing For Ahmad's Malay Parser 23 2.9 The Architecture of The Juzaidin's System 25 3.1 The Proposed Framework for MyParser 30 3.2 Flowchart of MyParser processing sentence 31 4.1 Recommended SRS Structure (IEEE Std. 830-1998) 38 4.2 Use Case Diagram of MyParser 41 4.3 Class Diagram of MyParser 48 4.4 Sequence Diagram of User 53 4.5 Sequence Diagram of Programmer 55 4.6 Activity Diagram of The Users 55 4.7 Activity Diagram of The Programmer 56 5.1 The Main Page of Application 60 5.2 The Sentence Tokenization Process 61 5.3 The Algorithm Sentence Tokenization 61 5.4 The Word By Word Tokenization Process 62 5.5 The Algorithm Word for Tokenization Process 62 5.6 The Algorithm Word for POS Tagging Process 62 5.7 The Error Massage Window Of Myparser Application 63 5.8 The Inference Rule for MyParser Based on Malay CFG 64 5.9 The Message Window Of Malay CFG Structure Correct 65 xi 5.10 The Sentence Visualization Structure Of Malay CFG Is Correct 66 5.11 The Sentence Visualization Structure Of Malay CFG Is Incorrect 64 6.1 The Score Of Build Correct And Incorrect Sentence 68 6.2 The Pie Chart Of Total Sentence Parsed 69 6.3 The Errors Analysis of MyParser 70 6.4 The Lexicon Analysis Error of MyParser 71 6.5 The Syntax Analysis Error of MyParser 72 6.6 The Orthography Analysis Error of MyParser 73 CHAPTER 1 INTRODUCTION 1.1 Overview Processing techniques for English texts have been developed over long periods of time and have their own general approach to the structure of English. This cannot be done for Malay language, partly because the structures of Malay language remains largely improper, and partly because it is strikingly different from English (Don, 2010). The development of the database incorporates two important design principles, namely the logical organization of data, and the separation of text properties, lexicon, and grammar. Grammar is a formal specification of rules of language, while parsing is a method to perform syntactic analysis where the syntactic means the rule of the grammar. According to Abidin et al. (2007), grammar is the rule and pattern which combined words, clause, and phrases to provide meaning to our daily use and the study of grammar that relates to language parsing. Muhamad & Jamaludin (2012) concluded that "In Malaysia, research in sentence parse tree visualization for Malay language (BM) still not gaining enough attention from researcher to construct a prototype as what been done to English language". Recently, the progress on developing Malay Text parser for development of simple sentences categorization has not spread widely despite the massive development of other language in Indo-European and other Asian family language. The growth of text processing application in English is becoming more widespread in many different fields such as information system, natural language processing. The use of text categorization with natural language processing approach is derived from a combination of the important role of their relation between each other. By considering the time and cost factor, the software requirement has played an 2 important role in most of the building process of Natural Language Processing (NLP) toolkit (Talita & Yeo, 2010). An effective NLP tool as the pre-processing component is considered necessary in developing a tool for automatic parsing of Malay text based.(Zakree et al., 2008). The purpose of NLP is to ensure that machines understand the human language. In a study by Surabhi (2013), NLP is a technique where a machine can become more human, and in that way, reducing the distance between human being and the machine. Therefore, NLP makes human to easily communicate with a machine. Many applications have been developed within past few decades in NLP. Most of these are very useful for everyday life. There are also lots of research groups working on this topic to develop more practical useful systems. Furthermore, the entities, properties, interrelationship, and functions derived by graphical design notation approach are representation of knowledges intended to capture the conceptualized information (Talita & Yeo, 2010). For English sentence grammar, most of the language researchers introduced sentences parse tree visualizations to help understanding sentence structure. Among the applications introduced, phpSyntaxTree and RSyntax tree give users the chance to visualize an English sentence throughout online dealings (Muhamad & Jamaludin, 2012). Currently, word-processing application for Malay language only exists in the form of words translator or spellchecker. Therefore, this research aims to complement the existing word-processing software by presenting an application of Malay sentence parser using structure of an expression by writing square brackets ('[' and ']') to the left and right hand side of its component parts technique. The prototype is able to illustrate the structure of a grammatically correct sentence, determine if a sentence is grammatically correct, and semantically parse a sentence. In evaluating the application, sentences which were inputs to the application, were randomly provided by experts in the structure of Malay language grammar. If a sentence follows the Context Free Grammar (CFG) rules for Malay language but is incorrect in terms of semantics, it is considered to be a correct sentence. So, it enables people and software to share a common understanding of the information’s nature and structure. Then, the sentence structure will be categorized between the inferences rules that will be develop in this studies. 3 The pre-processing component specifications play a key role in enhancing the potential for data sharing and reuse across the applications. Prior to any deeper linguistic treatment of a text, the units of the text must be marked and possibly classified. So, to categorize Malay language sentence using MyParser application, it involved some tool in NLP where we need to develop the programming language for the acceptance of Malay Text via the NLP tools. An important element of this research is the structure of building this MyParser to parsing the text to build the labelled Malay sentence. This study contributes a design of an Inference Rule for Malay Text in the parser which focuses on the relationship between the programming language and the implementation of the inference rule that was created using labelled bracket notation from left to right, top bottom parse approach (Charles, 1994). Subsequently, a parser was developed for the simple sentence of Malay Text based on Inference Rule that followed several Malay CFG rules. Moreover, MyParser provides the ability to full fill the sentence required by percentage of application tested on the three test cases. The main goal of this study is to outline the Inference Rules for Malay sentence which is the key of MyParser to process and create a sentence with Malay CFG grammatical label using graphical bracket labeled approach. The sample of simple sentence was provided by Munsyi Dewan, who is the expert of Malay language. Munshi Dewan is a speaker officially selected and established by the Dewan Bahasa dan Pustaka (DBP) to conduct courses, talks, and seminars of the Malay language in public and private sectors. The role of Munsyi Dewan is to provide expert services for reference and to help DBP in upholding the use of Malay language in public and private sectors. In this study, Munsyi Dewan plays an important role for overviewing the inference rule used in the parser. 1.2 Problem Statement Reading and writing are basic skills in Malay Language for primary school children. Mastering these skills are important in order to achieve greater achievements in the coming years. A study was conducted by Muzaliha et al. (2012) to investigate children with learning disabilities. The targeted sample consisted of 1010 children 4 with age range between 8 and 12 years old from 40 primary schools in the Kota Bharu district. The children performance was measured via the Early Intervention Class for Reading and Writing Screening Test conducted by the Ministry of Education, Malaysia. The results indicated that a total of 4.8% of students had learning and writing skills problem. They suggested that educators should identify and treat these students at an earlier stage. In an effort to present a tool for automatic Malay CFG application, it is important to build the Inference Rule by following the CFG rule in Malay Text so more precise results on the Malay grammar can be produced. However, there is still no consensus standard in producing inference rule in Malay Text using NLP toolkits (Zakree & Nazri, 2008). Source for information retrieval for the Malay Text categorization using NLP is also limited. Even though there are recent researcher overviews on Malay NLP, information are still limited in the implementation of categorization the Malay text using the NLP (Zakree & Nazri, 2008). To date, there is no developed system has been designed for the processing of Malay language for sentence correction, but only prototype is produced (Noor & Jamaludin, 2012). Furthermore, until now, parser which processing Malay sentence for correction of wrong sentence has not yet been established for any language, especially in Malaysia language (Noor & Jamaludin, 2012). However, an effective parser is important in any NLP toolkits because their performance depends on the quality of input fed into it. The contribution in this study suggested to develop MyParser. In order to producing a better algorithm for NLP toolkit programming language based on the inference rule used for Malay Text Categorization. 5 1.3 Research Objectives The objectives of this research are as follows: i. To design Inference Rule for Malay Text Categorization in the parser. ii. To develop a parser for Malay Text Categorization based on Inference Rule found in (i) for Malay simple sentence. iii. 1.4 To evaluate MyParser using several case studies. Aim of the Study The aim of this research is to concentrate on developing an application for Malay Text categorization based on several sentence structures according to the Malay Context Free Grammar (CFG). 1.5 Research Scope This study focuses only on Malay Text Processing which is categorized into two clauses which are subject and predicate. The phrase will categorize limit on noun phrase, verb phrase, adjective phrase, and prepositional phrase. For data sampling, the input sentences have been taken from primary school books and tagged manually by Munsyi Dewan following the Malay language CFG. For data library limit, it has been taken only from one source which is from Kamus Dewan Bahasa dan Pustaka Bahasa Melayu 4th Edition. For the techniques limit, this research focus only on NLP toolkit and its programming language. 6 1.6 Report Outline This thesis is organized from general related knowledge to a deeper discussion, where each chapter is the foundation of the next chapter. The rest of this thesis is organized as follows; Chapter 1 is the introduction of the report, followed by Chapter 2 that provides the relevant background information on Malay sentences categorization using MyParser. The discussion is then continued with the complementary approaches for studying the Malay Text Processing Tools in Categorization that leads to categorize the simple sentence. Afterwards, overview of Malay Text Grammar Rule is presented. In addition, this chapter also provides a general introduction to Malay Text Categorization using MyParser. Chapter 3 explains and illustrates research methodology used to carry out the categorization of the Malay Text. This chapter also discusses on the method and framework that are used in the pre-processing part that contribute to this research. Chapter 4 discusses several software requirements that based on IEEE Std 830-1998 for SRS. This specific requirement helps a lot in phase of development and implementation. It helps in the description of the important document in an application and facilitating two-way process between the developer documentation and the user. Chapter 5 is about the implementation of MyParser based on the proposed framework. This chapter shows how the implementation occurs and its user interface. Then, Chapter 6 is concerned with the results and discussion. Lastly, Chapter 7 is the conclusion of this dissertation where it provides some conclusions, implementation of research objectives, recommendations for future work and summary. All related documents are presented in APPENDIX section. CHAPTER 2 LITERATURE REVIEW 2.1 Introduction This chapter will briefly describe the dominant topics of the research such as Sentence Grammar Overview and Standard Malay Language. Then, the types of Malay language grammar that consist of sentence grammar are discussed. Next, Rule for basic Malay Sentence followed by Malay Text categorization, Pre-Processing with the parsing technique is presented to give a clear grasp on the research field area. Finally, this chapter will come out with an overview of related studies, relevant to the research field included to foster the main issues that have to be addressed. The issues are: document representation and categorization of the Malay language sentence based on Malay Context Free Grammar (CFG) rule. 2.2 Sentence Grammar Overview Grammar is an inner regularity and a simple knowledge representation of language (Chomsky, 1966). It occurs from language and plays the most important role in the implementation of the fundamental aims of linguistic analysis. Grammar plays two roles which are to separate the grammatical sentence from the ungrammatical sequences and study the structure of grammatical sentences. The grammar of a language is also a device that generates all grammatical sentences of the language and none of the ungrammatical ones. Parsing a sentence is 8 a difficult task. It is an initial step in understanding natural language, although ambiguity is a serious problem that linguists face in building the algorithm for natural language processing for sentence parser. An ambiguity problem occurs when more than one parse tree are constructed. For example, the sentence “He saw the boy with a telescope” can give two interpretations for readers. First reading: “He used the telescope to see the boy”. Second reading: “He saw the boy who had a telescope”. Both versions can also be interpreted by using tree structures. The first tree is shown in Figure 2.1. Figure 2.1: First Tree of “He saw the boy with a telescope” (Meyer et al., 2002) Abbreviations: Sentence (S), Noun Phrase (NP), Verb Phrase (VP), Prepositional Phrase (PP), Noun (N), Verb (V), Determiner (D), Preposition (P) 9 Figure 2.2: Second Tree of “He saw the boy with a telescope” (Meyer et al., 2002) Abbreviations: Sentence (S), Noun Phrase (NP), Verb Phrase (VP), Prepositional Phrase (PP), Noun (N), Verb (V), Determiner (D), Preposition (P) Figure 2.3: First Tree of “Kami adang air itu” (Karim, 2004) Abbreviations: Ayat (A), Subjek (S), Predikat (P), Frasa Nama (FN), Frasa Kerja (FK), Kata Nama (KN), Kata Kerja Transitif (KKTr), Penentu (Pent) 10 Figure 2.4: Second Tree of “Kami adang air itu” (Karim, 2004) Abbreviations: Ayat (A), Subjek (S), Predikat (P), Frasa Nama (FN), Frasa Kerja (FK), Kata Nama (KN), Penentu (Pent) Structural ambiguity means a phrase or sentence can have more than one structure as shown in Figure 2.1 and Figure 2.2. Other than structural ambiguity, there are also three other types of ambiguity that usually occur in natural language such as English and Malay language, which are part of- speech ambiguity, semantic ambiguity, and verbal ambiguity (Jurasky et al., 2000). In minimizing the structural ambiguity, Myparser is one of the solutions that can be used to encounter the problem using inference rules created by Malay Context Free Grammar (CFG) rule. 2.3 The Standard Malay Language The Standard Malay Language or Bahasa Baku (the word Baku comes from a Javanese word which means true and correct) that was made upon agreement between Malaysia, Indonesia, and Brunei is Bahasa Riau. This implies that the spelling, words, phrasing, grammar, pronunciation, punctuation, sentences, abbreviations, acronyms, capital letters, numbering, and style of the language are already standardized (Khalifa et al., 2007). The Malay language has their own context free grammar where there is a combination of subject and a predicate in a sentence (Juzaiddin et al., 2006). It requires a set of grammar rules which is also known as context-free grammar (CFG) 11 in English or phrase structure rules (RSF) in Malay Language. Every sentence used in a language is constructed according to the CFG, especially in Malay Language. For this reason, there are lots of research that have been conducted in language studies in producing a good sentence structure, especially in BM (Noor & Jamaludin, 2012). Sentence parser is one of the tools of technology that can be used in validating a sentence to produce a good sentence structure. It is also known as a syntactic parser by others researcher. It parses the sentence according to the CFG provided. Its function is to validate the construction of words used in a sentence. If a sentence is structured according to the rules of CFG, the parser will classify the sentence as true. There are many studies conducted by Malay language researchers on sentence parser as cited in Latif (1995), Ramli (2002), and Ahmad et al. (2007). The studies could validate a sentence according to the CFG rules. Thus, this study is to take up this challenge in producing an algorithm in the development of Malay Text parser with categorization of sentence into correct structure of grammar rule. The researcher studied on the Malay sentence similarity which is based on searching the appropriate context within Malay sentence like pattern of words. They use the context which was determined by seeking rules from a rule-based phrase database. In implementing this approach (Rahman et al., 2011), working on prototype application is described which can be used as a tool for improving writing text in Malay language, especially well personalized toward the requirements of teaching and learning this language in primary and secondary schools. The challenges ongoing Malay Text based as the domain of research makes this language are dominant to be used as sample when developing a parser. It is different from English text where the rules of grammar phrase are simpler to understand and NLP tool are based on English language. 12 2.4 The Types of Malay Language Grammar There are three types of grammar in Malay language; sentence grammar, partial discourse grammar, and the pola (pattern) sentence grammar. 2.4.1 Sentence Grammar This type of grammar uses personal, dialectal (the total amount of a language that any person knows and uses), artificial sounding, and independent sentences as a guide in making syntactic Malay sentences. Ayat (sentence) grammar has two models, namely the transformational-generative grammar and the relational grammar (Nik Safiah Karim, 1975). The transformational-generative grammar is a grammar that consist a series of phrase-structure rewrite rules. For example, a series rules that generates the underlying phrase structure of a sentence; and a series of rules that act upon the phrase structure to form a more complex sentence. The relational grammar is a theory of descriptive grammar which stated the syntactic operations such as the relationship between subject and object. These two models are inherited of CFG. 2.4.2 CFG in Malay Language CFG for Malay language was formed by Nik Safiah Karim (1995). It became the basis in developing probabilistic for Malay language grammar. The CFG in forming a basic sentence of the Malay language is pictured in Figure 2.5. 13 Table 2.1: Description of Elements Used in Malay Grammar Rules (Ahmad et al., 2007) 14 Figure 2.5: CFG for Malay Language (Karim, 1995) 15 2.4.3 Partial discourse grammar A partial discourse is the grammar that picks out the sentences from discourse to make linguistic statements about them. This type of grammar is different from sentence grammar because it uses “language-first” approach in the writing of syntax while sentence grammar uses “theory-first” approach. According to Simin (1988), “language-first” approach represents a chance for Malay readers to read the latest ideas in his own language about the genius of his language, while “theory-first” approach is more likely to be used in order to make the chosen theory appear workable. Example of partial discourse grammar: Aminah membaca buku. Dia juga mendengar radio. Dia (She) is referring to Aminah (Aminah is reading a book. She is also listening to the radio.) 2.4.4 ‘Pola’ (Pattern) Grammar “Pola” grammar is the pattern of grammar in the sentences. This type of grammar was used by Azhar Simin (1988). Each “pola” is linked to class-name that forms or helps to make a basic sentence. Each “pola” is a formula to make a basic sentence. 16 Example: ”Pola”: Pelaku + perbuatan Pattern: Actor + verb Sentence: Saya makan. (I eat). Karim (2004) represents the most theoretical work on pola grammar. It provides a methodology for pola grammar writing. Below are the pola of grammar for Malay language: (i) Pelaku + Perbuatan (Actor + Verb) (ii) Pelaku + Perbuatan + Pelengkap (Actor + Verb + Complement) (iii) Perbuatan + Pelengkap (Verb + Complement) (iv) Diterangkan + Menerangkan (Signified + Signify) (v) Digolong + Penggolong (Classified + Classifier) (vi) Pelengkap + Perbuatan + Pelaku (Complement + Verb + Actor) (vii) Pelengkap + Perbuatan (Complement + Verb) In forming a basic sentence in Malay language, the type of grammar that is suitable to use is sentence grammar which provides rules. The rules are derived from CFG that was mentioned in Figure 2.5. 17 2.5 Rules for Basic Malay Sentence Mainly, to create rules for a sentence in Malay language, we should follow CFG for Malay language by Nik Safiah Karim (1995) as shown in Figure 2.5. A basic sentence in Malay language can be derived from these four basic patterns of rules: 1) A → FN + FN (S → NP + NP) 2) A → FN + FK (S → NP + VP) 3) A → FN + FA (S → NP + AP) 4) A → FN + FS (S → NP + PP) Where, A = Ayat (sentence), FN = Frasa Nama (Noun Phrase), FK = Frasa Kerja (Verb Phrase), FA = Frasa Adjektif (Adjective Phrase), FS = Frasa Sendi (Prepositional Phrase) These basic rules will be used in the implementation of inference rule inside MyParser which will categorize the sentence structured with their own phrase. 18 2.6 Malay Text Categorization Malay Text Categorization means to categorize or classified the Malay Text into a suitable label which can be applied in the Malay sentence. In this study, the goal of text categorization is to classify some given new sentences into a fixed rule of predefined categories or structures. Other researchers nowadays, especially in Malaysia likes to explain the new prototypes of Malay Text processing on various way and not focusing on commercializing it. As mention in previous chapter, to date, there are no developed system that have been designed in processing Malay language for sentence correction, just the prototype (Noor & Jamaludin, 2012). In this study, the Malay Text Categorization is done based on the Inferences Rule built from the CFG and implemented in MyParser. MyParser will categorize the Malay simple sentence into its structure based on CFG and the output is labelled using the bracket notation. The example will be explained in Chapter 5. Furthermore, previous researcher explained Automated text categorization (TC) as a supervised learning task, defined as assigning category labels (pre-defined) to new documents based on the likelihood suggested by a training set of labelled documents (Yang & Liu, 1999). The categories are just symbolic labels, and no additional knowledge (of a procedural or declarative nature) of their meaning is available. No data provided for classification purposes by an external source (exogenous data) knowledge is available; therefore, classification must be accomplished on the basis of endogenous knowledge only (i.e., knowledge extracted from the documents). Therefore, building the parser for text categorization is a smart tool that can be used to differentiate or categorized every word in the sentence with the phrase. 19 2.7 Pre-processing Prior to any deeper linguistic treatment of a text, the unit of text must be demarcated and possibly classified. It can initially be viewed as mere sequences of character within which we must define these unit (Grefenstette & Tapanainen, 1994). Pre-processing plays an important role in developing MyParser, which in preprocessing part, involves Natural Language Processing Toolkit (NLP toolkit). The top down approach (Bill, 2002) is applied to the Malay language grammatical rules. In addition, Malay words applicable to humans and animals are identified to achieve a basic level of semantic parsing on top of syntactical parsing. The developed prototype expects a user to input a sentence before parsing it for grammatical errors check. Next, the prototype should be able to display the graphical bracket notation for a grammatical structure of the sentence so that the user would know the structure arrangement of a correct sentence and develop the programming language for Malay Text Categorization that make the application suit with Bahasa Melayu grammar. In pre-processing part, NLP plays an important role because the method of producing text categorization is implemented in it (Zakree et al., 2008). In understanding the state of the art in Malay language technologies, they conducted three tests that are available in NLP applications, in which two of the tools are Malay Sense Tagger Prototype1 and a syntactic parser based on maximum entropy. Figure 2.6 shows the pre-processing for conceptual clustering that similarly used to be implemented in text categorization. 20 Figure 2.6 : Pre-processing for Conceptual Clustering (Zakree et al., 2008) Many of the knowledge-representation and inference techniques that have been applied fruitfully in knowledge-based systems were originally developed for processing natural language. However, the language-processing applications themselves have always seemed far from being realized. This special series on NLP is an attempt to bring language processing and its applications into focus - to demonstrate techniques that have recently been applied to real-world problems, to identify research ripe for practical exploitation, and to illustrate some promising combinations of NLP with other emerging technologies. In NLP, there is a different between language dependent and independent. A language dependent system would be a system geared at a specific language, or a set of languages. It might perhaps utilize manually built lexical resources such as ontologies, thesauri or other language or domain specific knowledge bases (Hassel & Dalianis, 2011). Other dependencies constraining a system to a specific language may be the employments of advanced tools, for example, full parsers, semantic role assigners or named entity tagging, or the use of techniques such as template filling. Language independent usually denotes a NLP system that is easily dealt in between different languages or domains. The system is thereby independent of the target language. 21 2.7.1 Parsing technique Parsing a sentence is a process of breaking down a sentence into its component words. It involves the use of linguistic knowledge of a language to discover the way in which a sentence is structured. According to Charles (1994), there are two types of parsing a sentence; top-down and bottom-up parse. Top-down parse is a predictive process where the verification of a rule occurs as the last step in the application of the rule. It begins with the start symbol at the top of the parse tree and works downward. For example, replacing S with NP VP; matching the left side of a rule to an appropriate symbol and then replacing that symbol with the symbols on the right side of the rule. In contrast, the bottom-up parse is a postdictive process, where a rule not applied to the input required for its application is satisfied. It is referred to as bottomup parse because the construction of the tree begins from the terminal nodes of the tree. For example, replacing NP VP with S; matching the right side of a rule to the appropriate symbols and then replacing the matched symbols with the symbol on the left side of the rule. 2.7.2 Error Analysis of Parsing Technique According to Hendrickson (1979) who introduced the grid for the use of evaluation of language performance (Figure 2.7), the grid format allows errors to be categorized along two scales. The horizontal scale are categories by lexicon (vocabulary), syntax (grammatical structure: word order, verb phrase and etc.), morphology (grammatical agreement), and orthography (spelling errors). On the vertical scale, "global" refers to errors that affect the organization of the entire sentence (for example missing subjects or main verbs). "Local" errors affect only the constituent in which they appear (such as a noun phrase or prepositional phrase). Problems area means to be filled in with short descriptors of the errors. In spite of these differences, Hendrickson’s grid provides a good framework on which to contrast an attempt to predict the categories of errors which will be found by the syntactic parser and to 22 describe how that parser might be designed to provide meaningful error messages to the user. Figure 2.7: The Format for Errors Analysis (Hendrickson, 1979) Then, the average recall Avg (2.1) and the weighted average recall WAvg (2.2) of correctly parsed sentences were reported using the following equations: (2.1) Where k = 3 represents three person levels, A the number of test cases, and B the total number of sentences correctly parsed. The formula was introduced by Abidin (2007). (2.2) The weighted of average recall was calculated mainly because the number of total sentence correctly parsed are different for each other. 23 2.8 Related Work From the previous studies, the parser produced by Latif (1995) and Ahmad et.al. (2007) was used in the study to produce output in the form of a syntax tree and receiving input of sentence that can be seen as to have more in-depth studies compared to Suzaimah’s parser. The Suzaimah’s parser has the limiting factor where the input was only in Parlog code (Ramli, 2002). Besides, the resulting output (Parlog clause) is hard to understand by some users. 2.8.1 Ahmad’s Malay Parser The parser was developed by a group of researchers from Universiti Teknologi Petronas led by Ahmad Izuddin Zainal Abidin (Ahmad et. al., 2007). This parser is a type of syntactic parsing using a top-down parsing approach. The target of the parser is to complete the existing word processing system by checking the grammar of a test sentence. Another function of this parser is that it is able to illustrate a parse tree if the sentence is grammatically correct. The research domain of the parser is Malay language and focuses on basic Malay sentences. This parser also focused on the semantic part, which is a basic level of semantic parsing on top of syntactical parsing. In the semantic parsing, Malay words are divided into two categories: humans and animals. Some examples of words for humans are ‘mengandung’ (pregnant), ‘memasak’ (cooking), and ‘berfikir’ (thinking) while examples of animals are ‘meragut’(grazing), ‘mengawan’ (mating) and ‘bunting’(pregnant for animals). The advantage of this parser is that it can handle the semantic ambiguity. Case in point, the sentence ‘bapa meragut rumput' (a father is grazing grasses) failed in parsing as the word ‘meragut’ is categorised under animals and not for humans. This parser was evaluated by experts in Malay language, specifically school teachers. 24 Figure 2.8: The Architecture of Ahmad's Malay Parser Figure 2.8 illustrates the architecture system for Ahmad’s Malay Parser. When a user inputs a sentence, the checking engine will parse the test sentence using the text parser component. In the text parser component, there are two important parts which form the technical structure of the parser. These parts are grammar rules and Malay Lexicon. The grammar rules are derived from Karim (1995) while the Malay Lexicon contains three thousand words (3000) and arranged according to word categories. The words were collected from Kamus Dwibahasa Oxford Fajar 2nd Edition (Hawkins, 2001). 81 REFERENCES Abidin, A. I., Yong, S. P., Kasbon, R., & Azman, H. (2007). Utilizing top-down parsing technique in the development of a Malay language sentence parser.Proceeding of 2nd International Conference on Informatics, 125–131. Bill, W. (2002). Grammars and Parsing, Retrieved September 12, 2013, from http://www.cse.unsw.edu.au/~billw/cs9414/notes/nlp/grampars.html Blossom, A., Gebhard, D., Emelander, S. & Meyer, R. (2007). Software Requirements Specification (SRS) Book E-Commerce System. Retrieved June 23, 2013, from http://www.cse.msu.edu/~chengb/RE-491/Papers/SRS-BECS2007.pdf Charles F. S. (1994). Parsing Techniques: A practical Guide. State University of New Jersey. Retrieved September 30, 2013, from http://wwwrci.rutgers.edu/~cfs/305_html/Understanding/Parsing.html Cheng, B. H. C., (2007). “Intro to Specifications”. CSE 435, East Lansing, MI, Computer Science and Engineering, Michigan State University Chin, B. A., (2000). The Role of Grammar in Improving Student’s Writing. Retrieved September 12, 2013, from http://www.sadlier-oxford.com/papers/chinpaper.html Chomsky, N. (1966). Topics in the theory of generative grammar (Vol. 56). Walter de Gruyter. Dennis A., Wixom B., H., & Tegard D., (2005). System Analysis and Design with UML Version 2.0. John Wiley & Sons Inc. ISBN 0-471-34806-6 Domeiji R., Knutsson O., & Eklundh K.S., (2002). Different Ways of Evaluating a Swedish Grammar Checker, Proceedings of The Third International Conference on Language Resources and Evaluation (LREC 2002). Las Palmas, Spain Don, Z. M. (2010). Processing Natural Malay Texts: a Data-Driven Approach. Trames. Journal of the Humanities and Social Sciences, 14(1), 90. doi:10.3176/tr.2010.1.06 Grefenstette, G., & Tapanainen, P. (1994). What is a Word, what is a Sentence?: Problems of Tokenisation (pp. 79-87). Rank Xerox Research Centre. Hassel, M., & Dalianis, H. (2011). Portable Text Summarization. Applied Natural Language Processing and Content Analysis: Advances in Identification, Investigation and Resolution, 17. 82 Hawkins J. M. (1997). Kamus Dwibahasa Oxford Fajar Edisi Kedua. Penerbit Fajar Bakti Sdn. Bhd. Hendrickson, J. M. (1979). Evaluating Spontaneous Communication Through Systematic Error Analysis. Foreign Language Annals. 12(5), 357-364. IEEE Recommended Practice for Software Requirements Specifications. (1998). Retrieved June 23, 2013, from http://www.cse.msu.edu/~cse870/IEEEXploreSRS-template.pdf Jurafsky, D., & James, H. (2000). Speech and Language Processing. Retrieved June 23, 2013, from http://www.deepsky.com Juzaiddin, M., Aziz, A., Ahmad, F., Azim, A., Ghani, A., & Mahmod, R. (2006). Pola grammar technique for grammatical relation extraction in Malay language.Malaysian Journal of Computer Science, 19(1), 59. Karim N. S. & Arbak O. (2004). Kamus Komprehensif Bahasa Melayu. Cetakan Kedua. Penerbit Fajar Bakti Sdn. Bhd. Kassim M. K., Kaedah Pengajaran Murid Sederhana Dan Lemah Retrieved April 30, 2013, from http://khirkassim.blogspot.com/2009/02/kaedah-pengajaran-pelajarsederhana-dan.html Khalifa, O. O., Ahmad, Z. H. & Gunawan, T. S. (2007). SMaTTS : Standard Malay Text to Speech System, 285–293. Latif, R. A. (1995). Penyemak Sintaksis Ayat Bahasa Malaysia. Universiti Kebangsaan Malaysia, Bangi, Selangor. Muhamad N. Y. & Jamaludin, Z. (2012). Malay declarative sentence: Visualization and sentence correction, Open Systems (ICOS), 2012 IEEE Conference on , vol., no., pp.1,5, doi: 10.1109/ICOS.2012.6417659 Noor, M. N. & Jamaludin, Z. (2012). Parser with Sentence Correction for Malay Language ( BM ), 45(Icikm), 138–142. Noor, M.Y. & Jamaludin, Z., (2012) Malay declarative sentence: Visualization and sentence correction, Open Systems (ICOS), 2012 IEEE Conference on , vol., no., pp.1,5, 21-24 Oct. 2012 doi: 10.1109/ICOS.2012.6417659 Perkins J. (2010). Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham, UK. ISBN 978-1-849513-60-9 Pfleeger, S. L. (1998). Software Engineering Theory and Practice, International Edition, USA: Prentice Hall. 83 Rahman, S. A., Omar, N. B., Mohamed, H., Juzaidin, M., & Aziz, A. (2011). A Synonym Contextual-based Process for Handling Word Similarity in Malay Sentence. Ramli, S. (2002). Reka bentuk dan implementasi suatu penghurai bahasa Melayu menggunakan sistem logik selari. Universiti Putra Malaysia, Selangor. Simin, A. (1988). Discourse-Syntax of "YANG" in Malay (Bahasa Malaysia), Dewan Bahasa dan Pustaka, Kuala Lumpur Sukor, M. N., Jeon, K., Yusof, Z. A. & Yusuf, K. (2006). Kurikulum Bersepadu Sekolah Rendah Bahasa Melayu Tahun 5. Dewan Bahasa dan Pustaka. Surabhi, M.C. (2013). "Natural language processing future," Optical Imaging Sensor and Security (ICOSS). International Conference on , vol., no., pp.1,3, 2-3 July 2013 doi: 10.1109/ICOISS.2013.6678407 Talita, P. & Yeo, A. W. (2010). Challenges in Building Domain Ontology For Minority Languages, (ICCAIE), 574–578. Yang, Y. & Liu, X. (1999). A re-examination of text categorization methods. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’99, 42–49. doi:10.1145/312624.312647 Zakree, M. & Nazri, A. (2008). An Exploratory Study on Malay Processing Tool for Acquisition of Taxonomy Using FCA. doi:10.1109/ISDA.2008.254 Zakree, M., Nazri, A., Shamsudin, S. M. & Bakar, A. A. (2008). An Exploratory Study of the Malay Text Processing Tools in Ontology Learning.