MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

advertisement
MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING
INFERENCE RULE
MAISARAH BTE YAMAN
A thesis submitted in
partial fulfillment of the requirement for the award of the
Degree of Master of Science (Software Engineering)
Faculty of Computer Science and Information Technology
Universiti Tun Hussein Onn Malaysia
DECEMBER, 2013
iv
ABSTRACT
Natural Language Processing (NLP) is a technique where a machine can understand
human better and thereby reduce the distance between human being. NLP was
implemented in MyParser, acting as a tool of parsing sentence. This research aimed
to develop a sentence parser application for teachers and learners of the Malay
language who faced difficulties in comprehending the grammar and phrase structure.
There are three major processing steps that have been drawn which are Input
Component, Parsing Engine and Output Component. Parsing engine phase involved
pre-processing phase; the use of 'Tokenizer' and 'Part Of Speech' (POS). The input
component is a simple sentence provided by an expert of Malay Language (Munsyi
Dewan). It is categorized based on the inference rule implemented in MyParser. This
study is significant in designing inference rules for Malay Language sentences,
focusing on the relationship between computing language. The output was labelled
according to each phrase following the Malay Context Free Grammar (CFG) rule.
Thus, parsing technique is an essential component to be considered in parsing
application development. Besides that, MyParser application was developed to
process Malay simple sentence by categorizing them into different grammar and
phrase structure. The target groups for this application are students and teachers of
Malay Language subject in primary school. MyParser was evaluated using 100
training data which were agreed by a qualified expert on Malay Language of
Malaysia (Munsyi Dewan). More than thousand words were stored in the database.
This application was found to be able to visualize correct sentence with its labelled
graphical tags. MyParser was tested by different level of teachers from primary and
secondary school and Munsyi Dewan. The results proved that MyParser achieved
more than 90% accuracy in constructing sentences based on its grammatical rule.
v
ABSTRAK
Pemprosesan bahasa semulajadi (NLP) adalah teknik yang mana sesuatu mesin boleh
memahami kehendak manusia. NLP adalah komponen yang penting dalam proses
penguraian bahasa yang dibangunkan iaitu MyParser. Penyelidikan ini bertujuan
untuk membangunkan aplikasi penghurai untuk memproses ayat tunggal Bahasa
Melayu bagi kegunaan guru dan pelajar kelas Bahasa Melayu yang mempunyai
masalah pemahaman nahu konteks Bahasa. Terdapat tiga bahagian utama yang
terdapat dalam MyParser iaitu ‘Komponen Input’, ‘Enjin Penghurai’, dan
‘Komponen Output’. Fasa Enjin Penghurai melibatkan proses 'Penandaan' dan
‘Golongan Kata’. Komponen input adalah ayat ringkas yang diberikan oleh pakar
bahasa (Munsyi Dewan) yang dikategorikan berdasarkan peraturan inferen yang
dilaksanakan di MyParser. Kajian ini menyumbang kepada pembentukan peraturanperaturan untuk ayat dalam Bahasa Melayu dengan memberi tumpuan kepada
hubungan antara bahasa pengaturcaraan. Oleh itu, teknik penghuraian ini merupakan
komponen penting dalam pembangunan aplikasi ini. Selain itu, aplikasi MyParser
dibangunkan
untuk
memproses
ayat
ringkas
Bahasa
Melayu
dengan
mengkategorikan ayat-ayat tersebut dalam struktur tatabahasa dan frasa yang
berbeza. Kumpulan sasaran aplikasi ini adalah guru-guru dan pelajar dari sekolah
rendah yang mengambil matapelajaran bahasa Melayu. Tatabahasa yang dipilih
untuk aplikasi ini ialah nahu bebas konteks (CFG) dan terhad kepada ayat tunggal
Bahasa Melayu yang mengandungi frasa-frasa ayat mudah untuk Bahasa Melayu.
'MyParser' ini dinilai menggunakan 100 data ujian yang dipersetujui oleh pakar
Bahasa Melayu (Munsyi Dewan) dan terdapat lebih daripada 1000 patah perkataan
Bahasa Melayu yang disimpan dalam rekod pangkalan data. Aplikasi ini didapati
dapat menggambarkan ayat yang betul dengan tag grafik yang dilabel.Hasil
keputusan daripada kajian yang dilakukan terhadap guru Bahasa Melayu sekolah
rendah dan menengah serta Munsyi Dewan menunjukkan 'MyParser' mencapai lebih
daripada 90% ketepatan dalam pembentukan ayat mengikut nahu bahasanya.
vi
CONTENTS
TITLE
i
DECLARATION
ii
DEDICATION
iii
ABSTRACT
iv
CONTENTS
vi
LIST OF TABLE
viii
LIST OF FIGURE
ix
CHAPTER 1 INTRODUCTION
1.1
An Overview
1
1.2
Problem Statement
3
1.3
Research Objectives
5
1.4
Aim of Study
5
1.5
Research Scope
5
1.6
Report Outline
6
vi
CHAPTER 2 LITERATURE REVIEW
2.1
Introduction
7
2.2
Sentence Grammar Overview
7
2.3
The Standard Malay Language
10
2.4
The Types of Malay Language Grammar
12
2.4.1 Sentence Grammar
12
2.4.2 CFG in Malay Language
12
2.4.3 Partial discourse grammar
15
2.4.4 ‘Pola’ (Pattern) Grammar
15
2.5
Rules for Basic Malay Sentence
17
2.6
Malay Text Categorization
18
2.7
Pre-processing
19
2.7.1 Parsing technique
21
2.7.2 Error Analysis of Parsing Technique
21
Related Work
23
2.8.1 Ahmad’s Malay Parser
23
2.8
2.8.2
2.9
Juzaiddin’s Malay Parser
25
2.8.3 Comparison Among Research Works
27
Summary
28
CHAPTER 3 RESEARCH METHODOLOGY
3.1
Introduction
29
3.2
The Proposed Framework for MyParser
30
3.2.1 Input Component
32
3.2.2 MyParser (Pre-processing process)
35
3.2.3 Output Sentence
36
Summary
36
3.4
CHAPTER 4 ANALYSIS: MYPARSER
4.1
Introduction
36
4.2
Specific Requirements
39
4.2.1 Functional Requirements
39
4.2.2 Non-functional Requirements
39
4.2.3 UML Specification
40
vii
4.3
4.2.3.1 Use case diagram
40
4.2.3.2 Class Diagram
49
4.2.3.3 Sequence Diagram
53
4.2.3.4 Activity Diagram
55
4.2.3.5 Test Plan
56
Summary
58
CHAPTER 5 IMPLEMENTATION OF MYPARSER
5.1
Introduction
59
5.2
The implementation of MyParser
60
5.3
Summary
67
CHAPTER 6 RESULTS AND DISCUSSION
6.1
Introduction
68
6.2
The Experimental Result of MyParser
68
6.2.1 Error Analysis of MyParser
71
6.3
Comparison with other Malay NLP
76
6.3
Summary
76
CHAPTER 7 CONCLUSION
7.1
Introduction
77
7.2
Achievement of Research Objectives
77
7.3
Recommendations for Future Work
78
7.4
Conclusion
80
REFERENCES
81
APPENDIX A
84
APPENDIX B
102
viii
LIST OF TABLE
4.1
The Description of Insert Sentence
42
4.2
The Description of Edit Sentence
43
4.3
The Description View Malay CFG Rule
44
4.4
The Description Update Sentence
45
4.5
The Description Add Rule
46
4.6
The Description Update Rule
46
4.7
The Description Update New Word
47
4.8
The Description Record Data
48
4.9
The Description Class User
50
4.10
The Description Class Student
50
4.11
The Description Class Teacher
51
4.12
The Description Class Programmer
51
4.13
The Description Class Programmer
52
4.14
The Test Plan of MyParser
57
6.1
Results Of Sentence Parse
67
6.2
The Analysis Error of MyParser
69
6.3
The Analysis Of Malay NLP
69
x
LIST OF FIGURE
2.1
First tree of “He saw the boy with a telescope” (Meyer et al., 2002)
8
2.2
Second tree of “He saw the boy with a telescope” (Meyer et al., 2002)
9
2.3
First tree of “Kami adang air itu
9
2.4
Second tree of “Kami adang air itu”
10
2.5
Context-Free Grammar for Malay Language (Karim,1995)
14
2.6
Pre-processing for conceptual clustering (Zakree et al., 2008)
20
2.7
The Format For Errors Analysis by Hendrickson (1979)
22
2.8
The Architecture of The Parsing For Ahmad's Malay Parser
23
2.9
The Architecture of The Juzaidin's System
25
3.1
The Proposed Framework for MyParser
30
3.2
Flowchart of MyParser processing sentence
31
4.1
Recommended SRS Structure (IEEE Std. 830-1998)
38
4.2
Use Case Diagram of MyParser
41
4.3
Class Diagram of MyParser
48
4.4
Sequence Diagram of User
53
4.5
Sequence Diagram of Programmer
55
4.6
Activity Diagram of The Users
55
4.7
Activity Diagram of The Programmer
56
5.1
The Main Page of Application
60
5.2
The Sentence Tokenization Process
61
5.3
The Algorithm Sentence Tokenization
61
5.4
The Word By Word Tokenization Process
62
5.5
The Algorithm Word for Tokenization Process
62
5.6
The Algorithm Word for POS Tagging Process
62
5.7
The Error Massage Window Of Myparser Application
63
5.8
The Inference Rule for MyParser Based on Malay CFG
64
5.9
The Message Window Of Malay CFG Structure Correct
65
xi
5.10
The Sentence Visualization Structure Of Malay CFG Is Correct
66
5.11
The Sentence Visualization Structure Of Malay CFG Is Incorrect
64
6.1
The Score Of Build Correct And Incorrect Sentence
68
6.2
The Pie Chart Of Total Sentence Parsed
69
6.3
The Errors Analysis of MyParser
70
6.4
The Lexicon Analysis Error of MyParser
71
6.5
The Syntax Analysis Error of MyParser
72
6.6
The Orthography Analysis Error of MyParser
73
CHAPTER 1
INTRODUCTION
1.1
Overview
Processing techniques for English texts have been developed over long periods of
time and have their own general approach to the structure of English. This cannot be
done for Malay language, partly because the structures of Malay language remains
largely improper, and partly because it is strikingly different from English (Don,
2010). The development of the database incorporates two important design
principles, namely the logical organization of data, and the separation of text
properties, lexicon, and grammar. Grammar is a formal specification of rules of
language, while parsing is a method to perform syntactic analysis where the syntactic
means the rule of the grammar. According to Abidin et al. (2007), grammar is the
rule and pattern which combined words, clause, and phrases to provide meaning to
our daily use and the study of grammar that relates to language parsing. Muhamad &
Jamaludin (2012) concluded that "In Malaysia, research in sentence parse tree
visualization for Malay language (BM) still not gaining enough attention from
researcher to construct a prototype as what been done to English language".
Recently, the progress on developing Malay Text parser for development of
simple sentences categorization has not spread widely despite the massive
development of other language in Indo-European and other Asian family language.
The growth of text processing application in English is becoming more widespread
in many different fields such as information system, natural language processing.
The use of text categorization with natural language processing approach is derived
from a combination of the important role of their relation between each other. By
considering the time and cost factor, the software requirement has played an
2
important role in most of the building process of Natural Language Processing (NLP)
toolkit (Talita & Yeo, 2010). An effective NLP tool as the pre-processing component
is considered necessary in developing a tool for automatic parsing of Malay text
based.(Zakree et al., 2008). The purpose of NLP is to ensure that machines
understand the human language. In a study by Surabhi (2013), NLP is a technique
where a machine can become more human, and in that way, reducing the distance
between human being and the machine. Therefore, NLP makes human to easily
communicate with a machine. Many applications have been developed within past
few decades in NLP. Most of these are very useful for everyday life. There are also
lots of research groups working on this topic to develop more practical useful
systems.
Furthermore, the entities, properties, interrelationship, and functions derived
by graphical design notation approach are representation of knowledges intended to
capture the conceptualized information (Talita & Yeo, 2010). For English sentence
grammar, most of the language researchers introduced sentences parse tree
visualizations to help understanding sentence structure. Among the applications
introduced, phpSyntaxTree and RSyntax tree give users the chance to visualize an
English sentence throughout online dealings (Muhamad & Jamaludin, 2012).
Currently, word-processing application for Malay language only exists in the form of
words translator or spellchecker. Therefore, this research aims to complement the
existing word-processing software by presenting an application of Malay sentence
parser using structure of an expression by writing square brackets ('[' and ']') to the
left and right hand side of its component parts technique. The prototype is able to
illustrate the structure of a grammatically correct sentence, determine if a sentence is
grammatically correct, and semantically parse a sentence. In evaluating the
application, sentences which were inputs to the application, were randomly provided
by experts in the structure of Malay language grammar. If a sentence follows the
Context Free Grammar (CFG) rules for Malay language but is incorrect in terms of
semantics, it is considered to be a correct sentence. So, it enables people and
software to share a common understanding of the information’s nature and structure.
Then, the sentence structure will be categorized between the inferences rules that will
be develop in this studies.
3
The pre-processing component specifications play a key role in enhancing the
potential for data sharing and reuse across the applications. Prior to any deeper
linguistic treatment of a text, the units of the text must be marked and possibly
classified. So, to categorize Malay language sentence using MyParser application, it
involved some tool in NLP where we need to develop the programming language for
the acceptance of Malay Text via the NLP tools. An important element of this
research is the structure of building this MyParser to parsing the text to build the
labelled Malay sentence.
This study contributes a design of an Inference Rule for Malay Text in the
parser which focuses on the relationship between the programming language and the
implementation of the inference rule that was created using labelled bracket notation
from left to right, top bottom parse approach (Charles, 1994). Subsequently, a parser
was developed for the simple sentence of Malay Text based on Inference Rule that
followed several Malay CFG rules. Moreover, MyParser provides the ability to full
fill the sentence required by percentage of application tested on the three test cases.
The main goal of this study is to outline the Inference Rules for Malay sentence
which is the key of MyParser to process and create a sentence with Malay CFG
grammatical label using graphical bracket labeled approach. The sample of simple
sentence was provided by Munsyi Dewan, who is the expert of Malay language.
Munshi Dewan is a speaker officially selected and established by the Dewan Bahasa
dan Pustaka (DBP) to conduct courses, talks, and seminars of the Malay language in
public and private sectors. The role of Munsyi Dewan is to provide expert services
for reference and to help DBP in upholding the use of Malay language in public and
private sectors. In this study, Munsyi Dewan plays an important role for overviewing
the inference rule used in the parser.
1.2
Problem Statement
Reading and writing are basic skills in Malay Language for primary school children.
Mastering these skills are important in order to achieve greater achievements in the
coming years. A study was conducted by Muzaliha et al. (2012) to investigate
children with learning disabilities. The targeted sample consisted of 1010 children
4
with age range between 8 and 12 years old from 40 primary schools in the Kota
Bharu district. The children performance was measured via the Early Intervention
Class for Reading and Writing Screening Test conducted by the Ministry of
Education, Malaysia. The results indicated that a total of 4.8% of students had
learning and writing skills problem. They suggested that educators should identify
and treat these students at an earlier stage.
In an effort to present a tool for automatic Malay CFG application, it is
important to build the Inference Rule by following the CFG rule in Malay Text so
more precise results on the Malay grammar can be produced. However, there is still
no consensus standard in producing inference rule in Malay Text using NLP toolkits
(Zakree & Nazri, 2008). Source for information retrieval for the Malay Text
categorization using NLP is also limited.
Even though there are recent researcher overviews on Malay NLP,
information are still limited in the implementation of categorization the Malay text
using the NLP (Zakree & Nazri, 2008). To date, there is no developed system has
been designed for the processing of Malay language for sentence correction, but only
prototype is produced (Noor & Jamaludin, 2012). Furthermore, until now, parser
which processing Malay sentence for correction of wrong sentence has not yet been
established for any language, especially in Malaysia language (Noor & Jamaludin,
2012).
However, an effective parser is important in any NLP toolkits because their
performance depends on the quality of input fed into it. The contribution in this
study suggested to develop MyParser. In order to producing a better algorithm for
NLP toolkit programming language based on the inference rule used for Malay Text
Categorization.
5
1.3
Research Objectives
The objectives of this research are as follows:
i.
To design Inference Rule for Malay Text Categorization in the parser.
ii.
To develop a parser for Malay Text Categorization based on Inference
Rule found in (i) for Malay simple sentence.
iii.
1.4
To evaluate MyParser using several case studies.
Aim of the Study
The aim of this research is to concentrate on developing an application for Malay
Text categorization based on several sentence structures according to the Malay
Context Free Grammar (CFG).
1.5
Research Scope
This study focuses only on Malay Text Processing which is categorized into two
clauses which are subject and predicate. The phrase will categorize limit on noun
phrase, verb phrase, adjective phrase, and prepositional phrase. For data sampling,
the input sentences have been taken from primary school books and tagged manually
by Munsyi Dewan following the Malay language CFG. For data library limit, it has
been taken only from one source which is from Kamus Dewan Bahasa dan Pustaka
Bahasa Melayu 4th Edition. For the techniques limit, this research focus only on NLP
toolkit and its programming language.
6
1.6
Report Outline
This thesis is organized from general related knowledge to a deeper discussion,
where each chapter is the foundation of the next chapter. The rest of this thesis is
organized as follows; Chapter 1 is the introduction of the report, followed by Chapter
2 that provides the relevant background information on Malay sentences
categorization using MyParser. The discussion is then continued with the
complementary approaches for studying the Malay Text Processing Tools in
Categorization that leads to categorize the simple sentence. Afterwards, overview of
Malay Text Grammar Rule is presented. In addition, this chapter also provides a
general introduction to Malay Text Categorization using MyParser. Chapter 3
explains and illustrates research methodology used to carry out the categorization of
the Malay Text. This chapter also discusses on the method and framework that are
used in the pre-processing part that contribute to this research. Chapter 4 discusses
several software requirements that based on IEEE Std 830-1998 for SRS. This
specific requirement helps a lot in phase of development and implementation. It
helps in the description of the important document in an application and facilitating
two-way process between the developer documentation and the user. Chapter 5 is
about the implementation of MyParser based on the proposed framework. This
chapter shows how the implementation occurs and its user interface. Then, Chapter 6
is concerned with the results and discussion. Lastly, Chapter 7 is the conclusion of
this dissertation where it provides some conclusions, implementation of research
objectives, recommendations for future work and summary. All related documents
are presented in APPENDIX section.
CHAPTER 2
LITERATURE REVIEW
2.1
Introduction
This chapter will briefly describe the dominant topics of the research such as
Sentence Grammar Overview and Standard Malay Language. Then, the types of
Malay language grammar that consist of sentence grammar are discussed. Next, Rule
for basic Malay Sentence followed by Malay Text categorization, Pre-Processing
with the parsing technique is presented to give a clear grasp on the research field
area. Finally, this chapter will come out with an overview of related studies, relevant
to the research field included to foster the main issues that have to be addressed. The
issues are: document representation and categorization of the Malay language
sentence based on Malay Context Free Grammar (CFG) rule.
2.2
Sentence Grammar Overview
Grammar is an inner regularity and a simple knowledge representation of language
(Chomsky, 1966). It occurs from language and plays the most important role in the
implementation of the fundamental aims of linguistic analysis. Grammar plays two
roles which are to separate the grammatical sentence from the ungrammatical
sequences and study the structure of grammatical sentences.
The grammar of a language is also a device that generates all grammatical
sentences of the language and none of the ungrammatical ones. Parsing a sentence is
8
a difficult task. It is an initial step in understanding natural language, although
ambiguity is a serious problem that linguists face in building the algorithm for
natural language processing for sentence parser. An ambiguity problem occurs when
more than one parse tree are constructed. For example, the sentence “He saw the boy
with a telescope” can give two interpretations for readers. First reading: “He used the
telescope to see the boy”. Second reading: “He saw the boy who had a telescope”.
Both versions can also be interpreted by using tree structures. The first tree is shown
in Figure 2.1.
Figure 2.1: First Tree of “He saw the boy with a telescope” (Meyer et al., 2002)
Abbreviations: Sentence (S), Noun Phrase (NP), Verb Phrase (VP), Prepositional Phrase
(PP), Noun (N), Verb (V), Determiner (D), Preposition (P)
9
Figure 2.2: Second Tree of “He saw the boy with a telescope” (Meyer et al., 2002)
Abbreviations: Sentence (S), Noun Phrase (NP), Verb Phrase (VP), Prepositional Phrase
(PP), Noun (N), Verb (V), Determiner (D), Preposition (P)
Figure 2.3: First Tree of “Kami adang air itu” (Karim, 2004)
Abbreviations: Ayat (A), Subjek (S), Predikat (P), Frasa Nama (FN), Frasa Kerja (FK),
Kata Nama (KN), Kata Kerja Transitif (KKTr), Penentu (Pent)
10
Figure 2.4: Second Tree of “Kami adang air itu” (Karim, 2004)
Abbreviations: Ayat (A), Subjek (S), Predikat (P), Frasa Nama (FN), Frasa Kerja (FK),
Kata Nama (KN), Penentu (Pent)
Structural ambiguity means a phrase or sentence can have more than one
structure as shown in Figure 2.1 and Figure 2.2. Other than structural ambiguity,
there are also three other types of ambiguity that usually occur in natural language
such as English and Malay language, which are part of- speech ambiguity, semantic
ambiguity, and verbal ambiguity (Jurasky et al., 2000). In minimizing the structural
ambiguity, Myparser is one of the solutions that can be used to encounter the
problem using inference rules created by Malay Context Free Grammar (CFG) rule.
2.3
The Standard Malay Language
The Standard Malay Language or Bahasa Baku (the word Baku comes from a
Javanese word which means true and correct) that was made upon agreement
between Malaysia, Indonesia, and Brunei is Bahasa Riau. This implies that the
spelling, words, phrasing, grammar, pronunciation, punctuation, sentences,
abbreviations, acronyms, capital letters, numbering, and style of the language are
already standardized (Khalifa et al., 2007).
The Malay language has their own context free grammar where there is a
combination of subject and a predicate in a sentence (Juzaiddin et al., 2006). It
requires a set of grammar rules which is also known as context-free grammar (CFG)
11
in English or phrase structure rules (RSF) in Malay Language. Every sentence used
in a language is constructed according to the CFG, especially in Malay Language.
For this reason, there are lots of research that have been conducted in language
studies in producing a good sentence structure, especially in BM (Noor & Jamaludin,
2012). Sentence parser is one of the tools of technology that can be used in validating
a sentence to produce a good sentence structure. It is also known as a syntactic parser
by others researcher. It parses the sentence according to the CFG provided. Its
function is to validate the construction of words used in a sentence. If a sentence is
structured according to the rules of CFG, the parser will classify the sentence as true.
There are many studies conducted by Malay language researchers on
sentence parser as cited in Latif (1995), Ramli (2002), and Ahmad et al. (2007). The
studies could validate a sentence according to the CFG rules. Thus, this study is to
take up this challenge in producing an algorithm in the development of Malay Text
parser with categorization of sentence into correct structure of grammar rule. The
researcher studied on the Malay sentence similarity which is based on searching the
appropriate context within Malay sentence like pattern of words. They use the
context which was determined by seeking rules from a rule-based phrase database. In
implementing this approach (Rahman et al., 2011), working on prototype application
is described which can be used as a tool for improving writing text in Malay
language, especially well personalized toward the requirements of teaching and
learning this language in primary and secondary schools.
The challenges ongoing Malay Text based as the domain of research makes
this language are dominant to be used as sample when developing a parser. It is
different from English text where the rules of grammar phrase are simpler to
understand and NLP tool are based on English language.
12
2.4
The Types of Malay Language Grammar
There are three types of grammar in Malay language; sentence grammar, partial
discourse grammar, and the pola (pattern) sentence grammar.
2.4.1
Sentence Grammar
This type of grammar uses personal, dialectal (the total amount of a language that
any person knows and uses), artificial sounding, and independent sentences as a
guide in making syntactic Malay sentences. Ayat (sentence) grammar has two
models, namely the transformational-generative grammar and the relational grammar
(Nik Safiah Karim, 1975). The transformational-generative grammar is a grammar
that consist a series of phrase-structure rewrite rules. For example, a series rules that
generates the underlying phrase structure of a sentence; and a series of rules that act
upon the phrase structure to form a more complex sentence. The relational grammar
is a theory of descriptive grammar which stated the syntactic operations such as the
relationship between subject and object. These two models are inherited of CFG.
2.4.2
CFG in Malay Language
CFG for Malay language was formed by Nik Safiah Karim (1995). It became the
basis in developing probabilistic for Malay language grammar. The CFG in forming
a basic sentence of the Malay language is pictured in Figure 2.5.
13
Table 2.1: Description of Elements Used in Malay Grammar Rules
(Ahmad et al., 2007)
14
Figure 2.5: CFG for Malay Language (Karim, 1995)
15
2.4.3
Partial discourse grammar
A partial discourse is the grammar that picks out the sentences from discourse to
make linguistic statements about them. This type of grammar is different from
sentence grammar because it uses “language-first” approach in the writing of syntax
while sentence grammar uses “theory-first” approach. According to Simin (1988),
“language-first” approach represents a chance for Malay readers to read the latest
ideas in his own language about the genius of his language, while “theory-first”
approach is more likely to be used in order to make the chosen theory appear
workable.
Example of partial discourse grammar:
Aminah membaca buku. Dia juga mendengar radio.
Dia (She) is referring to Aminah
(Aminah is reading a book. She is also listening to the radio.)
2.4.4
‘Pola’ (Pattern) Grammar
“Pola” grammar is the pattern of grammar in the sentences. This type of grammar
was used by Azhar Simin (1988). Each “pola” is linked to class-name that forms or
helps to make a basic sentence. Each “pola” is a formula to make a basic sentence.
16
Example:
”Pola”: Pelaku + perbuatan
Pattern: Actor + verb
Sentence: Saya makan.
(I eat).
Karim (2004) represents the most theoretical work on pola grammar. It provides a
methodology for pola grammar writing. Below are the pola of grammar for Malay
language:
(i) Pelaku + Perbuatan (Actor + Verb)
(ii) Pelaku + Perbuatan + Pelengkap (Actor + Verb + Complement)
(iii) Perbuatan + Pelengkap (Verb + Complement)
(iv) Diterangkan + Menerangkan (Signified + Signify)
(v) Digolong + Penggolong (Classified + Classifier)
(vi) Pelengkap + Perbuatan + Pelaku (Complement + Verb + Actor)
(vii) Pelengkap + Perbuatan (Complement + Verb)
In forming a basic sentence in Malay language, the type of grammar that is
suitable to use is sentence grammar which provides rules. The rules are derived from
CFG that was mentioned in Figure 2.5.
17
2.5
Rules for Basic Malay Sentence
Mainly, to create rules for a sentence in Malay language, we should follow CFG for
Malay language by Nik Safiah Karim (1995) as shown in Figure 2.5. A basic
sentence in Malay language can be derived from these four basic patterns of rules:
1) A → FN + FN
(S → NP + NP)
2) A → FN + FK
(S → NP + VP)
3) A → FN + FA
(S → NP + AP)
4) A → FN + FS
(S → NP + PP)
Where,
A = Ayat (sentence), FN = Frasa Nama (Noun Phrase), FK = Frasa Kerja
(Verb Phrase), FA = Frasa Adjektif (Adjective Phrase), FS = Frasa Sendi
(Prepositional Phrase)
These basic rules will be used in the implementation of inference rule inside
MyParser which will categorize the sentence structured with their own phrase.
18
2.6
Malay Text Categorization
Malay Text Categorization means to categorize or classified the Malay Text into a
suitable label which can be applied in the Malay sentence. In this study, the goal of
text categorization is to classify some given new sentences into a fixed rule of
predefined categories or structures. Other researchers nowadays, especially in
Malaysia likes to explain the new prototypes of Malay Text processing on various
way and not focusing on commercializing it. As mention in previous chapter, to date,
there are no developed system that have been designed in processing Malay language
for sentence correction, just the prototype (Noor & Jamaludin, 2012).
In this study, the Malay Text Categorization is done based on the Inferences
Rule built from the CFG and implemented in MyParser. MyParser will categorize the
Malay simple sentence into its structure based on CFG and the output is labelled
using the bracket notation. The example will be explained in Chapter 5. Furthermore,
previous researcher explained Automated text categorization (TC) as a supervised
learning task, defined as assigning category labels (pre-defined) to new documents
based on the likelihood suggested by a training set of labelled documents (Yang &
Liu, 1999). The categories are just symbolic labels, and no additional knowledge (of
a procedural or declarative nature) of their meaning is available. No data provided
for classification purposes by an external source (exogenous data) knowledge is
available; therefore, classification must be accomplished on the basis of endogenous
knowledge only (i.e., knowledge extracted from the documents). Therefore, building
the parser for text categorization is a smart tool that can be used to differentiate or
categorized every word in the sentence with the phrase.
19
2.7
Pre-processing
Prior to any deeper linguistic treatment of a text, the unit of text must be demarcated
and possibly classified. It can initially be viewed as mere sequences of character
within which we must define these unit (Grefenstette & Tapanainen, 1994).
Pre-processing plays an important role in developing MyParser, which in preprocessing part, involves Natural Language Processing Toolkit (NLP toolkit). The
top down approach (Bill, 2002) is applied to the Malay language grammatical rules.
In addition, Malay words applicable to humans and animals are identified to achieve
a basic level of semantic parsing on top of syntactical parsing. The developed
prototype expects a user to input a sentence before parsing it for grammatical errors
check.
Next, the prototype should be able to display the graphical bracket notation
for a grammatical structure of the sentence so that the user would know the structure
arrangement of a correct sentence and develop the programming language for Malay
Text Categorization that make the application suit with Bahasa Melayu grammar. In
pre-processing part, NLP plays an important role because the method of producing
text categorization is implemented in it (Zakree et al., 2008). In understanding the
state of the art in Malay language technologies, they conducted three tests that are
available in NLP applications, in which two of the tools are Malay Sense Tagger
Prototype1 and a syntactic parser based on maximum entropy. Figure 2.6 shows the
pre-processing for conceptual clustering that similarly used to be implemented in text
categorization.
20
Figure 2.6 : Pre-processing for Conceptual Clustering (Zakree et al., 2008)
Many of the knowledge-representation and inference techniques that have
been applied fruitfully in knowledge-based systems were originally developed for
processing natural language. However, the language-processing applications
themselves have always seemed far from being realized. This special series on NLP
is an attempt to bring language processing and its applications into focus - to
demonstrate techniques that have recently been applied to real-world problems, to
identify research ripe for practical exploitation, and to illustrate some promising
combinations of NLP with other emerging technologies.
In NLP, there is a different between language dependent and independent. A
language dependent system would be a system geared at a specific language, or a set
of languages. It might perhaps utilize manually built lexical resources such as
ontologies, thesauri or other language or domain specific knowledge bases (Hassel &
Dalianis, 2011). Other dependencies constraining a system to a specific language
may be the employments of advanced tools, for example, full parsers, semantic role
assigners or named entity tagging, or the use of techniques such as template filling.
Language independent usually denotes a NLP system that is easily dealt in between
different languages or domains. The system is thereby independent of the target
language.
21
2.7.1
Parsing technique
Parsing a sentence is a process of breaking down a sentence into its component
words. It involves the use of linguistic knowledge of a language to discover the way
in which a sentence is structured. According to Charles (1994), there are two types of
parsing a sentence; top-down and bottom-up parse.
Top-down parse is a predictive process where the verification of a rule occurs
as the last step in the application of the rule. It begins with the start symbol at the top
of the parse tree and works downward. For example, replacing S with NP VP;
matching the left side of a rule to an appropriate symbol and then replacing that
symbol with the symbols on the right side of the rule.
In contrast, the bottom-up parse is a postdictive process, where a rule not
applied to the input required for its application is satisfied. It is referred to as bottomup parse because the construction of the tree begins from the terminal nodes of the
tree. For example, replacing NP VP with S; matching the right side of a rule to the
appropriate symbols and then replacing the matched symbols with the symbol on the
left side of the rule.
2.7.2
Error Analysis of Parsing Technique
According to Hendrickson (1979) who introduced the grid for the use of evaluation
of language performance (Figure 2.7), the grid format allows errors to be categorized
along two scales. The horizontal scale are categories by lexicon (vocabulary), syntax
(grammatical structure: word order, verb phrase and etc.), morphology (grammatical
agreement), and orthography (spelling errors). On the vertical scale, "global" refers
to errors that affect the organization of the entire sentence (for example missing
subjects or main verbs). "Local" errors affect only the constituent in which they
appear (such as a noun phrase or prepositional phrase). Problems area means to be
filled in with short descriptors of the errors. In spite of these differences,
Hendrickson’s grid provides a good framework on which to contrast an attempt to
predict the categories of errors which will be found by the syntactic parser and to
22
describe how that parser might be designed to provide meaningful error messages to
the user.
Figure 2.7: The Format for Errors Analysis (Hendrickson, 1979)
Then, the average recall Avg (2.1) and the weighted average recall WAvg
(2.2) of correctly parsed sentences were reported using the following equations:
(2.1)
Where k = 3 represents three person levels, A the number of test cases, and B the
total number of sentences correctly parsed. The formula was introduced by Abidin
(2007).
(2.2)
The weighted of average recall was calculated mainly because the number of total
sentence correctly parsed are different for each other.
23
2.8
Related Work
From the previous studies, the parser produced by Latif (1995) and Ahmad et.al.
(2007) was used in the study to produce output in the form of a syntax tree and
receiving input of sentence that can be seen as to have more in-depth studies
compared to Suzaimah’s parser. The Suzaimah’s parser has the limiting factor where
the input was only in Parlog code (Ramli, 2002). Besides, the resulting output
(Parlog clause) is hard to understand by some users.
2.8.1 Ahmad’s Malay Parser
The parser was developed by a group of researchers from Universiti Teknologi
Petronas led by Ahmad Izuddin Zainal Abidin (Ahmad et. al., 2007). This parser is a
type of syntactic parsing using a top-down parsing approach. The target of the parser
is to complete the existing word processing system by checking the grammar of a test
sentence. Another function of this parser is that it is able to illustrate a parse tree if
the sentence is grammatically correct. The research domain of the parser is Malay
language and focuses on basic Malay sentences. This parser also focused on the
semantic part, which is a basic level of semantic parsing on top of syntactical
parsing. In the semantic parsing, Malay words are divided into two categories:
humans and animals. Some examples of words for humans are ‘mengandung’
(pregnant), ‘memasak’ (cooking), and ‘berfikir’ (thinking) while examples of
animals are ‘meragut’(grazing), ‘mengawan’ (mating) and ‘bunting’(pregnant for
animals). The advantage of this parser is that it can handle the semantic ambiguity.
Case in point, the sentence ‘bapa meragut rumput' (a father is grazing grasses) failed
in parsing as the word ‘meragut’ is categorised under animals and not for humans.
This parser was evaluated by experts in Malay language, specifically school teachers.
24
Figure 2.8: The Architecture of Ahmad's Malay Parser
Figure 2.8 illustrates the architecture system for Ahmad’s Malay Parser.
When a user inputs a sentence, the checking engine will parse the test sentence using
the text parser component. In the text parser component, there are two important
parts which form the technical structure of the parser. These parts are grammar rules
and Malay Lexicon. The grammar rules are derived from Karim (1995) while the
Malay Lexicon contains three thousand words (3000) and arranged according to
word categories. The words were collected from Kamus Dwibahasa Oxford Fajar 2nd
Edition (Hawkins, 2001).
81
REFERENCES
Abidin, A. I., Yong, S. P., Kasbon, R., & Azman, H. (2007). Utilizing top-down
parsing technique in the development of a Malay language sentence
parser.Proceeding of 2nd International Conference on Informatics, 125–131.
Bill, W. (2002). Grammars and Parsing, Retrieved September 12, 2013, from
http://www.cse.unsw.edu.au/~billw/cs9414/notes/nlp/grampars.html
Blossom, A., Gebhard, D., Emelander, S. & Meyer, R. (2007). Software
Requirements Specification (SRS) Book E-Commerce System. Retrieved June
23, 2013, from http://www.cse.msu.edu/~chengb/RE-491/Papers/SRS-BECS2007.pdf
Charles F. S. (1994). Parsing Techniques: A practical Guide. State University of
New Jersey. Retrieved September 30, 2013, from http://wwwrci.rutgers.edu/~cfs/305_html/Understanding/Parsing.html
Cheng, B. H. C., (2007). “Intro to Specifications”. CSE 435, East Lansing, MI,
Computer Science and Engineering, Michigan State University
Chin, B. A., (2000). The Role of Grammar in Improving Student’s Writing. Retrieved
September 12, 2013, from http://www.sadlier-oxford.com/papers/chinpaper.html
Chomsky, N. (1966). Topics in the theory of generative grammar (Vol. 56). Walter de
Gruyter.
Dennis A., Wixom B., H., & Tegard D., (2005). System Analysis and Design with
UML Version 2.0. John Wiley & Sons Inc. ISBN 0-471-34806-6
Domeiji R., Knutsson O., & Eklundh K.S., (2002). Different Ways of Evaluating a
Swedish Grammar Checker, Proceedings of The Third International Conference
on Language Resources and Evaluation (LREC 2002). Las Palmas, Spain
Don, Z. M. (2010). Processing Natural Malay Texts: a Data-Driven Approach.
Trames. Journal of the Humanities and Social Sciences, 14(1), 90.
doi:10.3176/tr.2010.1.06
Grefenstette, G., & Tapanainen, P. (1994). What is a Word, what is a Sentence?:
Problems of Tokenisation (pp. 79-87). Rank Xerox Research Centre.
Hassel, M., & Dalianis, H. (2011). Portable Text Summarization. Applied Natural
Language Processing and Content Analysis: Advances in Identification,
Investigation and Resolution, 17.
82
Hawkins J. M. (1997). Kamus Dwibahasa Oxford Fajar Edisi Kedua. Penerbit Fajar
Bakti Sdn. Bhd.
Hendrickson, J. M. (1979). Evaluating Spontaneous Communication Through
Systematic Error Analysis. Foreign Language Annals. 12(5), 357-364.
IEEE Recommended Practice for Software Requirements Specifications. (1998).
Retrieved June 23, 2013, from http://www.cse.msu.edu/~cse870/IEEEXploreSRS-template.pdf
Jurafsky, D., & James, H. (2000). Speech and Language Processing. Retrieved June
23, 2013, from http://www.deepsky.com
Juzaiddin, M., Aziz, A., Ahmad, F., Azim, A., Ghani, A., & Mahmod, R. (2006).
Pola grammar technique for grammatical relation extraction in Malay
language.Malaysian Journal of Computer Science, 19(1), 59.
Karim N. S. & Arbak O. (2004). Kamus Komprehensif Bahasa Melayu. Cetakan
Kedua. Penerbit Fajar Bakti Sdn. Bhd.
Kassim M. K., Kaedah Pengajaran Murid Sederhana Dan Lemah Retrieved April 30,
2013, from http://khirkassim.blogspot.com/2009/02/kaedah-pengajaran-pelajarsederhana-dan.html
Khalifa, O. O., Ahmad, Z. H. & Gunawan, T. S. (2007). SMaTTS : Standard Malay
Text to Speech System, 285–293.
Latif, R. A. (1995). Penyemak Sintaksis Ayat Bahasa Malaysia. Universiti
Kebangsaan Malaysia, Bangi, Selangor.
Muhamad N. Y. & Jamaludin, Z. (2012). Malay declarative sentence: Visualization
and sentence correction, Open Systems (ICOS), 2012 IEEE Conference on ,
vol., no., pp.1,5, doi: 10.1109/ICOS.2012.6417659
Noor, M. N. & Jamaludin, Z. (2012). Parser with Sentence Correction for Malay
Language ( BM ), 45(Icikm), 138–142.
Noor, M.Y. & Jamaludin, Z., (2012) Malay declarative sentence: Visualization and
sentence correction, Open Systems (ICOS), 2012 IEEE Conference on , vol.,
no., pp.1,5, 21-24 Oct. 2012 doi: 10.1109/ICOS.2012.6417659
Perkins J. (2010). Python Text Processing with NLTK 2.0 Cookbook. Packt
Publishing, Birmingham, UK. ISBN 978-1-849513-60-9
Pfleeger, S. L. (1998). Software Engineering Theory and Practice, International
Edition, USA: Prentice Hall.
83
Rahman, S. A., Omar, N. B., Mohamed, H., Juzaidin, M., & Aziz, A. (2011). A
Synonym Contextual-based Process for Handling Word Similarity in Malay
Sentence.
Ramli, S. (2002). Reka bentuk dan implementasi suatu penghurai bahasa Melayu
menggunakan sistem logik selari. Universiti Putra Malaysia, Selangor.
Simin, A. (1988). Discourse-Syntax of "YANG" in Malay (Bahasa Malaysia),
Dewan Bahasa dan Pustaka, Kuala Lumpur
Sukor, M. N., Jeon, K., Yusof, Z. A. & Yusuf, K. (2006). Kurikulum Bersepadu
Sekolah Rendah Bahasa Melayu Tahun 5. Dewan Bahasa dan Pustaka.
Surabhi, M.C. (2013). "Natural language processing future," Optical Imaging Sensor
and Security (ICOSS). International Conference on , vol., no., pp.1,3, 2-3 July
2013 doi: 10.1109/ICOISS.2013.6678407
Talita, P. & Yeo, A. W. (2010). Challenges in Building Domain Ontology For
Minority Languages, (ICCAIE), 574–578.
Yang, Y. & Liu, X. (1999). A re-examination of text categorization methods.
Proceedings of the 22nd annual international ACM SIGIR conference on
Research and development in information retrieval - SIGIR ’99, 42–49.
doi:10.1145/312624.312647
Zakree, M. & Nazri, A. (2008). An Exploratory Study on Malay Processing Tool for
Acquisition of Taxonomy Using FCA. doi:10.1109/ISDA.2008.254
Zakree, M., Nazri, A., Shamsudin, S. M. & Bakar, A. A. (2008). An Exploratory
Study of the Malay Text Processing Tools in Ontology Learning.
Download