Internetworking Indonesia Journal Vol. 1/No.2 (2009)

Volume 1
Number 2
Fall 2009
Internetworking
Indonesia
Journal
The Indonesian Journal of ICT and Internet Development
ISSN: 1942-9703
IIJ © 2009
www.InternetworkingIndonesia.org
Guest Editors’ Introduction: Special Issue on Data Mining
by Anto Satriyo Nugroho & Moch Arif Bijaksana
1
Enhanced SMART-TV: A Classifier with Vertical Data Structure
and Dimensional Projections Pruning
by Taufik Fuadi Abidin & William Perrizo
3
Using Business Intelligence Solutions for Achieving Organization’s
Strategy: Arab International University Case Study
by Mouhib Alnoukari
11
Feature Selection for Large Scale Data by Combining Class Association
Rule Mining and Information Gain: a Hybrid Approach
by Appavu alias Balamurugan, Pramala, Rajalakshmi & Rajaram
17
Detecting the Originality of Extended Research Articles Using Similarity
Techniques - A Proposal
by Shanmugasundaram Hariharan
25
Prediksi Masa Studi Sarjana dengan Artificial Neural Network
by Muhamad Hanief Meinanda, Metri Annisa, Narendi Muhandri &
Kadarsyah Suryadi
31
Adaptive Content-based Navigation Generating System:
Data Mining on Unorganized Multi Web Resources
by Diana Purwitasari, Yasuhisa Okazaki & Kenzi Watanabe
37
Fuzzy Decision Tree dengan Algoritme ID3 pada Data Diabetes
by F. Romansyah, I. S. Sitanggang & S. Nurdiati
45
Internetworking Indonesia Journal
The Indonesian Journal of ICT and Internet Development
ISSN: 1942-9703
Internetworking Indonesia is a semi-annual electronic journal devoted to the timely study of the Information and Communication
Technology (ICT) and Internet development in Indonesia. The journal seeks high-quality manuscripts on the challenges and opportunities presented by information technology and the Internet in Indonesia.
Journal mailing address: Internetworking Indonesia Journal, PO Box 397110 MIT Station, Cambridge, MA 02139, USA.
Co-Editors
Thomas Hardjono, PhD
(MIT Kerberos Consortium, MIT, USA)
Budi Rahardjo, PhD
(ITB, Indonesia)
Kuncoro Wastuwibowo, MSc
(PT. Telkom, Indonesia)
Editorial Advisory Board
Prof. Edy Tri Baskoro, PhD (ITB, Indonesia)
Mark Baugher, MA (Cisco Systems, USA)
Lakshminath Dondeti, PhD (Qualcomm, USA)
Prof. Svein Knapskog, PhD (NTNU, Norway)
Prof. Merlyna Lim, PhD (Arizona State University, USA)
Prof. Bambang Parmanto, PhD (University of Pittsburgh, USA)
Prof. Wishnu Prasetya, PhD (Utrecht University, The Netherlands)
Prof. Jennifer Seberry, PhD (University of Wollongong, Australia)
Prof. Willy Susilo, PhD (University of Wollongong, Australia)
Prof. David Taniar, PhD (Monash University, Australia)
Focus & Scope: The Internetworking Indonesia Journal aims to become the foremost publication for practitioners, teachers, researchers and policy makers to share their knowledge and experience in the design, development, implementation, and the management of ICT and the Internet in Indonesia.
Topics of Interest: The journal welcomes and strongly encourages submissions based on interdisciplinary approaches focusing on
ICT & Internet development and its related aspects in the Indonesian context. These include (but not limited to) information
technology, communications technology, computer sciences, electrical engineering, and the broader social studies regarding ICT
and Internet development in Indonesia. A list of topics can be found on the journal website at www.InternetworkingIndonesia.org.
Open Access Publication Policy: The Internetworking Indonesia Journal provides open access to all of its content on the principle
that making research freely available to the public supports a greater global exchange of knowledge. This follows the philosophy of
the Open Journal Systems (see the Public Knowledge Project at pkp.sfu.ca). The journal will be published electronically and there
are no subscription fees. Such access is associated with increased readership and increased citation of an author’s work.
Manuscript Language: The Internetworking Indonesia Journal accepts and publishes papers in Bahasa Indonesia and English.
Manuscript Submission: Manuscripts should be submitted according to the IEEE Guide for authors, and will be refereed in the
standard way. Manuscript pages should not exceed 7 pages of the IEEE 2-column format in a Microsoft-Word file format. Links to
the IEEE style files can be found at www.InternetworkingIndonesia.org. Manuscripts submitted to Internetworking Indonesia must
not have been previously published or committed to another publisher under a copyright transfer agreement, and must not be
under consideration by another journal. Authors of accepted papers are responsible for the Camera Ready Copy (CRC) in the IEEE
2-column format (delivered to the Editor in a Microsoft-Word file). Authors are advised that no revisions of the manuscript can be
made after acceptance by the Editor for publication. The benefits of this procedure are many, with speed and accuracy being the
most obvious. We look forward to working with your electronic submission which will allow us to serve you more efficiently. Please
email manuscripts or inquiries to the editors at the following address: editor@InternetworkingIndonesia.org.
Submission Guidelines: Internetworking Indonesia accepts a variety of manuscripts in either English or Bahasa Indonesia. Please
review the descriptions below and identify the submission type best suited to your intended submission:
• Research Papers: Research papers report on results emanating from research projects, both theoretical and practical in nature.
• Short papers: Short research papers provide an introduction to new developments or advances regarding on-going work.
• Policy Viewpoints: Policy Viewpoints explore competing perspectives in the Indonesian policy debate that are informed by academic research.
• Teaching Innovation: Teaching Innovation papers explore creative uses of information technology tools and the Internet to improve learning and education in Indonesia.
• Book Reviews: A review of a book, or other book-length document, such as a government report or foundation report.
Vol. 1/No. 2 (2009)
INTERNETWORKING INDONESIA JOURNAL
1
Guest Editors’ Introduction:
Special Issue on Data Mining
E
khusus IIJ ini memilih datamining sebagai topik
utama. Datamining merupakan salah satu topik menarik
dalam bidang komputasi yang akhir-akhir ini mendapatkan
perhatian yang makin meningkat, baik dari kalangan
akademisi maupun praktisi di berbagai sektor. Hal ini tidak
terlepas dari ketersediaan data yang berskala besar pada
berbagai bidang, sehingga menuntut pengembangan teknikteknik komputasi untuk mengekstrak informasi signifikan dari
lautan data tsb. Ketersediaan data dalam skala besar ini
dijumpai di berbagai bidang, antara lain bioteknologi,
meteorologi, akademik dan finansial. Terlebih lagi dewasa ini
penelitian lintas bidang seolah sudah menjadi keharusan,
karena pemecahan masalah yang demikian kompleks tidak
akan tercapai tanpa adanya kerjasama berbagai disiplin ilmu.
Hal ini menyebabkan data yang diperoleh tidak hanya berskala
besar, tetapi juga
semakin kompleks dan heterogen.
Tantangan-tantangan inilah yang menyebabkan datamining
menjadi salah satu tren teknologi terkini dalam bidang
komputasi. Datamining diharapkan mampu menemukan
informasi yang tersuruk dalam lautan data itu untuk
dimanfaatkan dalam berbagai aplikasi, baik di dunia
kedokteran, marketing maupun bidang yang lain.
Di Indonesia, minat terhadap datamining terlihat makin
meningkat, sebagaimana tercermin pada diskusi di berbagai
komunitas maya. Berbagai perguruan tinggi telah
memasukkan datamining dalam kurikulumnya, disamping
matakuliah yang secara konten berkaitan erat seperti statistika,
pengenalan pola dan machine learning. Pelatihan datamining
pun telah dilakukan di berbagai instansi, seperti perbankan
misalnya. Kebutuhan akan praktisi TI yang menguasai
teknologi datamining pun mulai sering terlihat dari proses
perekrutan pegawai, misalnya perekrutan pegawai perbankan
di Indonesia. Hal lain yang menggembirakan adalah mulai
ditemuinya buku-buku teks datamining berbahasa Indonesia,
sehingga memudahkan para pelajar, mahasiswa dan peneliti
untuk melakukan studi dan penelitian di bidang ini. Hal-hal ini
yang melatarbelakangi dipilihnya topik datamining dalam
edisi khusus IIJ kali ini.
Dari berbagai makalah yang diterima, dilakukan proses
review oleh para pakar, dan akhirnya terpilih 7 makalah untuk
dimuat pada edisi kali ini. Paper pertama memperkenalkan
algoritma Enhanced SMART-TV (SMall Absolute diffeRence
of ToTal Variation) yang merupakan varian dari nearest
neighbor yang memiliki kelebihan pada sisi kecepatan,
dibandingkan dengan pendekatan nearest neighbor
konvensional. Paper kedua mengambil topik Business
DISI
The Guest Editors can be reached at the following email addresses. Dr.
Anto Satriyo Nugroho is at asnugroho@inn.bppt.go.id, while Moch Arif
Bijaksana is at mab@ittelkom.ac.id.
Intelligence untuk perancangan strategi, dengan studi kasus
pada Arab International University. Paper ketiga mengupas
teknik feature selection untuk aplikasi pada data skala besar.
Paper keempat merupakan aplikasi datamining untuk
informasi tekstual, yaitu mendeteksi originality sebuah paper
terhadap paper yang telah dipublikasikan sebelumnya. Dengan
demikian, bisa diketahui apakah seorang peneliti memang
benar-benar telah berhasil mengembangkan ide penelitiannya,
ataukah sekedar mengulang ide yang telah dipresentasikan
yang bukan lain adalah bentuk plagiarism. Paper kelima
adalah pemenang kompetisi datamining nasional bagi
mahasiswa S1 yang diselenggarakan oleh Dikti pada
GeMasTIK II tahun 2009, bertemakan prediksi masa studi
sarjana berdasarkan data-data akademis. Paper keenam
mengajukan teknik baru untuk membangun sistim generasi
navigasi berdasar konten secara otomatis. Paper ketujuh
mengetengahkan aplikasi datamining pada kesehatan, yaitu
dalam menganalisa data diabetes. Teknik datamining yang
dibahas dalam ketujuh tema tersebut cukup beragam, dengan
target aplikasi pada bidang akademik dan kesehatan.
Edisi khusus ini terbit dengan dukungan para reviewer yang
merupakan pakar data mining Indonesia. Untuk itu kami
ucapkan terimakasih dan penghargaan yang tinggi kepada
beberapa kolega yang telah mlakukan review; Dr. Budi
Santosa, Dr. Yudho Giri Sucahyo, Dr. G.A. Putri Saptawati,
Prof. The Houw Liong, Dr. Yudhi Agusta, Dr. Iko
Pramudiono, dan terutama kepada Prof David Taniar yang
juga telah mendorong adanya edisi khusus ini. Dan tentunya
kepada editor IIJ Dr. Thomas Harjono yang telah mengawal
dan mendukung dari awal hingga akhir edisi ini.
Kami berharap agar paper yang diterbitkan pada Edisi
Khusus kali ini dapat bermanfaat bagi penelitian dan
pengembangan aplikasi datamining pada berbagai sektor di
tanah air. Semoga pada masa mendatang semakin banyak
tulisan berkaitan dengan datamining yang dapat dimuat pada
IIJ.
____________
This Special Issue of the IIJ focuses on data mining. Data
mining is an interesting topic within the broad field of
computation that has received increased attention recently, not
only from academics but also from practitioners in various
sectors. This is not unrelated to the availability of large scale
data within different areas, and which therefore has
necessitated the development of new computational
techniques to extract significant information from this ocean
of data. The availability of large scale data can be found in
various areas, including biotechnology, meteorology, the
academic fields and the financial sector. Furthermore, in
recent times multidisciplinary research has in reality become
ISSN: 1942-9703 / © 2009 IIJ
2
INTERNETWORKING INDONESIA JOURNAL
unavoidable (and even mandatory) due to the fact that solving
many of these complex problems requires collaboration across
various disciplines of study. This has resulted in data not only
being large scale, but also more complex and heterogeneous. It
is these challenges that have made data mining a new
technological trend of interest within the field of computation.
The expectation is that data mining can be deployed to find
useful information buried within the vast ocean of data, which
in turn can be beneficial to diverse areas of application such as
medicine, marketing and others.
In Indonesia recently there has been an increase in interest
in data mining, as reflected by the discussion traffic on the
Internet. A number of universities and higher education
institutions have introduced the subject of data mining into
their curriculum as an addition to subjects whose contents are
already closely related to data mining, such as statistics,
pattern recognition and machine learning. Training in data
mining has also been provided by a number of institutions,
such as those in the banking sector. The increasing demand for
IT practitioners who are familiar with data mining
technologies is also evident from increasing recruitments
seeking these practitioners, such as in the area of banking.
Most pleasing also is the increasing availability of textbooks
on data mining written in Bahasa Indonesia, which allows a
broader range of students and researchers alike to focus on the
study of data mining and conduct research in this area. It is all
these factors that have provided the backdrop for this IIJ
Special Issue on Data Mining.
This Special Issues carries seven (7) papers which have
been reviewed by experts in this field. The first paper
introduces the Enhanced SMART-TV algorithm (SMall
Absolute diffeRence of ToTal Variation), which is a variation
of the nearest neighbor approach. However, this algorithm
variation provides an increase in speed compared to the
conventional nearest neighbor approach. The second paper
discusses on the role of business intelligence in achieving the
business strategy of an organization, using the Arab
International University as its case-study. The third paper
focuses on a feature selection technique for large scale data.
The fourth paper uses data mining on contextual information
for the purpose of detecting the originality of the contents of a
new paper as compared to existing published papers. The aim
here is to determine whether a researcher has truly expanded a
given research idea, or if he or she has simply repeated the
same idea using a different presentation, something considered
by many to be equivalent to plagiarism. The fifth paper is the
winner of the Indonesian national data mining competition for
S1 (undergraduate) students held by the DIKTI organization
(Directorate General for Higher Education) at the GeMasTIK
II event in 2009, which is a student event focusing on
information and communications technologies. The theme of
this fifth paper is the predictive techniques that can be used to
predict a student’s performance based on their previous
academic data. The sixth paper is on a navigation-generating
system based on adaptive content, while the seventh paper
discusses the application of data mining within the field of
healthcare, for the purpose of analyzing diabetes related data.
Vol. 1/No. 2 (2009)
All in all the data mining techniques discussed within these
seven papers cover a broad range, with the target area of
application primarily being those of the academic field and
healthcare.
This Special Issue on Data Mining received tremendous
support from reviewers who are experts in the field of data
mining in Indonesia. We would like to thank and indicate our
deepest appreciation for these colleagues who did the reviews:
Dr. Budi Santosa, Dr. Yudho Giri Sucahyo, Dr. G.A. Putri
Saptawati, Prof. The Houw Liong, Dr. Yudhi Agusta, and Dr.
Iko Pramudiono. Special appreciation goes to Prof. David
Taniar who came-up with the idea of an IIJ Special Issue and
strongly supported its creation. We also thank the IIJ Editor
Dr. Thomas Hardjono who has provided editorial guidance
and support from the beginning until the completion of this
Special Issue.
We hope the papers appearing in this IIJ Special Issue on
Data Mining will be of benefit to research and development on
data mining applications within the various sectors in
Indonesia. We hope other papers related to data mining can
appear in the IIJ in the future.
ISSN: 1942-9703 / © 2009 IIJ
Anto Satriyo Nugroho
Moch Arif Bijaksana
Dr. Anto Satriyo Nugroho is a researcher working
for Agency for the Assessment & Application of
Technolog (BPPT), Indonesia. He received his
B.Eng degree in 1995, M.Eng in 2000 and Dr.Eng
degree in 2003, all in Electrical & Computer
Engineering from Nagoya Institute of Technology,
Japan. From 2003 to 2007 he worked for Chukyo
University as a Visiting Professor in the School of
Cognitive and Computer Science (2003-2004) and
in the School of Life System Science & Technology
(2004-2007). He received the First Prize award of
the ‘99 Fog Forecasting Contest conducted by the
Neurocomputing Technical Group of The Institute
of Electronics, Information and Communication
Engineers (IEICE), Japan, and also the best paper
award in the Konferensi Nasional Sistem &
Informatika 2008 at Inna Sindhu Beach Hotel, Sanur
Bali. His research interests are in the areas of pattern
recognition, computational biology and data mining.
Since 2007 he has been serving as the vice president
of the Indonesian Society for Softcomputing. Dr.
Nugroho is a member of the Institute of Electrical &
Electronics Engineers (IEEE).
Moch Arif Bijaksana is a lecturer working for
Institut Teknologi Telkom (ITTelkom), Indonesia.
He received his B.Eng degree in 1991 from Gadjah
Mada University and M.Tech in 1998 from RMIT
University. He is an active member of Indonesian
Data Mining Community. His main research
interest is text mining.
INTERNETWORKING INDONESIA JOURNAL
Vol.1/No.2 (2009)
3
Enhanced SMART-TV: A Classifier with Vertical
Data Structure and Dimensional Projections Pruning
Taufik Fuadi Abidin
Data Mining and IR Research Group
Department of Mathematics, Syiah Kuala University
Banda Aceh, NAD 23111 Indonesia
taufik.abidin@unsyiah.ac.id
Abstract—Classification is the process of assigning class
membership to the newly encountered and unlabeled objects
based on some notion of closeness between the objects in the
training set and the new objects. In this work, we introduce the
enhancement of SMART-TV (SMall Absolute diffeRence of
ToTal Variation) classifier by introducing dimensional
projections process to prune the neighbors in the candidate set
that are far away from the unclassified object in terms of
distance but are close in terms of total variation. SMART-TV is
scalable nearest-neighbors classifier. It uses vertical data
structure and approximates a set of potential candidate of
neighbors by means of vertical total variation. The total variation
of a set of objects about a given object is computed using an
efficient and scalable Vertical Set Squared Distance algorithm. In
our previous work of SMART-TV, the dimensional projections
was not introduced, and thus, the candidate set that are far away
from the unclassified object but close in terms of total variation
also voted. This, in some cases, can reduce the accuracy of the
algorithm. Our empirical results using both real and synthetic
datasets show that the proposed dimensional projections
effectively prune the superfluous neighbors that are far away
from the unclassified object. In terms of speed and scalability, the
enhanced SMART-TV is very fast and comparable to the other
vertical nearest neighbor algorithms, and in terms of accuracy,
the enhanced SMART-TV with dimensional projections pruning
is very comparable to that of KNN classifier.
Index Terms—Data Mining, Classification, Vertical Data
Structure, Vertical Set Squared Distance.
I. INTRODUCTION
C
lassification on massive datasets has become one of the
most important research challenges in data mining.
Classification involves predicting the class label of newly
encountered objects using feature attributes of a set of preclassified objects. The main goal of classification is to
determine data membership of the unclassified objects and to
discover the group of the new objects.
K-nearest neighbor (KNN) classifier is one of the methods
Manuscript received September 30, 2009.
T. Abidin is the faculty member of Syiah Kuala University and the founder
of Data Mining and Information Retrieval Research Group at the Department
of Mathematics, College of Science (e-mail: taufik.abidin@unsyiah.ac.id).
W. Perrizo is a Professor of Computer Science at North Dakota State
University, Fargo, ND, 58105, USA (e-mail: william.perrizo@ndsu.edu).
William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105 USA
william.perrizo@ndsu.edu
used for classification [5]. KNN classifier does not build a
model in advance like decision tree induction, Neural
Network, and Support Vector Machine; instead, it invests all
the efforts when new objects arrive. The class assingment is
made locally based on the features of the new objects. KNN
classifier searches for the most similar objects in the training
set and assigns class labels to the new instances based on the
plurality of class of the k-nearest neighbors. The notion of
closeness between the training objects and the new object is
determined using a distance measure, e.g. Euclidian distance.
KNN has shown good performance on various datasets.
However, when the training set is very large, i.e. millions of
objects, it will not scale. The brute force searches to find the
k-nearest neighbors will increase the classification time
significantly.
Several new approaches have been developed to accelerate
the classification process by improving the KNN search, such
as [9]-[16] just to mention a few. Grother [9] accelerated the
nearest neighbor search using kd-tree method, while Vries
[16] improved the efficiency of KNN search in high
dimensional spaces as a physical database design problem.
Most of the state-of-the-art classification algorithms
represent the data as the traditional horizontal row based
structuring. This structure, combined with sequential scanbased approach, is known to scale poorly to very large
datasets. Gray [8] in his talk at SIGMOD 2004 conference
emphasized that vertical column based structuring as opposed
to horizontal row based structuring is a good alternative to
speed up query process and ensure scalability.
In this work, we introduce the enhancement of SMART-TV
classifier with dimensional projections to prune the neighbors
in the candidate set that are distance away from the
unclassified object, even though in terms of absolute
difference of total variation, their values can be small.
SMART-TV is an efficient and scalable nearest-neighbors
classifier that approximates a set of candidates of neighbors
by means of vertical total variation [1]. The total variation of a
set of objects about a given object is computed using an
efficient and scalable Vertical Set Squared Distance algorithm
[2].
ISSN: 1942-9703 / © 2009 IIJ
4
INTERNETWORKING INDONESIA JOURNAL
A1
A1
P13
P12
P11
4
2
2
7
5
1
6
3
100
010
010
111
101
001
110
011
1
0
0
1
1
0
1
0
0
1
1
1
0
0
1
1
0
0
0
1
1
1
0
1
P11
ABIDIN ET AL.
P12
0
0
0
0
0
0
L3
0
0
1
1
0
0
0
1
0
0
1
0
L2
1
1
L1
L0
P13
0
L3
0
0
0
0
0
L2
0
1 0 0 1 1 0 1 0
L1
L0
Fig. 1. The construction of P-tree
We use vertical data structure that has been experimentally
proven to address the curse of scalability and to facilitate
efficient data mining over large datasets. We demonstrate the
scalability of the algorithm empirically through several
experiments.
II. P-TREE VERTICAL DATA STRUCTURE
Vertical data representation arranges and processes the data
differently from horizontal data representation. In vertical data
representation, the data is structured column-by-column and
processed horizontally through logical operation like AND or
OR, while in horizontal data representation, the data is
structured row-by-row and processed vertically through
scanning or using some indexes.
P-tree vertical data structure is one choice of vertical data
representation [13]. P-tree is a lossless, compressed, and datamining ready data structure. P-tree is lossless because the
vertical bit-wise partitioning guarantees that the original data
values can be retained completely. P-tree is compressed
because when the segments of bit sequences are either pure-1
or pure-0, they can be represented in a single bit. P-tree is
data-mining ready because it addresses the curse of cardinality
or the curse of scalability, one of the major issues in data
mining. P-tree vertical data structure has been intensively
exploited in various domains and data mining algorithms,
ranging from classification [1]-[2]-[11], clustering [12],
association rule mining [14], and outlier analysis [15]. In
September 2005, P-tree technology was patented in the United
States by North Dakota State University (US patent number:
6,941,303).
The construction of P-tree vertical data structure is began
by converting the dataset, normally arranged in a relation of
horizontal records, into binary. Each attribute in the relation is
vertically partitioned, and for each bit position in the attribute,
vertical bit sequences (containing 0s and 1s) are subsequently
created. During partitioning, the relative order of the data is
retained to ensure convertibility. In 0-dimensional P-trees, the
vertical bit sequences are left uncompressed and are not
constructed into predicate trees. The size of 0-dimensional Ptrees is equal to the cardinality of the dataset. In 1-dimensional
compressed P-trees, the vertical bit sequences are constructed
into predicate trees. In this compressed form, AND operations
can be accelerated. The 1-dimensional P-trees are constructed
by recursively halving the vertical bit sequences and recording
the truth of “purely 1-bits” predicate in each half. A predicate
1 indicates that the bits in that half are all 1s, and a predicate 0
indicates otherwise.
The complete set of (uncompressed) vertical bit slices (for
numeric attributes) or bitmaps (for categorical attributes) will
be called the 0-dimensional P-trees for the given relation of
horizontal records. A 1-dimensional P-tree set involves a
binary tree compression of those 0-dimensional P-trees as
shown in the Fig. 1. A 2-dimensional, 3-dimensional or multidimensional P-tree set can be created as well, depending upon
the nature of the dataset. In this work, we used the
uncompressed P-tree set.
In P-tree data structure, range queries, values, or any other
patterns can be obtained using a combination of Boolean
algebra operators AND, OR, and NOT. Beside those
operators, COUNT operation is also very important in this
structure, which counts the number of 1-bits in the basic or
complement P-tree. Ding [7] shows the complete description
about P-tree algebra and its operations.
III. VERTICAL SET SQUARED DISTANCE
Let R(A1, A2, …, Ad) be a relation in d-dimensional space. A
numerical value v of attribute Ai can be written in b bits binary
representation to be:
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol.1/No.2 (2009)
0
∑2
vib −1 vi 0 =
j
j =b −1
d
⋅ vij
dimensional space. The binary representation of x in b bits can
be written as:
x = ( x1(b−1)  x10 , x2 (b−1)  x20 ,  , xd (b−1)  xd 0 ) (2)
Let X be a set of vectors in R or X be the relation R itself,
x∈X, and a be the input vector. The total variation of X about
a, denoted as TV(X, a), measures the cumulative squared
separation of the vectors in X about a, which can be computed
quickly and scalably using the Vertical Set Squared Distance
(VSSD) formula, derived as follows:
2
TV ( X , a ) = ( X − a )  ( X − a ) = ∑ (x − a )  (x − a ) = ∑∑ (xi − ai )
x∈X
d
d
x∈X i =1
= T1 − 2T2 + T3
d
T1 =
∑∑
(3)
 0


2 j ⋅ xij 


i =1  j =b −1

d
xi2
x∈X i =1
=
2
∑∑ ∑
x∈X
ij
(4)
⋅ xik with COUNT ( PX ∧ Pij ∧ Pik ) ,
(6)
Similarly for terms T2 and T3, we get:
d
∑∑ x a
i i
x∈ X i =1
d

 0
= ∑ ai  ∑ 2 j ⋅ ∑ xij 
i =1
x∈ X

 j =b −1
= ∑ ai
i =1
0
∑2
j =b −1
j
i =1
∑2
j = b −1
⋅ COUNT ( PX ∧ Pij )
j
⋅ COUNT ( PX ∧ Pij ) +
d
COUNT ( PX ) ⋅ ∑ a i2
i =1
0
d  0

TV ( X , a ) = ∑  ∑ ∑ 2 j + k ⋅ COUNT ( Pij ∧ Pik )  −
i =1  j =b −1 k =b −1

0
d
j =b −1
i =1
∑ 2 j ⋅ COUNT ( Pij ) + | X | ∑ ai2
(10)
where |X| is the cardinality of R.
Note that the COUNT operations are independent from a
(the input value). For classification problem where X is the
training set (or the relation R itself), this independency allows
us to pre-compute the count values in advance, retain them,
and use them repeatedly when computing total variations. The
reusability of count values expedites the computation of total
variation significantly.
(5)
d  0
0

T1 = ∑  ∑ ∑ 2 j + k ⋅ COUNT ( PX ∧ Pij ∧ Pik ) 
i =1  j =b −1 k =b −1

d
0
d
2 ⋅ ∑ ai ⋅
i =1
where COUNT is the aggregate operation that counts bit 1 in a
given pattern and Pij is the P-tree at ith dimension and jth bit
position. Thus, T1 can be simplified to be:
T2 =
(9)
0
d  0

= ∑  ∑ ∑ 2 j + k ⋅ COUNT ( PX ∧ Pij ∧ Pik )  −
i =1  j = b −1 k = b −1

d
Let PX be a P-tree mask of set X that can quickly identify data
points in X. PX is a bit pattern containing 1s and 0s, where bit
1 indicates that a point at that bit position belongs to X, while
0 indicates otherwise. Using this mask, (3) can be simplified
by substituting:
∑x
TV ( X , a ) = ( X − a )  ( X − a )
2 ⋅ ∑ ai ⋅
d  0
0

= ∑  ∑ ∑ 2 j + k ⋅ ∑ xij ⋅ xik 
i =1  j =b −1 k =b −1
x∈X

x∈X
Therefore:
Let us consider X as the relation R itself. In this case, the
mask PX can be removed since all objects now belong to R.
The formula then can be rewritten as:
= ∑∑ xi2 − 2 ⋅ ∑∑ xi ai + ∑∑ ai2
x∈X i =1
(8)
i =1
x∈X i =1
d
d
 d

= ∑  ∑ xi2 − 2 ⋅ ∑ xi ai + ∑ ai2 
x∈X  i =1
i =1
i =1

x∈X i =1
2
x∈X i =1
Where vij can either be 0 or 1. Now, let x be a vector in d-
d
d
T3 = ∑∑ ai = COUNT ( PX ) ⋅ ∑ ai2
(1)
d
5
(7)
IV. ENHANCED SMART-TV WITH DIMENSIONAL PROJECTIONS
PRUNNING
SMART-TV classifier finds the candidates of neighbors by
forming a total variation contour around the total variation of
the unclassified object [1]. The objects within the contour are
then considered as the superset of nearest neighbors. These
objects are identified efficiently using P-tree range query
algorithm without the need to scan the total variation values.
The proposed algorithm further prunes the neighbors set by
means of dimensional projections. After pruning, the k-nearest
neighbors are searched from the set, and then, they vote to
determine the class label of the unclassified object.
The algorithm consists of two phases: preprocessing and
classifying. In the preprocessing phase, all processes are
conducted only once, while in the classifying phase, the
processes are repeated for every new unclassified object.
Let R(A1,…,An, C) be a training space and X(A1,…,An) =
R[A1,…,An] be the features subspace. TV(X, a) is the total
variation of X about a, and let f (a ) = TV ( X , a ) − TV ( X , µ ) .
ISSN: 1942-9703 / © 2009 IIJ
6
INTERNETWORKING INDONESIA JOURNAL
Recall that:
ABIDIN ET AL.
shown in Fig. 2.


TV ( X , a ) = ∑  ∑ ∑ 2 j + k ⋅ COUNT ( Pij ∧ Pik )  −
i =1  j =b −1 k =b −1

d
0
0
d
2 ⋅ ∑ ai ⋅
i =1
d
0
∑2
j =b −1
j
⋅ COUNT ( Pij ) + | X | ∑ ai2
i =1
d  0
0

= ∑  ∑ ∑ 2 j + k ⋅ COUNT ( Pij ∧ Pik )  −
i =1  j =b −1 k =b −1

d
d
2⋅ | X | ∑ ai µ i + | X | ∑ ai2
i =1
i =1


= ∑  ∑ ∑ 2 j + k ⋅ COUNT ( Pij ∧ Pik )  +
i =1  j =b −1 k =b −1

d
d


| X |  − 2 ⋅ ∑ ai µ i + ∑ ai2 
i =1
i =1


d
0
Fig. 2. The shape of graph g(a).
0
A. Preprocessing Phase
In the preprocessing phase, we compute g(x), ∀x ∈ X, and
create derived P-trees of the functional values g(x). The
derived P-trees are stored together with the P-trees of the
dataset. The complexity of this computation is O(n) since the
computation is applied to all objects in X. Furthermore,
because the vector mean µ = (µ1 , µ 2 ,  , µ d ) is used in the
Since f (a ) = TV ( X , a ) − TV ( X , µ ) , we get:
d
d


= | X |  − 2 ⋅ ∑ ai µ i + ∑ ai2  −
i =1
i =1


d
d


| X |  − 2 ⋅ ∑ µ i µ i + ∑ µ i2 
i =1
i =1


function, then the vector must be initially computed.
We compute the vector mean efficiently using P-tree
vertical structure. An aggregate function COUNT is used to
count the number of 1s in the bit pattern such that the sum of
1s in each vertical bit pattern is acquired first. The following
formula shows how to compute the element of vector mean at
d
d


=| X |  − 2 ⋅ ∑ (ai µ i − µ i µ i ) + ∑ ai2 − µ i2 
i =1
i =1


d
d
d


=| X |  ∑ ai2 − 2∑ µ i ai + ∑ µ i2 
i =1
i =1
 i =1

=| X | a − µ
2
i th dimension vertically:
µi =
In the preprocessing phase, the function f (a ) is applied to
each training object, and derived P-trees of those functional
values are created to incorporate a fast and efficient way to
determine the candidates of neighbors. Since in large training
sets the values of f (a ) can be very large and representing
these large values in binary will require unnecessarily large
to reduce the bit width.
By taking the first derivation of
(12)
0
0
1
1
⋅ ∑ 2 j ⋅ ∑ xij =
⋅ ∑ 2 j ⋅ COUNT ( Pij )
X j =b −1 x∈X
X j =b −1
(11)
number of bits, we let g ( a) = ln( f ( a )) = ln X + ln a − µ
0
1
1
⋅ ∑ xi =
⋅ ∑ ∑ 2 j ⋅ xij =
X x∈X
X x∈X j =b −1
2
g(a) fixing at
th
d dimension, we find the gradient of g at a is
2
∂g
(a) =
⋅ (a − µ ) d . The gradient is zero if and
2
∂a d
a−µ
only if a=µ, and the gradient length depends only on the
length of vector (a − µ ) . This indicates that the isobars are
hyper-circles. Note also that to avoid singularity when a=µ,
we add a constant 1 to the function f(a) such that
g (a ) = ln( f (a ) + 1). The shape of the graph is a funnel as
The complexity to compute the vector mean is O(d⋅b), where
d is the number of dimensions and b is the bit width.
B. Classifying Phase
The following processes are performed on each unclassified
object a. For simplicity, consider the objects in 2-dimensional
space.
1. Determine vector (a − µ ) , where a is the new object and
µ is the vector mean of the features space X.
2. Given an epsilon of the contour (ε > 0), determine two
vectors located in the lower and upper side of a by moving
the ε unit inward toward µ along vector (a − µ ) and
moving the ε unit outward away from a. Let b and c be the
two vectors in the lower and upper side of a respectively,
then vector b and c can be determined using the following
equations:
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol.1/No.2 (2009)

ε
b = µ + 1 −
a−µ


ε
c = µ + 1 +
a
−
µ


 ⋅ (a − µ )



 ⋅ (a − µ )


(13)
(14)
3. Calculate g(b) and g(c) such that g(b)≤ g(a)≤ g(c) and
determine the interval [g(b), g(c)] that creates a contour
over the functional line. The contour mask of the interval
is created efficiently using the P-tree range query
algorithm without having to scan the functional values one
by one. The mask is a bit pattern containing 1s and 0s,
where bit 1 indicates that the object is in the contour while
0 indicates otherwise. The objects within the pre-image of
the contour in the original feature space are considered as
the superset of neighbors (ε neighborhood of a or
Nbrhd(a, ε)).
Fig. 3. The pre-image of the contour of interval [g(b), g(c)] creates a Nbrhd
(a, ε).
4. Since the total variation contour is annular around the
mean, the candidate set may contain neighbors that are
actually far from the unclassified object a, e.g. located
within the contour but in the opposite side of a. Therefore,
an efficient pruning technique is required to eliminate the
superfluous neighbors in the candidate set. We prune those
neighbors using dimensional projections. For each
dimension, a dimensional projection contour is created
around the vector element of the unclassified object in that
dimension. The size of the contour is specified by moving
the ε unit away from the element of vector a in that
dimension on both sides. It is important to note that
because the training set was converted into P-trees vertical
structure, thus, no additional derived P-trees are needed to
perform the projection. The P-trees of each dimension can
be used directly with no extra cost. Furthermore, a
parameter ms (manageable size of candidate set) is needed
for the pruning. This parameter specifies the upper bound
of neighbors in the candidate set such that if the
manageable size of neighbors is reached, the pruning
process will be terminated, and the number of neighbors in
the candidate set is considered small enough to be scanned
to search for the k-nearest neighbors. The pruning
7
technique consists of two major steps. First, it obtains the
count of neighbors in the pre-image of the total variation
contour (candidate set) relative to a particular dimension.
The rationale is to maintain the neighbors that are
predominant (close to the unclassified object) in most
dimensions so that, when the Euclidian distance is
calculated (step 5 of the classifying phase), they are the
true closest neighbors. The process of obtaining the count
starts from the first dimension. The dimensional projection
contour around the unclassified object a is formed, and the
contour mask is created. Again, the mask is simply a bit
pattern containing 1s and 0s, where bit 1 indicates that the
object belongs to the candidate set when projected on that
dimension while 0 indicates otherwise. The contour mask
is then AND-ed with the mask of the pre-image of the total
variation contour, and the total number of 1s is counted.
Note that no neighbors are pruned at this point, only the
count of 1s is obtained. The process continues for all
dimensions, and at the end of the process, the counts are
sorted in descending order. The second step of the pruning
is to intersect each dimensional projection contour with
the candidate set. The intersection starts based on the order
of the count. The dimension with the highest count is
intersected first, followed by the dimension with the
second most count, and so forth. In each intersection, the
number of neighbors in the candidate set is updated. From
the implementation perspective, this intersection is simply
a logical AND operation between the mask of the total
variation contour and the mask of the dimensional
projection contour. The second step of the pruning
technique continues until a manageable size of neighbors
is reached or all dimensional projection contours have
been intersected with the candidate set. Fig. 4 illustrates
the intersection between the pre-image of the total
variation contour and the dimensional projection contours.
Fig. 4. Pruning the contour using dimensional projections.
5. Find the k-nearest neighbors from the remaining neighbors
by calculating the Euclidian distance, ∀x ∈ Nbrhd(a, ε),
d 2 ( x, a ) =
ISSN: 1942-9703 / © 2009 IIJ
d
∑ (x
i =1
i
− ai ) 2
.
8
INTERNETWORKING INDONESIA JOURNAL
6. Let the k nearest neighbors vote using a weighted vote
( − d *d )
to influence the vote of the k nearest
w=e
neighbors. The weighting function is a Gaussian function
that gives a nice drop off based on the distance of the
neighbors to the unclassified object as shown in Fig. 5.
The closer the neighbors the higher is the weight and vise
versa.
Weighting function w = exp(-d*d)
1
0.9
0.8
0.7
Weight
0.6
0.5
ABIDIN ET AL.
cardinality of these datasets is varied from 32 to 96 million.
The KDDCUP 1999 dataset is the network intrusion dataset
used in KDDCUP 1999 [10]. The dataset contains more than
4.8 million samples from the TCP dump. Each sample
identifies a type of network intrusion such as Normal, IP
Sweep, Neptune, Port Sweep, Satan, and Smurf.
The Wisconsin Diagnostic Breast Cancer (WDBC) dataset
contains 569 diagnosed breast cancers with 30 real-valued
features. The task is to predict two types of diagnoses as either
Benign or Malignant.
The OPTICS dataset has eight different classes, containing
8,000 points in 2-dimensional space. The Iris dataset is the Iris
plants dataset, created by R.A. Fisher [6]. The task is to
classify Iris plants into one of the three varieties: Iris Setosa,
Iris Versicolor, and Iris Virginica. Iris dataset contains 150
samples in a 4-dimensional space.
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
Fig. 5. Weighting function w
2
2.5
Distance (d)
3
3.5
4
= e ( − d *d ) .
V. EXPERIMENTAL RESULTS
The experiments were conducted on an Intel Pentium 4
CPU 2.6 GHz machine with 3.8GB RAM, running Red Hat
Linux version 2.4.20-8smp. We tested the proposed classifier
using several datasets that have been previously used for
classification problems, including those datasets from the
Machine
Learning
Databases
Repository
at
We
(http://www.ics.uci.edu/~mlearn/MLRepository.html).
also incorporated real-life datasets, such as Remote Sense
Imaginary (RSI), OPTICS and Iris datasets, a few benchmark
datasets used to evaluate classification algorithms.
The RSI dataset is a set of aerial photographs from the Best
Management Plot (BMP) of Oakes Irrigation Test Area
(OITA) near Oakes, North Dakota, taken in 1998. The images
contain three bands: red, green, and blue. Each band has
values in the range of 0 and 255, which in binary can be
represented using 8 bits. The corresponding synchronized data
for soil moisture, soil nitrate, and crop yield were also used,
and the crop yield was selected as the class attribute.
Combining all the bands and synchronized data, a dataset with
6 dimensions (five feature attributes and one class attribute)
was obtained. To simulate different classes, the crop yield was
divided into four different categories: low yield having
intensity between 0 and 63, medium low yield having intensity
between 64 and 127, medium high yield having intensity
between 128 and 191, and high yield having intensity between
192 and 255. Three synthetic datasets were created to study
the scalability and running time of the proposed classifier. The
A. Speed and Scalability Comparison
We compared the performance in terms of speed using only
RSI datasets. As mentioned previously, the cardinality of
these datasets varies from 32 to 96 million. In this experiment,
we compared the classification time of enhanced SMART-TV
(SMART-TV with P-tree) against the SMART-TV with scan,
P-KNN with HOBBIT metric, and a brute-force KNN. Note
that SMART-TV with scan is the original SMART-TV [1]
and it is slightly difference from the enhanced SMART-TV.
SMART-TV with scan does not have derived P-trees of the
total variation values of the training set and it determines the
superset of neighbors in the total variation contour by
scanning the total variation values of the training objects one
by one. While scanning, the relative indexes of the objects in
the contour are stored in an array. In this experiment, the same
pruning technique was also applied to the original SMARTTV after the scanning process is completed. The candidate
mask of the neighbors is created using P-Tree vertical
structure API, and the parameter passed to the API is the array
that holds the indexes of objects within the total variation
contour. The results show that even though the scan was
conducted on a single dimension containing the total variation
values of the training objects, the original SMART-TV takes
more time when compared to the enhanced SMART-TV that
uses P-tree range query to determine the objects in the
contour. Fig. 6 and 7 depict the classification time and
scalability comparison of the algorithms. The speed was
measured by observing the time required by CPU to complete
the classification process. We learned that the enhanced
SMART-TV and P-KNN are very comparable in terms of
speed. Both algorithms are faster than the other two
algorithms. For example, P-KNN takes 12.37 seconds on an
average to classify using a dataset of size 96 million, while the
enhanced SMART-TV takes 17.42 seconds. For the same
dataset, SMART-TV with scan takes about 106.76 seconds to
classify and KNN takes 891.58 seconds. KNN with sequential
scan is almost three orders of magnitude slower than the
enhanced SMART-TV and P-KNN algorithms. From this
observation, it is clear that when the cardinality increases, the
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
classification time of KNN increases linearly. The algorithms
that use P-tree vertical data structure are much faster.
SMART-TV with scan is faster than KNN because it only
scans a single dimension and compares the functional values
of the unclassified object with the functional values of the
training objects. In addition, the pruning technique is also
incorporated in SMART-TV with scan, where as KNN has to
scan and compute the Euclidian distance at the same time.
However, the SMART-TV with scan is still slower than the
enhanced SMART-TV and P-KNN.
The enhanced SMART-TV devotes much of its time for
preprocessing as discussed previously in Preprocessing Phase
section. The complexity to create P-trees and derived P-trees
of the functional values is O(n) since the computation is
applied to all objects in the training set. The enhanced
SMART-TV also needs to compute a vector mean. The
complexity to compute the mean is O(d⋅b), where d is the
number of dimensions and b is the bit width of the P-trees. In
classification phase, the time is used to prune the superfluous
neighbors in the candidate set. However, since the pruning is
done using P-tree vertical masks and AND bit operation, the
process is very fast. This is also true for the other vertical
algorithm such as P-KNN. KNN classifier, on the other hand,
does not build a model or structure in advance. It invests all
the efforts in the classification phase when new objects arrive.
Much of its time is used to calculate the distance between the
training and the unclassified objects. Therefore, the
computational complexity for KNN to classify each new
unclassified sample is O(n).
SMART-TV w/ P-Tree
SMART-TV w/ Scan
P-KNN with HOBit
KNN
Classification Time (Seconds/Sample)
700
250
200
Smart-TV with P-tree
Smart-TV with Scan
150
P-KNN HOBBIT
KNN
100
50
0
KDDCUP Dataset
Fig. 7. Classification time on KDDCUP dataset.
B. Classification Accuracy Comparison
We examined the classification accuracy (F-score) using
the experimental datasets with 5-folded cross validation
model. We used ε = 0.1 and ms = 200, k=3, k=5, and k=7 and
compared the enhanced SMART-TV with P-KNN using
HOBBIT, P-KNN using Equal Interval Neighborhood (EINring), and KNN with linear search. The results are
summarized in Table 1. From the results, it is clear that the
classification accuracy of enhanced SMART-TV is very
comparable to that of the KNN algorithm for all datasets. The
performance of P-KNN using HOBBIT degrades significantly
when classifying the dataset with high dimensions as shown in
WDBC dataset.
TABLE 1
CLASSIFICATION ACCURACY COMPARISON ON VARIOUS DATASETS
900
800
9
300
Classification Time (Seconds/Sample)
Vol.1/No.2 (2009)
Classification Accuracy (F-Score)
Dataset
P-KNN with
HOBBIT
0.91
P-KNN
with EIN
n/a
KNN
KDDCUP
Enhanced
SMART-TV
0.93
OPTICS
0.99
0.99
0.99
0.99
IRIS
0.97
0.89
0.96
0.96
WDBC
0.94
0.39
0.96
0.97
600
500
400
300
0.93
200
100
0
VI. CONCLUSION
0
32000000
64000000
Dataset Cardinality
Fig. 6. Trend of classification time on RSI dataset.
96000000
In this paper, we have presented an enhancement of
SMART-TV classifier by introducing dimensional projections
pruning. The evaluations show that the dimensional
projections can effectively prune the superfluous neighbors
that are distance away from the unclassified object, even
though in terms of absolute difference of total variation, their
values can be small. In terms of speed and scalability, the
enhanced SMART-TV algorithm is very fast and comparable
to the other vertical nearest neighbor algorithms, and in terms
of accuracy, the enhanced SMART-TV is very comparable to
that of KNN classifier.
ISSN: 1942-9703 / © 2009 IIJ
10
INTERNETWORKING INDONESIA JOURNAL
ABIDIN ET AL.
REFERENCES
[1] Abidin, T., & Perrizo, W. (2006). SMART-TV: A fast and scalable
nearest neighbor based classifier for data mining, Proceedings of the 21st
Annual ACM Symposium on Applied Computing (pp. 536-540). Dixon,
France: Printing House, Inc.
[2] Abidin, T., Perera, A., Serazi, M., & Perrizo, W. (2006). A vertical
approach to computing set squared distance. International Journal of
Computers and their Applications (IJCA), 13 (2), 94-102.
[3] Abidin, T., Dong, A., Li, H., & Perrizo, W. (2006). Efficient image
classification on vertically decomposed data. Proceedings of the 1st
IEEE International Workshop on Multimedia Databases and Data
Management. Atlanta, Georgia: IEEE Computer Society.
[4] Ankerst, M., Breunig, M, Kriegel, H-P., & Sander, J. (1999). OPTICS:
ordering points to identify the clustering structure. Proceeding of the
ACM SIGMOD (pp. 49-60). Philadelphia, P.A.
[5] Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern
classification. IEEE Transaction on Information Theory, 13, 21-27.
[6] Demsar, J., Zupan, B., & Leban, G., (2004). Orange: from experimental
machine learning to interactive data mining. White paper
(www.ailab.si/orange). Faculty of Computer and Information Science,
University of Ljubljana.
[7] Ding, Q., Khan, M., Roy, A., & Perrizo, W. (2002). The p-tree algebra.
Proceedings of the 17th Annual ACM Symposium on Applied Computing
(pp. 426-431). Madrid, Spain.
[8] Gray, J. (2004). The next database revolution. Proceedings of the 10th
ACM SIGMOD (pp. 1-4). Paris.
[9] Grother, P.J., Candela, G.T., & Blue, J.L. (1997). Fast implementations
of nearest neighbor classifiers. Pattern Recognition, 30, 459-465.
[10] Hettich, S., & Bay, S. D. (1999). The UCI KDD archive
http://kdd.ics.uci.edu. Irvine, University of California, Department of
Information and Computer Science.
[11] Khan, M., Ding, Q., & Perrizo, W. (2002). K-nearest neighbor
classification on spatial data stream using p-trees. Proceedings of the
Pacific-Asia Conference on Knowledge Discovery and Data Mining
(pp. 517-528). Taipei, Taiwan.
[12] Perera, A., Abidin, T., Serazi, M., Hamer, G., & Perrizo, W. (2005).
Vertical set squared distance based clustering without prior knowledge
of k. Proceedings of the 14th International Conference on Intelligent
and Adaptive Systems and Software Engineering (pp. 72-77). Toronto,
Canada.
[13] Perrizo, W. (2001). Peano count tree technology lab notes (Tech. Rep.
NDSU-CS-TR-01-1). Computer Science Department, North Dakota
State University.
[14] Rahal, I., Ren, D., & Perrizo, W. (2004). A scalable vertical model for
mining association rules. Journal of Information & Knowledge
Management (JIKM), 3(4), 317-329.
[15] Ren, D., Wang, B., & Perrizo, W. (2004). RDF: a density-based outlier
detection method using vertical data representation. Proceedings of the
4th IEEE International Conference on Data Mining (pp. 503-506).
[16] Vries, A.P., Mamoulis, N., Nes, N., & Kersten, M. (2002). Efficient kNN search on vertically decomposed data. Proceedings of the ACM
SIGMOD (pp. 322-333). Madison, Wisconsin, USA.
ISSN: 1942-9703 / © 2009 IIJ
Taufik F. Abidin received his B.Sc from Sepuluh
Nopember Institute of Technology, Indonesia in
1993. Upon completing his B.Sc degree, he joined
Syiah Kuala University in Banda Aceh, Indonesia.
He received his M.Tech degree in Computer
Science from RMIT University, Australia in 2000,
and completed his Ph.D. in Computer Science at
North Dakota State University (NDSU), USA in
2006. He received the ND EPSCoR Doctoral
Dissertation Award from NDSU in 2005. He has
been a Senior Software Engineer at Ask.com in
New Jersey to develop algorithms and implement
efficient production-level programs to improve web
search results. His research interests include Data
Mining, Database Systems, and Information
Retrieval.
William Perrizo is a Professor of Computer
Science at North Dakota State University. He
holds a Ph.D. degree from the University of
Minnesota, a Master’s degree from the University
of Wisconsin and a Bachelor’s degree from St.
John's University. He has been a Research Scientist
at the IBM Advanced Business Systems Division
and the U.S. Air Force Electronic Systems
Division. His areas of expertise are Data Mining,
Knowledge
Discovery,
Database
Systems,
Distributed Database Systems, High Speed
Computer and Communications Networks,
Precision Agriculture and Bioinformatics. He is a
member of ISCA, ACM, IEEE, IAAA, and AAAS.
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
11
Using Business Intelligence Solutions for
Achieving Organization’s Strategy: Arab
International University Case Study
Mouhib Alnoukari
Arab International University, Damascus, Syria
Abstract—Business Intelligence (BI) is becoming an important
IT framework that can help organizations managing, developing
and communicating their intangible assets such as information
and knowledge. Thus it can be considered as an imperative
framework in the current knowledge-based economy arena.
In this paper, we will explain the role BI is playing in
providing organizations with a way to plan and achieve their
business strategy. We will experiment this role using a case study
in the field of high education, especially helping one of the new
private
university
in
Syria
(Arab
International
University)planning and achieving their business strategy.
Index Terms—Business Intelligence, Data Mining, Strategic
Management.
I. INTRODUCTION
B
usiness Intelligence is becoming vital for many
organizations, especially those have extremely large
amount of data. Decision makers depend on detailed and
accurate information when they have to make decisions. BI
can provide decision makers with such accurate information,
and with the appropriate tools for data analysis.
BI is an umbrella term that combines architectures, tools,
data bases, applications, practices, and methodologies [20, 6].
Gartner Group (1996) (the first company used BI in marker in
the mid-1990) defined BI as “information and applications
available broadly to employees, consultants, customers,
suppliers, and the public. The key to thriving in a competitive
marketplace is staying ahead of the competition. Making
sound business decisions based on accurate and current
information takes more than intuition. Data analysis,
reporting, and query tools can help business users dig in the
mine of data to extract and/or synthesize valuable information
from it – today these tools collectively fall into category called
Business Intelligence” [9]. Many organizations who
developed successful BI solutions, such as Continental
Airlines, have seen investment in BI generate increases in
revenue and cost saving equivalent to 1000% return on
Manuscript received September 30, 2009.
Mouhib Alnoukari is with the Arab International University, Damascus,
Syria (phone: +963-015-2050; fax: +963-015-860385; e-mail: mnoukari@aiu.edu.sy).
investment (ROI) [22].
An important question that was raised by many researchers
[16, 18] as to what was the main reason pushing companies to
search for BI solutions, and what differentiates BI from
Decision Support System (DSS) systems? In fact, over the last
decades, organizations developed a lot of Operational
Information Systems (OIS), resulting in a huge amount of
disparate data that are located in different geographic
locations, on different storage platforms, with different forms.
This situation prevents organization from building a common,
integrated, correlated, and immediate access to information at
its global level. DSS evolved during the 1970s, with the
objective of providing organization’s decision makers with the
required data to support decision-making process. In the
1980s, Executive Information System (EIS) evolved to
provide executive officers with the information needed to
support strategic decision-making process. BI evolved during
the 1990s as data-driven DSS, sharing some of the objectives
and tools of DSS and EIS systems.
BI architectures include: data warehousing, business
analytics, business performance management, and data
mining. Most of BI solutions are dealing with structured data
[1]. However, many application domains require the use of
unstructured data (or at least semi-structured data), e.g.
customer e-mails, web pages, competitor information, sales
reports, research paper repositories, and so on [4, 21].
Any BI solution can be divided into the following three
layers [1]: data layer, which is responsible for storing
structured and unstructured data for decision support
purposes. Structured data is usually stored in Operational Data
Stores (ODS), Data Warehouses (DW), and Data Marts (DM).
Unstructured data are handled by using Content and
Document Management Systems. Data are extracted from
operational data sources, e.g. SCM, ERP, CRM, or from
external data sources, e.g. market research data. Data are
extracted from data sources that are transformed and loaded
into DW by ETL tools. Logic layer provides functionality to
analyze data and provide knowledge. This includes OLAP,
data mining. And finally access layer, realized by some sort of
software portals (BI portal).
Our main focus in this paper is to explain the role of BI
solution that helps organizations in formulating,
ISSN: 1942-9703 / © 2009 IIJ
12
INTERNETWORKING INDONESIA JOURNAL
implementing, and achieving their strategy. Many researchers
[5, 17, 10, 12] were highlighting the IT alignment in general,
with business, without specifying what are the technologies,
and tools that can help organizations in achieving their
strategy.
ALNOUKARI
the need to position the firm in an external marketplace where
growth can take place, and functional integration which
addresses how to best structure internal systems to execute the
business strategy of the firm [12].
Fig. 1. IT alignment with Business Strategy [5].
The rest of this paper will be organized as the follows, the
next section will explain the role BI is taking as an IT-enabler
to achieve organization’s strategy; such role will be
highlighted by using strategic alignment model proposed by
Henderson and Venkatraman (1993), explaining how this
alignment can help organizations in becoming flexible
organizations, concluding how could BI solution provide
organizations with sustainable competitive advantages. BI role
as a strategic solution will be then experimented using a case
study at the higher education field (Arab International
University), explaining how the solution implemented at this
university helped achieving their main strategic goal,
improving quality in their higher education system.
II. BUSINESS INTELLIGENCE AS AN IT ENABLER TO ACHIEVE
ORGANIZATION’S STRATEGY
In recent years, IT in general, and BI as a strategic
framework, is becoming increasingly important in strategic
management, supporting business strategies. IT-enabled
strategic management addresses the IT role in strategy
formulation and implementation processes [19]. Drucker, the
pioneer of management by objectives, was one of the first
who recognized the dramatic changes IT brought to
management. Strategic management theories were largely
geared towards gaining competitive advantages. Porter (1979)
proposed a number of very influential strategic analysis
models, such as the five-forces model of competition, the
value chain and generic competitive strategies. Porter (1979)
said “The essence of strategy formulation is coping with
competition” [14].
Many researchers have indicated the importance of IT
alignment with business strategy in order to enhance corporate
strategy [5, 17], (Figure1).
Strategic Alignment Model developed by Henderson and
Venkatraman (1993) was one of the first models that
described in an explicit way the interrelationships between
business strategies and IT strategies [10]. This model is based
on two main concepts (Figure2): strategic fit that recognizes
Fig. 2. Strategic Alignment Model
IT alignment is not simply formulating IT strategy to fit
business strategy. It has to consider external forces and the
environment uncertainty. Such alignment helps organizations
becoming flexible organizations.
As a result of accelerations in the rates of innovation and
technological changes, markets evolve rapidly, products’ life
cycles get shorter and innovation becomes the main source of
competitive advantage. Therefore, organizations seek
flexibility to meet market demands.
Drnevich et al. (2006) explained that flexibility-based
perspectives evolved from Schumpeter’s concept of creative
destruction [8]. Operationalization of these perspectives in
strategic management is done through dynamic capabilities
and real options views. Dynamic capabilities view refers to
the firm’s abilities to maintain and adapt its internal resources
to environment changes to maintain sustainability of
competitive advantages. It refers to the capability of acquiring
new ways of competitive advantage. It involves continuous
search, innovation and adaptation of firm resources and
capabilities to uncover and tape new sources of competitive
advantages. Real options view is effective in dealing with
issues of uncertainty. It allows the firm to defer investment
decisions until uncertainties are resolved.
New IT organizational adoption facilitates the transition
into flexible organizations. Business Intelligence is one of
these new IT frameworks that can help such transition. BI
technologies become a source of competitive advantages and
differentiation [13, 11]. Tang and Walters (2006) mentioned
that competitive advantage became a hot strategic
management topic [19]. They also view that generating new
knowledge in a continued way is the single way to obtain
competitive advantage.
There are many reasons for organization to adopt business
intelligence systems in order to achieve organization’s
ISSN: 1942-9703 / © 2009 IIJ
Vol. 1/No. 2 (2009)
INTERNETWORKING INDONESIA JOURNAL
strategy:
• Business Intelligence is considered as an extension
to corporate strategy activities. Herring (1988)
considered that “Strategy can be no better than the
information from which it is derived” [11].
• Data analytics can be used effectively to build future
business strategy.
• Data analytics and data mining could reveal hidden
reasons for some deficiencies as well as possible
high-yielding new investments.
• Corporations need to be sure that they are receiving
the right information related to their long-term
strategy.
Herring (1988) considered that business intelligence can
help organizations in [11]:
• Supporting the strategic decision making process of
the corporation.
• Supporting corporation SWOT analysis
• Supporting strategic planning and processes.
All the mentioned benefits should provide organizations
with sustainable competitive advantages.
III. ARAB INTERNATIONAL UNIVERSITY BI SOLUTION
Arab International University (AIU) is a new private
university in Syria. Being a 3-year old university, it began to
find difficulties in managing the huge data deployed from its
different information systems. The academic system, financial
system, HR system, and Quality Assurance Automated
System (QAAS) system, are at the core of the university daily
operations [3].
As most of the university information systems were
provided from different sources, there was an urgent need to
integrate data from all these sources into one data warehouse
in a manner that could help the university in making use of all
data to assure quality
The data gathered in order to produce the university BI
solution and help in enhancing education quality are:
• Academic
data
(registration,
examination,
enrollment, etc).
• Financial data (student fees, staff salaries, orders,
sales, etc.).
• Human Resources data (staff personal information).
• QAAS data (student feedback, GPA differences,
drop ratio, plan completion, industry feedback,
etc).
An enterprise data warehouse is built to hold all the
previous data. The data sources were provided from different
sources (Oracle databases, SQL Server data bases, excel). The
data warehouse was built using Oracle 11g data base.
The ETL was built by using ETL package in Oracle
Warehouse Builder.
The solution developed (by AIU developers) followed
ASD-DM paradigm for BI solution. ASD-DM paradigm
suggests three agile modeling steps (specifically Adaptive
Software Development (ASD)) to develop a successful BI
13
model [2].
The BI solution could provide university’s top management
with reports to uncover trends in the students and instructors’
performance in a manner that would be impossible or at least
extremely tedious without data warehouses and data mining.
Fig. 3. AIU students clustered according to their accumulative
GPA & their credit hours.
One of the main strategic goals of the AIU is to enhance
students GPA. Figure 3 shows the correlation between
student’s accumulative GPA and their credit hours. It shows
that students with higher accumulative hours are getting
higher accumulative GPA. On the other hand it was very clear
that there is strong correlation between students’ level of
English and accumulative GPA as shown in figure 4.
Clustering each faculty’s students according to their
cumulative GPAs, and their completed hours helps the
university’s academic advisors focus on special groups,
especially the group of students that are likely to drop out
(Figure 5). Correlation between credit hours and AGPA
changes shows a clear picture about the optimal number of the
credit hours the students would take to increase their AGPA.
Figure 5 Shows that this ranges between 2-12 and 20-22
hours. This provides the AIU decision makers with the
reasons to find out how to help students to enhance their
AGPAs while getting the required credit hours in each
Fig. 4. Relationship between AIU students’ accumulative GPA
& their English level.
ISSN: 1942-9703 / © 2009 IIJ
14
INTERNETWORKING INDONESIA JOURNAL
semester (which varies between 16-19 hours).These results
helped AIU to update their education system in order to force
students to enhance their English level by adding more
English teaching hours at the early phases.
AIU BI solution is able to provide AIU managers with a set
of reports that can help them evaluating the university current
strategy. AIU BI solution’s reporting includes:
ALNOUKARI
enhancing the total number of enrolled courses
(Figure 7). This has an immediate financial
revenue increase.
Fig. 7. Currents enrolled university courses.
Fig. 5. Credit hours relation’s with the AGPA changes.
• Predicting students GPAs. Different algorithms
were used for prediction. Evaluation of the results
for each algorithm permit choosing the best
method with the highest predictive confidence.
SVA algorithm was chosen with more than 70%
predictive confidence (Figure 6 shows GPA
prediction deviation errors). This also helps
predicting the average GPA per each faculty,
which would help AIU preparing plans to enhance
the overall performance.
• Analysis of the total number of students’ presence
per different time ranges per days helps AIU
achieving one of its strategic goals by enhancing
services provided to its students and optimizing
costs (Figure 8). Such analysis helps AIU
preparing the optimal plans for transportation,
restaurants, library services, and others.
• Identifying students likely to drop out using the
students predicted GPA results (Figure 6).
• Classifications of students’ results according to
different subjects (Figure 4).
Fig. 8. AIU students attendance per different time ranges per days.
Fig. 6. GPA prediction deviation errors chart.
• Market basket analysis report helps in preparing the
time table for each semester. The resultant time
table would contain a set of highly interrelated
courses that students require. This means that two
courses with high association correlation don’t
overlap in the time table, also these two courses
should be enrolled in the same semester, and not
wait for one or two semesters. This analysis helps
achieving one of the main AIU strategic goals by
IV. CONCLUSION
In this paper, we explained the use of BI solution in
formulating, implementing, and achieving organization’s
strategy. We also demonstrated how BI solution could provide
organizations with sustainable competitive advantages.
This work can be extended by integrating knowledge
management with BI solutions, as it can help deriving more
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
value (knowledge) from the explosion of textual information,
which can add more inputs to the strategic decision.
Another important factor is the use of agile methodologies
in order to manage the high-speed high-change current
environment. Such complex and dynamic environment highly
affect organization’s strategy.
15
[22] J. Zack, R. K. Rainer, and T. E. Marshall, “Business Intelligence: An
Analysis of the Literature”, Information Systems Management, 25: pp. 121–
131, 2007.
REFERENCES
[1] M. Alnoukari, and W. Alhussan., “Using Data Mining Techniques for
Predicting Future Car market Demand”, International Conference on
Information & Communication Technologies: From Theory to Applications,
IEEE Conference, Syria, 2008.
[2] M. Alnoukari, Z. Alzoabi, and S. Hanna, “Using Applying Adaptive
Software Development (ASD) Agile Modeling on Predictive Data Mining
Applications: ASD-DM Methodology”, International Symposium on
Information Technology, ITSIM 08, Malaysia, 2008.
[3] Z. Alzoabi, F. Diko, and M. Alnoukari, “Enhancing Education Quality
Assurance Using information Systems- QAAS System”, International
Symposium on Information Technology, ITSIM 08, Malaysia, 2008.
[4] H. Baars, and H. G. Kemper, “Management Support with Structured and
Unstructured Data- An Integrated Business Intelligence Framework”,
Information Systems Management, 25: pp. 132–148, 2007.
[5] D. Boddy, A. Boonstra, and G. Kennedy, Managing Information Systems:
An Organisational Perspective, (2nd edition). Harlow: Pearson. 2005
[6] F. Cody, J. T. Kreulen, V. Krishna, and W. S. Spangler, “The Integration
of Business Intelligence and Knowledge Management”, Systems Journal, Vol.
41, No. 4, pp. 697–713.
[7] T. Corbitt, “Business Intelligence and Data Mining”, Management
Services Magazine, November 2003.
[8] P. Drnevich, J. Hahn, and M. Shanley, “Toward a Strategic Perspective of
Information Technology”, IT-Enabled Strategic Management, Idea Group
Publishing, pp. 16-37, 2006.
[9] Gartner Group (September, 1996). Retrieved November 12, 2005, from
http://www.innerworx.co.za/products.htm.
[10] W. V. Grembergen, S. D. Haes, and E. Guldentops, “Structures,
Processes and Relational Mechanisms for IT Governance”, Strategies for
Information Technology Governance, by Wim Van Grembergen (ed), Idea
Group Publishing, 2004.
[11] J. P. Herring, “Building a Business Intelligence Systems”, Journal of
Business Strategy, May/June 1988.
[12] J. D. Katz, “The Integral Role of Information Technology in Achieving
Business Strategy Success: Managing the Information Resources of Global
Competitors”, Advanced Topics in Global Information Management, by Tan
F. B., Idea Group Publishing, pp. 42-62, 2002.
[13] M. Pérez-Valls, J. M. Ortega-Egea, and J. A. Plaza-Úbeda, “Relationship
between New Information Technologies and Flexible Organizational Forms”,
IT-Enabled Strategic Management, Idea Group Publishing, pp. 1-16, 2006.
[14] M. E. Porter, “How competitive forces shape strategy?”, Harvard
Business Review 57, April/Mai 1979, pp. 137-145.
[15] M. E. Porter, “Competitive Strategy—Techniques for Analyzing
Industries and Competitors”, New York: Free Press. 1980.
[16] D. J. Power, “A Brief History of Decision Support Systems”.
DSSResources.COM,
World
Wide
Web,
http://DSSResources.COM/history/dsshistory.html, version 4.0, March 10,
2007.
[17] R. Sabherwal, and Y. E. Chan, “Alignment Between Business and IS
Strategies: A Study of Prospectors, Analyzers, and Defenders”. Information
Systems Research, 12(1), 11-33. 2001.
[18] M. Shariat, R. Jr. Hightower, “Conceptualizing Business Intelligence
Architecture”, Marketing Management Journal, Volume 17, Issue 2, pp. 40 –
46, 2007.
[19] Z. Tang, and B. Walters, “The Interplay of Strategic Management and
Information Technology”, IT-Enabled Strategic Management, Idea Group
Publishing, pp. 1-16, 2006.
[20] E. Turban, J. E. Aronson, T. P. Liang, R. Sharda, Decision Support and
Business Intelligence Systems, 8th edition, Pearson Prentice Hall, 2007.
[21] C. H. Wee, and M. L. Leow, “Competitive Business Intelligence in
Singapore”, Journal of Strategic Marketing 1, pp. 112-139, 1994.
ISSN: 1942-9703 / © 2009 IIJ
16
INTERNETWORKING INDONESIA JOURNAL
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
17
Feature Selection for Large Scale Data by
Combining Class Association Rule Mining and
Information Gain: a Hybrid Approach
Appavu alias Balamurugan, Pramala, Rajalakshmi and Rajaram
Department of Information Technology,
Thiagarajar College of Engineering, Madurai, India
Abstract— Feature selection is a fundamental problem in data
mining to select relevant features and cast away irrelevant and
redundant features from an original feature set based on some
evaluation criterion. In this paper, we propose a filter method to
find associate attributes with respect to class and rank using
information gain. Highly associated features are searched using
class association rule mining. Information gain is used for
removing redundancy and for further pruning.The efficiency and
effectiveness of our method is demonstrated through extensive
comparisons with other methods using real-world data of high
dimensionality.
Index Terms— Association mining, Data mining, Feature
selection, Information gain
I. INTRODUCTION
D
ATA MINING deals with the discovery of hidden
knowledge, unexpected patterns and new rules from
large databases. It often involves datasets with a large
number of attributes. Many of the attributes in most real world
data are redundant and/or simply irrelevant to the purposes of
discovering interesting patterns.
Attribute reduction selects relevant attributes in the dataset
prior to performing data mining. Attribute reduction is also
known as feature selection. For example, high dimensional
data (i.e., data sets with hundreds or thousands of features),
can contain high degree of irrelevant and redundant
information which may greatly degrade the performance of
learning algorithms. Therefore, feature selection becomes very
necessary for machine learning tasks when they face high
dimensional data nowadays. Thus the dimensionality of the
datasets has to be reduced. Selection of features is one of the
most important tasks for designing a good classifier. Before
we enter the learning phase, it is said that as many features as
possible are to be collected for improving the performance.
Yet irrelevant and correlated may degrade the performance of
the classifier. Large number of features may yield to complex
models which make interpretation difficult. This work aims
to enhance feature selection methodologies and
correspondingly the accuracy and performance of the
algorithm.
II. RELATED WORK
Several feature selection techniques have been proposed in
the literature, including some important surveys on feature
selection algorithms such as Molina et al.[2] Guyon and
Elisseeff [3]. Different feature selection methods can be
categorized into the wrapper [4], filter [5-9] and the embedded
model [1, 5]. The wrapper model uses the predictive accuracy
of a predetermined learning algorithm to determine the
goodness of the selected subset. The most important point
about this method is the higher computational cost [4]. The
filter model separates feature selection from classifier learning
and selects feature subsets that are independent of any
learning algorithm. It relies on various measures of the general
characteristics of training data such as distance, information
dependency and consistency [15]. Filter methods are found to
perform faster than wrapper methods and are therefore widely
used to deal with high dimensional datasets. Several feature
selection techniques have been proposed in the literature like
Correlation-based feature selection (CFS), Principal
Component
Analysis
(PCA),Gain
Ratio
Attribute
Evaluation(GR) etc[17]. Most of these methods usually
combine with some other method to find the appropriate
number of attributes [18, 19]. There are some proposals on
how to find optimal feature selection algorithms. They have
comprehensively discussed about attribute selection based on
entropy metrics [10, 11, 12]. Several new feature selection
methods have been derived based on entropy metric such as
symmetrical uncertainty, information gain and gain ratio [13]
and mutual information [14]. The concept of two dimensional
discriminant rules initially grew in association rules
generation and also support for classification and regression
[7]. The core of the operation is laid on the concept of xmonotone region optimization. The concern of correlation in
numeric and discrete class has been discussed in detail in
comparison to accuracy level is considered for single
attributes .The need to accommodate pair wise attributes as a
two dimensional feature is applied as a pattern [18].Other
recent researchers also attempt to explore the pair wise
attributes selection in the aspect of noise detection [19].
ISSN: 1942-9703 / © 2009 IIJ
18
INTERNETWORKING INDONESIA JOURNAL
III. PROBLEM DEFINITION
The enormity in the number of instances causes serious
problems to many machine learning algorithms. High
dimensional data (i.e., data sets with hundreds or thousands of
features) can contain high degree of irrelevant and redundant
information which may greatly degrade the performance of
learning algorithms. In general any learning algorithm faces
the problem of selecting a subset of features upon which
attention must be focused. Therefore, feature selection
becomes very necessary. In this paper Feature selection is
done using class association rule incorporated along with
information gain. Our proposed method filters out the
irrelevant attributes to target class by rule generation
according to class based association rule [20] and then select
features with relevant support and confidence value.
Redundancy from the selected features is removed using
information gain algorithm.
IV. PROPOSED METHODOLOGY
In this work, we propose a filter approach to select relevant
features based on the associations among them. Such
associations can be discovered using class based association
rule mining. The proposed associative feature selection
approach is based on the heuristics discussed above to
separate relevant and irrelevant terms. The occurrence of
terms in many association rules means that they are associated
with many other terms. These terms should then be assigned
with a high score so that they are considered as relevant terms.
On the contrary, terms that occur infrequently in
association rules should be assigned with a low score of
irrelevant terms. The assignment of scores to features
comprises the following three steps:
(1) We first determine the constraints of the association rules;
(2) We search for association rules satisfying the constraints;
(3) Further reduce feature by setting threshold and removing
redundant features using information gain by ranking criteria.
In the first step, we determine the constraint F for the
association rules: F: ARs → Boolean. Here, ARs is the set of
all association rules on the set of terms. Conventionally, the
constraint for association rules is that the values of support
and confidence should be greater than the minimal values
(thresholds) of min_supp and min_conf. The constraint F can
also include other measures for mining association rules.
The second step searches for association rules that satisfy
the constraint F. The typical approach for searching
association rules is the Apriori algorithm[16]. In this
approach, the support measure is first used to filter out most of
the rules that do not satisfy min_supp. The confidence
measure (or other measures) is then applied to the remaining
rules that satisfy min_conf.
Finally, the threshold is set for selected features by
retrieving the percentage of selected features from the set of
original features. Redundancy is also removed from those
BALAMURUGAN ET AL.
selected features having similar ranking.
Here pruning is done at two stages, One is to prune low
score feature based on the confidence and support value. The
other is to prune redundant attributes which are present within
the threshold value.
PROPOSED ALGORITHM
Input:
S (a1, a2, a3……aN, C)
// a training dataset
M instance, S classes
Min sup, min conf
// minimum support and confidence
Output:
Sbest
//selected features
Find Relevant Features:
1 begin
2
ck : candidate item set of size k;
3
Lk : frequent item set of size k;
4
L1 : frequent item set1;
5
C1 : class label;
6
L2 : combine L1 and C1 and check with minsup and conf;
7
for k=2 to Lk! =null do begin
8
Lk+1 : combine Lk and Lk;
9
ck+1 : candidates generated from Lk+1;
10
for each transaction t in database d do begin
11
count++;
//all candidates in ck+1
12
Lk+1 : candidates in cK+1 with minsup and minconf;
13
end;
14
end;
15 Slist : selected feature from rule
16 Selected percentage = (selected feature/total feature) * 100;
also in t
Remove Redundant Features:
17 find gain value for Slist features using info gain algorithm;
m
18
Info(D)= − ∑ Pi log 2 ( Pi )
//m-no of distinct class
i =1
19
for each feature value
v
20
Info A (D)= ∑ (|D j | |D|) * Info(D j ) //v-no of distinct value
21
Gain(A)=Info(A)-Info A (D)
j=1
22
23
24
25
26
end
if two or more have same gain value
remove all redundant features leave one feature
end
No of feature select = (selected percentage/100)*
selected feature
27
Sbest : Final Selected feature ;
Where, Final Selected feature =No of feature select - feature selected from
order of Info gain value
Let dataset have N features, M instances, S target classes.
Towards the end of the proposed algorithm, association rules
are generated, from which we select the corresponding
feature. To remove redundancy from relevant features we go
for information gain. Using information gain algorithm, the
gain value for each attribute with class label is calculated. If
two or more features have same information gain value, that is
called as redundancy. Remove those redundant features. Sbest
is the final selected based on percentage calculation.
The above proposed algorithm is based on the following
two concepts namely the class based and correlation based
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
association rule mining.
A. Class Based Association
Association rule mining has gained a great deal of
attention. Formally, an association rule R is an implication X
 Y, where X and Y are sets of items in a given dataset. The
confidence of the rule confidence(R) is the percentage of
records that contain Y among the total number of records
containing X. The support of the rule ‘support(R)’ is the
percentage of records containing X and Y with respect to the
number of all records.
Let P[S] be the probability of an itemset S present in a
certain transaction of the database. P[S] can be considered as
the support of the itemset S as defined above,(also denoted as
support(S)).The antecedence support is denoted by P[X] and
the consequence support by P[Y] of the rule R. Assume that
the database contains N records with the numbers of records
that contain X, Y, and both X and Y are a, b, and c
respectively. It can be implied from the definitions of support
and confidence of association rules that support(R) = c/N and
confidence(R) = c/a; and from the definition of support of an
itemset that P[X] = a/N, P[Y] = b/N and P[X ∧ Y] = c/N.
Thus, the values of supp(R) and conf(R) can be computed
using P[X ∧ Y] and P[Y] as follows:
Support=
( R) P[ X ∧ Y ]
Confidence( R ) =
P [ X ∧Y ]
P[ X ]
(1)
(2)
The confidence of a rule R measures the implication
relation of the antecedence (X) to the consequence (Y), which
is the actual interestingness to the rule. It shows the prediction
of Y when X occurs.
Association rules are proposed for resolving market basket
problems on transactional data. However, when all rule targets
are constrained by the class labels, association rules become
class (or constraint) association rules and they can be used for
classification purpose.
B. Information Gain
In our solution we adopt an approach based on the
information-theoretical concept of entropy, a measure of the
uncertainty of a random variable. . The entropy of a variable X
is defined as
(3)
H ( X )= P ( xi ) lo 2g( P ( xi ))
∑
i
and the entropy of X after observing values of another
variable Y is defined as
H ( X Y ) = ∑ P ( y j )∑ P( xi y j ) lo 2 (gP( xi y j ))
j
(4)
i
where P(xi) is the prior probability for all values of X, and
P(xi/yj) is the posterior probability of X given the values of
Y. The amount by which the entropy of X decreases reflects
additional information about X provided by Y and is called
Information gain given by
(5)
IG=
(X Y) H ( X ) − H (X Y)
19
According to this measure, a feature Y is regarded to be
more correlated to feature X than to feature Z, if IG(X/Y) >
IG(Z/Y). Symmetry is a desired property for a measure of
correlations between features. However, information gain is
biased in favor of features with more values. Furthermore, the
values have to be normalized to ensure that they are
comparable and have the same aspect. If two or more features
have the same value of information gain, they are removed
from the dataset to reduce redundancy.
V. IMPLEMENTATION OF PROPOSED ALGORITHM
The proposed algorithm is implemented using Java net
beans. The Oracle ODBC driver is first created, and then the
‘Oracle ODBC driver’ is selected in the ‘Create new Data
Source’ window. The properties ‘Data Source name’,
‘Description’ and ‘User ID’ are given in the ‘ODBC Setup
window’. The user interface (GUI) is designed to be simple
and user friendly. The user interface consists of three buttons.
Button1 is ‘Read Input from Arff Format’. On clicking
Button1, a file chooser is opened to select the ARFF file and it
converts the selected input arff file into oracle. Button2 is
‘Features based on Class association. On clicking Button2
new window opens to get support and confidence values, a
dataset undergoes attribute reduction using class association
rule mining method. Class association rule mining will select
the relevant features with minimum support and confidence
values. Button3 is ‘Features based on Info gain’. On clicking
Button3, features selected from button2 is further undergoes
attribute reduction using information gain. Information gain is
used to reduce redundant attribute by calculating gain value.
On clicking Button4 i.e. ‘display features’, features selected
from Button2 will be displayed. Button5 is to drop tables.
Links are provided appropriately for each button, so that each
option opens in a separate window so as to make the user
understand and feel free to access it.
VI. PERFORMANCE EVALUATION
In order to evaluate the efficiency and performance of our
proposed methodology, we conduct tests on various datasets.
Table 1 gives the list of datasets chosen for evaluation along
with the number of features and instances present in each
domain. These domains are then subject to various feature
selection methods that already exist. The results obtained from
existing methods and our proposed methodology has been
tabulated in Table 2. The domains with the reduced number of
features are then evaluated with classifiers like J48 and Naïve
bayes. Their performance measures are shown in Table 3 and
Table 4 respectively. It seen from the tables that our proposed
method selects lesser number of features when compared to
other methods. The time taken for classification reduces,
consequently increasing the efficiency of the classification
algorithms. Table 5 gives the performance of the entire set of
features. Figure 1 depicts the improvement in accuracy when
NB is applied to a set of selected features from a feature
selection algorithm. The improvement in accuracy when
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
20
Information Gain, association rule mining and NB are applied
sequentially to a set of selected features is shown in Figure2.
A comparative analysis of accuracies of applying C4.5 and
NB after information gain and ARM, over a set of selected
features is shown in Figure 3.From the above observations, we
show the following results.
[5]
1.Reduced (or equal) number of selected features when
compared to other methods in datasets such as vote,
spectheart, shuttle, monk, Hayes Roth, car evaluation, contact
lenses, postoperative, weather, parity (refer Table 2) .
2. Improved performance (or equal) when compared to the
performance of other feature selection methods,
A.With j48 classifier in datasets such as vote, spectheart,
shuttle, monk, contact lenses, postoperative (refer Table 3).
B. With naïve bayes classifier in datasets such as vote,
spectheart, shuttle, parity, weather, nursery, monk, contact
lenses, postoperative.
(refer Table 4)
3. Improved performance (or equal) when compared to the
performance of with total features, (refer Table 5 & 7)
A. With naïve bayes classifier in datasets such as vote,
shuttle, parity, weather, monk, contact lenses, tic-tac, car
evaluation.
B. With j48 classifier in datasets such as shuttle, contact
lenses, postoperative.
C. With ID3 classifier in datasets such as spectheart,
shuttle, parity, monk, contact lenses.
[8]
VII. CONCLUSION AND FUTURE WORK
The objective of this paper is to describe a novel approach
for attribute reduction. In data-mining, most of the
classification algorithms are less accurate due to the presence
of some relevant and redundant attributes in the dataset. A
new Attribute Reduction algorithm using class based
association rule mining has been incorporated along with
information gain and evaluated through extensive
experiments. We used Apriori algorithm to identify the
relevant attributes and remove irrelevant attributes from the
dataset. We then remove redundant features using information
gain. Our approach demonstrates its efficiency and
effectiveness in dealing with high dimensional data for
classification. Our future work, involves the study to combine
this approach with feature discretization algorithms to
smoothly handle data of different feature types.
[6]
[7]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Bhavani, S.D., Rani, T.S., Bapi, R.S.: Feature selection using
correlation fractal dimension: Issues and applications in binary
classification problems. Applied Soft Computing 8 (2008) 555-563
Haindl, M., Somol, P., Ververidis, D., Kotropoulos, C.: Feature
selection based on mutual correlation. Progress in Pattern
Recognition, Image Analysis and Applications, Proceedings 4225
(2006) 569-577
Liu, H., Motoda, H., Yu, L.: A selective sampling approach to active
feature selection. Artificial Intelligence 159 (2004) 49-74
Liu, H., Yu, L., Dash, M., Motoda, H.: Active feature selection using
classes. Advances in Knowledge Discovery and Data Mining 2637
(2003) 474-485
Yu, L., Liu, and H.: Feature Selection for High-Dimensional Data: A
Fast Correlation-based Filter Solution. Proc. Int.Conference
ICML2003 2003 (2003) 856-86310. Liu, H.A., Setiono, R.:
Incremental feature selection
Dash, M., Liu, H.: Feature selection for classification. Intelligent
Data Analysis: An International Journal 1(1997) 131-156
Koller, D., Sahami, and M.: Toward Optimal Feature Selection. Proc.
Int.Conference ICML'96 (1996)170-178
Bakus, J., Kamel, and M.S.: Higher order feature selection for text
classification. Knowledge and Information Systems 9 (2006) 468-491
Liu, H., Yu, and L.: Toward integrating feature selection algorithms
for classification and clustering. IEEE Transactions on Knowledge
and Data Engineering 17 (2005) 491-502
Peng, H.C., Long, F.H., Ding, C.: Feature selection based on mutual
information: Criteria of max-dependency, max-relevance, and minredundancy. IEEE Transactions on Pattern Analysis and Machine
Intelligence 27 (2005) 1226-1238
Liu, H.A., Setiono, R.: Incremental feature selection.
Liu, W.Z., White, A.P.: The Importance of Attribute Selection
Measures in Decision Tree Induction. Machine Learning 15 (1994)
25-41
H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques. Morgan Kauffmann, San Francisco, 2nd
Edition,2005.
Harol, A., Lai, C., Pezkalska, E., Duin, and R.P.W.: Pairwise feature
evaluation for constructing reduced representations. Pattern Analysis
and Applications 10(2007) 55-68
Van Hulse, J.D., Khoshgoftaar, T.M., Huang, H.Y.: The pair wise
attribute noise detection algorithm. Knowledge and Information
Systems 11 (2007)171-190
Tin Dung Do, Siu Cheung Hui and Alvis C.M. Fong.: Associative
Feature Selection for Text Mining. International Journal of
Information Technology, Vol. 12 No.4 (2006).
REFERENCES
[1]
[2]
[3]
[4]
BALAMURUGAN ET AL.
De Sousa, E.P.M., Traina, C., Traina, A.J.M., Wu, L.J., Faloutsos, C.:
A fast and effective method to find correlations among attributes in
databases. Data Mining and Knowledge Discovery 14 (2007) 367-407
Molina, L.C., Belanche, L., Nebot, and A.: Attribute Selection
Algorithms: A survey and experimental evaluation. Proceedings of
2nd IEEE's KDD 2002 (2002) 306-313
Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature
Selection. Journal of Machine Learning Research 3 (2003) 1157-1182
Applied Intelligence 9 (1998) 217-230 11. Kohavi, R., John, G.H.:
Wrappers for feature subset selection. Artificial Intelligence 97
(1997) 273-324
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
21
APPENDIX
Table I
Dataset Description:
Dataset
Vote
Spectheart
shuttle landing
Parity
Monk
Nursery
Postoperative
Hayes Roth
tic-tac
Carevalution
Contact lenses
Weather
Soybean(Small)
Instances
435
267
15
100
124
12960
57
132
958
1728
24
14
35
Features
17
23
7
11
6
9
9
6
10
7
5
5
47
Classes
2
2
2
2
2
3
3
3
2
4
3
2
4
Table II
Number of selected features for each feature selection algorithm
Dataset
Cfs
Vote
Spectheart
Shuttle
Parity
Monk
Nursery
Postoperative
Hayes Roth
tic-tac
car evaluation
contact lenses
Weather
Soybean(Small
)
4
12
2
3
2
1
5
1
5
1
1
2
Chi
squar
e
6
8
6
6
2
1
5
3
1
6
2
2
22
35
Gai
n
Info
gain
Oneratt
r
Symmetric
Relief
Associatio
n
Class Association
& Infogain
7
10
3
6
2
1
5
3
1
6
2
2
7
9
5
6
2
1
5
3
1
6
2
2
6
16
4
6
2
5
7
3
1
6
3
3
7
9
4
6
2
5
7
3
1
6
1
2
8
8
4
6
2
3
5
3
5
5
3
2
3
11
6
5
4
5
6
3
5
3
3
3
2
6
1
3
2
4
4
1
4
2
2
2
35
35
35
35
35
40
28
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
22
BALAMURUGAN ET AL.
Table III
Accuracy of C4.5 on selected features for each feature selection algorithm
Dataset
Cfs
Chisquare
Vote
Spectheart
Shuttle
Parity
Monk
Nursery
Postoperative
Hayes Roth
Tic-tac
car evaluation
contact lenses
Weather
Soybean(Small)
96.092
81.6479
53.3333
44
91.9355
70.9722
71.1864
37.8788
79.4363
70.0231
70.8333
42.8571
96.31
95.6322
79.4007
53.3333
40
91.9355
70.9722
71.1864
50
69.9374
92.3611
87.5
42.8571
98.15
Gain
Info gain
Oneattr
Symmetric
Relief
Associati
on
91.7241
75.6554
60
40
91.9355
70.9722
71.1864
50
69.9374
92.3611
87.5
42.8571
98.15
91.7241
75.6554
53.3333
40
91.9355
70.9722
71.1864
50
69.9374
92.3611
87.5
42.8571
98.15
95.6322
79.0262
60
50
91.9355
90.7407
71.1864
70.4545
69.9374
92.3611
58.3333
50
98.15
91.7241
757.655
53.333
40
91.935
70.972
71.186
70.454
69.937
92.3611
70.833
42.857
98.15
96.092
79.4007
53.3333
48
91.9355
89.213
71.1864
70.4545
79.4363
93.2292
83.3333
42.8571
98.15
95.6322
79.4007
53.333
50
90.322
78.603
71.929
70.454
79.436
80.324
83.333
64.285
91.33
Class
Associatio
n&
Info gain
95.6322
79.4007
53.3333
46
91.9355
78.248
71.9298
54.5455
76.3048
76.5625
87.5
42.8571
98.23
Table IV
Accuracy of NB on selected features for each feature selection algorithm
Dataset
Cfs
Chi
square
Vote
Spectheart
Shuttle
Parity
Monk
Nursery
Postoperative
Hayes Roth
tic-tac
Car evaluation
Contact lenses
Weather
Soybean(Small)
96.092
82.0225
80
50
100
70.9722
72.8814
37.8788
72.4426
70.0231
70.8333
78.5714
94.47
91.0345
76.779
80
46
100
70.9722
72.8814
59.8485
69.9374
85.5324
87.5
78.5714
92.94
Gain
Info gain
Onerattr
Symmetric
Relief
Association
91.7241
80.1498
80
46
100
70.9722
72.8814
59.8485
69.9374
85.5324
87.5
78.5714
92.94
91.7241
79.0262
73.3333
46
100
70.9722
72.8814
59.8485
69.9374
85.5324
87.5
78.5714
92.94
91.0345
79.0262
73.3333
47
100
88.8426
67.966
81.8182
69.9374
85.5324
54.1667
71.4286
92.94
91.7241
79.0262
73.3333
46
100
70.9722
72.8814
59.8485
69.9374
85.5324
70.8333
78.5714
92.94
93.5633
76.779
73.3333
43
100
87.6543
76.2712
81.8182
72.4426
85.3588
83.3333
78.5714
92.94
92.8736
77.1536
80
47
99.1935
77.2531
71.9298
81.8182
72.4426
79.5718
83.3333
50
87.34
Figure 1: Accuracy of NB on selected features for each feature selection algorithm
ISSN: 1942-9703 / © 2009 IIJ
Class
Association
&
info gain
95.6322
77.9026
80
51
100
77.3848
75.4386
54.5455
70.5637
76.8519
87.5
78.5714
92.94
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
23
Table V
Accuracy of ID3, C4.5 and NB on full set of features
DATASET
Vote
Spectheart
Shuttle landing
Parity
Monk
nursery
postoperative
Hayes Roth
tic-tac
Carevalution
Contact lenses
Weather
Soybean(Small)
TOTAL
FEATURES
16
22
6
10
5
8
8
5
9
6
4
4
35
ID3 (%)
C4.5 (%)
70.0375
60
45
95.9677
98.1867
69.4915
0
83.4029
89.3519
70.8333
85.7143
91.50
96.3218
80.8989
53.3333
48
90.3226
97.0525
71.1864
72.7273
85.0731
92.3611
83.3333
50
92.09
NAIVE
BAYES (%)
90.1149
79.0262
80
40
99.1935
90.3241
80.303
69.6242
85.5324
70.8333
57.1429
92.97
Figure 2: Improvement in performance of Naive Bayes Classifier
ISSN: 1942-9703 / © 2009 IIJ
24
INTERNETWORKING INDONESIA JOURNAL
BALAMURUGAN ET AL.
Figure 3: Improvement in performance of C4.5 and Naive Bayes Classifier
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
25
Detecting the Originality of Extended Research
Articles Using Similarity Techniques –
A Proposal
Shanmugasundaram Hariharan
B. S. Abdur Rahman University, Chennai, Tamilnadu, India
Abstract— Research articles with respect to increased growth
of research all round the world, personal interest of
academicians, industrialists or other personals has paved way to
huge knowledge repositories. The outcome of knowledge base
provided by such a group is both advantageous and vice versa. In
sense we mean that if the knowledge were used properly, it leads
to innovation of new ideas or further enhancements, while on the
other side when such knowledge is used repetitively in a
reformulated fashion it would result in plagiarism of research
articles pulling out the quality of research. This article proposes
a method to detect the originality of research articles published
as an extension of the previous works using similarity technique.
We have proposed a method to summarize the scientific papers,
so as to reduce the time involved in reading a document, there by
presenting the originality of the research artciles. From the
preliminary study carried out using the corpus collected, the
study seems to be promising. We focus on to implement the
proposed system and provide a comparison of the methods
discussed in this paper analyzing the necessary parameters.
Index Terms— Similarity measures, Cosine metric, scientific
articles, summarization, extended research papers
I. INTRODUCTION
Text summarization is a technique of automatically creating
summary from one or more texts. Initial interest for automatic
summarization started in 1960’s in American research
libraries, where a large amount of scientific papers and books
were to be digitally stored and made searchable. Research on
automatic summarizing, taken as including extracting,
abstracting, etc., has a long history with an early burst of
effort in the sixties following Luhn’s pioneering work [4],
followed by marked growth of activity since the mid eighties
and especially very recently[5]. Research on summarization
has attained its roots long back in 1950’s-1960’s and has
become a steady subject of interest among research
community [4, 7].
Text summarization has several branches or dimensions
[13].History of summarization can be traced through several
approaches like surface –level approaches in 1950’s, entity –
Shanmuasundaram Hariharan is with the Department of
Information Technology, B.S.Abdur Rahman University, Chennai,
Tamilnadu, India.(Phone :04422751347, Mobile: +91-9884204036,
E-mail : mailtos.hariharan@gmail.com). He is working as Assistant
Professor and currently pursuing his doctoral programme in the area
of Information Retrieval.
level approaches in 1960’s, extensive entity level approaches
in 1970’s, 1990’s with the renaissance of all three approaches
[14]. In addition we are seeing the emergence of new areas
such as multi-document summarization, multilingual
summarization and multimedia summarization [13].
Due to strenuous efforts and improvements over the
significant years [15], summarization has been a major
challenge and has drifted attention among the research
community. Rapid advancement of Internet technologies,
documents available on the web were digitized and made
available online. Researchers regularly browse through
literature papers to update their existing knowledge or for use
in their research work. The challenge lies in summarizing the
research articles. This paper mainly focuses on methods to
detect the similarity among research articles. The proposed
idea well suits for identifying the originality of papers
submitted for conference submissions, journals and more
specifically extended research articles that are submitted for
publication in more than one system. The proposed system
leads to the following.
- To prevent malpractice of repeating the same article in
repetitive forms.
- Helps the researchers to think in a broader sense.
- Improves the quality of the paper.
Helps the academicians to enhance the work in better
way.
The paper is organized as follows: Section II discusses the
related research carried out. In Section III, proposed system
with sample corpus, modules involved in the proposed system
are explained. Finally Section IV gives the conclusions.
II. RELATED WORK
Simone Teufel et al., [1] proposed a system for scientific
articles that concentrates the statements which have rhetorical
status for summarization and highlighted summaries. Several
experiments were conducted for substantial corpus of
conference articles with human judgments of rhetorical status
and sentence relevance. Extraction of unseen articles content
and classification into fixed set of seven numbers of rhetorical
categories is done which is viewed further as a singledocument summary.
Dain Kaplan et al., [2] designed an automatic method based
on co-reference-chains for extracting citations from research
papers. The authors considered the span of text as “c sites”
which describes the work being cited. The system is designed
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
26
to extract all c-sites that refer the target paper is aggregated
later to form summary. It is different from traditional
summarization technique based on fact the summary
generated contains multiple points-of-view. The authors
conducted several surveys in relation to parsing and extraction
on several pre-existing components.
Vahed Qazvinian et al., [3] developed a model that
summarizes a single article that can be summarized further
based on the topic. The author found that this model breaks
the difficulty for researchers to explore the vast amount of
scientific literature in each field of study. Authors have used
some effective clustering approaches for study of citation
summary network based on the views given to the articles by
others.
Balage Filho et al., [5] presented a paper with experiments
on scientific text summarization. The authors adopted the
simple extractive summarizers that generates a small version
of main part from the given complete source text. It is also
found from their investigation that by considering the
specificity of the text genre, the results can be improved and
made better. The authors have validated summarization
process by taking only text structures.
Maher Jaoua et al., [6] proposed a system which creates
indicative summaries for scientific papers that differs from
conventional methods. The authors extract summaries in two
steps. First step produce a population of extracts followed by
second step that classifies and selects the best one based on
some global criteria that are defined for whole extract than
sentences. The authors have deployed a summarization system
for French language called “ExtraGen” is developed as a
HARIHARAN
prototype that performs the generation and classification
mechanism by implementing a genetic algorithm.
Stephen Wan et al., [8] developed a new research tool
called Citation-Sensitive In-Browser Summarizer (CSIBS), to
speed up the update on research article based on user
requirement browsing task. Building such comprehensive
summary helps the readers to explore the cited article and
determine whether time is to be spent and thereby alleviate
overload. The authors retrieved the sentences from cited
document by exploiting citation context by bringing together
the meta-data and a citation-sensitive. The authors found that
the relevancy judgment task is facilitated by CSIBS, thereby
the users' self reported confidence in making such judgments
is increased.
II.
EXPERIMENTAL SETUP
This section briefs the proposed system to detect the
originality in extended research articles. Fig 1 gives the
proposed architectural system, with each module discussed in
several subsections. The input source article is mostly in pdf
or html format. We convert the documents in either form to
text for further processing. Table I presents the details of the
samples investigated for the proposed system.
The corpuses were collected from the papers submitted by
different authors to International Conference and journals,
which we have surveyed personally for project and research
activities (This proposed system is currently under
implementation). A sample set S1 is shown in Figures 2A, 2B,
2C for illustration. The proposed system would provide a
solution to point out, which of the articles is better or worth
reading, which document has been plagiarized etc.
TABLE I: DATA CORPUS AND STATISTICS
Set
ID
S1
S2
A.
Title
Centroid Based summarization
Centroid Based summarization of multiple documents
Centroid Based summarization of multiple documents
Using timestamps
A Bottom up approach for sentence ordering in multi
document summarization
A Bottom up approach for sentence ordering in multi
document summarization
Pre-processing Of Documents:
Preprocessing is most significant task in data mining tasks,
information retrieval and several other tasks. In this phase the
documents were pre-processed by the following steps:
a. Converting all uppercase letters to lowercase.
b. Removing unwanted words (stop words)
c. Stemming the terms occurring in the document.
To measure the content similarity (discussed in Section B),
we remove the stop words from the text document followed
by stemming of samples. We adopt the same process
whenever similarity is measured.
Stop words are ordinary or unusual words which occur in
the document, which don’t have significant meaning (e.g.
Year of
publication
2000
2004
Authors
Dragomir Radev
Dragomir Radev
Publisher
ACL
Elsevier
2007
Nedunchelian
IEEE
2007
Danuksha Bollegola
IEEE
2009
Danuksha Bollegola
Elsevier
connector words, conjunctions, single letter words). From a
corpus of database, we eliminate such unwanted words [10].
We also eliminate special symbols that do not have significant
part in text processing (e.g. “,”, /,-, etc, in general symbols
other than characters and numbers).
Truncation, also called stemming, is a technique that allows
us to search for various word endings and spellings
simultaneously. Stemming algorithms [11] are used in many
types of language processing and text analysis systems, and
are also widely used in summarization, information retrieval
and database search systems.
A stemmer is a program determines a stem form of a given
word. Terms with a common stem will usually have similar
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
meanings. An example is shown below, that gives to a
common stem:
ACUSES, ACUSER, ACUSED, ACUSERS = ACCUS
The suffix stripping process will reduce the total number of
terms in the IR system, and hence reduce the size and
27
complexity of the data in the system, which is advantageous.
We have adopted suffix stripping that would help us better in
term frequency calculations during sentence scoring. Each
stem term and its equivalents are stored in database and
retrieved whenever comparison is required [11].
Source article
Article -I
(Year – 200X)
Article -II
(Year – 200X)
Article –III
(Year – 200X)
Pre-processor
HTML - Text
Converter
PDF – text
Converter
Special Character
elimination
Stop word
Removal
Stop
Words
Stemming
Words
Stemming
Similarity Analyzer
Summarizer
Scoring
Frequency
Weight calculation
Sentences
Ranking
Words
Subject
Content
Statisticer
Articles detected based on originality
Fig 1: Proposed system Architecture
B. Similarity analysis:
The system designed to detect similarity among text
documents calculates content similarity among specified
documents. The similarity is estimated between the articles
using cosine metric. Though there exist several choices like
dice, Jaccard, Hellinger, we adopt cosine measure, which is
quiet popular and yields better results [9,12] given by
expression (1).
k
k 2 k 2
Co sin e (ti ,tj ) = ∑ tiht jh /
∑ tih ∑ t jh
=
h 1
=
h 1=
h 1
---(1)
A document is treated as a vector, with tih representing the
first vector (each term being normalized using the total
number of words in the document) and tjh corresponds to the
second vector, k: is the number of terms in the document.
Statisticer acts as agent in finding out the number of
sentences, word count and frequency of words. Based on the
term frequency of both documents, cosine metric obtains a
value in the interval of 0-1. If both documents are exactly
same they have a value of 1, 0 otherwise.
Consider an example to illustrate the calculations for
measuring the similarity. If S1 and S2 denotes two different
ISSN: 1942-9703 / © 2009 IIJ
28
INTERNETWORKING INDONESIA JOURNAL
sentences correspondingly, then cosine measure is calculated
as shown below.
S1: Heavy earth quake affects Indonesia.
S2: Indonesia rocked at 6.5 ritcher scale.
All the terms in each sentence have term frequency of 1.
Hence cosine value = (1*1)/sqrt (5*6) = 1/ 2.23* 2.44 = 0.18.
A suitable threshold can be fixed to identify the similarity
among extended and original articles.
This section also discusses the issue of measuring the
similarity among documents. Number of papers available
online, lack of commercially available tools to detect the
plagiarism effectively has made us to focus on such an issue.
Our focus lies on measuring the similarity of documents under
the following categories.
• Similarity measured as whole (Category – I)
- measures the similarity by taking the entire content
• Similarity
measured
under
each
subject/criteria(Category – II)
- Each technical paper is structured in different ways.
However the paper has titles like “Abstract”,
“Conclusion/Future Work”, “Introduction”, “Results”,
“Experimental Section”
• Similarity measured from summaries of each
articles(Category – III)
- Generates summary for each criteria
- Measures the ratio of similarity under each category
Through these categories, we would set penalties for
sentence scoring. Consider an scenario to illustrate the
situations of each the categories discussed above. Category – I
when it measures the relevancy among the entire contents,
may not reflect the originality of the articles. The reason
behind this that users who write research articles abstract the
contents from the original source. Hence we may go for
Category II and measure the importance at each level, which
would lead to better conclusions.
This would allow the
researchers or academicians who are interested in reading out
the literature to skip the paper immediately or to read anyone
HARIHARAN
(which is worth while) without wasting his time. Category –
III would be more useful in upcoming years, as literature
papers have been increasing year by year. Hence determining
the similarity among research articles based on summary of
each individual article may be deemed useful, provided the
quality of generated summary is good.
C. Summarizer:
The algorithm designed for summarizing scientific articles
consists of the following steps:
a. Scoring each sentence in document.
b. Weight calculation for each scored sentence.
c. Ranking the results based on the user requirements.
Sentences within each document are ranked depending on
the sentence weights using term frequency approach [8]. For
generating weight for each sentence we adopted extraction
method, where we measure the frequency of terms in each
document with special weights considered for each special
category like (bold, italic, words matching title or subtitle
etc...). Finally summary is generated depending on the score
each sentences obtain.
IV.
CONCLUSIONS
The paper has focused on the issue of research articles
published in different publications. We have identified a
solution to detect the similarity of the research articles, so as
the proposed system might provide assistance in finding out
the originality of the content. The system also serves as a
platform to detect plagiarism, especially for papers submitted
for conferences. The system can also well be adopted to
measure the relativeness among articles of any nature
depending on the users choice by modifying the threshold.
The paper has not focused on the implementation part, as it is
currently implemented. We also focus to identify main theme
of each articles based on the summaries of each articles.
Fig 2A: Paper published in the Year 2000 – In Proceedings of ANLP/NAACL Workshop
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
29
Fig 2B: Paper published in the year 2004 by Elsevier
Fig 2C: Paper published in the year 2008 by IEEE
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Simone Teufel and Marc Moens, “Summarizing Scientific
Articles:Experiments with Relevance and Rhetorical Status”, In:
Association for Computational Linguistics,Vol .28,No. 4,pp 409-445,
2002.
Dain Kaplan, Ryu Iida and Takenobu Tokunaga, “Automatic Extraction
of Citation Contexts for Research Paper Summarization: A Coreferencechain based Approach”, Proceedings of the 2009 Workshop on Text and
Citation Analysis for Scholarly Digital Libraries, ACL-IJCNLP 2009,
pp. 88–95, Suntec, Singapore, 2009.
Vahed Qazvinian and Dragomir R. Radev, “Scientific Paper
Summarization Using Citation Summary Networks”, Proceedings of the
22nd International Conference on Computational Linguistics (Coling
2008), pp. 689–696, Manchester, 2008.
Luhn, H. P,“The Automatic Creation of Literature Abstracts”. IBM
Journal of Research Development, 2(2):159–165,1958.
Balage Filho, P.P. Salgueiro Pardo, T.A. and das Gracas Volpe Nunes,
M, “Summarizing Scientific Texts: Experiments with Extractive
Summarizers” ,In Proceedings of IEEE,Intelligent Systems Design and
Applications, 2007. ISDA 2007. Seventh International Conference ,
pp.520-524, 2007.
Maher Jaoua and Abdelmajid Ben Hamadou, “Automatic Text
Summarization of Scientific Articles Based on Classification of
Extract’s Population” , In proceedings of Computational linguistics and
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
intellignet text processing, Lecture Notes in Computer Science, Springer
Berlin / Heidelberg, Volume 2588/2008, pp. 363-377, 2008.
H.P. Edmundson, “ New Methods in Automatic Extracting “, Journal of
the ACM, Vol .16, no 2 ,pp.264-285, 1969.
Shanmugasundaram Hariharan and Rengarmanujam Srinivasan,
“Investigations in Single document Summarization by Extraction
Method”, In Proceedings of IEEE International
Conference on
Computing, Communication and Networking (ICCCN’08), pp.1-5, 2008.
Shanmugasundaram Hariharan and Rengaramanujam Srinivasan
(2008b), “A Comparison of Similarity Measures for Text Documents”,
Journal of Information & Knowledge Management, Vol. 7, No. 1, pp.
1–8.
http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words(Last
accessed on 2 December 2009)
M.F. Porter, “An algorithm for suffix stripping”, Program, 14(3) pp
130−137, 1980.
Michael W.Berry, Murray Brown , “Lecture notes in Data Mining”,
World Scientific Publishing., 2006.
Mani.I., and M.T Maybury , “Advances in Automatic Summarization.”,
MIT Press , Cambridege, MA., 1999.
Karen Sparck-Jones (2007).Automatic Summarizing: The state of art.
Information Processing and Management ,Issue :43 , pp:1449-1481,
2007.
Karel Jezek and Josef Steinberger, “Automatic summarization (The state
of Art 2007 and new challenges)”, Vaclav Snasel, Znalosti, pp.112.,2008.
ISSN: 1942-9703 / © 2009 IIJ
30
INTERNETWORKING INDONESIA JOURNAL
ISSN: 1942-9703 / © 2009 IIJ
Vol.1/No.2 (2009)
INTERNETWORKING INDONESIA JOURNAL
31
Prediksi Masa Studi Sarjana dengan
Artificial Neural Network
Muhamad Hanief Meinanda, Metri Annisa, Narendi Muhandri, dan Kadarsyah Suryadi
Fakultas Teknologi Industri
Institut Teknologi Bandung (ITB), Indonesia
Abstrak—Prediksi lama masa studi dibutuhkan oleh
manajemen perguruan tinggi dalam menentukan kebijakan
preventif terkait pencegahan dini kasus Drop Out (DO).
Penelitian ini bertujuan untuk menentukan faktor akademis
yang berpengaruh terhadap masa studi dan membangun model
prediksi terbaik dengan teknik data mining. Kriteria pemilihan
model yang digunakan adalah minimasi Sum Square Error (SSE).
Model terbaik untuk memprediksi lama masa studi adalah model
yang dibangun dengan Artificial Neural Network (ANN) dengan
arsitektur Multilayer Perceptron (MLP). Dari penelitian ini
ditemukan bahwa lama masa studi dipengaruhi oleh Indeks
Prestasi Kumulatif (IPK), jumlah mata kuliah yang diambil,
jumlah mata kuliah mengulang, dan jumlah pengambilan mata
kuliah tertentu.
Kata Kunci—Artificial Neural Network, Multilayer Perceptron,
Prediksi masa studi.
I. PENDAHULUAN
emakin ketatnya persaingan dalam mendapatkan
lapangan pekerjaan menuntut perguruan tinggi
menghasilkan sarjana yang berkualitas dan memiliki
daya saing. Untuk itu, setiap perguruan tinggi selalu
melakukan evaluasi performansi mahasiswa. Hasil evaluasi
tersebut disimpan dalam basis data akademik. Data tersebut
dapat digunakan untuk sebagai pendukung keputusan oleh
manajemen perguruan tinggi. Salah satu variabel indikator
efisiensi proses pendidikan adalah informasi mengenai lama
masa studi mahasiswa.
S
Muhamad Hanief Meinanda adalah mahasiswa Program Sarjana Teknik
Industri Fakultas Teknologi Industri Institut Teknologi Bandung. Saat ini
bertanggung jawab sebagai Koordinator Asisten Laboratorium Perencanaan
dan Optimasi Sistem Industri (LPOSI) ITB. Penulis dapat dihubungi melalui
e-mail: meinanda@hanief.com. Pandangan dan informasi tentang penulis
dapat diakses pada http://www.hanief.com.
Metri Annisa Arrum adalah mahasiswi Program Sarjana Teknik Industri
Fakultas Teknologi Industri Institut Teknologi Bandung. Saat ini aktif sebagai
Asisten Laboratorium Perencanaan dan Optimasi Sistem Industri (LPOSI)
ITB. Penulis dapat dihubungi melalui e-mail: m.tirtosiswoyo@yahoo.com.
Narendi Muhandri adalah mahasiswa Program Sarjana Teknik Industri
Fakultas Teknologi Industri Institut Teknologi Bandung. Penulis pernah aktif
menjadi salah satu asisten mata kuliah dan praktikum Logika Pemrograman
dan Komputer Teknik Industri ITB. Penulis dapat dihubungi melalui e-mail:
muhandrii@yahoo.com.
Kadarsah Suryadi adalah dosen Program Studi Teknik Industri ITB. Saat
ini penulis bertanggung jawab sebagai Ketua Program Studi Magister dan
Doktor Teknik & Manajemen Industri. Penulis merupakan dosen dari
Laboratorium Sistem Informasi dan Keputusan dengan kelompok keahlian
Manajemen Industri. Penulis dapat dihubungi melalui e-mail:
kadarsah@bdg.centrin.net.id.
Artificial Neural Network (ANN) sejak diperkenalkan pada
sekitar tahun 1940 telah banyak diimplementasikan pada
berbagai bidang keilmuan. ANN banyak digunakan untuk
melakukan prediksi atau peramalan [1]. Williams dan Li
(2008) telah meneliti penggunaan ANN dengan algoritma
training back-propagation untuk melakukan prediksi pacuan
kuda di Jamaika. ANN dengan jenis feed forward network
atau back-propagation yang digunakan dalam penelitian ini
telah terbukti memberikan hasil yang baik untuk keperluan
prediksi [2].
Al Cripps (1996) telah melakukan penelitian penggunaan
ANN untuk memprediksi perfomansi akademik berupa
presentasi kelulusan, masa studi, dan GPA. Penelitian tersebut
menggunakan tidak menggunakan data akademis yang
diperoleh selama mahasiswa kuliah. Variabel prediktor yang
digunakan pada penelitian tersebut adalah usia, jenis kelamin,
skor American College Testing (ACT), ras, dan kemampuan
membaca [3]. Bijayananda Naik dan Srinivasan Ragothaman
(1998) telah meneliti penggunaan neural network untuk
memprediksi tingkat kesuksesan mahasiswa MBA, dengan
prediktor GPA program sarjana [4].
Dengan acuan kesempatan penelitian yang tersedia
berdasarkan penelitian sebelumnya maka pada penelitian ini
akan diteliti variabel prediktor dari data akademis yang
berpengaruh terhadap masa studi dan pembuatan model ANN
untuk prediksi masa studi. Model prediksi tersebut digunakan
untuk menentukan kebijakan terhadap mahasiswa yang
diprediksi memiliki masa studi melebihi batas. Pada penelitian
ini diujicobakan juga model multiple regression sebagai model
pembanding dalam melakukan prediksi masa studi.
II. TINJAUAN PUSTAKA
A. Struktur Neural Network
Artificial neural network (ANN) terinspirasi dari kesadaran
atas complex learning system pada otak yang terdiri dari setset neuron yang saling berhubungan secara dekat. Jaringan
neuron mampu melakukan tugas yang sangat kompleks seperti
klasifikasi dan pemahaman pola. ANN dapat memperkirakan
rentang yang cukup luas suatu model statistika dan fleksibel
dalam menggambarkan model (linier maupun nonlinier) [5].
ANN dapat digunakan untuk permasalahan yang sama dengan
permasalahan statistika multivariat seperti multiple regression,
analisa diskriminan, dan analisa kluster. Dalam banyak kasus,
hasil yang didapat dengan ANN dapat dibandingkan dengan
model statistika multivariat [6].
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
32
Terdapat tiga jenis utama dari ANN yakni Multilayer
Perceptron, Radial Basis Function, dan Kohonen Network.
Multilayer Perceptron merupakan model yang paling banyak
digunakan untuk melakukan prediksi. Radial Basis Function
merupakan model yang dapat melakukan hal yang dilakukan
oleh Multilayer Perceptron. Kohonen Network baik digunakan
pada permasalahan clustering [7]. Pada penelitian ini
digunakan model Multilayer Perceptron karena model ini
umum digunakan pada permasalahan prediksi.
Multilayer Perceptron merupakan model yang memetakan
suatu set input data menjadi set output, dengan menggunakan
fungsi aktivasi nonlinier. Pada Multilayer Perceptron variabel
independen maupun dependen dapat memiliki tingkat
pengukuran metrik maupun nonmetrik. Multilayer perceptron
merupakan feedforward neural network dimana informasi
bergerak hanya dalam satu arah, dari simpul input melalui
simpul tersembunyi dan simpul output [8].
B. Algoritma Pembelajaran
Neural network memperoleh nilai bobot dari suatu algoritma
pembelajaran tertentu. Bobot ini digunakan dalam melakukan
transformasi nilai dari node input ke node output. Algoritma
pembelajaran merupakan tahap penyesuaian terhadap bobot
yang telah terbentuk secara random. Pembaharuan nilai bobot
secara umum dirumuskan sebagai berikut:
(1)
wij (n + 1) = wij (n) + ∆wij (n)
dimana ∆wij(n) dihitung dengan algoritma pembelajaran dan
wij(n) merupakan bobot awal yang ditentukan secara acak pada
tahap inisialisasi [9].
C. Algoritma Back-Propagation
Masukan dari node input diteruskan ke hidden layer
kemudian dilanjutkan ke node output. Setiap hubungan dari
unit i ke unit j memiliki bobot wij yang mengindikasikan
kekuatan dari koneksi. Jumlah dari pembobotan, aj, untuk
suatu input xij dan bobot wij didefinisikan sebagai berikut:
n
a j = ∑ wij xi
(2)
III. METODOLOGI PENELITIAN
Metodologi penelitian diadopsi dari metodologi CrossIndustry Standard Process for Data Mining (CRISP-DM)
yang dikembangkan pada tahun 1996 oleh analis dari Daimler
Chrysler, SPSS, dan NCR. CRISP-DM memiliki enam fase
yaitu Business understanding phase, Data understanding
phase, Data preparation phase, Modeling phase, Evaluation
phase, dan Deployment phase [10].
Tahap awal dari penelitian adalah memahami permasalahan
yang akan diselesaikan yaitu melakukan estimasi terhadap
masa studi berdasarkan ketersamaan pola antara data masa
lalu dengan data aktual. Pada tahap ini, peneliti melakukan
pemahaman terhadap data dan mencoba mencari adanya pola
serta keterkaitan antara variabel-variabel data dengan masingmasing tujuan penelitian.
Peneliti kemudian melakukan preprocessing data di
antaranya dengan melakukan pembuatan cross-tabulation,
koreksi terhadap data yang mengalami misclassification, dan
menghapus missing value dan outlier. Setelah itu, dilakukan
tahap pembuatan model. Untuk estimasi terhadap masa studi
yang memiliki tingkat pengukuran metrik begitu juga dengan
prediktornya peneliti menggunakan model artificial neural
network dan multiple regression.
Setelah mengetahui model yang sesuai, peneliti melakukan
pembuatan model dengan menggunakan data training. Model
yang dipilih mempertimbangkan kesesuaian asumsi model dan
error yang dihasilkan. Kemudian model tersebut diterapkan
pada set data testing dan dilakukan analisis terhadap
penggunaan model dengan hasil yang diperoleh.
IV. PENGOLAHAN DATA
Data input yang digunakan merupakan data hipotetik dalam
kontes Data Mining, Pagelaran Mahasiswa Nasional Bidang
Teknologi Informasi dan Komunikasi (Gemastik) 2009. Data
tersebut berasal dari data akademis aktual di suatu perguruan
tinggi. Data tersebut terdiri dari catatan akademis 1289
mahasiswa, dengan variabel yang dapat dilihat pada Tabel I.
i =1
dimana nilai n merupakan jumlah input pada suatu neuron.
Fungsi aktivasi yang digunakan adalah fungsi aktivasi logistic
sigmoid:
TABEL I
KETERANGAN VARIABEL PADA DATA INPUT
Field
(3)
ID
Masa Studi
Nilai galat, Ej(n), antara output aktual yj(n) dan nilai output
dari neuron dj(n) dihitung dengan rumus:
(4)
E j ( n) = d j ( n) − y j ( n)
Kode Mata
Kuliah
Nama Mata
Kuliah
Ambil Ke
g (a) =
1
1 + e− a
Rumus pembelajaran dengan Back-Propagation adalah:
δE j
∆wij = ηxi + α∆wij = ηxi − α
δwij
MEINANDA ET AL.
Nilai
(5)
dimana η adalah laju pembelajaran (learning rate) dan α
adalah faktor moment. Parameter tersebut menentukan
seberapa besar pengaruh parameter lama terhadap arah
perubahan parameter yang baru.
ISSN: 1942-9703 / © 2009 IIJ
Deskripsi Data
Identitas mahasiswa (primary key)
Rentang waktu Mahasiswa menjalani masa
kuliah
Kode unik untuk tiap mata kuliah
Tipe
Data
String
Integer
String
Nama mata kuliah
String
Jumlah pengambilan suatu mata kuliah oleh
mahasiswa bersangkutan
Nilai untuk tiap mata kuliah yang diambil
mahasiswa bersangkutan
Integer
String
Vol.1/No.2 (2009)
INTERNETWORKING INDONESIA JOURNAL
TABEL II
KETERANGAN VARIABEL PADA CROSS-TABULATION
Nama
Pengolahan
Tingkat
Deskripsi Data
Data
Data
Pengukuran
Bobot
Nilai mahasiswa
Melakukan
Ordinal
yang dikodifikasi nilai
transformasikan
A=4; B=3; C=2;
ke dalam angka
D=1; E=0
SKS
Jumlah satuan
Menjumlahkan
Metrik
kredit semester
seluruh satuan
yang diambil oleh
kredit semester
mahasiswa
dari tiap mata
kuliah yang
diambil oleh
mahasiswa.
IPK
Indeks
prestasi Hasil pembagian
Metrik
kumulatif
kolom
bobot
mahasiswa
dengan
kolom
SKS
Jumlah
Jumlah mata
Penjumlahan
Metrik
Mata
kuliah yang
total mata kuliah
Kuliah
diambil
yang diambil
mahasiswa
mahasiswa
Jumlah
Jumlah
Penjumlahan
Metrik
Mengulang pengulangan mata
total pengulangan
kuliah
mata kuliah yang
diambil
mahasiswa
Dari data input tersebut kemudian dilakukan tahap
preprocessing yang terdiri dari transformasi dan pembersihan
data. Transformasi data dilakukan dengan membuat crosstabulation sehingga data memiliki unique key dan memiliki
kolom sesuai dengan hipotesa variabel prediktor (Tabel II).
Selanjutnya dilakukan koreksi terhadap entri data yang
memiliki misclassification. Misclassification terjadi karena
ada perbedaan kurikulum sehingga nama mata kuliah yang
sama tertulis berbeda. Selanjutnya dilakukan penghapusan
terhadap missing value dan data yang memiliki nilai tidak
wajar (outlier).
Eksperimen dilakukan pada komputer berbasis Intel Atom
N280 menggunakan algoritma ANN Multilayer Perceptron,
Linear Regression, dan Spearman Correlation dengan
perangkat lunak SPSS 16 (Windows XP x32). Preprocessing
data dan pembuatan Cross-Tabulation dilakukan pada
platform yang sama dengan menggunakan Microsoft Excel
2007 SP0.
V. EKSPERIMEN DAN ANALISIS
A. Penentuan Variabel Prediktor
Variabel masa studi yang terdapat pada data training terdiri
dari 47 nilai berbeda dengan interval antara 41 sampai 88
bulan, sehingga variabel masa studi merupakan variabel
dengan skala pengukuran metrik. Variabel prediktor untuk
data masa studi memiliki skala pengukuran metrik. Dalam
melakukan prediksi nilai masa studi, peneliti akan
menggunakan model dependensi dengan variabel dependen
metrik.
Langkah selanjutnya adalah penentuan prediktor yang dapat
mempengaruhi variabel masa studi. Data yang disediakan
adalah catatan akademis setiap mahasiswa, sehingga prediktor
yang akan dimasukan ke dalam model prediksi merupakan
data yang berkaitan dengan performansi akademik mahasiswa.
Peneliti mengajukan hipotesa a priori dalam menentukan
33
variabel independen sebagai prediktor masa studi. Hipotesa
tersebut adalah adanya hubungan antara Indeks Prestasi
Kumulatif, jumlah mata kuliah yang diambil, dan jumlah mata
kuliah mengulang terhadap masa studi.
Variabel prediktor IPK dihitung dengan menghitung bobot
nilai dibagi dengan total SKS yang diambil. Peneliti tidak
memasukan variabel IPK ke dalam model, karena variabel
IPK merupakan variabel turunan dari variabel bobot dan SKS.
Variabel IPK tidak digunakan dalam membangun model dan
digantikan oleh variabel bobot dan variabel SKS.
B. Preprocessing
Preprocessing data dilakukan agar pada tahap pembuatan
model mampu menghasilkan model yang efektif. Beberapa hal
yang dilakukan pada tahap data preprocessing adalah
transformasi ke dalam bentuk yang lebih informatif dengan
menggunakan cross-tabulation. Tahap data preprocessing
selanjutnya adalah melakukan penghapusan terhadap missingvalue dan outlier pada data.
C. Exploratory Analysis
Prediktor yang dipilih secara a priori diuji dengan
menggunakan Spearman Correlation Coefficient untuk
memastikan adanya hubungan antara prediktor-prediktor
tersebut terhadap masa studi. Korelasi Spearman dipilih
karena tidak membutuhkan asumsi distribusi dan normalitas
data [11]. Hasil perhitungan korelasi dapat dilihat pada Tabel
III. Pada tabel tersebut nilai signifikansi korelasi kurang dari
nilai kritis (α=0.05), sehingga berdasarkan pengujian korelasi,
semua prediktor memiliki korelasi signifikan terhadap variabel
masa studi, sehingga variabel tersebut dimasukkan ke dalam
model prediksi.
TABEL III
PERHITUNGAN KORELASI MASA STUDI
Jumlah
Jumlah
Spearman's rho
Bobot
MataKul
Mengulang
Correla
Masa
tion
-0.60
0.76
0.78
Studi
Coef,
Sig. (20
0
0
tailed)
N
1221
1221
1221
SKS
0.43
0
1221
Untuk keperluan validasi, data dibagi ke dalam dua
kelompok (split-sample) yakni data training dan data testing.
Data training merupakan data yang digunakan untuk
membangun model. Data training dipilih secara random
dengan jumlah data 80% dari seluruh data. Data testing
digunakan untuk keperluan validasi.
D. Pembuatan Model Prediksi Masa Studi
Model yang dapat digunakan untuk memprediksi variabel
metrik dengan prediktor metrik adalah model multiple
regression dan neural network. Model yang pertama kali
dibangun dan diuji adalah model regresi. Pengujian asumsi
model regresi baru dapat dilakukan ketika model regresi sudah
terbentuk. Model regresi memiliki asumsi normalitas error,
konstant error variance (homoscedasticity), dan independensi
error [7]. Semua asumsi tersebut harus dapat dipenuhi agar
model regresi tidak bias. Setelah melakukan pembuatan model
regresi menggunakan parameter Minimum Least Square
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
didapat bahwa variansi dari error tidak konstan (Gambar 1).
Berdasarkan scatterplot antara standarized residual terhadap
z-predicted variansi titik residu membesar dengan
membesarnya predicted value. Oleh karena itu, model regresi
tidak efisien digunakan karena asumsi homoscedasticity tidak
dipenuhi.
MEINANDA ET AL.
Variable Add. vs SSE
Sum-Squared-Error
34
170
160
150
140
130
120
110
100
90
80
70
13
18
23
28
33
38
Variable Additions
Gambar. 2. Plot Penambahan Variabel v.s. SSE
Dari Gambar 2, penurunan terhadap SSE terjadi signifikan
pada penambahan sejumlah 22 variabel. Oleh karena itu, pada
model ditambahan 22 variabel yang berisi data jumlah berapa
kali mahasiswa tertentu mengambil mata kuliah yang
terbanyak diambil mahasiswa secara keseluruhan. Perfomansi
dari model perbaikan dapat dilihat pada Tabel VI.
Gambar. 1. Plot Pengujian Homoscedasticity dari Error
Model lain yang digunakan dalam melakukan prediksi masa
studi adalah Artificial Neural Network (ANN). Jenis Artificial
Neural Network yang digunakan adalah Multilayer
Percepteron (MLP). Multilayer Percepteron dapat digunakan
untuk memprediksi data ril dengan supervised training.
Arsitektur yang digunakan dipilih yang terbaik secara otomatis
oleh SPSS 16. Tipe training yang dipilih adalah batch
training. Batch training mampu menghasilkan error paling
kecil dibanding metoda training lainnya. Initial learning rate
di-set = 0.4, momentum 0.9. Training epoch ditentukan secara
otomatis. Performansi model dapat dilihat pada Tabel IV.
TABEL IV
RINGKASAN MODEL MLP MASA STUDI
Training
Sum of Squares Error
184.346
Relative Error
.357
Dependent Variable: MASASTUDI
E. Improvement Model Masa Studi
Untuk memperkecil Sum of Squares Error (SSE), peneliti
kembali mengevaluasi prediktor yang digunakan. Peneliti
memasukan prediktor tambahan berupa jumlah berapa kali
mahasiswa mengambil mata kuliah tertentu. Mata kuliah yang
dipilih adalah mata kuliah yang paling banyak diambil oleh
mahasiswa. Mata kuliah yang paling banyak diambil
mengindikasikan mata kuliah tersebut banyak diulang
mahasiswa. Variabel tersebut dimasukan satu persatu ke
dalam model kemudian dilakukan evaluasi SSE yang
dihasilkannya (Tabel V).
TABEL V
PERFORMANSI MODEL (SSE) SETELAH PENAMBAHAN VARIABEL
Penambahan
Sum Square Error
Variabel
15
150.5
20
93.3
25
89.1
30
86.4
35
81.0
TABEL VI
RINGKASAN MODEL MLP MASA STUDI PERBAIKAN
Training
Sum of Squares Error
97.004
Relative Error
.159
Dependent Variable: MASASTUDI
Model tersebut selanjutnya di-validasi menggunakan set
data testing. Validasi dilakukan dengan membandingkan
apakah ada perbedaan signifikan terhadap data masa studi
aktual dengan data masa studi prediksi. Berdasarkan hasil uji
normalitas data prediksi menggunakan Kolmogorov-Smirnov
dengan tingkat kepercayaan 95%, didapatkan bahwa data
prediksi tidak berdistribusi normal, sehingga uji beda yang
dilakukan dengan Wilcoxon Signed-Ranks (Tabel VII).
TABEL VII
UJI BEDA VARIABEL MASA STUDI AKTUAL DENGAN PREDIKSI
Test Statisticsb
Pred - MasaStudi
Z
-1.842a
Asymp. Sig. (2-tailed)
.065
a. Based on negative ranks.
b. Wilcoxon Signed Ranks Test
Uji beda menghasilkan nilai p-value sebesar 0.65. Nilai
tersebut lebih besar dari nilai kritis 0.05. Oleh karena itu,
dengan tingkat kepercayaan 95%, tidak ada perbedaan
signifikan antara nilai masa studi aktual dengan nilai masa
studi berdasarkan model prediksi.
VI. PENUTUP
Berdasarkan hasil eksperimen, evaluasi, dan analisis yang
dilakukan, maka dapat disimpulkan bahwa (a) variabel Indeks
Prestasi Kumulatif, jumlah mata kuliah yang diambil, jumlah
mata kuliah mengulang, dan jumlah pengambilan mata kuliah
tertentu mempengaruhi masa studi (b) Dalam melakukan
prediksi masa studi, model regresi akan menghasilkan prediksi
masa studi yang bias karena asumsi Homoscedasticity Error
tidak dapat dipenuhi (c) Artificial Neural Network dengan
ISSN: 1942-9703 / © 2009 IIJ
Vol.1/No.2 (2009)
INTERNETWORKING INDONESIA JOURNAL
arsitektur Multilayer Perceptron dalam penelitian ini
merupakan model terbaik untuk memprediksi lama masa studi.
Mengingat data yang digunakan merupakan data hipotetik
pada kontes Data Mining Pagelaran Mahasiswa Nasional
Teknologi Informasi (Gemastik 2009), maka simpulan
penelitian akan berbeda jika diaplikasikan pada data akademis
aktual. Penelitian lebih lanjut mengenai prediksi performansi
akademik mahasiswa dapat dilakukan dalam bentuk: (a)
penelitian menggunakan data akademis aktual pada perguruan
tinggi tertentu (b) pembuatan model prediksi terhadap status
Drop Out sehingga dapat digunakan sebagai early warning (c)
perancangan perangkat lunak untuk mengimplementasikan
model sehingga praktis digunakan pada sistem nyata.
VII. DAFTAR PUSTAKA
[1]
J. Williams and L. Yan, “A case study using neural network algorithms:
horse racing prediction in jamaica” in International Conf. on Artificial
Intelligence (ICAI'08), Las Vegas, 2008.
[2] A. Lapedes and R. Farber, “How neural nets works,” Evolution,
Learning, and Cognition, pp. 331-345,1998, submitted for publication.
[3] Al Cripps, “Using Artificial Neural Nets to Predict Academic
Performance,” in ACM Symposium on Applied Computing, 1996.
[4] N. Bijayananda dan R. Srinivasan, “Predicting M.B.A. student
performance: An empirical comparison of neural network vis-à-vis
statistical models,” in Midwest Decision Sciences, Lincoln Institute,
1998.
[5] Y. Bar-Yam, Dynamics of Complex Systems. 2008.
[6] P.V. Balakrishnan, M.C. Cooper, V.S. Jacob, dan P.A. Lewis, “A study
of the classification capabilities of neural networks using unsupervised
learning: A comparasion with k-means clustering,” Psychometrika. Vol.
59, 1994.
[7] J. Hair danR. Anderson, Multivariate Data Analysis. New York :
Prentice Hall, 1998.
[8] D. Pyle,Data Preparation for Data Mining. Morgan Kaufmann
Publisher, 1999.
[9] C.M. Bishop, Neural Networks for Pattern Recognition, 3rd ed. Oxford :
Oxford University Press, 1995.
[10] D.T. Larose, Discovering Knowledge In Data – An Introduction to Data
Mining. New Jersey : John Wiley & Sons, 2005.
[11] E.L. Lehmann dan H.J.M D’Abrera. Nonparametrics: Statistical
Methods Based on Ranks. rev. ed. NJ : Prentice-Hall, 1998.
ISSN: 1942-9703 / © 2009 IIJ
35
36
INTERNETWORKING INDONESIA JOURNAL
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol.1/No.2 (2009)
37
Adaptive Content-based Navigation Generating
System: Data Mining on Unorganized Multi Web
Resources
Diana Purwitasari1, Yasuhisa Okazaki2, and Kenzi Watanabe2
1
Department of Informatics, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
2
Department of Information Science, Saga University, Saga, Japan
Abstract— Rapid growth of the Internet makes Web-based
applications becomes a new means to learn. Application for
learning takes into account of creating navigation as an important
task to help users in understanding structured idea of learning
topics within its materials collection. Let multi resources become
the collection of a learning application. Then the arising problem
is how to structure such abundance resources as n avigation i n the
application that could satisfy different interests of users.
Our adaptive content-based navigation generating system
provides a domain of subjects called topics extracted from the
collection. S ystem automatically produces an organized structure
of topics offering users' guidance for learning. S ince the
collection consists of Web pages, system will equally exploit
information of contents and hyperlinks with the aim of
generating navigation. When context interests of users changes
like in a time user clicks a topic to know more about it,
content-based navigation will adapt. S ystem exploits user model
because of that adaptation.
Index Terms— navigation generating system, text mining,
unorganized web resources
I. INT RODUCT ION
W
ITH such availability of abundance resources in the
Internet, there are lots of learning site confined to certain
coverage knowledge area. Some subjects in one knowledge area
might be overlap to other coverage. Ones could want to put
together existing learning materials from multi resources in order
to develop learning system that accommodates different
subjects of users' interests. Scopes of subjects are hidden topics
in the whole system collection of learning materials gathered
from multi resources.
Some users might not yet have enough structured idea of
D. Purwitasari is with the Department of Informatics, Institut
T eknologi Sepuluh
Nopember, Surabaya, Indonesia, e-mail:
diana@if.its.ac.id.
Y. Okazaki and K. Watanabe are with Department of Information
Science, Saga University, Saga, Japan.
learning topics. Navigation, an organized structure from a
collection of learning materials, offers guidance for learning to
help users. When dealing with such abundance resources,
navigation map of learning topics would be out of date if it is
manually constructed. It is easier to maintain structure of multi
resources in the manner of hand over navigation creating tasks
to a system.
Content-based navigation is defined as a sequence list of
topics and sub topics hierarchically structured from a collection
of documents. In our system we use data mining techniques and
then followed by hypergraph partitioning to extract topics.
Mapping of topics is considered as hypergraph construction
with vertices representing words and edges representing
strength relation between words. Extracted topics will be
restructured to produce a hierarchical topics list using modified
agglomerative clustering. System utilizes information of
contents and hyperlinks in the materials collection of Web pages
to provide the list.
Implementation at educational fields, which particularly utilize
services on the Web, refers the term of adaptive as information
filtering. That is to say as finding relevant items to user interests
in large collection of document. Therefore content-based
navigation adapts to users when context of users' interests
changes like in a time user clicks a topic to know more about it.
II. OUR PREVIOUS W ORKS AND RELAT ED W ORKS
A variety of educational resources available to users on the
Web for almost every domain is changing rapidly. However, the
abundance of resources has created a problem of finding and
organizing resources that match individual goals, interests, and
current knowledge of users. Web-based systems for learning
with the beneficial of adaptivity and intelligence provide an
alternative to the traditional “just-put-it-on-the-Web” approach.
That makes personalized access for users becomes the issue.
Web-based learning systems should behave differently for
different users to provide the adaptation effect. So far, the only
techniques that demonstrate a good ability to provide
personalized access to information in the learning context are
ISSN: 1942-9703 / © 2009 IIJ
38
INTERNETWORKING INDONESIA JOURNAL
PURWITASARI ET AL.
Fig. 1. Framework of adaptive content-based navigation generating system.
adaptive navigation support [1] such as curriculum sequencing.
The goal is to provide a user with “optimal path” through the
materials that is the most suitable individually planned sequence
of topics for learning. This becomes very important due to its
ability to guide users through the volume of available resources.
Sequencing could be implemented in the form of list of some
recommended links as a suggested learning path.
We design generating system which produces appropriate
learning path with current interest of users. A document of Web
page is representation form of at least an identified topic in
learning path. On account of users want to browse more of a
topic, system translates users' clicked action into inquiry using
feature terms within the Web page as search keywords of
selected topic. Our scoring schema exploits information of
contents and hyperlinks in Web pages [2].
Many research efforts, which influence our framework to
generate content-based navigation, have been engaged to bring
structured representation to a large collection of documents in
the Web. To fulfill the function there are two main approaches: (i)
organize documents in a domain of subjects [3] [4] [5] [6] and (ii)
provide scenarios of topics for learning particular subjects [7]
[8].
We do thorough study literature to propose a framework that
makes the best use of both models [9]. Then prototype
implementation of the framework shows system uniqueness
compare to works of previously mentioned researches. With
data mining techniques on topic identification [3], we adapt the
techniques and then employ similarity function for merging
topics to other tasks like structuring and representing document
into topics within content-based navigation. Documents may
have membership in several domains of subjects. Though fuzzy
clustering might be more suitable in partition collection of
subjects-overlapped documents [4], we show even
agglomerative clustering can be applied. We consent in
exploiting information of hyperlinks as well [5] [6] to generate
navigation of learning path because documents in the materials
collection are Web pages.
Evaluations are done to verify validity of selected methods in
the framework [10]. Some evaluations concerns about preparing
Web contents that should be sufficient for a collection thus the
system could suitably generate navigation. Another evaluation
is related to comparing results of some distance measures for
partitioning criteria on agglomerative clustering. Other
evaluation observes generating navigation from contents of
some documents that close to user's current interest of topic.
The documents are retrieved with our scoring model [2].
After observing number of evaluations, the proposed
framework is implemented [11]. However it is still a model system
which generates a content-based navigation. An adaptive effect
to generate and re-generate navigation that supports statement
to produce appropriate learning path with current interest of
users is emphasized in current work.
III. CONT ENT -BASED NAVIGAT ION GENERAT ING SYST EM
Given a collection of documents, system outlines learning
topics then provides learning scenarios to offer guidance for
users. Initially it extracts topics which are frequently discussed
in the collection with data mining techniques (Topics Detecting
Module, Fig.1), then clusters extracted topics in order to
hierarchically produce a sequence list of topics (Navigation
Creating Module, Fig.1)
ISSN: 1942-9703 / © 2009 IIJ
Vol.1/No.2 (2009)
INTERNETWORKING INDONESIA JOURNAL
A. Overview
Basic cognitive process of topic identification is that there
exist sets of common words (frequent itemsets) for each topic.
Frequent itemsets are mapped into hypergraph with vertices
represent words and edges represent strength relation between
words (Topics Detecting Module, Fig.1). To extract topics
means to partition hypergraph into sub-graphs. Then extracted
topics will be restructured to produce a hierarchical list of
topics-subtopics using agglomerative clustering with some
adjustments. This hierarchical list is supposed to give an implicit
path as kind of learning direction for users (Navigation Creating
Module, Fig.1).
When users click in one topic, system makes an inquiry using
feature terms within Web page as search keywords of selected
topic. Inquiry results retrieved with our scoring scheme [2]
become new resources to produce next sequence of topics
(Query Searching Module, Fig.1). This module becomes
addendum of adaptive effect in adaptive content-based
navigation generating system.
B. Topics Detecting Module
If important terms (frequent itemsets) in the collection are
mapped into graph of word vertices, sub graphs indicate implicit
topics. Hypergraph is a graph in which its edge can connect
more than two vertices. This module partitions hypergraph to
single out hyperedges for identifying mostly discussed topics.
shmetis library in hMetis tool is executed to do hypergraph
partitioning[12]. Based on defined writing rule by shmetis for
input-output files, we provide ways of encoding frequent
2-itemsets of hypergraph into input file and decoding output file
into list of unnamed topics consisting common words of each
topic.
Topics Detecting Module begins with preprocessing
collection [13] which includes indexing, removing stop words,
stemming, and TF-IDF (term frequency - inverse document
frequency) weighting in order to retrieve candidates of frequent
itemsets. Data mining lists frequent itemsets that satisfying
some minimum values of support and confidence filters. We
introduce usage of other filters to set apart significant frequent
2-itemsets. First is indispensable feature value of TF-IDF term
weight (Eq.1), and second is our own weighting schema of term
entropy (Eq.2).
ωd ,t =
tf d ,t
N
⋅ log ≥ δ w ...............................................(1)
max tf d ,t
nt
i
Let the collection D of N Web pages, |D| = N, shows n t as
number of documents containing term t. Then a weight ωd,t is
assigned for each term t in a document d that depends on the
number of term occurrences in the document, tfd,t, normalized by
frequency value of the highest term occurrences. Term weight
ωd,t is attenuated with an effect of idft = log N/n t if the term occurs
too often in the collection. For TF-IDF filter we do not consider
values of term weight ωd,t less than δw.
39
T ABLE I
A CONTINGENCY TABLE TO HELP DETERMINING IMP ORTANCE STATE OF
TERMS IN A DOCUMENT
from set S lt
from set S gt
belongs
to CD
a
c
belongs
to CNOT_D
b
d
| S lt | = a + b
| S gt | = c + d
Entropy value measures uncertainty state. We consider
entropy value of a term reflects effectiveness of the term in
identifying certain document to others. This reasoning is
derived from selection of most useful attribute to test at each
node in decision tree algorithm [14]. For term entropy calculation
we assume that there are only two classes as target
classification:
(i) whether the term is one of important terms in currently
observed document or
(ii) in the contrary that the term is important for any other
documents except the currently observed one.
Attribute for classification of a term belongs to (i) or (ii) is test
condition whether frequency of the term in currently observed
document (term frequency) will be less or more than average
value of term frequency from all documents in the collection.
In a collection D, for a document d let term t is classified into
two classes:
(i) important terms of document d, Cd, or
(ii) important terms in other documents, Cnot_d.
The state whether tfd,t value is less or greater than average
number of term occurrences t in any document within collection
avg(tfd,t)d∈{1…} becomes attribute for classifying.
For each calculation of term entropy value ent d,t in document
d, we assume that there are only two classes as target
classification. Term entropy value ent d,t is defined as:
ent d ,t = ∑
| Sn |
ent _ pd ,t ( S x ) ≥δ e ..................................... (2)
N
with S x can be set of documents where term occurrences tfd,t in
current document d is less, S lt, or greater, S gt, than average
number avg(tfd,t); S x ∈{S lt, S gt}, and D = {S lt,∩S gt} (Table 1 depicts
contingency table to help determining importance state of terms
in a document). For entropy filter we do not consider values of
term entropy less than δe .
Given a set S x , containing only positive and negative
documents of 2 class problem Cd and Cnot_d, entropy of set S x
relative to this simple, binary classification is defined as:
ent _ pd ,t ( S x ) = −( p p log 2 p p + pn log 2 pn )
................. (3)
where p p is proportion of documents with term t belongs to Cd in
S x and p n is proportion of documents belongs to C not_d.
We list frequent itemsets that satisfy data mining filters, which
are minimum support (Eq.4) and minimum confidence (Eq.5) [15].
Support s value shows a percent value of n documents that must
contain terms ti and tj, pair of common words occurs together in
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
40
PURWITASARI ET AL.
multiple documents. The support filter will ignore any 2-itemsets
of words that occur in few documents to ensure the ones that
often occur are worthy of attention.
tpca or tpcb could become topics cluster as a result of merge
topics. This modification handles issue of articles that very likely
have blended topics.
s i , j = p(t i , t j ) ≥ δ s ....................................................................(4)
sim(d i , tpct ) =
While confidence c, also called as mutual information,
measures dependence or correlation between occurences of
terms ti and tj. The confidence filter makes further analysis to find
whether the presence of a 2-itemsets that often occurs (read:
sufficient to the support filter) is strongly significant to become
a potential frequent itemsets.
ci , j = log 2
p (t i , t j )
p (t i ) × p (t j )
≥ δ c ..............................................(5)
Note, p(x) is defined as a probability function of x, i.e. p(t) =
n t/N, while p(x, y) is a joint probability function of x and y.
C. Navigation Creating Module
To know relations between topics-subtopics, agglomerative
clustering is used to organize topics in nested groups of merged
topics [3]. Clustering initialization uses extracted topics from
hypergraph partitioning as cluster seeds. In the first iteration a
cluster contains single topic but for the next a cluster could
contain more than one merged topics. Merging some topics in a
cluster could result well comprehensive topic or not is an
appraising question. Thus we judge within distances on
members in the same cluster should be minimized while between
distances on different clusters be maximized. The smaller value
variance ratio of between and within distance shows, the better
clustering results are. If certain iteration has large value of
variance ratio though it inclines smaller at previous iterations,
then clustering will cut-off tree of topics-subtopics.
Our assumptions during clustering process are that each
document can only belong to exactly one cluster of topics and
any cluster without document members will be removed. We use
extracted topics as cluster seeds in initialization. Because of the
assumptions there is possibility that some initial clusters do not
have any document members and will be left out during tree
construction.
Our proximity matrix P is a matrix whose ijth entry denotes the
similarity between topics in i th and jth clusters. Before merging
topic a, tpca, and topic b, tpcb, we recalculate similarity between
any document d to soon-to-be-merged topics based on their
common words (sim(d, tpca), and sim(d, tpcb)).
It means that there are words that not only frequently found in
documents close to tpca but also exist in documents of tpcb. The
similarity of document-to-topic is derived from TFIDF formula in
term weighting process (Eq.6) [3]. Accumulation of document
similarities value with every topics pair is applied in a mutual
information style to update proximity matrix of topic clusters in
each iteration time clustering. Eq.7 defines simab as similarity of
topic a, tpca, and topic b, tpcb [3]. In the first iteration, tpca and
tpcb contain single topic which is cluster seed taken from the
extracted topics by Topics Detecting Module. For next iterations
∑
k∈t
tf ik × (log(
∑
j∈t
(log(
N 2
)) ×
nk
N 2
))
nk
∑
j∈t
(tf ij ) 2 (log(
N 2
))
nk
.......................................................................................................... (6)
simab =
∑
i∈docs
∑
sim(d i , tpca ) × sim(d i , tpcb )
N
j∈docs
sim(d j , tpca )
N
×
∑
k∈docs
sim(d k , tpcb )
N
.......................................................................................................... (7)
Distances between two clusters of topics signify overlapping
domain of subjects. Similarity function which is used to define
likeness of document and cluster also works for calculating
distances between clusters [3]. Our implementation
demonstrates that similarity function of document and cluster is
not only for merging topics [9] [11]. Aforesaid design for
cutting-off topics-subtopics tree utilizes similarity function to
work out on variance ratio value. In the end cluster of single
topic or merged topics will have a document representation. For
that purpose, this module searches the representation from list
of documents similar to clusters of topics. The list is a descended
order by depth-first traversing. Similarity function is used to
measure likeness level similarity between document and cluster
of topics based on common words which exist within. We
compute similarity of each traversed document ignoring any
document which has already become representation of other
clusters. Similarity between document and sub clusters of
currently visited cluster is also checked, because it is possible
that the document is more suitable as representation of one of its
sub cluster. In order to get orders of traversal topics in the list,
we consider implicit links information of Web pages. PageRank
[16] is selected to do link analysis for acquiring importance
weight of each document representation.
IV. A DAPT IVE CONT ENT -BASED NAVIGAT ION GENERAT ING
SYST EM
Content-based navigation generating system changes point
focus of learning path if current knowledge state of users is
shifting. Query Searching Module will pin-point some
documents from the collection where their contents relevant
with information in user model (see Fig.1). Then the system
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol.1/No.2 (2009)
41
Fig. 2. Browser displays first page from a generated navigation.
Fig. 3. Map of topics from a generated navigation on a Web-based Application.
adaptively re-generates a new navigation to learn those
documents.
A. Query Searching Module
We use PageRank [16] to pick out most linked-by Web pages
which their contents are more suitable with selected topic.
Generated navigation is derived from mapping of clustered
topics where each single topic might be term phrases. Since
position of combined terms might lead to different meaning, to
find suitable contents we need to exploit schema of term
weighting that considers position of terms [17]. Our searching
process applies combined ranking factors of link analysis,
PageRank Scoring, and content analysis that still retaining
spatial information (position of terms) of search keywords,
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
42
Fourier Domain Scoring [2]. Searching results act as dummy
collection replacing the original one in which the system will
produce the next suitable navigation for more focused collection
to learn further.
Adaptive content-based navigation generating system uses
Vector Space Model (VSM) or TF-IDF weighting, Fourier
Domain Scoring (FDS) and PageRanks Score (PRS) in a
combination to get similarity score of a document with selected
material. The selected one is a Web page where user clicks
[Generate] button (see Fig.2). System concerns term weight
values of important terms that exist in document and query Web
pages to get VSM score. System will calculate FDS score if and
only if title of selected material consists more than single term.
Finally multiplication between those three kinds of scores, if FDS
score exists, defines final similarity between Web pages of
document and query.
B. System Implementation on Web-based Application
As experiments to test functionality of each module, adaptive
content-based navigation generating system is implemented.
Small collection of documents contains English Wikipedia
articles without images 1 is prepared. Collection of Web pages is
prepared beforehand such as do html scraping based on XML
files as patterns to extract only learning contents from raw Web
pages. For this time being it is easier to observe adaptive effect
with small experimental collection (±30 docs). The adaptive is
concerned with topics mapping to recognize domain of subjects
and then topics spotting of user interests to provide scenarios
for learning subjects. We use libraries of Oracle Text 2 in indexing
terms and Porter Stemmer algorithm in stemming processes. List
of stop words contains words provided by Oracle Text, HTML
tags and some common words in Wikipedia articles.
Browser display Web page in Fig.2 for users. Highlighted
links indicate that their being referenced Web pages are
documents in the collection. System deals with contents of
those pages in consecutive way of routine processes from
aforementioned modules to derive map of topics for generating
navigation.
First generated navigation is illustrated in Fig.3 where it also
shows graph of words 3, though to create illustration graph is not
part of navigation generating system. Note that the graph
describes map of topics in experimental collection of Wikipedia
articles. The navigation has two sections as shown with a
two-divided area in map op topics by dashed line. Each divided
area in the graph contains several groups of gradational vertices.
Nodes with the same gradation color are common words of
single topic or sub section in generated navigation.
Let say user clicks one topic of first generated navigation in
1
crawled from main page of category Statistic
http://en.wikipedia.org/wiki/Statistic
2
Oracle T ext in Oracle Database
http://www.oracle.com/technology/products/text
3
hypergraph is written with Dot Language and viewed at ZGRViewer
http://zvtm.sourceforge.net/zgrviewer.html
PURWITASARI ET AL.
Fig.3 (see rectangle area of a sub section in the navigation).
Selected topic has words in dotted ellipses within its coverage
concept of subjects. Query Searching Module recognizes that
searching results of documents have subjects relate to selected
topic as illustrated with ellipse area of solid line. For second
generated navigation in this case, the system will produce a
structure from pin-points topics within solid line ellipse, and not
from a whole map of topics. It is said that the system adaptively
changes point focus of learning path for users based on their
current interest.
Fig.2 illustrates Statistic course in a sample of web-based
learning. It is suggested that user learns topics about [Statistic
Population] then followed by [Order Statistic] from the course. If
user clicks link of [Probability Distribution], the system analyzes
topic mapping in Fig.3 and generates new learning path focused
on topics in ellipse area with solid line. The second generated
path does not concern with all topics in the mapping.
V. CONCLUSIONS AND FUT URE W ORKS
In previous works we do study literatures and propose a
framework of navigation generating system, evaluate selected
methods in the frameworks to confirm their validities, then
implement the framework with still one left aspect to produce
appropriate learning path for users: adaptive effect. This work
brings to completion of proposed framework for generating
navigation solely based on contents in the collection of
documents. With addition of adaptive effect, the system could
generate navigation as guidance for users without neglecting
their context of interests on certain learning subjects.
Adaptive content-based navigation generating system has
produced a structural list of topics, not only hierarchical list in
the different level but also reading sequence in the same level.
Instead following links within Web pages of learning materials
and then reading them in a hypertextual manner, users could
follow path in our generated navigation and stay away of any
hypertext disorientation.
System implementation on this paper utilizes the term of
adaptive from user model with information on the selected topic.
As a note, it is believed that the collection of learning materials
consists of topics and relation between topics decided by
frequency of words in documents within the collection.
Information of selected topics by user should be recorded to
reflect user preferences of learning topics. Preference of a user is
not adequately reflected yet in this paper. More elaborate
studies are needed to represent user model. Information about
previously visited links, time to access in each link or others
could become reference for modeling users. Next, we would like
the system also considers those aspects.
The framework that has been discussed so far involves
analyzing hidden topics that we believe will affect the optimal
path of navigation support system to the user. It also will affect
the reaction of the user to the system about learning process.
Though evaluating the framework about functionality has been
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol.1/No.2 (2009)
done, more quantitative evaluations of the effectiveness of the
system are also expected. The effectiveness should face the
question of how to directly evaluate user reaction to navigation
generating system. Our next work will also consider this
unanswered question
[17] L. A. Park, K. Ramamohanarao, and M. Palaniswami, “ Fourier
Domain Scoring: A Novel Document Ranking Method,” IEEE T rans.
on Knowledge and Data Engineering, vol. 16, no. 5, pp. 529–539,
May 2004.
REFERENCES
[1] P. Brusilovsky and E. Millan, “ T he Adaptive Web: Methods and
Strategies of Web Personalization”, ser. Lecture Notes in Computer
Science. Springer Berlin / Heidelberg, 2007, vol. 4321, ch. User
Models for Adaptive Hypermedia and Adaptive Educational Systems,
pp. 3–53.
[2] D. Purwitasari, Y. Okazaki, and K. Watanabe, “ A Study on Web
Resources’ Navigation for e-Learning: Usage of Fourier Domain
Scoring on Web Pages Ranking Method,” in ICICIC ’07: Proc. of the
Second Intl. Conf. on Innovative Computing, Information and
Control, 2007.
[3] C. Clifton, R. Cooley, and J. Rennie, “ T opCat: Data Mining for
T opic Identification in a T ext Corpus,” IEEE T rans. on Knowledge
and Data Engineering, vol. 16, no. 8, pp. 949–964, 2004.
[4] M. E. S. Mendes, W. Jarrett, O. Prnjat, and L. Sacks, “ Flexible
Searching and Browsing for T elecoms Learning Material,” in
IST ’2003: Proc. of the 2003 Intl. Symp. on T elecomm., 2003.
[5] M. Halkidi, B. Nguyen, I. Varlamis, and M. Vazirgiannis, “ T HESUS:
Organizing Web Document Collections based on Link Semantics,”
T he VLDB Journal, vol. 12, no. 4, pp. 320–332, 2003.
[6] J. Zhu, J. Hong, and J. G. Hughes, “ PageCluster: Mining Conceptual
Link Hierarchies from Web Log Files for Adaptive Web site
Navigation,” ACM T rans. Interet T echnol., vol. 4, no. 2, pp.
185–208, 2004.
[7] A. D. Wissner-Gross, “ Preparation of T opical Reading Lists from the
Link Structure of Wikipedia,” in ICALT ’06: Proc. of the Sixth IEEE
Intl. Conf. on Advanced Learning T echnologies, 2006, pp. 825–829.
[8] S. Reinhold, “ WikiT rails: Augmenting Wiki Structure for
Collaborative, Interdisciplinary Learning,” in WikiSym ’06: Proc. of
the 2006 Intl. Symp. on Wikis, 2006, pp. 47–58.
[9] D. Purwitasari, Y. Okazaki, and K. Watanabe, “ Content-based
Navigation in Web-based Learning Applications,” in ICCE ’08: Proc.
of the 16th Intl. Conf. on Computers in Education, 2008, pp.
557–564.
[10] D. Purwitasari, Y. Okazaki, and K. Watanabe, “A Study on Adaptive
Content-based Navigation Generating System for Web-based
Learning,” in SIG-ALST ’53: Proc. of the 53 Conf. on Special
Interest Group - Adv. Learning Science and T ech., T he Japanese
Society for Artificial Intelligence. IEEE Education Japan Chap.,
2008, pp. 31–36.
[11] D. Purwitasari, Y. Okazaki, and K. Watanabe, “Data Mining for
Navigation Generating System with Unorganized Web Resources,” in
KES ’08: Proc. of the 12th Intl. Conf. on Knowledge-Based
Intelligent Information and Engineering Systems, 2008, pp.
598–605.
[12] G. Karypis, “ Multilevel Hypergraph Partitioning,” Comput. Sci. and
Eng. Dept., Univ. Minnesota, Minneapolis, T ech. Rep. 02-25, 2002.
[13] R. B. Yates and B. Ribeiro-Neto, Modern Information Retrieval.
Addison-Wesley, 1999.
[14] J. R. Quinlan, “ Induction of Decision T rees,” Machine Learning, vol.
1, pp. 81–106, 1986.
[15] R. Agrawal, T . Imieli´nski, and A. Swami, “ Mining Association Rules
between Sets of Items in Large Databases,” ACM SIGMOD, vol. 22,
no. 2, pp. 207–216, 1993.
[16] L. Page, S. Brin, R. Motwani, and T . Winograd, “ T he PageRank
Citation Ranking: Bringing Order to the Web,” Stanford Digital
Library T echnologies Project, T ech. Rep., 1998.
43
.
ISSN: 1942-9703 / © 2009 IIJ
44
INTERNETWORKING INDONESIA JOURNAL
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
45
Fuzzy Decision Tree dengan Algoritme ID3 pada
Data Diabetes
F. Romansyah, I. S. Sitanggang, S. Nurdiati
Abstract— Decision tree is one of widely used methods in
developing classification models. In order to handle uncertainty,
fuzzy approach is used. This work applied a classification
technique using fuzzy decision tree method on diabetes dataset
to obtain classification rules for predicting new data. Fuzzy ID3
(fuzzy Iterative Dichotomiser 3) was used to develop fuzzy
decision tree with high accuracy. The result is a fuzzy decision
tree with the highest accuracy 94,15% for fuzziness control
threshold (θr ) = 75% and leaf decision threshold (θn) = 8 % or
10%.
Index Terms— Fuzzy decision tree, Fuzzy ID3
Penelitian ini bertujuan untuk: 1) Menerapkan salah satu
teknik klasifikasi yaitu Fuzzy ID3 (Iterative Dichotomiser 3)
Decision Tree pada data hasil pemeriksaan lab pasien; 2)
Menemukan aturan klasifikasi pada data diabetes yang
menjelaskan dan membedakan kelas-kelas atau konsep
sehingga dapat digunakan untuk memprediksi penyakit
diabetes berdasarkan nilai dari atribut lain yang diketahui.
Model yang dihasilkan pada penelitian ini diharapkan dapat
digunakan oleh pihak yang berkepentingan untuk memprediksi
potensi seseorang atau pasien terserang penyakit diabetes,
sehingga terjadinya penyakit ini pada seseorang dapat
diprediksi sedini mungkin dan dapat dilakukan tindakan
antisipasi.
I. PENDAHULUAN
II. FUZZY DECISION TREE
D
ata mining merupakan proses ekstraksi informasi atau
pola penting dalam basis data berukuran besar [3]. Pada
penelitian ini akan diterapkan salah satu teknik dalam data
mining, yaitu klasifikasi terhadap data diabetes. Data diabetes
yang digunakan adalah data hasil pemeriksaan lab pasien dari
sebuah rumah sakit yang meliputi pemeriksaan GLUN (Glukosa
Darah Puasa), GPOST (Glukosa Darah 2 Jam PP), Tg
(Trigliserida), HDL (Kolesterol HDL), serta diagnosa pasien
berdasarkan nilai GLUN, GPOST, HDL, dan TG.
Klasifikasi merupakan salah satu metode dalam data mining
untuk memprediksi label kelas dari suatu record dalam data.
Metode yang digunakan dalam penelitian ini yaitu metode
klasifikasi dengan fuzzy decision tree. Penggunaan teknik fuzzy
memungkinkan melakukan prediksi suatu objek yang dimiliki
oleh lebih dari satu kelas. Dengan menerapkan teknik data
mining pada data diabetes diharapkan dapat ditemukan aturan
klasifikasi yang dapat digunakan untuk memprediksi potensi
seseorang terserang penyakit diabetes.
Naskah diterima pada 2 Desember 2009.
F. Romansyah adalah Associate Business Consultant (IT Consultant)
di PT AGIT (Astra Graphia Information T echnology) 22/F ANZ T ower
Jl. Jend Sudirman kavling 33A Jakarta 10220 (e-mail:
fyro_mans@yahoo.com).
I. S. Sitanggang adalah staf pengajar di Departemen Ilmu Komputer
FMIPA Institut Pertanian Bogor, Jl. Meranti, Wing 20 Level V, Kampus
IPB
Darmaga,
Bogor
16680
–
Indonesia
(e-mail:
imas.sitanggang@ipb.ac.id).
S. Nurdiati adalah staf pengajar di Departemen Ilmu Komputer
FMIPA Institut Pertanian Bogor, Jl. Meranti, Wing 20 Level V, Kampus
IPB Darmaga, Bogor 16680 – Indonesia (e-mail: nurdiati@ipb.ac.id)
A. Peubah Linguistik dan Linguistic Term
Peubah linguistik merupakan peubah verbal yang dapat
digunakan untuk memodelkan pemikiran manusia yang
diekspresikan dalam bentuk himpunan fuzzy. Peubah linguistik
dikarakterisasi oleh quintaple (x, T(x), X, G, M) dengan x
adalah nama peubah, T(x) adalah kumpulan dari linguistic
term, G adalah aturan sintaks, M adalah aturan semantik yang
bersesuaian dengan setiap nilai peubah linguistik. Sebagai
contoh, jika umur diinterpretasikan sebagai peubah linguistik,
maka himpunan dari linguistic term T(umur) menjadi :
T(umur) = {sangat muda, muda, tua}
Setiap term dalam T(umur) dikarakterisasi oleh himpunan
fuzzy, X menunjukkan nilai interval x. Aturan semantik
menunjukkan fungsi keanggotaan dari setiap nilai pada
himpunan linguistic term [1].
Linguistic term didefinisikan sebagai kumpulan himpunan
fuzzy yang didasarkan pada fungsi keanggotaan yang
bersesuaian dengan peubah linguistik. Misalkan D adalah
kumpulan dari record yang terdiri dari himpunan atribut
I = {I1 ,..., I n } , dengan I v , v = 1,..., n . Atribut I dapat berupa
atribut numerik atau kategorikal. Untuk setiap record d elemen
D, d [I v ] menotasikan nilai i dalam record d untuk atribut I v .
Kumpulan linguistic term dapat didefinisikan pada seluruh
domain dari atribut kuantitatif. Lvr , r = 1,..., sv menotasikan
linguistic term yang berasosiasi dengan atribut I v , sehingga
himpunan fuzzy dapat didefinisikan untuk setiap Lvr .
Himpunan fuzzy, Lvr ,
ISSN: 1942-9703 / © 2009 IIJ
r = 1,..., sv
didefinisikan sebagai:
46
INTERNETWORKING INDONESIA JOURNAL
µ L (i )

vr v jika I diskret
v
 ∑ dom( I v ) iv

Lvr = 
µ
∫ dom( I ) Lvr (iv ) jika I kontinu
v
v

iv
untuk semua iv ∈ dom( I v ) , dengan dom( I v ) = {iv1 ,..., ivn }. Derajat
keanggotaan dari nilai iv ∈ dom( I v ) dengan beberapa linguistic
term L vr dinotasikan oleh µ Lvr .
B. Fuzzy Decision Tree (FDT)
Decision tree merupakan suatu pendekatan yang sangat
populer dan praktis dalam machine learning untuk
menyelesaikan permasalahan klasifikasi. Metode ini digunakan
untuk memperkirakan nilai diskret dari fungsi target, yang mana
fungsi pembelajaran direpresentasikan oleh sebuah decision
tree [5]. Decision tree merupakan himpunan aturan IF…THEN.
Setiap path dalam tree dihubungkan dengan sebuah aturan, di
mana premis terdiri atas sekumpulan node-node yang ditemui,
dan kesimpulan dari aturan terdiri atas kelas yang terhubung
dengan leaf dari path [6].
Dalam pohon keputusan, leaf node diberikan sebuah label
kelas. Non-terminal node, yang terdiri atas root dan internal
node lainnya, mengandung kondisi-kondisi uji atribut untuk
memisahkan record yang memiliki karakteristik yang berbeda.
Edge-edge dapat dilabelkan dengan nilai-nilai numericsymbolic. Sebuah atribut numeric-symbolic adalah sebuah
atribut yang dapat bernilai numeric ataupun symbolic yang
dihubungkan dengan sebuah variabel kuantitatif. Sebagai
contoh, ukuran seseorang dapat dituliskan sebagai atribut
numeric-symbolic: dengan nilai kuantitatif, dituliskan dengan
“1,72 meter”, ataupun sebagai nilai numeric-symbolic seperti
“tinggi” yang berkaitan dengan suatu ukuran (size). Nilai-nilai
seperti inilah yang menyebabkan perluasan dari decision tree
menjadi fuzzy decision tree [8]. Penggunaan teknik fuzzy
memungkinkan melakukan prediksi suatu objek yang dimiliki
oleh lebih dari satu kelas.
Fuzzy decision tree memungkinkan untuk menggunakan
nilai-nilai numeric-symbolic selama konstruksi atau saat
mengklasifikasikan kasus-kasus baru. Manfaat dari teori
himpunan fuzzy dalam decision tree ialah meningkatkan
kemampuan dalam memahami decision tree ketika
menggunakan atribut-atribut kuantitatif. Bahkan, dengan
menggunakan teknik fuzzy dapat meningkatkan ketahanan saat
melakukan klasifikasi kasus-kasus baru [6].
C. Fuzzy ID3 Decision Tree
ID3 (Iterative Dichotomiser 3) merupakan salah satu
algoritme yang banyak digunakan untuk membuat suatu
decision tree. Algoritme ini pertama kali diperkenalkan oleh
Quinlan, menggunakan teori informasi untuk menentukan
atribut mana yang paling informatif, namun ID3 sangat tidak
stabil dalam melakukan penggolongan berkenaan dengan
gangguan kecil pada data pelatihan. Logika fuzzy dapat
memberikan suatu peningkatan untuk dalam melakukan
penggolongan pada saat pelatihan [5].
ROM ANSYAH ET AL.
Algoritme fuzzy ID3 merupakan algoritme yang efisien
untuk membuat suatu fuzzy decision tree. Algoritme fuzzy ID3
adalah sebagai berikut [5]:
1. Create a Root node that has a set of fuzzy
data with membership value 1.
2. If a node t with a fuzzy set of data D
satisfies the following conditions, then it
is a leaf node and assigned by the class
name.
• The proportion of class C k is greater than
or equal to θ r,
| D Ci |
≥ θr
|D|
• the number of a data set is less than θ n
• there
are
no
attributes
for
more
classifications
3. If a node D does no satisfy the above
conditions, then it is not a leaf-node. And
an new sub-node is generated as follow:
(i=1,…,L)
calculate
the
• For
A i’s
information gain, and select
the test
attribute A max that maximizes them.
• Devide
D
into
fuzzy
subset
D 1,…,D m
according to A max, where the membership
value of the data in D j is the product of
the membership value in D and the value
of F max,j of the value of A max in D.
• Generate new node t i,…,t m for fuzzy subsets
D 1,…,D m and label the fuzzy sets F max,j to
edges that connect between the nodes t j
and t.
• Replace D by D j (j=1,2,…,m) and repeat
from 2 recursively.
D. Fuzzy Entropy dan Information Gain
Information gain adalah suatu nilai statistik yang
digunakan untuk memilih atribut yang akan mengekspansi tree
dan menghasilkan node baru pada algoritme ID3. Suatu
entropy dipergunakan untuk mendefinisikan nilai information
gain. Entropy dirumuskan sebagai berikut [5]:
N
(1)
H s ( S ) = ∑ − Pi * log 2 ( Pi )
i
dengan Pi adalah rasio dari kelas Ci pada himpunan contoh S =
{x1,x2,…,xk }.
∑ j =1 x j ∈ Ci
P =
k
i
(2)
S
Information gain digunakan sebagai ukuran seleksi atribut,
yang merupakan hasil pengurangan entropy dari himpunan
contoh setelah membagi ukuran himpunan contoh dengan
jumlah atributnya. Information gain didefinisikan sebagai
berikut [5]:
| Sv |
G ( S , A) = H ( S ) −
H ( S ) (3)
∑
v∈Values ( A)
|S|
v
| Sv |
adalah rasio dari data dengan atribut
|S|
v pada himpunan contoh.
Pada himpunan data fuzzy, terdapat penyesuaian rumus
untuk menghitung nilai entropy untuk atribut dan information
dengan bobot Wi =
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
gain karena adanya ekspresi data fuzzy. Berikut adalah
persamaan untuk mencari nilai fuzzy entropy dari keseluruhan
data [5]:
N
(4)
H f ( S ) = H s ( S ) = ∑ − Pi * log 2 ( Pi )
i
Untuk menentukan fuzzy entropy dan information gain dari
suatu atribut pada algoritme fuzzy ID3 (FID3) digunakan
persamaan sebagai berikut [5]:
∑ j µij log ∑ j µij
(5)
∑v⊆ A | Sv | * H f (Sv , A)
(6)
N
H f ( S , A) = −∑i =1
C
G f (S ) = H f (S ) −
N
2
S
N
|S |
S
dengan µj adalah nilai keanggotaan dari pola ke-j untuk kelas
ke-i. Hf(S) menunjukkan entropy dari himpunan S dari data
pelatihan pada node. |S v | adalah ukuran dari subset S v ⊆ S dari
data pelatihan xj dengan atribut v. |S| menunjukkan ukuran dari
himpunan S.
E. Threshold dalam Fuzzy Decision Tree
Jika pada proses learning dari FDT dihentikan sampai
semua data contoh pada masing-masing leaf-node menjadi
anggota sebuah kelas, akan dihasilkan akurasi yang rendah.
Oleh karena itu untuk meningkatkan akurasinya, proses
learning harus dihentikan lebih awal atau melakukan
pemotongan tree secara umum. Untuk itu diberikan 2 (dua)
threshold yang harus terpenuhi jika tree akan diekspansi, yaitu
[5]:
• Fuzziness control threshold (FCT) / θr
Jika proporsi dari himpunan data dari kelas Ck lebih besar
atau sama dengan nilai threshold θr, maka hentikan
ekspansi tree. Sebagai contoh: jika pada sebuah subdataset rasio dari kelas 1 adalah 90%, kelas 2 adalah 10%
dan θr adalah 85%, maka hentikan ekspansi tree.
• Leaf decision threshold (LDT) / θn
Jika banyaknya anggota himpunan data pada suatu node
lebih kecil dari threshold θn, hentikan ekspansi tree.
Sebagai contoh: sebuah himpunan data memiliki 600
contoh dengan θn adalah 2%. Jika jumlah data contoh pada
sebuah node lebih kecil dari 12 (2% dari 600), maka
hentikan ekspansi tree.
III. T AHAPAN PENELIT IAN
Proses dasar sistem mengacu pada proses dalam
Knowledge Discovery in Database (KDD). Proses tersebut
dapat diuraikan sebagai berikut :
a. Pembersihan data, membuang data dengan nilai yang
hilang dan data yang duplikat.
b. Transformasi data ke dalam bentuk data fuzzy.
c. Data dibagi menjadi training set dan test set dengan
menggunakan metode k-fold cross validation [2].
d. Aplikasi teknik data mining, merupakan tahap yang
penting karena pada tahap ini teknik data mining
diaplikasikan terhadap data. Untuk menemukan aturan
klasifikasi digunakan metode fuzzy decision tree. Langkahlangkah pada metode tersebut yaitu:
1. Menentukan atribut yang akan digunakan.
47
2. Menentukan banyaknya fuzzy set untuk masing-masing
atribut.
3. Menentukan banyaknya training set yang akan
digunakan.
4. Menghitung membership value.
5. Memilih besarnya threshold yang akan digunakan.
6. Membangun fuzzy decision tree dengan algoritme
Fuzzy ID3.
Spesifikasi perangkat keras dan perangkat lunak yang
digunakan untuk penelitian ini adalah sebagai berikut :
Perangkat keras berupa komputer personal dengan
spesifikasi: Prosesor AMD Athlon 64 2800+, Memori DDR 512
MB, dan Harddisk 80 GB. Perangkat lunak yang digunakan
adalah sistem operasi Windows XP Profesional, MATLAB 7.0
sebagai bahasa pemrograman, dan Microsoft Excel 2003.
A. Transformasi Data
Dari 5 atribut yang digunakan, 4 di antaranya merupakan
atribut yang kontinu, yaitu GLUN, GPOST, HDL, dan TG,
sedangkan atribut Diagnosa adalah atribut kategorik.
Berdasarkan referensi hasil laboratorium, range normal untuk
atribut GLUN, GPOST, HDL, dan TG diperlihatkan pada Tabel
1.
T ABLE 1
NILAI REFERENSI HASIL LABORATORIUM
Kode Pe me riksaan
Satuan
Nilai Normal
GLUN
GPOST
HDL
TG
Mg/DL
Mg/DL
Mg/DL
Mg/DL
70 ~ 100
100 ~ 140
40 ~ 60
50 ~ 150
Atribut yang telah ditransformasi ke dalam bentuk fuzzy
antara lain:
• Atribut GLUN
Atribut GLUN dibagi menjadi 4 kelompok atau linguistic
term, yaitu rendah (GLUN < 70 mg/DL), sedang (70 mg/DL
<= GLUN < 110 mg/DL), tinggi (110 mg/DL <= GLUN < 140
mg/DL), dan sangat tinggi (GLUN >= 140 mg/DL) [4]. Dari
pembagian itu dapat ditentukan membership function dari
himpunan fuzzy rendah, sedang, tinggi, dan sangat tinggi
untuk atribut GLUN secara terpisah yaitu:
 0
 x − 65
1 ; x < 65


,
 x − 75
 10
µ rendah ( x) = 
; 65 <= x < 75 µ
x
(
)
=
 1
sedang
 − 10
 x − 115
0 ; x >= 75

 − 10
 0

 0
 x − 105

 10
µ tinggi ( x) =  1
 x − 145
 − 10
 0

;
;
x < 65
;
65 <= x < 75
;
75 <= x < 105
; 105 <= x < 115
;
x >= 115
x < 105
;
x < 135
 0
 x − 135
; 135 <= x < 145
µ sangatTinggi ( x) = 
; 115 <= x < 135
 10
;
x >= 145
 1
; 105 <= x < 115 ,
; 135 <= x < 145
;
x >= 145
Himpunan fuzzy untuk setiap linguistic term menggunakan
kurva dengan bentuk trapesium seperti pada Gambar 1.
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
48
Himpunan fuzzy untuk setiap linguistic term menggunakan
kurva dengan bentuk trapesium seperti pada Gambar 3.
MF GLUN
1
MF HDL
0,8
MF
ROM ANSYAH ET AL.
µ_rendah
0,6
µ_sedang
0,4
µ_tinggi
0,8
µ_sangatTinggi
0,6
MF
0,2
1
µ_rendah
µ_sedang
0,4
0
1
18
35
52
69
86 103 120 137
µ_tinggi
0,2
GLUN
0
1
6 11 16 21 26 31 36 41 46 51 56 61 66 71
HDL
Gambar 1 Himpunan fuzzy atribut GLUN
• Atribut GPOST
Gambar 3 Himpunan fuzzy atribut HDL
Atribut GPOST dibagi menjadi 4 kelompok atau linguistic
term, yaitu rendah (GPOST < 100 mg/DL), sedang (100
mg/DL <= GPOST < 140 mg/DL), tinggi (140 mg/DL <=
GPOST < 200 mg/DL), dan sangat tinggi (GPOST >= 200
mg/DL) [4]. Dari pembagian itu dapat ditentukan membership
function dari himpunan fuzzy rendah, sedang, tinggi, dan
sangat tinggi untuk atribut GPOST secara terpisah yaitu:
 0
 x − 95
1 ; x < 95


,
 x − 105
 10
µ rendah ( x) = 
; 95 <= x < 105 µ
x
(
)
=
 1
sedang
−
10

 x − 145
0 ; x >= 105

 − 10
 0

 0
 x − 135

 10
µ tinggi ( x) =  1
 x − 205
 − 10
 0

;
x < 95
; 95 <= x < 105
; 105 <= x < 135
; 135 <= x < 145
;
x >= 145
x < 135
; 135 <= x < 145 ,
; 145 <= x < 195
; 195 <= x < 205
;
;
;
x < 195
 0
 x − 195
; 195 <= x < 205
µ sangatTinggi ( x) = 
 10
;
x >= 205
 1
x >= 205
• Atribut TG
Atribut TG dibagi menjadi 3 kelompok atau linguistic term,
yaitu rendah (TG < 50 mg/DL), sedang (50 mg/DL <= TG <
150 mg/DL), dan tinggi (TG >= 150 mg/DL) [4]. Dari
pembagian itu dapat ditentukan membership function dari
himpunan fuzzy rendah, sedang, tinggi, dan sangat tinggi
untuk atribut TG secara terpisah yaitu:
 0
 x − 45
1 ; x < 45


 x − 55
,
 10
µ rendah ( x) = 
; 45 <= x < 55 µ
1
sedang ( x ) = 
−
10

 x − 155
0 ; x >= 55

 − 10
 0

;
x < 145
 0
 x − 145
; 145 <= x < 155
µ tinggi ( x) = 
 10
1
;
x >= 155

x < 45
;
45 <= x < 55
;
55 <= x < 145
; 145 <= x < 155
;
x >= 155
Himpunan fuzzy untuk setiap linguistic term menggunakan
kurva dengan bentuk trapesium seperti pada Gambar 4.
Himpunan fuzzy untuk setiap linguistic term menggunakan
kurva dengan bentuk trapesium seperti pada Gambar 2.
MF TG
1
0,8
MF
MF GPOST
1
µ_rendah
0,6
µ_sedang
0,4
0,8
MF
;
µ_rendah
0,2
0,6
µ_sedang
0
0,4
µ_tinggi
µ_tinggi
1
16
31
46
61
76
91 106 121 136 151 166
TG
µ_sangatTinggi
0,2
Gambar 4 Himpunan fuzzy atribut T G
0
1
22 43 64 85 106 127 148 169 190 211
GPOST
Gambar 2 Himpunan fuzzy atribut GPOST
• Atribut HDL
Atribut HDL dibagi menjadi 3 kelompok atau linguistic term,
yaitu rendah (HDL < 40 mg/DL), sedang (40 mg/DL <= HDL
< 60 mg/DL), dan tinggi (HDL >= 60 mg/DL) [4]. Dari
pembagian itu dapat ditentukan membership function dari
himpunan fuzzy rendah, sedang, tinggi, dan sangat tinggi
untuk atribut HDL secara terpisah yaitu:
 0
 x − 35
1 ; x < 35


 x − 45
,
 10
µ rendah ( x) = 
; 35 <= x < 45 µ
x
(
)
=
 1
sedang
 − 10
 x − 65
0 ; x >= 455

 − 10
 0

;
x < 55
 0
 x − 55
; 55 <= x < 65
µ tinggi ( x) = 
 10
1
;
x >= 65

;
x < 35
; 35 <= x < 45
; 45 <= x < 55
; 55 <= x < 65
;
x >= 65
Data dari atribut GLUN, GPOST, HDL, dan TG kemudian akan
ditransformasi ke dalam bentuk fuzzy dengan menghitung
derajat keanggotaan fuzzy pada tiap-tiap himpunan dari domain
setiap atribut linguistik.
• Atribut Dignosa
Atribut Diagnosa selanjutnya akan disebut sebagai CLASS,
direpresentasikan oleh dua buah peubah linguistik yaitu
”negatif diabetes” dan ”positif diabetes”. Kedua linguistic
termnya tersebut didefinisikan sebagai berikut:
”negatif diabetes” = 1
”positif diabetes” = 2
Nilai atribut CLASS yang akan dikategorikan sebagai
positif diabetes adalah diagnosa dengan label E10 (Insulindependent diabetes mellitus), E11 (Non-insulin-dependet
diabetes mellitus), E12 (Malnutrition-related diabetes
mellitus), dan E13 (Unspecified diabetes mellitus),
selainnya dikategorikan sebagai negatif diabetes.
Vol. 1/No. 2 (2009)
INTERNETWORKING INDONESIA JOURNAL
B. Data Mining
Pada tahap ini dilakukan teknik data mining menggunakan
algoritme FID3 untuk membangun fuzzy decision tree (FDT).
Data yang telah ditransformasi akan dibagi menjadi training
set dan test set. Pembagian data ini menggunakan metode 10fold cross validation. Data akan dibagi menjadi 10 subset
(S1,....,S10) yang berbeda dengan jumlah yang sama besar.
Setiap kali sebuah subset digunakan sebagai test set maka 9
buah partisi lainnya akan dijadikan sebagai training set.
Fase training dilakukan untuk membangun FDT dengan
menggunakan algoritme FID3. Proses training harus melalui
berbagai tahapan, sebagai contoh akan dijelaskan
pembentukan FDT dengan menggunakan contoh training set
yang terdiri dari 15 data atau record.
Pada contoh training set tersebut akan diterapkan
algoritme fuzzy ID3 untuk menemukan model atau aturan
klasifikasi. Langkah-langkah pembentukan aturan klasifikasi
dengan algoritme FID3, yaitu:
1. Membuat root node dari semua data training yang ada
dengan memberi nilai derajat keanggotaan untuk semua
record sama dengan 1.
2. Menghitung fuzzy entropy dari training set yang ada. Dari
hasil perhitungan diperoleh nilai fuzzy entropy sebesar
0.9968. Nilai ini akan digunakan untuk menghitung nilai
information gain dari masing-masing atribut.
3. Menghitung information gain dari atribut GLUN, GPOST,
HDL, dan TG, masing-masing diperoleh nilai 0.2064, 0.3330,
0.0304, dan 0.0050. Dari hasil tersebut dipilih atribut
dengan nilai information gain terbesar yaitu GPOST.
Atribut
GPOST
selanjutnya
digunakan
untuk
mengekspansi tree atau menjadi root-node, tapi pada subnode selanjutnya atribut GPOST tidak dapat digunakan
untuk mengekspansi tree.
4. Data training diekspansi berdasarkan atribut GPOST
sehingga diperoleh hasil seperti pada Gambar 5.
49
5. Menghitung proporsi dari tiap kelas yang ada pada tiaptiap node. Misalkan, untuk sub-node dengan nilai
keanggotaan atribut rendah, proporsi kelasnya adalah:
C1 = 0.1 + 1 + 1 = 2.1
C2 = 1
Proporsi kelas 1 (negatif diabetes) = C1 *100% = 67.74%
C1 + C 2
Proporsi kelas 2 (positif diabetes)
=
C2
* 100% = 32.26%
C1 + C2
6. Pada contoh ini digunakan fuzziness control threshold (θr)
sebesar 80% dan leaf decision threshold (θn) sebesar
20%*15 = 3. Kedua threshold tersebut akan menentukan
apakah sub-node akan terus diekspansi atau tidak.
Misalkan sub-node dengan nilai atribut rendah yang
memiliki data sebanyak 4, berdasarkan nilai proporsi kelas
1 (67.74%) dan kelas 2 (32.26%) yang lebih kecil dari θr
(80%) dan banyaknya data atau record pada sub-node
tersebut lebih besar dari θn, maka sub-node tersebut akan
terus diekspansi. Lain halnya jika θr yang digunakan
adalah 65%, maka sub-node tersebut tidak akan
diekspansi.
7. Lakukan terus ekspansi dari sub-node yang ada sampai
tidak ada lagi data yang dapat diekspansi atau tidak ada
lagi atribut yang dapat digunakan untuk mengekspansi
tree yaitu ketika tree yang terbentuk sudah mencapai
kedalaman maksimum atau sub-node tidak memenuhi
syarat dari threshold yang diberikan. Jika sub-node sudah
tidak dapat diekspansi maka nilai proporsi kelas terbesar
merupakan kesimpulan dari sekumpulan aturan yang
diperoleh dengan menghubungkan setiap node yang
dilewati dari root node hingga leaf node. Gambar 6
menunjukkan fuzzy decision tree yang diperoleh dari
training set.
Gambar 1 Fuzzy decision tree untuk contoh training set
Gambar 6 Fuzzy decision tree untuk training set
Gambar 5 Hasil ekspansi training set berdasarkan atribut GPOST
Nilai derajat keanggotaan yang baru masing-masing
record pada sub-node diperoleh dari hasil perkalian antara
derajat keanggotaan pada root node dan derajat
keanggotaan atribut yang digunakan untuk mengekspansi
tree. Misalkan, untuk sub-node dengan nilai atribut
rendah, nilai derajat keanggotaan dari data no.38 µl = 0.1
dan derajat keanggotaan dari data no.38 pada root node
µold = 1, maka nilai derajat keanggotaannya pada sub-node
µnew adalah:
µ new = µl ⊗ µ old = 1 ⊗ 0.1 = 0.1
IV. HASIL DAN PEMBAHASAN
Berdasarkan langkah-langkah algoritme FID3 dalam Bab III,
diperoleh sebuah model yang terdiri atas 9 aturan dengan
menggunakan training set. Model atau aturan klasifikasi yang
diperoleh:
1. IF GPOST rendah AND HDL rendah T HEN Negatif
Diabetes
2. IF GPOST rendah AND HDL sedang T HEN Negatif Diabetes
3. IF GPOST rendah AND HDL tinggi T HEN Positif Diabetes
4. IF GPOST sedang T HEN T HEN Negatif Diabetes
ISSN: 1942-9703 / © 2009 IIJ
INTERNETWORKING INDONESIA JOURNAL
5. IF GPOST tinggi AND GLUN rendah T HEN
Diabetes
6. IF GPOST tinggi AND GLUN sedang T HEN
Diabetes
7. IF GPOST
tinggi AND GLUN tinggi T HEN
Diabetes
8. IF GPOST tinggi AND GLUN sangat tinggi T HEN
Diabetes
9. IF GPOST sangat tinggi T HEN Positif Diabetes
Negatif
Negatif
Negatif
Positif
Berdasar metode uji 10-fold cross validation, proses
training dilakukan sebanyak 240 kali. Untuk setiap training
set, proses training dilakukan sebanyak 24 kali, dengan
mengubah nilai θr sebanyak 6 kali yaitu 75%, 80%, 85%, 90%,
95%, dan 98%, dan untuk masing-masing nilai θr yang sama
diberikan nilai θn yang berbeda-beda yaitu 3%, 5%, 8%, dan
10%. Rata-rata jumlah aturan yang dihasilkan pada proses
training dan waktu eksekusi yang dibutuhkan dapat dilihat
pada Tabel 2 dan Tabel 3.
T ABLE 2
RATA - RATA JUMLAH ATURAN
θr
75%
80%
85%
90%
95%
98%
θn
3%
4
7
11
12
20
27
5%
4
7
10
11
18
24
8%
4
7
10
10
15
20
10%
4
6
8
8
11
16
Nilai-nilai θr dan θn yang digunakan dipilih berdasarkan
hasil percobaan, karena dengan nilai-nilai tersebut jumlah
model atau aturan yang dihasilkan mengalami perubahan yang
cukup signifikan. Dengan nilai θr 70%, aturan yang diperoleh
tidak berbeda dengan nilai θr 75% sehingga nilai θr hanya
dipilih sampai 75%.
ROM ANSYAH ET AL.
peningkatan yang paling signifikan terjadi pada saat nilai θr
dinaikkan dari 90% menjadi 95% dan dari 95% menjadi 98%.
Kondisi semacam ini disebabkan karena pada saat ekspansi
training set yang pertama kali dilakukan banyak sub-node
yang proporsi salah satu kelasnya sudah mencapai nilai di atas
90%, sehingga sub-node tersebut tidak perlu diekspansi lagi.
Ketika nilai θr ditingkatkan sampai 95%, baru terjadi ekspansi
pada beberapa sub-node yang lain dan hasil proporsi kelas
pada node di bawahnya pun ternyata banyak yang lebih
rendah dari nilai θr yang mengakibatkan training set akan
terus diekspansi sampai seluruh atribut terpakai.
Jika diamati dengan seksama pada Tabel 2, walaupun nilai
θn ditingkatkan, jumlah aturan yang dihasilkan tidak mengalami
penurunan secara signifikan. Berdasarkan pengamatan yang
dilakukan, ternyata karakteristik data pada training set yang
digunakan tidak terlalu berbeda, pada saat terjadi ekspansi tree
data tidak akan terlalu menyebar, karenanya jumlah himpunan
data yang ada pada sub-node tidak berbeda jauh dengan
jumlah himpunan data yang ada pada root-node. Dengan
adanya situasi yang demikian, syarat untuk menghentikan
ekspansi tree yaitu jumlah data atau record pada sub-node
harus lebih kecil dari nilai θn sulit untuk tercapai.
Nilai θr yang terlalu rendah dan atau θn yang terlalu tinggi
akan menghasilkan tree dengan ukuran yang kecil sehingga
jumlah aturan yang dihasilkan juga sangat sedikit, hal ini terjadi
karena tree yang sedang dibangun mengalami pemotongan
(pruning) pada saat model masih mempelajari struktur dari
training set. Sebaliknya, nilai θr yang terlalu tinggi dan atau θn
yang terlalu rendah kadang kala akan menyebabkan fuzzy
decision tree berperilaku seperti decision tree biasa yang tidak
memerlukan adanya threshold sehingga menghasilkan tree
dengan ukuran sangat besar dan jumlah aturan yang juga
sangat banyak, karena tree akan terus diekspansi sampai leafnode terdalam.
30
25
Jumlah Aturan
50
20
15
10
5
0
75%
T ABLE 3
80%
85%
θn=3%
θr
75%
80%
85%
90%
95%
98%
θn
3%
0,125
0,239
0,344
0,404
0,717
0,955
5%
0,128
0,227
0,334
0,361
0,594
0,842
90%
95%
98%
θr
RATA - RATA WAKTU EKSEKUSI DALAM DETIK
θn=5%
θn=8%
θn=10%
Gambar 7 Perbandingan rata-rata jumlah aturan
8%
0,120
0,214
0,308
0,328
0,492
0,712
10%
0,127
0,183
0,244
0,264
0,356
0,544
Berdasarkan hasil percobaan, semakin tinggi nilai θr
semakin banyak pula aturan yang dihasilkan, hal ini terjadi
karena sebelum suatu node didominasi oleh sebuah kelas dan
proporsi untuk kelas tersebut di atas atau sama dengan nilai θr
maka tree akan terus diekspansi. Pada Tabel 2 terlihat bahwa,
Gambar 7 menunjukkan perbandingan rata-rata jumlah
aturan yang dihasilkan pada proses training. Dapat terlihat
bahwa semakin tinggi nilai θr akan menyebabkan jumlah aturan
yang dihasilkan juga meningkat, hal sebaliknya terjadi jika nilai
θn semakin tinggi maka aturan yang dihasilkan cenderung
berkurang.
INTERNETWORKING INDONESIA JOURNAL
Vol. 1/No. 2 (2009)
V. EVALUASI KINERJA FID3
1,200
Evaluasi kinerja algoritme FID3 dapat diketahui dengan cara
menghitung rata-rata akurasi dari seluruh proses testing pada
10 test set yang berbeda. Evaluasi kinerja dari algoritme FID3
pada nilai θr dan θn yang berbeda dapat dilihat pada Tabel 4.
1,000
Waktu (detik)
51
0,800
0,600
0,400
T ABLE 4
0,200
AKURASI FUZZY DECISION TREE
0,000
75%
80%
85%
90%
95%
98%
θr
θn=3%
θn=5%
θr
θn=8%
θn=10%
Gambar 8 Perbandingan rata-rata waktu eksekusi proses training
Dengan melihat Gambar 7 dan Gambar 8, dapat disimpulkan
bahwa, semakin tinggi nilai θr yang digunakan akan
menghasilkan jumlah aturan yang semakin banyak sehingga
waktu yang dibutuhkan untuk menghasilkan aturan-aturan
tersebut juga meningkat, hal ini terjadi karena proses yang
harus dilakukan untuk membangun tree semakin banyak.
Untuk mengukur akurasi dari model yang dihasilkan pada
fase training, proses testing dilakukan sebanyak 240 kali.
Proses testing dilakukan dengan cara memasukkan aturan yang
diperoleh dari proses training ke dalam sebuah FIS Mamdani
[7] untuk menentukan kelas dari masing-masing record pada
test set. Untuk satu kali proses training dilakukan satu kali
proses testing.
Dengan melihat hasil testing, mengubah nilai θr dan θn tidak
menyebabkan adanya perubahan pada nilai akurasi dari model
yang ada, walaupun jumlah aturan yang dihasilkan proses
training mengalami peningkatan. Hal ini terjadi karena secara
kebetulan semua aturan pada kedua model tersebut memiliki
kelas target yang sama yaitu negatif diabetes dan test set yang
digunakan juga berasal dari kelas target yang sama sehingga
hasil testing dari kedua buah model dengan test set yang sama
tidak mengalami perubahan. Namun model dengan aturan yang
seragam seperti ini tidak dapat dipakai untuk melakukan
prediksi karena berapapun nilai atribut yang diberikan hasilnya
akan selalu negatif diabetes. Hasil testing akan berbeda jika
aturan-aturan pada model yang dihasilkan memiliki kelas target
yang berbeda. Model untuk training set pertama dengan θr
(75%) dan θn (3%):
IF GPOST
IF GPOST
IF GPOST
IF GPOST
rendah T HEN Negatif Diabetes
sedang T HEN Negatif Diabetes
tinggi T HEN Negatif Diabetes
sangat tinggi T HEN Negatif Diabetes
Pada test set ketujuh dan kedelapan, nilai akurasi mengalami
penurunan saat θr ditingkatkan menjadi 80%. Berdasarkan
pengamatan yang dilakukan, keadaan seperti ini dinamakan
overfitting karena terlalu tingginya θr untuk training set
tersebut, sehingga tree akan terus diekspansi sampai betulbetul sesuai dengan training set. Akibatnya tree memiliki
node-node yang mengandung data yang mengalami kesalahan
klasifikasi.
75%
80%
85%
90%
95%
98%
θn
3%
94,14
92,07
92,07
92,07
90,69
90,69
5%
94,14
92,07
92,07
92,07
91,73
91,73
8%
94,15
93,45
93,45
93,45
93,10
93,10
10%
94,15
93,45
93,45
93,45
93,45
93,45
Dari Tabel 4 dapat dilihat bahwa kinerja algoritme FID3
mengalami penurunan jika nilai θr semakin besar dan atau nilai
θn semakin kecil, walaupun penurunan yang terjadi tidaklah
signifikan sehingga masih dapat ditoleransi. Seperti telah
dijelaskan sebelumnya kondisi ini disebabkan karena terjadinya
fenomena overfitting. Nilai akurasi terbaik yaitu 94,15%
diperoleh pada θr = 75% dan θn = 8 % atau 10%.
VI. KESIMPULAN
Dari berbagai percobaan yang dilakukan terhadap data
Diabetes didapat kesimpulan bahwa algoritme FID3 memiliki
kinerja yang baik dalam membentuk fuzzy decision tree untuk
data Diabetes yang ada. Nilai akurasi terbaik dari model yaitu
94,15% diperoleh pada fuzziness control threshold (θr) = 75%
dan leaf decision threshold (θn) = 8 % atau 10%. Nilai θr dan θn
sangat berpengaruh terhadap jumlah aturan yang dihasilkan,
nilai θr yang terlalu tinggi akan menyebabkan turunnya nilai
akurasi. Di lain pihak, nilai θn yang terlalu rendah juga dapat
menyebabkan akurasi menurun.
REFERENCES
[1] E Cox. Fuzzy Modeling and Algorithms for Data mining and
Exploration. USA: Academic Press. 2005.
[2] L. Fu. Neural Network In Computer Intellegence. Singapura:
McGraw Hill. 1994.
[3] J. Han and M. Kamber. Data Mining Concepts and Techniques.
Simon Fraser University. USA: Morgan Kaufman, 2006
[4] Herwanto. Pengembangan Sistem Data Mining untuk Diagnosis
Penyakit Diabetes Menggunakan Algoritme Classification Based
Association [T esis]. Bogor. Departemen Ilmu Komputer Fakultas
Matematika dan Ilmu Pengetahuan Alam Institut Pertanian Bogor,
2006.
[5] G. Liang. A Comparative Study of Three Decision Tree algorithms:
ID3, Fuzzy ID3 and Probabilistic Fuzzy ID3. Informatics &
Economics Erasmus University Rotterdam Rotterdam, the
Netherlands, 2005.
[6] C. Marsala. Application of Fuzzy Rule Induction to Data Mining.
University Pierre et Marie Curie. 1998.
[7] Ormos L. Soft Computing Method On Thom’s Catastrophe Theory
For Controlling Of Large-Scale Systems. University of Miskolc,
Department of Automation. 2004.
[8] Y. Yuan dan Shaw M J. Induction of fuzzy decision trees, Fuzzy Sets
and Systems Vol. 69. 1995.
ISSN: 1942-9703 / © 2009 IIJ
52
INTERNETWORKING INDONESIA JOURNAL
Firat Romansyah dilahirkan di Indonesia pada tahun 1985. Dia
memperoleh gelar Sarjana Ilmu Komputer dari Departemen Ilmu
Komputer, Institut Pertanian Bogor, Indonesia pada tahun 2007. Saat ini
dia bekerja sebagai Associate Business Consultant (IT Consultant) di PT
AGIT (Astra Graphia Information T echnology).
Imas Sukae sih Sitanggang dilahirkan di Indonesia pada tahun
1975. Dia memperoleh gelar Sarjana Matematika dari Departemen
Matematika, Institut Pertanian Bogor, Indonesia pada tahun 1997 dan
mendapat gelar Master Ilmu Komputer dari Universitas Gadjah Mada,
Indonesia pada tahun 2002. Saat ini dia bekerja sebagai dosen di
Departemen Ilmu Komputer, Institut Pertanian Bogor, Indonesia.
T opik utama penelitiannya adalah dalam spatial data mining.
Sri Nurdiati dilahirkan di Indonesia pada tahun 1960. Dia
memperoleh gelar Sarjana Statistika dari Institut Pertanian Bogor,
Indonesia pada tahun 1984 dan mendapat gelar Master Ilmu Komputer
dari The University of Western Ontario di Canada pada tahun 1991. Pada
tahun 2005 dia memperoleh gelar Doktor Matematika T erapan dari
Twente University di Belanda. Saat ini dia bekerja sebagai dosen di
Departemen Ilmu Komputer, Institut Pertanian Bogor, Indonesia.
T opik utama penelitiannya adalah scientific computing.
.
ROM ANSYAH ET AL.
Internetworking Indonesia Journal
The Indonesian Journal of ICT and Internet Development
ISSN: 1942-9703
About the Internetworking Indonesia Journal
The Internetworking Indonesia Journal (IIJ) was established out of the need to address the lack of an Indonesia-wide independent
academic and professional journal covering the broad area of Information and Communication Technology (ICT) and Internet
development in Indonesia.
The broad aims of the Internetworking Indonesia Journal (IIJ) are therefore as follows:
•
Provide an Indonesia-wide independent journal on ICT: The IIJ seeks to be an Indonesia-wide journal independent from
any specific institutions in Indonesia (such as universities and government bodies). Currently in Indonesia there are numerous
university-issued journals that publish papers only from the respective universities. Often these university journals experience
difficulty in maintaining sustainability due to the limited number of internally authored papers. Additionally, most of these
university-issued journals do not have an independent review and advisory board, and most do not have referees and reviewers
from the international community.
•
Provide a publishing venue for graduate students: The IIJ seeks also to be a publishing venue for graduate students (such as
Masters/S2 and PhD/S3 students) as well as working academics in the broad field of ICT. This includes graduate students from
Indonesian universities and those studying abroad. The IIJ provides an avenue for these students to publish their papers to a
journal which is international in its reviewer scope and in its advisory board.
•
Improve the quality of research & publications on ICT in Indonesia: One of the long term goals of the IIJ is to promote
research on ICT and Internet development in Indonesia, and over time to improve the quality of academic and technical publications from the Indonesian ICT community. Additionally, the IIJ seeks to be the main publication venue for various authors
worldwide whose interest include Indonesia and its growing area of information and communication technology.
•
Provide access to academics and professionals overseas: The Editorial Advisory Board (EAB) of the IIJ is intentionally composed of international academics and professionals, as well as those from Indonesia. The aim here is to provide Indonesian
authors with access to international academics and professionals who are aware of and sensitive to the issues facing a developing nation. Similarly, the IIJ seeks to provide readers worldwide with easy access to information regarding ICT and Internet
development in Indonesia.
•
Promote the culture of writing and authorship: The IIJ seeks to promote the culture of writing and of excellent authorship
in Indonesia within the broad area of ICT. It is for this reason that the IIJ is bilingual in that it accepts and publishes papers in
either English or Bahasa Indonesia. Furthermore, the availability of an Indonesia-wide journal with an international advisory
board may provide an incentive for young academics, students and professionals in Indonesia to develop writing skills appropriate for a journal. It is hoped that this in-turn may encourage them to subsequently publish in other international journals in
the future.
Internetworking Indonesia Journal
ISSN: 1942-9703
www.InternetworkingIndonesia.org