CLASSIFICATION OF BREAST CANCER MICROARRAY DATA USING RADIAL BASIS FUNCTION NETWORK

advertisement
CLASSIFICATION OF BREAST CANCER MICROARRAY DATA USING
RADIAL BASIS FUNCTION NETWORK
UMI HANIM BINTI MAZLAN
UNIVERSITI TEKNOLOGI MALAYSIA
CLASSIFICATION OF BREAST CANCER MICROARRAY DATA USING
RADIAL BASIS FUNCTION NETWORK
UMI HANIM BINTI MAZLAN
A project report submitted in partial fulfillment of the
requirements for the award of the degree of
Master of Science (Computer Science)
Faculty of Computer Science and Information Systems
Universiti Teknologi Malaysia
October 2009
iii
To my beloved dad, my lovely mum and my precious siblings
iv
ACKNOWLEDGMENT
In the name of Allah, Most Gracious, Most Merciful
Praise to The Almighty, Allah S.W.T for giving me the strength and passion
towards completing this research throughout the semester. By His willing and bless, I
had managed to complete my work within the period given.
First of all, my thanks go to Dr. Puteh Bt Saad, who responsible as my Master
Project’s supervisor for her guidance, counsels, motivation, and most importantly for
sharing her thoughts with me.
My deepest thanks go to my lovely family for their endless prayers, love and
support. Their moral supports that always right by my side had been the inspiration
along the journey and highly appreciated.
I would also like to extend my heartfelt appreciation to all my friends for
their valuable assistance and friendship.
May Allah S.W.T repay all their sacrifices and deeds with His Mercy and Bless.
v
ABSTRACT
Breast cancer is the number one killer disease among women worldwide.
Although this disease may affect women and men but the rate of incidence and the
number of death is high among women compared to men. Early detection of breast
cancer will help to increase the chance of survival since the early treatment can be
decided for the patients who suffer this disease. The advent of the microarray
technology has been applied to the medical area in term of classification of cancer
and diseases. By using the microarray, thousands of genes expression can be
determined simultaneously. However, this microarray suffers several drawbacks such
as high dimensionality and contains irrelevant genes. Therefore, various techniques
of feature selection have been developed in order to reduce the dimensionality of the
microarray and also to select only the appropriate genes. For this study, the
microarray breast cancer data, which is obtained from the Centre for Computational
Intelligence will be used in the experiment. The Relief-F algorithm has been chosen
as the method of the feature selection. As the comparison, another two methods of
feature selection which are Information Gain and Chi-Square will also be used in the
experiment. The Radial Basis Function, RBF network will be used as the classifier to
distinguish between the cancerous and non-cancerous cells. The accuracy of the
classification will be evaluated by using the chosen metric namely Receiver
Operating Characteristic, ROC.
vi
ABSTRAK
Barah payudara merupkan pembunuh nombor satu di kalangan wanita di
seluruh dunia. Walaupun penyakit ini boleh menyerang wanita dan lelaki namun
kadar kejadian dan kematian adalah lebih tinggi di kalangan wanita berbanding
lelaki. Pengesanan awal barah payudara boleh membantu meningkatkan peluang
untuk hidup memandangkan rawatan awal boleh disarankan kepada pesakit yang
menghidapi penyakit ini. Kemunculan teknologi mikroarray telah diaplikasikan
dalam bidang perubatan untuk pengkelasan bagi penyakit barah. Dengan
menggunakan mikroarray, berjuta-juta pengekspresan gen boleh ditentukan secara
serentak. Walaubagaimanapun, mikroarray ini berhadapan dengan beberapa
kelemahan seperti mempunyai dimensi yang tinggi dan juga mengandungi gen-gen
yang tidak relevan. Oleh sebab itu, pelbagai teknik pemilihan fitur telah dibangunkan
bertujuan mengurangkan dimensi mikroarray dan hanya memilih gen-gen yang
bersesuaian sahaja. Bagi kajian ini, data barah payudara mikroarray yang digunakan
dalam eksperimen diperoleh dari Centre for Computational Intelligence. Algoritma
ReliefF telah dipilih sebagai teknik untuk pemilihan fitur. Sebagai perbandingan, dua
lagi teknik pemilihan fitur iaitu Information Gain dan Chi-square juga akan
digunakan dalam eksperimen. Bagi membezakan di antara sel barah dan bukan
barah, rangkaian Radial Basis Function, RBF, akan digunakan sebagai pengkelas.
Ketepatan pengkelasan akan dinilai oleh metrik yang telah dipilih iaitu Receiver
Operating Characteristic, ROC.
vii
TABLE OF CONTENTS
CHAPTER
1
TITLE
PAGE
DECLARATION
ii
DEDICATION
iii
ACKNOWLEDGEMENTS
iv
ABSTRACT
v
ABSTRAK
vi
TABLE OF CONTENTS
vii
LIST OF TABLES
xi
LIST OF FIGURES
xii
LIST OF ABBREVIATIONS
xiv
LIST OF APPENDICES
xv
INTRODUCTION
1.1
Introduction
1
1.2
Problem Background
3
1.3
Problem Statement
4
1.4
Objectives of the Projects
5
1.5
Scopes of the Projects
5
1.6
Importance of the Study
6
1.7
Summary
7
viii
2
LITERATURE REVIEW
2.1
Introduction
8
2.2
Breast cancer
9
2.2.1 What is Breast Cancer
9
2.2.2 Types of Breast Cancer
9
2.2.2.1 Non-invasive Breast Cancer
10
2.2.2.2 Invasive Breast Cancer
11
2.2.3 Screening for Breast Cancer
12
2.2.4 Diagnosis of Breast Cancer
13
2.2.4.1 Biopsy
2.3
2.4
Microarray Technology
15
2.3.1 Definition
15
2.3.2 Applications of Microarray
15
2.3.2.1 Gene Expression Profiling
17
2.3.2.2 Data Analysis
19
Feature Selection
20
2.4.1 Relief
24
2.4.2 Information Gain
25
2.4.3 Chi-Square
26
Radial Basis Function
27
2.5.1 Types of Radial Basis Function, RBF
27
2.5.2 Radial Basis Function, RBF, network
30
Classification
34
2.6.1 Comparison with others Classifiers
35
2.7
Receiver Operating Characteristic, ROC
37
2.8
Summary
38
2.5
2.6
3
14
METHODOLOGY
3.1
Introduction
39
3.2
Research Framework
40
3.3
Software Requirement
41
3.4
Data Source
41
3.5
HykGene
41
ix
3.5.1 Gene Ranking
3.5.1.1 ReliefF Algorithm
44
3.5.1.2 Information Gain
47
2
3.5.1.3 χ -statistic
3.5.2 Classification
3.6
4
47
50
3.5.2.1 k-nearest neigbor
50
3.5.2.2 Support vector machine
51
3.5.2.3 C4.5 decision tree
51
3.5.2.4 Naive Bayes
52
Classification using Radial Basis Function
Network
52
3.7
Evaluation of the Classifier
54
3.8
Summary
56
DESIGN AND IMPLEMENTATION
4.1
Introduction
57
4.2
Data Acquisition
58
4.2.1 Data format
61
Experiments
63
4.3.1 Genes Ranked
63
4.3
4.3.1.1 Experimental Settings
4.3.2 Classifications of top-ranked genes
4.3.2.1 Experimental Settings
4.3.3 Evaluation of Classifiers Performances
4.3.3.1 Experimental Settings
4.4
5
43
Summary
63
64
64
65
65
67
EXPERIMENTAL RESULTS ANALYSIS
5.1
Introduction
68
5.2
Results Discussion
69
5.2.1 Analysis Results of Features Selection
Techniques
5.2.2 Analysis Results of Classifications
69
72
x
5.2.3 Analysis Results of Classifier Performances 74
5.3
6
Summary
77
DISCUSSIONS AND CONCLUSION
6.1
Introduction
79
6.2
Overview
79
6.3
Achievements and Contributions
81
6.4
Research Problem
82
6.5
Suggestion to Enhance the Research
82
6.6
Conclusion
83
REFERENCES
85
APPENDIX A
95
APPENDIX B
99
xi
LIST OF TABLES
TABLE NO
2.1
TITLE
Previous researches using microarray
PAGE
16
technology
2.2
A taxonomy of feature selection techniques
22
2.3
Previous researches on classification using RBF
35
network
4.1
EST-contigs, GenBank accession number and
59
oligonucleotide probe sequence
5.1
Results on breast cancer dataset using Relief-F
69
5.2
Results on breast cancer using Information Gain
69
5.3
Results on breast cancer using χ2-statistic
70
5.4
Results on classification using different
72
classifiers
5.5
TPR and FPR of classifiers
75
5.6
ROC Area of Classifiers
77
xii
LIST OF FIGURES
FIGURE NO
1.1
TITLE
PAGE
Ten most frequent cancer in females, Peninsular
3
Malaysia 2003-2005
2.1
Ductal Carcinoma In-situ, DCIS
10
2.2
The cancer cells spread outside the duct and
11
invade nearby breast tissue
2.3
Gene Expression Profiling
18
2.4
Feature selection process
21
2.5
Gaussian RBF
28
2.6
Multiquadric RBF
29
2.7
Gaussian(green), Multiwuadric (magenta),
30
Inverse-multiquadric (red) and Cauchy (cyan)
RBF
2.8
The traditional radial basis function network
32
2.9
The ROC curve
38
3.1
Research Framework
40
3.2
HykGene: a hybrid system for marker gene
42
selection
3.3
The expanded framework of HykGene in Step1
44
3.4
Original Relief Algorithm
45
3.5
Relief-F algorithm
46
xiii
3.6
χ²-statistic algorithm
49
3.7
The expanded framework of HykGene in Step 4
50
3.8
Radial Basis Function Network architecture
53
3.9
AUC algorithm
55
4.1
The data matrix
58
4.2
Fifteenth of the instances of breast cancer
60
microarray dataset
4.3
Microarray Image
61
4.4
Header of the ARFF file
62
4.5
Data of the ARFF file
62
4.6
ROC Space
66
4.7
Framework to plotting multiple ROC curve
67
using Weka
5.1
Graph on cross validation classification
70
accuracy using 50-top ranked genes
5.2
Graph on cross validation classification
71
accuracy using 100-top ranked genes
5.3
Percentage of Correctly Classified Instances
73
using 50 top-ranked genes
5.4
Percentage of Correctly Classified Instances
73
using 100 top-ranked genes
5.5
ROC Space of 50 top-ranked genes
75
5.6
ROC Space of 100 top-ranked genes
75
5.7
ROC Curve of 50 top-ranked genes
76
5.8
ROC Curve of 100 top-ranked genes
76
xiv
LIST OF ABBREVIATIONS
AUC
-
Area Under Curve
ARFF
-
Artificial Relation File Format
cDNA
-
Complementary Deoxyribonucleic Acid
cRNA
-
Complementary Ribonucleic Acid
DNA
-
Deoxyribonucleic Acid
EST
-
Expressed Sequence Tag
FPR
-
False Positive Rate
k-NN
-
k-nearest neighbor
MLP
-
Multilayer Perceptron
mRNA
-
Messenger Ribonucleic Acid
NB
-
Naive Bayes
RBF
-
Radial Basis Function
RNA
-
Ribonucleic Acid
ROC
-
Receiver Operating Characteristics
SOM
-
Self Organising Maps
SVM
-
Support Vector Machine
TPR
-
True Positive Rate
xv
LIST OF APPENDICES
APPENDIX
TITLE
PAGE
A
Project 1 Gantt Chart
95
B
Project 2 Gantt Chart
99
CHAPTER 1
INTRODUCTION
1.1
Introduction
Cancer or known as malignant neoplasm in medical term is the number one
killer diseases in most of the countries worldwide. The number of deaths and people
suffering with cancer are increasing year by year. This number will continue to
increase with an approximate 12 million deaths in 2030 (Farzana Kabir Ahmad,
2008). This killer disease may affect people at all ages even fetuses and also the
animals but the risk increases with age. According to the National Cancer Institute,
cancer is defined as a disease causes by uncontrolled division of abnormal cells.
Cancer cells are capable to invade other tissues and can spread through the blood and
lymph systems to other parts of the body.
There are various factors that can cause cancer namely mutation, viral or
bacterial infection, hormonal imbalances and others. Mutation can be classified into
two categories which are chemical carcinogens and ionizing radiation. Carcinogens
are the mutagen, substances that cause DNA mutations, which cause cancer. Tobacco
2
smoking which is accounted for 90 percent of lung cancer (Biesalski HK, Bueno de
Mesquita B, Chesson A, et al., 1998) and prolonged exposure asbestos are the
examples of the substances (O'Reilly KM, Mclaughlin AM, Beckett WS, Sime PJ,
2007).
Radon gas which can cause cancer while prolonged exposure to ultraviolet
radiation from the sun that can lead to melanoma and other skin malignancies
(English DR, Armstrong BK, Kricker A, Fleming C,1997) are the sources of the
ionizing radiation. Based on the experiment and epidemiological data, it shows that
viruses become the second most important risk factor of the cancer development in
human (zur Hausen H, 1991). Hormonal Imbalances can lead to cancer because
some hormones are able to behave similarly like the non-mutagenic carcinogens.
This condition may stimulate the uncontrolled cell growth.
Cancer can be classified into various types with various names which are
mostly named based on the part of the body where the cancer originates (American
Cancer Society). Lung cancer, breast cancer, ovary cancer, colon cancer are several
examples of the common cancer types. Breast cancer is the disease which fears every
woman since it is the number one killer cancer for woman. Breast cancer remains a
major health burden since it is a major cause of cancer-related morbidity and
mortality among female worldwide (Bray, D., and Parkin, 2004).
It has been reported that, each year, more than one million new cases of
female breast cancer are diagnosed, worldwide (Ferlay J, Bray F, Pisani P, Parkin
DM, 2001). In United States of America, an estimated 192,370 new cases of invasive
breast cancer and an estimated 40,610 breast cancer death are expected in 2009
(American Cancer Society, 2009). The following figure shows the percentage of ten
most cancers in female in Malaysia from 2003 to 2005. From the figure, significant
shows that breast cancer hold the highest percentage among all the cancers in female.
3
As discussed in this part, it is proven that breast cancer is one of the main
health problems. A lot of researches done to study the best way to prevent, detect and
treat the breast cancer in order to reduce the number of deaths. This research also
study about detection of breast cancer using the microarray data. The microarray data
of the breast cancer will be classified to determine whether it is benign (noncancerous) or malignant (cancerous) using Radial Basis Function, RBF, after doing
the feature selection.
Figure 1.1: Ten most frequent cancer in females, Peninsular Malaysia 2003-2005
(Lim, G.C.C., Rampal, S., and Yahaya, H., 2008)
1.2
Problem Background
This project will use the microarray data of breast cancer. According to
Tinker et al., 2006, microarray has become a standard tool in many genomic research
laboratories because it has revolutionized the approach of biology research. Now,
scientist can study thousand of genes at once instead of working on a single gene
4
basis. However, microarray data are often overwhelmed, over fitting and confused by
the complexity of data analysis. Using this overwhelmed data during classification
will lead to the increase of the dimensionality of the classification problem presents
computational difficulties and introduces unnecessary noise to the process.
Other than that, due to improper scanning, it also contains multiple missing
gene expression values. In addition, mislabeled data or questioned tissues result by
experts also the drawback of microarray that decreases the accuracy of the results
(Furey et al., 2000). However, feature selection can be utilized to filter the
unnecessary genes before the classification stage is implemented. Based on Farzana
Kabir Ahmad, 2008, (cited from Ben-Dor et al., 2000; Guyon et al., 2002; Li, Y. et
al., 2002; Tago & Hanai, 2003 ) in many large scale of gene expression data analysis,
feature selection has become a prior step and a pre-requisite. The high reliance of
this gene selection technique is due to its important purpose that requires small sets
of genes in order to informatively sufficient to differentiate between cancerous and
non-cancerous cells.
1.3
Problem Statement
The main issue in the classification of microarray data is the presence of
noise and irrelevant genes. According to Liu and Iba, 2002, many genes are not
relevant to differentiate between classes and caused noise in the process of
classification. Hence, this will lead to the drowning out of the relevant ones Shen et
al., 2007. Therefore, the prior step before doing the classification which is feature
selection is important in order to reduce the huge dimension of microarray data.
5
1.4
Objectives of the Projects
The aim of the project is to classify a selected significant genes from
microarray breast cancer data. In order to achieve the aim, the objectives are as
follows:
i. To select significant features from microarray data using a suitable
feature selection technique.
ii. To classify the selected features into non-cancerous and cancerous using
Radial Basis Function network.
iii. To evaluate the performance of the classifier using Receiver Operating
Characteristic, ROC.
1.5
Scopes of the Project
This project will use a breast cancer microarray data which is obtained from
Centre for Computational Intelligence. This data originates from the Zhu, Z., Ong,
Y.S., and Dash, M., (2007). The dataset contains of 97 instances with 24481 genes
and 2 classes.
6
1.6
Importance of the Study
Early detection of breast cancer can help to avoid the cancer from getting
worse and eventually leads to the decreasing of the rate of death. Mammogram is one
of the tools that can detect early stage of breast cancer. However, biopsy is needed in
order to confirm the present of the malignant (cancerous) tissue that cannot be
performed by the mammogram. From this biopsy process the microarray data will be
obtained.
As discussed in the problem background there are several drawbacks of
microarray data. However, it can be solved by implementing the feature selection
before it can proceed to the classification. In this study, several feature selection will
be studied before choosing the most suitable method. Feature selection is very
important since it will determine the accuracy of the result after the classification.
The accuracy of the result is also important because it will classify between
the cancerous and non-cancerous. Since the purpose of biopsy is to confirm the
presence of the cancer cells after the early detection, the confirmation must not take a
long time. The best feature selection method must be used before classifying the
microarray data in order to reduce the time and produce the accurate result.
Through this project, the best feature selection method will be selected and
implemented to microarray data before it will be classified. The performance of the
classifiers will be evaluated based on the suitable metric. The result of this study will
be discussed in order to give the information for the future research in breast cancer
area.
7
1.7
Summary
As a conclusion, this chapter discusses the overview of the project. This
project will classify the breast cancer microarray data using the Radial Basis
Function to two classes namely benign (non-cancerous) and malignant (cancerous).
Feature selections will be studied and the best method will be chosen to implement
the microarray data before the classification.
CHAPTER 2
LITERATURE REVIEW
2.1
Introduction
Breast cancer can affect both genders but women have higher risk compares
to men. Early detection of breast cancer can ensure a better chance to cure the
disease at the early stage. There are three types of screening tools for early detection
of breast cancer. The best tool is the screening mammogram which produces the Xrays pictures. However, the cancer cells cannot be confirmed via the mammogram
alone, so the diagnosis is required to detect the existence of cancer cells. The
microarray data obtained from the biopsy can be used to distinguish the benign
(non-cancerous) and malignant (cancerous) tissue. Because the microarray data also
contains unnecessary genes, feature selection needs to be implemented in order to
filter it before the classification can be made using Radial Basis Function (RBF)
technique.
In this chapter, five main sub topics will be discussed; Breast Cancer,
Microarray Technology, Feature Selection, Radial Basis Function technique and
9
Receiver Operating Characteristic, ROC. The first subtopic will explain about the
types, screening tools and diagnosis of breast cancer. Microarray data produces by
the biopsy will be discussed in the following subtopic. Next, feature selection
method will be discussed to determine the best method that can be implemented in
this study. Finally the classification technique followed by the evaluation of the
classifier using chosen metric will be discussed. Furthermore, several previous
researches will be briefly explained for comparison.
2.2
Breast Cancer
2.2.1
What is Breast Cancer
According to the National Breast Cancer Foundation and American Cancer
Society, breast cancer is a disease in which malignant tumor which is a group of
cancer cells, forms in the tissue or cells of the breast. The malignant tumor is able to
invade surrounding tissues or spread to further area of the body, also called as
metastasize. Woman has the higher risk of developing the cancer compares to man.
2.2.2
Types of Breast Cancer
Breast cancer can be categorized by the areas where the cancer cells begin to
develop; ducts or lobules in which the milk produced. The different types of breast
cancer can be classified into two most common groups which are non-invasive and
10
invasive breast cancer. Besides the invasive and non-invasive, there are many other
common types of breast cancer.
2.2.2.1 Non-invasive Breast Cancer
The term non-invasive or also known as in-situ in breast cancer means that
the cancer cells remain confined to ducts or lubules. This category of breast cancer
did not invade surrounding tissue of the breast and spread to others part of the body.
There are two types of breast cancers that belong to this group which are:
i)
Ductal Carcinoma In-situ, DCIS
This type of breast cancer is the early stage of breast cancer in which
the cancer cells stay in the milk duct of the breast where it started. Although
it can grow, it does not affect the surrounding breast tissue or spread to lymph
nodes and other organs.
Figure 2.1: Ductal Carcinoma In-situ, DCIS
11
ii)
Lobular Carcinoma In-situ, LCIS
LCIS tends to develop into invasive breast cancer eventhough it has
been classified as non-invasive, in which the abnormal cells begins in the
milk-producing glands at the end of breast ducts.
2.2.2.2 Invasive Breast Cancer
Invasive or infliltrating breast cancer is a contrary to the non-invasive or insitu breast cancer. Like non-invasive, this category also have two types of breast
cancer:
Figure 2.2: The cancer cells spread outside the duct and invade nearby breast
tissue
i)
Invasive Ductal Carcinoma, IDC
Invasive ductal carcinoma is the most common type of breast
cancer where it begins in the duct of the breast. The cancer cells then
12
spread through the wall of the ducts and grows into the fatty tissue of
the breast. It can spread to the lymph nodes and to other parts of the
body later on.
ii)
Invasive Lobular Carcinoma, ILC
Like the non-invasive lobular carcinoma, invasive lobular
carcinoma also begins in the milk-producing glands. However, ILC
can spread beyond the lobules (milk-producing glands) to the other
parts of the body.
2.2.3
Screening for Breast Cancer
There are three types of screening test for breast cancer that can detect the
cancer at the early stage. The three screening test are as follows (National Cancer
Institute ):
i)
Screening Mammogram
Mammogram is the screening tool that produces the breast
picture from the X-rays which can help early detection of breast
cancer. This screening tool can often shows a breast lump before it
can be felt and also a cluster of tiny specks of calcium known as
microcalcifications. Cancer, precancerous or other condition can be
formed from that lump or specks.
Although mammograms can help early detection of breast
cancer, it also has several drawbacks which are:
13
a)
False negative because the mammogram may miss some
cancers.
b)
False positive causes by the mammogram may show the result
that is not the cancer.
ii)
Clinical Breast Exam
The clinical breast exam will be done by the doctor and during
the process the differences in size or shape of patient’s breasts will be
inspected. The doctor will examine the breast skin for signs of rash,
dimpling or other abnormal signs. Other examination method is by
squeezing the nipple to check for fluid or by using the pads of the
fingers. The entire breast including underarm and collarbone area will
be checked for the existence of any lumps. Lymph nodes will also be
inspected to ensure that they are in the normal size.
iii)
Breast-self Exam
Screening test for breast cancer can be done by any individual
once a month. This breast-self exam can be carried out even while in
the shower with several simple steps to detect any changes of the
breast. However, studies have shown that the reducing number of
deaths from breast cancer is not solely caused by breast-self exam.
2.2.4
Diognosis of Breast Cancer
There are three procedures in early detection of breast cancer. If there is any
differences discovered during the screening process, diagnosis stage will be required.
Mammograms is the most effective tool to detect breast cancer at the early stage
14
(American Cancer Society) although it has a few disadvantages. Because of the
disadvantages, the follow-up test is required to confirm the present of the cancer. The
biopsy is the follow-up test that can help to diagnose the existence of the malignant
tissue (cancerous).
2.2.4.1 Biopsy
Biopsy is a minor surgery to remove fluid or tissue from the breast to confirm
the presence of the cancer. The biopsy can be done in using three techniques
(National Cancer Institute ):
i)
Fine-needle aspiration
A thin needle will be used to remove the fluid from a breast
lump. The lab test will be run if the fluid consists of cancer cells.
ii)
Core biopsy/ Needle biopsy
To remove the breast tissue, a thick needle will be used and
the cancer cells will be examined afterwards.
iii)
Surgical biopsy
A sample of tissues will be checked for cancer cells after it has
been removed through the surgery. There are two types of surgical
biopsy which are incisional biopsy where a sample of a lump or
abnormal area will be taken and excisional biopsy which takes the
entire lump or area.
15
2.3
Microarray Technology
2.3.1
Definition
Microarray technology is defined as a method to study the activity of a huge
number of genes simultaneously. A high speed robotics is used in this method to
apply tiny droplets consisting of biologically active materials such as DNA, proteins
and chemicals to glass slides precisely. Microarrays are used in Functional Genomics
DNA to monitor gene expression at RNA level. Furthermore, it is also used to
identify genetic mutations and to discover the novel regulatory area in genomics
sequences. In the protein or chemicals materials, microarrays are used to monitor
protein expression or to identify the interactions of protein-protein or drug-protein.
Normally, in a single experiment, there are ten of thousands of genes interactions can
be monitored (Compton, R., et al., 2008).
2.3.2
Applications of Microarray
The applications of microarray were first applied in the mid 90s (Schena, M.,
et al., 1995). Since that, microarray technology has been successfully applied to
almost every field of biomedical research (Alizadeh AA et al., 2000; Miki R., et al.,
2001; Chu S., et al., 1998; Whitfield M.L., et al., 2002; van de vijver MJ et al., 2002;
Boldrick JC et al., 2002). Gene expression profiling is the most common application
of microarray technology (Russell Compton et al., 2008). Others application of
microarray technology are Genotyping of Human Single nucleotide polymorphism
(SNPs), Microarrays analysis of Chromatin Immunoprecipitation (Chip-on-chip) and
Protein Arrays.
16
Microarrays opened up an entire new world to researches (Fodor, S.P. et al.,
1991; Fodor, S.P. et al., 1993; Pease, A.C. et al., 1994). To study a unique single
aspect of the host or pathogen relationship is no longer a restriction but one can
explore a genome-wide view of the complex interaction. This helps scientist to
discover and understand disease pathways to ultimately develop better methods of
detection, treatment and prevention. Microarray technology has provided an
understanding of the tumor subtypes and their attendant clinical outcomes from the
molecular phenotyping of human tumors (Abu Khalaf, M.M. et al., 2007). In the
breast cancer study, it has been reported that the invasive breast cancer can be
divided into four main subtypes through unsupervised analysis by using microarray
technology (Sorlie, T. et al., 2001). In recent studies, the subset of genes that is
identified by microarray profiling has been suggested to be a prediction of a response
to chemotherapy in breast cancer, the taxanes (Abu Khalaf, M.M. et al., 2007).
The following table shows the previous researches using microarray
technology to detect, cure, or prevent various diseases including breast cancer.
Table 2.1: Previous researches using microarray technology
Author
Disease
Keltouma Driouch et al., 2007
Breast cancer
Lance D Miller and Edison T Liu, 2007
Breast cancer
Sofia K Gruvbeger-Saal et al., 2006
Breast cancer
Argryis Tzouvelekesis et al., 2004
Pulmonary
Gustavo Palacios et al., 2007
Infectious
Christian Harwanegg and Reinhard Hiller, 2005
Allergic
17
2.3.2.1 Gene Expression Profiling
In gene expression profiling or also known as mRNA experiment, the
thousand of genes levels are monitored simultaneously to study the effects of certain
treatments, diseases and developmental stages on gene expression (Adomas A. et al.,
2008). Typically, microarrays consist of several thousand single-stranded DNA
sequences, which are “arrayed” at specific locations on a synthetic “chip” via
covalent linkage. These DNA fragments provide a matrix of probes to label the
complementary RNA (cRNA) fluorescently which is derived from the sample of
interest. Each of gene expression is quantified by the intensity of the fluorescence
that emits from the specific location on the array matrix which proportional to the
amount of the gene product (Sauter, G. and Simon, R., 2002). There are two main
types of microarrays used to evaluate the clinical samples (Ramaswamy, S. and
Golub TR, 2002):
i)
cDNA array
This type of array also known as spotted arrays is build using
annotated cDNA probes or by reverse transcription of mRNA from a known
source. The spotted arrays used to generate cDNA probes for specific gene
product that will spot on glass slide systematically (Ramaswamy, S. and
Golub TR, 2002). By using human Universal RNA Standards, the sample
testing is extracted and reverse transcribed in parallel along with test samples.
It will be labeled using fluorescent dye, Cy3, while the cRNA from test
sample will be labeled with fluorescent dye, Cy5. The hybridization of each
Cy5-labeled test sample cRNA with the Cy3-labeled reference probe to the
spotted microarrays is done simultaneously. From the hybridization, the
fluorescence intensities of the scanned images are quantified, normalized, and
corrected to produce the transcript abundance of a gene as an intensity ratio
with respect to the reference sample (Cheung VG et al., 1999).
18
ii)
Oligonucleotides microarrays
25 pair base pair DNA oligomers are used in oligonucleotides
microarrays to be printed onto the matrix of array (Lockhart DJ et al., 1996).
This array that was designed and patented by Affymetric uses combination of
photolithography and combinatorial chemistry. This allows the thousands of
probes to generate simultaneously. In this array, the Perfect Match/Mismatch
(PM/MM) probe strategy is used to avoid nonspecific hybridization. By using
that strategy, a second probe which is identical to the first except for the
mismatched probe is laid adjoin to the PM probe. A signal from the MM
probed is subtracted from the PM probe signal to indicate the nonspecific
binding.
Figure 2.3: Gene Expression Profiling
Reprinted from Sauter, G., and Simon, R. (2002)
19
2.3.2.2 Data Analysis
In gene expression profiling, there are four levels of analysis that have been
applied. The levels of analysis are as follow (Compton, R., et al., 2008):
i)
Using the statistical test to identify the differences of genes expression
between two or more samples.
ii)
By using data reduction technique to simplify the complexity of the
dataset. The cluster of similar expression profile genes is identified
across different experimental.
iii)
A statistical model such as classifier used to identify genes predictive
of samples classes (cancerous and non-cancerous) or to the expression
of interesting biomaker. The evaluation of the accuracy of the model
is really important in this statistical modeling. Holdout method is the
classic approach in which the correct class labels are known and the
data is divided into a set of training and test. The training set builds
the model while the test set provides a group of samples that the
model unseen previously. Then, it can be used to test if the model
predicts their class membership on an independent population
correctly. Gene selection is required using this approach. The gene
may be tested individually (univariate gene selection) or in association
(multivariate selection strategies).
iv)
By using the reverse engineering algorithm to infer the transcriptional
pathway structures within the cells.
20
2.4
Feature Selection
Feature selection is the process of choosing the most suitable features when
building process model (Kasabov, 2002). The main purpose of feature selection in
pattern recognition, statitistic and data mining communities (text mining,
bionformatics, sensor networks) is to select a subset of the input variables by
eliminating features that have little or no predictive information. The
comprehensibilty of the resulting classifier model can significantly be improved by
the feature selection. Feature selection often builds a model of better generalization
to unseen point (YongSeong Kim, et al., 2003). The potential benefits of feature
selection as stated by (Gianluca Bontempi and Benjamin Haiben-Kains,2008) are as
follow:
i)
Facilitate the visualization and comprehensibilty of the data.
ii)
The measurement and storage requirement can be reduced.
iii)
Reduce the times of training and utilization.
iv)
Improve the performance of prediction by defying the curve of
dimensionality.
Feature selection is important for various reasons in many supervised
learning problems namely generalization performance, running time requirements,
and constraints and interpretational issues imposed by the problem itself (J.Weston et
al., 2000). The main goal of feature selection in supervised learning is to search for a
feature subset with a higher classification accuracy (YongSeong Kim, et al., 2003).
21
Evaluation function
Feature set
Selection of
starting feature set
Goodness
Generation of next feature set
Stopping criteria
Figure 2.4: Feature selection process
There are three main approaches of feature selection which are (YongSeong
Kim, et al., 2003) :
i)
Filter methods
The filter methods are the preprocessing methods that attempt to
evaluate the merits of features from the data. The performance of the learning
algorithm which is affected by the chosen feature subset will be ignored.
ii)
Wrapper methods
The subsets of variables is evaluated based on their usefulness to a
given predictor using these methods. In order to select the best subset, its own
learning algorithm will be used as a part of the evaluation function. A
problem of stochastic state space search is identified as the problem in these
methods.
iii)
Embedded methods
These methods implement variable selection as part of the learning
procedure and often specific to given learning machines.
22
Table 2.2: A taxonomy of feature selection techniques
Search Method
Filter
Univariate
Advantages
• Fast
• Scalable
• Independent of the classifier
Disadvantages
• Ignores features
dependencies
• Ignores interaction with
the classifier
Example
• χ2
• Euclidean distance
• Information gain, Gain ratio (BenBassat,
1982)
• t-test
Mutivariate
• Model features dependecies
• Independent of the classifier
• Better than wrapper method
in complexity computation
• Slower compares to
univariate techniques
• Less scalable compares to
univariate techniques
• Ignores interaction with
the classifier
• Correlation-based feature selection, CFS
(Hall, 1999)
• Markov blanket filter, MBF (Koller and
Sahami, 1996)
• Fast correlation-based feature selection,
FCBF (Yu and Liu, 2004)
• Relief ( Kenji Kira and Larry A. Rendell,
1992)
Wrapper
Deterministic
• Simple
• Risk of overfitting
• Interacts with the classifiers
• Prones to trap in a local
• Model feature dependencies
optimum compares to
randomized algorithm
• Sequential forward selestion, SFS
(Kittler, 1978)
• Sequential backward elimination, SBE
(Kittler, 1978)
• Plus q take -away r (Ferri et al., 1994)
23
• Less computationally
intensive compared to
• Classifier dependent
selection
• Beam search (Siedelecky and Sklansky,
1988)
randomized methods
Randomized
• Less prone to local optima
• Computationally intensive
• Simulated annealing
• Interacts with the classifier
• Classifier dependent
• Randomized hill climbing (Skalak, 1994)
• Model features dependencies
selection
• Higher risk of overfitting
compared to deterministic
Embedded
• Interacts with the classifier
• Computational complexity is
• Classifier dependent
selection
• Genetic Algorithms (Holland, 1975)
Estimation of distribution algorithm (Inza
et al.,2000 )
• Decision trees
Weighted naive Bayes (Duda et al., 2001)
better than wrapper method
• Feature selection using the weight vector
• Model features dependencies
of SVM (Guyon et al., 2002; Weston et
al., 2003)
24
Among the three main approaches, the filter approach are the computational
efficient and robust against overfitting. Although it may introduce bias but it may
have considerably less variance (Gianluca Bontempi and Benjamin HaibenKains,2008). Hence, a larger size of dataset can be handled by the filter method
compares to wrapper method (Guyon, I and Elisseeff A, 2003; Kohavi, R and John
G.H.,1997; Liu, H. and Motoda, H., 1997; Ahmad, A. and Dey, L., 2005; Dash, M.
And Liu, H.,1997). The common filter method drawback is the interaction with the
classifier will be ignored which means the search in the feature subset space is
separated from the search in the hypothesis space. The most suggested technique to
solve this problem is univariate. However, univariate technique may lead to worse
performance of classification due to the ignoring feature dependencies compares to
other techniques. Thus, various multivariate filter techniques are proposed to
overcome the problem which aims at the incorporation of feature dependencies to
some degrees (Saeys, Y. et al., 2007).
Relief is one type of the multivariate filter technique which leads to better
performance improvements and proves that the filter methods are capable to handle
the non-linear and multi-modal environments (Reisert, M., and Burkhardt, 2006).
This method is the most well known and deeply studied which estimation is better
than the usual stastical attribute such as correlation or covariance. This is because it
takes into account the attribute interrelationship (Arauzo-Azofra, A., et al., 2004). It
has been reported that this method has shown a good performance in various domain
(Sikonja, M.R. and Kononenko, I., 2003).
2.4.1
Relief
Relief is a feature selection method based on attribute information that was
introduced by Kira and Rendell (Kira, K., and Rendell, L.A., 1992). The original
25
Relief only handled boolean concepts problem but the extend of it has been
constructed to be used in classifications problems (Relief-F) (Kononenko, I., 1994)
and also in regression (RRelief-F) (Sikonja, M.R and Kononenko, I, 1997). In ReliefF, one nearest neighbor of E1 will be search from every class. With these neighbor,
the relevance of every feature f ∈ F will be evaluated by the Relief and accumulated
into the W[ f ] based on the equation (2.1).
W[ f ] = W[ f ] – diff ( f, E1, H ) + Σ
P ( C ) x diff (f, E1, M(C))
C ≠ class(E1)
(2.1)
H is a hit of the nearest neighbor which is from the same class while M(C) is the miss
from the different class of class C. W [ f ] is divided by m to get the average
evaluation in range of [-1,1]. The diff ( f, E1, H ) function computes the grade in
which the values of features f is different in examples E1 and E2 as in equation (2.2).
(2.2)
f is dicrete
f is continuous
diff ( f, E1,E2 ) = 0 if value ( f, E1 ) = value ( f, E2 ) |value ( f, E1 )- value ( f, E2 )|
1 otherwise
max (f) – min (f)
The value ( f, E1 ) indicate the value of feature f on example E1 while the max (f) is
the maximum value f gets. The distance used is the sum of differences considering
nearest neighbor given by diff function, of all features.
2.4.2
Information Gain
Information gain is commonly used as a surrogate for approximating a
conditional distribution in the classification setting (Cover, T., and Thomas, J).
26
According to Mitchell (1997), by knowing the value of a feature, information gain
measure the number of bits of information obtained for class prediction. The
following is the equation of information gain :
InfoGain = H(Y) – H(Y|X)
(2.3)
where
X and Y is the feature and
H(Y) = - Σ p (y) log2 (p(y))
(2.4)
yεY
H(Y|X) = - Σ p (x) Σ p (y|x) log2 (p(y|x))
xεX
2.4.3
(2.5)
yεY
Chi-Square
The Chi-square algorithm will evaluate genes individually with respect to the
classes. This algorithm is based on comparing the obtained values of the frequency
of a class because of the split to the expected frequency of the class. From the N
examples, let Nij be the number of samples of the Ci class within the jth interval
while MIj is the number of samples in the jth interval. The expected frequency of Nij
is Eij = MIj | Ci | lN . Therefore, the Chi-squared statistic of a gene is then defined as
follow (Jin, X., et al., 2006):
C
l
χ = Σ Σ (Nij - Eij)2
2
i=1 j=1
Eij
where I is the number of intervals.
The larger the the χ2 value, the more informative the corresponding gene is.
(2.6)
27
2.5
Radial Basis Function
A Radial Basis Function, RBF, was introduced by Powell to solve the
problem of the real multivariate interpolation (Powell M.J.D., 1997). In a neural
network, radial basis function is a set of functions formed by a hidden unit which
composes a random basis for the input vectors (Haykin, S., 1994). This function is a
real-valued in which the value merely depends on the distance from the origins. The
most general formula is:
h (x) = φ (x - c)2
(2.7)
2
r
where
φ is the function used (Gaussian, Quadratic, Inverse Multiquadric and
Cauchy)
c is the centre
r = R is the metric, Euclidean
2.5.1
Types of Radial Basis Function, RBF
Radial functions are class of function that has the characteristic feature where
response decreases or increases monotonically with distance from the origin. If the
radial functions is linear, all the center, the distance scale and the precise shape will
be fixed as a parameters of the model (Mark J.L.Orr, 1996). There are two common
types of Radial Basis Function based on the types of radial functions used. The types
of Radial Basis Function are as listed :
28
i) Gaussian Radial Basis Function
φ (z) = e –z
(2.8)
Figure 2.5: Gaussian RBF
ii) Multiquadric Radial Basis Function
φ (z) = (1 + z) ½
(2.9)
29
Figure 2.6: Multiquadric RBF
Among all of the types, Gaussian is a typical radial function and the most
commonly used. This is because Gaussian RBF monotonically decreases with the
origin and has a local response which gives a significant response only in a
neighborhood near the origin. Other than that, Gaussian RBF is biologically plausible
caused by its finite response. Multiquadric RBF is contrary to Gaussian RBF since it
monotonically increases with the distance from the origin and have a global
response. This is why the Gaussian RBF is commonly used compares to the
Multiquadric RBF. The following figure illustrates the different result of RBF types
with c = 0 and r = 1.
30
Figure 2.7: Gaussian (green), Multiquadric (magenta), Inverse-multiquadric (red)
and Cauchy (cyan) RBF
2.5.2
Radial Basis Function, RBF, Network
A neural network is designed as a machine to model the way of the brain
performs a particular tasks or function of interest (Haykin S., 1994). Neural network
has been successfully applied in various area including financial, engineering and
mathematics because of its useful properties and capabilities such as nonlinearity,
input-output mapping and neurobiological analogy (Mat Isa, N.A., et al., 2004). The
application of neural network has been expanded in the medical area with two
common applications which are image processing (Sankupellay, M., and
Selvanathan, N., 2002; Chang, C.Y., and Chung, P.C., 1999; Kovacevic, D., and
Loncaric, S., 1997) and diagnostic system (Kuan, M.M., et al., 2002; Ng, E.Y.K., et
al., 2002; Mat Isa, N.A., et al., 2002; Mashor, M. Y., et al., 2002). Neural network
has been proven to perform very well in both applications. A high capability can be
31
reached either to increase the medical image contrast or to detect certain region in the
medical image processing while in disease diagnostic system, it performs as good as
human experts or sometimes better than them (Sankupellay, M., and Selvanathan, N.,
2002; Chang, C.Y., and Chung, P.C., 1999; Kovacevic, D., and Loncaric, S., 1997;
Kuan, M.M., et al., 2002; Ng, E.Y.K., et al., 2002; Mat Isa, N.A., et al., 2002;
Mashor, M. Y., et al., 2002).
Radial Basis Function, RBF, networks is the artificial neural network type for
application of supervised learning problem (Orr, M.J.L., 1996). By using RBF
networks, the training of networks is relatively fast due to the simple structure of
RBF networks. Other than that, RBF networks is also capable of universal
approximation with non-restrictive assumptions (Park. J. et al., 1993). The RBF
networks can be implemented in any types of model whether linear on non-linear and
in any kind of network whether single or multilayer (Orr, M.J.L. 1996). The most
basic design of RBF networks as first proposed by Broomhead and Lowe in 1988 is
traditionally associated with radial functions in a single layer network. The following
figure shows the traditional RBF network with each of the n components of input
vector x feeds forward to m basis function. This yield a output that linearly combined
with weights {wj}jm=1 into the network output f(x).
32
Figure 2.8: The traditional radial basis function network
The first layers indicate the input layer which is the source nodes or sensory
units while the second layer is a hidden layer of high dimension. The last layer which
is the output layer will give the network response to the activation pattern that is
applied to the input layer. From the input space to the hidden unit space, the
transformation is non-linear but the transformation is linear from the hidden space to
the output space (Haykin, S., 1994).
To implement RBF network, we need to identify the hidden unit activation
function, the number of the processing units, a criterion for modeling the given task
and a training algorithm for finding the parameters of the network. The network will
be trained to find the RBF weight while the training set, a set of input-output pairs
will optimize the network parameters so that the network output will fit with the
given inputs. To evaluate the fits, the means of cost function or assumed as mean
square error will be used. The network can be used after the training with the data in
which the underlying statistics is similar to that of the training set. The adaptation of
the network parameters to the changing data statistics is caused by the on-line
training algorithms (Moody, J., 1989).
33
There are various functions that have been tested as activation function for
RBF networks (Matej, S., Lewitt, R.M, 1996) but Gaussian is preferred in the
application of pattern classification (Poggio, T., Girosi, F., 1990; Moody, J., 1989;
Bors, A.G., Pitas, I., 1996; Sanner, R.M., Slotine, J.J.E., 1994; Bors, A.G., Gabbouj,
G., 1994). The following is the Gaussian activation function for RBF network:
φj (X) = exp [(X - μj)T Σj-1 ( X - μj)]
(2.10)
for j = 1,..,L where
X is the input feature vector.
L is the number of hidden units.
μj and Σj are the mean and the covariance matrix of the jth Gaussian function.
The mean of the vector μj is representing the location while Σj models the shape of
activation function. The output layer applies a sum of weight of hidden unit output
according the following formula:
ψk(X) = Σ λjk φ j (X)
(2.11)
for k = 1,..,M where
λjk are the output weight, each corresponding to the connection between a
hidden unit and an output unit.
M represent the number of output units.
For classification problem, if the weight λjk > 0, the activation field of the hidden unit
j contained in the activation field of the output unit k.
34
2.6
Classification
Classification can also be termed as supervised learning (Tou, J.T. and
Gonzalez, R.C., 1974; Lin, C.T., and Lee, C. S. G., 1996). In the classification
problems, the previous unseen patterns will be assigned to their respective classes
according to previous examples from each class. This will cause the output of the
learning algorithm as one of a discrete set of possible classes (Mark J.L.Orr, 1996).
The feature entries will be represented by the input, the hidden layer unit correspond
to the subclasses while the class corresponded by each output (Tou, J.T. and
Gonzalez, R.C., 1974). There are several methods of classification (Bashirahamad
S.Momin et al., 2006):
i)
Decision Tree
A decision space divided into piecewise constant region (Pedrycz, W.,
1998)
ii)
Probabilistic or generative models
Compute the probabilities based on Bayes’ Theorem (Tou, J.T. and
Gonzalez, R.C., 1974).
iii)
Nearest-Neighbor classifiers
Calculate the minimum distance from instances or prototype (Tou,
J.T. and Gonzalez, R.C., 1974).
iv)
Artificial Neural Networks (ANNs)
The division is by nonlinear boundaries. The learning incorporated, in
the environment of data-rich by encoding information in the
conversion weights (Haykin, S., 1994).
35
2.6.1
Comparison with others Classifiers
The following table shows the previous researches that have be done on
classification with various type of data using the RBF network.
Table 2.3: Previous researches on classification using RBF network
Author
Classification
Wang, D.H., et al., 2002
Protein sequences
Li, X., Zhou, G., and Zhou, L., 2006
Novel data
Leonard, J.A., and Kramer, M.A., 1991
Process Fault
Horng, M-H., 2008
Texture of the Ultrasonic Image
of Rotator Cuff Diseases
Kégl, B., Krzyzak, A., and Niemann, H.
Nonparametric (Pattern
(1998)
Recognition)
Based on (Wang, D.H., et al.,2002) Self Organizing Mapping (SOM) networks
(Wang, H.C et al.,1998; Ferran EA., 1994) and Multilayer Perceptrons (Wu C.H. et
al., 1995; Wu C.H., 1997) are the two types of neural models applicable for protein
sequences classification task. By using the SOM networks, the relationship within a
set of proteins can be discovered by clustering them into different groups. However
SOM networks have two main limitation when applied in protein sequences
classification. The limitations are:
i) Time consuming and the difficulty of the results interpretation.
ii) The size selection of Kohonen layer is subjective and often involves the trial
and error procedure.
There are several undesired features arising from system designed of MLP based
classification system which are:
36
i)
The determination of the neural metwork architecture.
ii)
Lack of intepretability of the ”black box”.
iii)
The training time needed as the number of inputs is over 100.
In (Li, X., Zhou, G., and Zhou, L., 2006), RBF network has been compared to
the Support Vector Machine (SVM). Although data classification is one of the main
applications that RBF networks have applied (Oyang, Y.J et al., 2005; Sun, J. et al.,
2003), recent research has focused more on SVM. This is because it has been
reported by several latest research that SVM generally capable to deliver higher
classification accuracy compared to other existing data classification algorithms.
However, SVM suffer with one serious disadvantage namely the time taken to carry
out the model selection which could be unacceptably long for some contemporary
applications. All the proposed approaches for expediting the model selection of SVM
lead to lower prediction accuracy.
Backpropagation network and Nearest Neighbor classifiers have been
compared in (Leonard, J.A., and Kramer, M.A., 1991). The comparison have shown
that the backpropagation classifier can produce nonintuitive and nonrobust decision
surfaces due to the inherent nature of the sigmoid function, the definition of the
training set and the error function used for training. Other than that, there is no
standard training scheme mechanism for identifying unknown classes of regions in
backpropagation network. For the Nearest Neighbor classifier, it becomes inefficient
when the number of the training point is high. This is because it needs a complete
search via the set of training examples to classify each test point.
37
2.7
Receiver Operating Characteristic, ROC
The Reciever Operating Characteristic, ROC, analysis emerges from the
statistical decision theory (Green and Swets, 1996). In the medical area, this theory
was first applied during the late 1960s. This theory analysis has been increasingly
used as evaluation tool in computer science to assess distinction effects among
different methods (Westin, L. K., 2001 ). In neural classifier, ROC analysis is easy to
implement, yet rarely used to evaluate the performance of neural classifier outside of
the medical and signal detection field. The advantages of ROC analysis are the
robust description of the network’s predictive ability and an easy way to change the
existence network based on differential cost of misclassification and varying prior
probabilities of class occurences.
Reciever Operating Characteristic, ROC, curve is the ultimate metric to
detect algorithm performance (John Kerekes, 2008). The ROC curve is defined as
sensitivity and specificity which are the number of true positive decision over the
number of actually positive cases and the number of true negative decisons over the
number of actually negative cases, respectively. This method leads to the
measurement of diagnostic test performance (Park, S.H., et al., 2004) . Based on the
ROC curve, area under the curve, AUC, is the most commonly used and
comprehensive of the performance measures (Downey, T.J., Jr., et al., 1999). The
following equation is the formula of the AUC. The TPR is referred to True Positive
Rate while the FPR is referred to False Positive Rate.
1
ROC = (TPR)( FPR)
2
(2.8)
3
38
Figure 2.9: The RO
OC curve
2
2.8
Sum
mmary
pter discussees about the breast canceer, microarraay
In brrief, this chap
t
technology,
feature selecction, radial basis function, RBF nettwork, and reeceiver
o
operating
ch
haracteristic, ROC. The bbiopsy is thee way to confirm the occcurrence of
c
cancer
cells in the breastt after the m
mammograph
hy. By doing biopsy, the DNA
m
microrrays
w be obtainned. In order to select thhe useful genne, feature seelection needds
will
t be appliedd first. Theree are variouss techniques of feature seelection whiich are
to
c
categorized
into three methods.
m
The filter methood with the m
multivariate techniques
t
n
namely
ReliefF has beenn chosen to bbe applied inn this study. Based on th
he previous
r
researches,
t Radial baasis functionn network haas been show
the
wn the best network
n
in
c
classification
n problem. Therefore,
T
it has been ch
hosen as the classifier to distinguish
t cancerouus and non-ccancerous. Thhe receiver operating
the
o
chharacteristic, ROC curve
m
metric
is useed to evaluatte the perform
mance of thee classifier.
CHAPTER 3
METHODOLOGY
3.1
Introduction
This chapter elaborates the processes or steps which are involved during the
design and implementation phase. The process flows will be illustrated in the
research framework which contains four main steps. Begin with the data acquisition,
the format of the data will be explained briefly. Other than that, the HykGene
software that will be used in the experiment will be discussed according to its
framework.
40
3.2
Research Framework
The following figures describe the research framework for this study. There
are four main steps or processes involved which are data acquisition, feature
selection, classification and evaluation. In the data acquisition steps, the data set of
breast cancer microarray data is obtained. Then, we can proceed to the second step
which involved feature selection. This step is very important because it leads to the
accuracy of the classification. The classification of the microarray data will be
implemented to classify the cancer cells and the normal cells. The final step in this
framework is to evaluate the performance of the classifier using ROC.
Start
Data Acquisition
Feature Selection
Classification
Evaluation
End
Figure 3.1: Research Framework
41
3.3
Software Requirement
In this study, the software requirements include the Matlab version 7.0 Neural
Network toolbox to develop the feature selections function and the classifier namely
radial basis function, RBF, network. Other than that, another two software will be
used which are WEKA and HykGene to support the feature selection function.
3.4
Data Source
The breast cancer microarray data will be obtained from the Centre for
Computational Intelligence (http://www.c2i.ntu.edu.sg/) for this study. The dataset
which contains 97 instances with 24461 genes and 2 classes are in the ArtificialRelation File Format, ARFF. An ARFF format will be explain in detail in the next
chapter.
3.5
HykGene
HykGene is a hybrid approach for selecting marker genes for phenotype
classification using microarray gene expression data and have been developed by
Wang, Y. et al., (2005). The following figure indicates the framework of the
HykGene.
42
Figure 3.2: HykGene: a hybrid system for marker gene selection
(Wang, Y. et al., 2005)
In the fist step, feature filtering algorithms is applied on the training data to
select a set of top-ranked informative genes. Next, to generate a dendogram,
hierarchical clustering is applied on the set of top-ranked genes. Then, in the third
step a sweep- line algorithm is used to extract clusters from the dendogram and the
marker gene will be selected by collapsing clusters. The LOOCV will determine the
best number of marker genes. In the last step, only the marker genes will be used as
the features for classification of test data.
For the first step which is gene ranking, feature filtering algorithms are
applied to rank the genes. Features or genes are ranked based on the statistic over the
features that indicate which features are better than others. In this HykGene
software, three popular feature filtering methods have been used which are ReliefF,
Information Gain and Chi-Square or χ2-statistic.
After done with the feature filtering on the whole set of genes using training
data, the top-ranked genes will be pre-selected from the whole set. Normally, there
are 50 to 100 genes are pre-selected. Next, the hierarchical clustering (HC) will be
43
applied on the set of pre-selected genes. These will resulting a dendogram, or
hierarchical tree where heights represent degrees of dissimilarity. Due to the high of
correlation of the some genes in the pre-selected set of genes, the cluster will be
extracted by analyzing the dendogram output of the HC algorithm. Then, each cluster
will be collapsed into one representative gene from each cluster. These extracted
representative genes were used as marker genes in the classification step.
Finally, the test data will be classified using one of the four classifiers which
are k-nearest neighbor (k-NN), linear support vector machine (SVM), C4.5 decision
tree and Naive Bayes (NB).
3.5.1
Gene Ranking
The estimation accuracy of the features or attributes quality is important in
the preprocessing step of classification. Therefore, to produce an ideal estimation of
attributes in feature selection, the noisy and irrelevant features need to be eliminated.
Other than that, the features which are having the specific characteristics of a class
need to clearly select. As illustrated before in the HykGene framework, feature
selection or gene ranking is the first step which involved three feature filtering
method. Thus, the original framework can be expanded into the following figure in
order to show the methods that involved as mentioned previously more clearly.
44
ReliefF
Training
data
Step1: Ranking
Genes
Information
Gain
Topranked
Genes
χ2-statistic
Figure 3.3: The expanded framework of HykGene in Step1
3.5.1.1 ReliefF Algorithm
The main idea of the Relief algorithm that was proposed by (Kira and
Rendell, 1992) is to estimate the quality of attributes that have weights greater than
the threshold using the distinction of an attribute value between a given instance and
the two nearest instances namely Hit and Miss. The following is the original
algorithm of Relief.
45
Input : a vector space for training instances with the value of
attributes and class values
Output : a vector space for training instances with the weight W of
each attribute
Set all weights W[A]=0.0
for i=1 to m do begin
randomly select instance Ri
find nearest hit H and nearest miss M
for A=1 to all attribute do
W(A) = W(A) – diff (A, Ri, H)/m + diff (A, Ri, M)/m
end
Figure 3.4: Original Relief Algorithm
(Park, H., and Kwon, H-C., 2007)
The drawback of the original Relief algorithm is that it cannot deal with incomplete
data and multiclass datasets. To overcome the disadvantages of original algorithm,
the extension of Relief-F algorithm was suggested by Kononenko (1994). This
extension algorithm is more robust, meaning it can handle incomplete and noisy data
as well as multi-class dataset. The updated Relief-F algorithm is as follow.
46
Input : a vector space for training instances with the value of
attributes and class values
Output : a vector space for training instances with the weight W of
each attribute
Set all weights W[A]=0.0
for i=1 to m do begin
randomly select instance Ri
find k nearest hit Hj
for each class C≠ class(Ri) do
find k nearest miss Mj(C) from class C
for A=1 to all attribute do
k
k
W(A) = W(A) - Σ diff (A, Ri, Hj)/(m, k) + Σ
j=1
k
[P(C)/
C≠ class(Ri)
1-P(class(Ri)) Σ diff(A, Ri, Mj(C))]/(mk)
j=1
end
Figure 3.5: ReliefF algorithm
(Park, H., and Kwon, H-C., 2007)
According to the algorithm, the nearest hits Hj is finds k the nearest neighbors which
is from the same class while the nearest misses Mj(C) is k the nearest neighbors
which is from a different class. W (A) which is the quality estimation for all attributes
A is updated based on their values for Ri, hits Hj and M(C). Each probability weight
must be divided by 1-P (class (Ri)) due to the missing of the class of hits in the sum
of diff (A, R, Mj(C)).
47
3.5.1.2 Information Gain
The second method of feature selection that will be used in the experiment is
Information Gain. In the Hykgene software, the equation of the information gain
method applied is as following.
Let {ci}mi=1 denote the of classes. Let V be the set of possible values for feature f.
Thus, the information gain of a feature f is defined as:
m
m
G(f) = - Σ P(ci)log P(ci) + Σ Σ P (f = v) P(ci | f = v) log P (ci | f = v)
i=1
(3.1)
vεV i=1
In the information gain, the numeric features need to be discretized. Therefore,
HykGene software has used entropy-based discretization method (Fayyad and Urani,
1993) that has been implemented in WEKA software (Witten and Frank, 1999).
3.5.1.3 χ²-statistic
Chi- Square or χ2 –statistic is one more feature selection method that will be
used. The equation of this method that has been used by HykGene software is as
follow:
m
χ (f) = Σ Σ [Ai ( f = v) – Ei( f = v)]2
2
vεV i=1
Ei( f = v)
where
V is the set of possible values for feature f,
(3.2)
48
Ai ( f = v) is the number of instances in class ci with f = v
Ei( f = v) is the expected value of Ai ( f = v)
Ei( f = v) is computed with:
Ei( f = v) = P ( f = v)P(ci)N
(3.3)
where
N is the total number of instances
Like information gain, this method also requires numeric features to be discretized,
so the same discretization method is used. The following is the example of the chisquare algorithm which is consists of two phases. For discretization, in the first
phase, it begins with a high significance level (sigLevel) for all numeric attributes.
The process involved in phase 1 will be iterated with a decreased sigLevel until an
inconsistency rate, δ is exceeded in the discretized data. In the Phase 2, begin with
sigLevel0 determined in Phase 1, each attribute i is associated with a sigLevel [i] and
takes turns for merging until no attribute’s value can be merged. If an attribute is
merged to only one value at the end of Phase 2, it means that this attribute is not
relevant in representing the original dataset. Feature selection is accomplished when
discretization ends.
49
Phase 1:
set sigLevel = .5;
do while (InConsistency (data) < δ) {
for each numeric attribute {
Sort(attribute, data);
chi-sq-calculation(attribute, data)
do {
chi-sq-calculation(attribute, data)
} while (Merge (data))
}
sigLevel0 = sigLevel;
sigLevel = decreSigLevel(sigLevel);
}
Phase 2:
set all sigLvl [i] = sigLevel0
do until no-attribute-can be-merged {
for each attribute i that can be merged {
Sort(attribute, data);
chi-sq-initialization(attribute, data);
do {
chi-sq-calculation(attribute, data)
} while (Merge(data))
if (InConsistency (data) < δ)
sigLvl [i] = decreSigLevel(sigLvl [i]);
else
attribute i cannot be merged
}
}
Figure 3.6: χ²-statistic algorithm
(Liu, H., and Setiono, R., 1995)
50
3.5.2
Classification
Classification is a system capability to learn from a set of input or output
pairs which are composed of two set namely training and test. The input of the
training set is the features vector and class label while the output is the classification
model. For the test set, the inputs are the features vector and also the classification
model while the output is the predicted class. The classification is the last step
applied in the HykGene software after selecting the marker genes. There are four
classifiers that can be chosen as indicate in the following expanded framework of
HykGene.
k- nearest
neighbor
Test
data
Step 4:
Classification
using Marker
Genes Only
Support vector
machine
C4.5 decision
tree
Naïve Bayes
Figure 3.7: The expanded framework of HykGene in Step 4
3.5.2.1 k- nearest neighbor
The k- nearest neighbor or k-NN classifier (Dasarathy, B., 1991) is a wellknown non-parametric classifier. For the classification of a new input x, the k-NNs
are retrieved from the training data. Then, the input x is labeled with the majority
class label corresponding to the k-NNs.
51
3.5.2.2 Support vector machine
Support vector machine, SVM, is a type of the new generation of learning
system based on recent advances in statistical learning theory (Vapnik, V.N., 1998).
In the HykGene software, a linear SVM is used in order to find the separating
hyperplane with the largest margin. This is defined as the sum of distances from a
hyperplane (implied by a linear classifier) to the closest positive and negative
examplars with an expectation that the larger the margin, the better the generalization
of the classifier.
3.5.2.3 C4.5 decision tree
The C4.5 decision tree classifier is based on a tree structure where non-leaf
nodes represent test on one or more features and leaf nodes reflect classification
results. The classification of an instance is starting at the root node of the decision
tree, testing the attributes specified by this node then moving down the tree branch
corresponding to the value of the attribute. The process is iterated at the nodes on
that branch until a leaf node is reached. The J4.8 algorithm in Weka is used in the
HykGene software which is a Java implementation of C4.5 Revision 8.
52
3.5.2.4 Naïve Bayes
The fourth classifier used in the HykGene software is the Naïve Bayes
classifier which is a probabilistic algorithm based on Bayes’ rule. This classifier
approximates the conditional probabilities of classes using the joint probabilities of
training sample observation and classes.
3.6
Classification using Radial Basis Function Network
There are various approaches to solve the classification problem but as
discussed in the previous chapter, radial basis function network have been chosen as
the classifier. This is because others than previous discussion of the advantages of
RBF network, data classification is one of the main applications in RBF network.
To construct or build the classifier, a neural network will be embedded with
the radial basis function network in which each of the hidden unit implements a
radial activation function. The weighed sum of hidden unit outputs will be produce
by the output units. The RBF network is nonlinear from the input but the output is
linear. Typically, the RBF network is a three-layer, feed forward network which is
composed of input layer, hidden layer and output layer. As discussed in the literature
review, Gaussian has been chosen as the activation function. The following figure is
the architecture of radial basis function, RBF, network.
53
Figure 3.8: Radial Basis Function Network architecture
The Gaussian activation function that has been chosen as activation function
in the hidden layer has the following formula:
φj (X) = exp[-║ X-μj║2]
2σ2j
(3.1)
For j=1,…,L, where
X is the input selected l-dimensional feature vector
L is the number of hidden units
μj is the mean
σj is the standard deviation of the jth Gaussian activation function
the norm is the Euclidean
The following equation (3.2) is the formula of the output layer which is implements a
weighted sum of hidden-unit outputs.
L
ψk (X) = Σ λjk φj(X)
j=1
for k = 1,…,M where
(3.2)
54
λjk are the output weights, each corresponding to the connection between a
hidden unit an output unit
M refer to the number of output units
The weights λjk show the contribution of a hidden unit to respective output
unit.
The output of the radial basis function is often limited to the range (0, 1) by using the
sigmodial function given as follow:
Yk (X) =
1
, k =1,…,M
1+ exp [-ψk(X)]
(3.3)
A training mechanism is to minimize the sum of square error between the desired
output and the actual output Yk (X).
L
MSE = ½ Σ Σ (tpi - ypi) = ½Σ E (p)
p j=1
(3.4)
L
where
tp and yp are the i-th target and the network output for the p-th input respectively
3.7
Evaluation of the Classifier
Receiver operating characteristic, ROC, curves are widely used in the
medical area to evaluate the performance of a diagnostic test. This method consists a
lot of information for comprehensibility and improving classifiers performance.
However, it requires visual inspection because the best classifiers are hard to
recognize when the curves are mixed. Therefore the area under the ROC curve will
be used to help in deciding the suitable one for the problem. The Area Under Curve,
55
AUC, will summarized the statistic of diagnostic performance and evaluate the
ranking in terms to split of the classes. The following figure shows the algorithm of
AUC.
Algorithm. Calculating the area under an ROC curve
Inputs: L, the set of test examples: f(i), the probabilistic classifier’s
estimate that examples i is positive: P and N, the number of positive and
negative examples.
Outputs: A, the area under the ROC curve.
Require: P > 0 and N > 0
1. Lsorted
L sorted decreasing by f scores
2. FP
TP
0
3. FPprev
TPprev
0
4. A
0
5. f prev
-∞
6. i
1
7. while i ≤ | Lsorted | do
8.
if f(i) ≠ f prev then
9.
A
A+ TRAPEZOID_AREA (FP, FPprev,TP, TPprev)
10.
f prev
f(i)
11.
FPprev
FP
12.
TPprev
TP
13. end if
14. if i is positive example then
15.
TP
TP + 1
16. else /* i is negative example */
17.
FP
FP + 1
18. end if
19. i
i+1
20. end while
21. A
A+ TRAPEZOID_AREA (N, FPprev, N, TPprev)
22. A
A / (P x N) /* scale from P x N onto the unit square */
23. end
1. function TRAPEZOID_AREA(X1, X2, Y1, Y2)
2. Base
| X1 – X2|
3. Heightavg
(Y1 + Y2) / 2
4. return Base x Heightavg
5. end function
Figure 3.9: AUC algorithm
(Fawcett, T., 2005)
56
Based on the algorithm, the trapezoids are used rather than rectangles and thus, the
successive region of trapezoids added to A in order to average the effect between
points. The algorithm divides A by the total possible area to scale the value to the
unit square.
3.8
Summary
In this chapter, the processes or steps that involve in the study were briefly
discussed. The flow of the process has been described by the research framework
which consists of four main steps. In the HykGene software, three filtering methods
and four classifiers can be chosen to carry out the experiment. This chapter will be
guidelines in the design and implementation phase for the next chapter.
CHAPTER 4
DESIGN AND IMPLEMENTATION
4.1
Introduction
In this chapter, all experiments that were carried out on breast cancer
microarray data will be discussed in detail. These experiments involve features
selection, classification and evaluation of the classifiers. Breasts cancer microarray
data is needed to proceed with the experiments. The data sets will be explained
thoroughly in the first section of this chapter and followed by the elaboration of the
features selection techniques. The following section explains about the classifications
of selected genes by several classifiers. Finally, the last section will discuss the
classifier evaluation done by ROC.
58
4.2
Data Acquisition
Figure 4.1: The data matrix
(Wang, Y. et al., 2005)
The microarray data can be stored in an N x (M + l) matrix as shown in
Figure 4.4. Each of the first M columns represents a gene while each row represents a
tissue sample. gi,j is a numeric value indicate the gene expression level of gene i in
sample j. lj in the last column is the class label for sample j. Each gene is labeled by
Expressed Sequence Tag (EST). EST is used to identify gene transcripts and also the
tools in discovered and determined genes (Adams, M.D. et al., 1991). Several groups
such as Phil Green’s group had assembled EST into EST – contigs in order to reduce
the number of EST for downstream gene discovery analysis.
The EST-contigs are oriented by a procedure which involves the Basic Local
Alignment Search Tool (BLAST) similarity searching contradict a database of
mRNAs and putatively oriented ESTs, a procedure estimated to be 90 percent
accurate (Burchard, J., unpublished ). EST-contigs with significant BLAST hits were
correspond with the accession number of the sequence to which they had the best hit.
The accession numbers of these sequences were altered by addition of “_RC” to the
end presenting the distinct in orientation from the original sequence.
59
Oligonucleotides are often used as probes for detecting DNA or RNA
samples because they readily bind to their respective complementary nucleotide
sequences. The following table mapping the EST-contigs with GenBank accession
number and oligonucleotide probe sequence.
Table 4.1: EST-contigs, GenBank accession number and oligonucleotide probe
sequence
Phil Green Contig
GenBank
Oligonucleotide Probe Sequence
Accession
Number
Contig45645_RC
AI151500
CACACTCTCTTCGAACCCAATTGTGGGTGTA
GCAATGAAAGCAATATGATATGCTGCAGT
Contig44916_RC
AI279346
TTGTCCCTTTTGTCTCAAAATTTTCAAAACAG
CAACACCATACATGTGTCATTATAGGGG
Contig29982_RC
AI277612
TAAATGTCCTAATGATGAGAAACTCTCTTTC
TGACCAATTGCTATGTTTACATAACACGC
As mentioned in Chapter 3, the breast cancer microarray data used in this study is
obtained from the Centre for Computational Intelligence. The data contains 78
training data and 19 testing data set. For the training data, 34 samples are from the
patients who had developed distances metastases within 5 years which is labeled as
relapse. The rest of the 44 samples are from patients who remained healthy from the
disease after their initial diagnosis for at least 5 years which is labeled as non-relapse.
In the training data, there are 12 relapse and 7 non-relapse samples correspondingly.
The following figure shows fifteenth of the instances of the microarray breast cancer
dataset with certain of its attributes.
60
Figure 4.2: Fifteenth of the instances of breast cancer microarray dataset
The values of Ratio have been extracted from the original microarray data
with the replacement of all Not a Number or NaN to 100.0. If the ratio is greater than
one, it indicates that the binding of the test sample is greater than the control sample
which is known as up-regulation. If the ratio is smaller than one, it shows that the
binding of the test sample is less than the control sample and known as downregulation. Up-regulation and down-regulation are the regulations of gene
expression. Down-regulation is the process which the quantity of cellular component
such as RNA protein is decreased by the cells in response to an external variable
whereas up-regulation is an increase of cellular component.
In the microarray image, the green spot represents down-regulation while the
red spot indicates up-regulation. If the spot shows incandescent yellow, it indicates
that the binding between samples are equal or constant regulation whereas if the spot
appears in black, it signifies that there is no detectable signal.
61
Figure 4.3: Microarray Image
4.2.1
Data Format
The breast cancer microarray data obtained from the Centre for
Computational Intelligence is in Attribute-Relation File Format, ARFF. This file
format is an ASCII text file that illustrates a list of instances sharing a set of
attributes. An ARFF file is used as the format of the data because this file has been
developed to be implemented with WEKA machine learning software. The ARFF
file contains two different sections which are Header information and Data
Information. The Header of the ARFF file consists of relation declaration and
attributes declarations while in the Data section, it contains data declaration line and
actual instances lines. The following figures show the Header and Data information
of the ARFF file.
62
@relation Breast
@attribute Contig45645_RC numeric
@attribute Contig44916_RC numeric
@attribute D25272 numeric
@attribute J00129 numeric
@attribute Contig29982_RC numeric
@attribute Contig26811 numeric
Figure 4.4: Header of the ARFF file
@data
-0.299,0.093,-0.215,-0.566,-0.596,-0.195,0.039,-0.409,-0.352,0.066,0.136,
-0.336,-0.001,-0.272,-0.012,-0.141,0.037,-0.055,-0.141,0.146,-0.087,0.028,
-0.172,-0.005,-0.064,-0.147,0.039,-0.148,-0.081,0.061,0.037,-0.112,0.106,
-0.117,-0.069,0.064,0.086,0.062,-0.054,- 0.007,0.259,0.383, 0.049,0.383,0.094,
0.152,-0.161,0.121,0.021,0.246,-0.042,0.08,0.006,-1.082,- 0.085,0.213,100,
0.009,-0.024,0.049,0.145,0.071,-0.097,0.035,0.083,-0.119,-0.087,-0.087,-0.188,
-0.447,0.058,0.283,0.093,0.098,0.08,-0.12,0.039,-0.252,0.389,-0.257,-0.088,0.356,-0.43,0.148,0.063,-0.035,-0.069,0.27,0.363,-0.073,-0.018,-0.005,- 0.054,
0.006,-0.052,-0.129,0.008,-0.169,-0.063,0.029,0.054,-0.137,-0.107,-0.185,.035,0.109,0.289,-0.08,-0.15,-0.006,0.258,-0.263,0.15,0.324,-0.128,0.071,0.03,0.055,-0.238,-0.207,0.525,100,-0.149,-0.064,0.369,0.036,-0.041,-0.211,0.073,0.063,-0.229,0.146,-0.23,-0.112,-0.076,0.06,0.264,0.077,0.135,0.033,0.222,0.105,0.193,0.024,0.115,-0.235,-0.282,-0.063,- 0.133,0.074,
0.155,0.003,0.069,0.083,0.077,-0.125,0.164,0.381,0.074,0.096
Figure 4.5: Data of the ARFF file
The @relation, @attributes and @data declaration are case insensitive. The
sequence of the attribute declarations of the @attribute statements is ordered. In the
dataset, each attribute has its own @attribute statement that defines the name of its
attribute and its data type uniquely. The sequence order of the attribute indicates the
position of the column in the data section of the file. There are four types of the data
type which supported by WEKA namely numeric attributes, nominal attributes,
string attributes and date attributes.
63
4.3
Experiments
4.3.1 Genes Ranked
4.3.1.1 Experimental Settings
The process of selecting informative genes was implemented in Java and
Matlab using Weka 3.6.0 and PRTools 3.1.7. All three features selection techniques
in HykGene software were used in the experiments on breast cancer microarray data.
For comparison and demonstrating the effectiveness of each feature selection
algorithm, classification using four classifiers in the HykGene software was
conducted. The classifications had been done on 50 and 100 top-ranked genes which
were produced by each filtering methods.
Based on the cross validation classification accuracy, the best features
selection techniques can be chosen for this study. This is because the performance of
the classifiers is also depends on the selected genes during classification. If the
accuracy is high it means that the classifier had classified selected genes with high
number of informative genes and vice versa.
64
4.3.2 Classifications of top-ranked genes
4.3.2.1 Experimental Settings
One of the binary classification tasks is medical testing to determine whether
a patient has certain disease or not. Binary classification is a task to classify the
members of the given set of objects into two groups on the basis whether they have
some property or not. The classification property of the medical testing is the disease.
In order to classify the selected genes of breast cancer microarray data into two
classes which are relapse and non-relapse, the experiments were done using Weka
3.6.0.
Before classifying the selected genes, they need to be split into training and
testing set. This is because supervised learning or classification is the machine
learning method for deducing a function from training data. A training set is a set of
data used to discover potentially predictive relationships while a testing set is a set of
data used to evaluate the strength and utility of a predictive relationship. Therefore,
the 50 and 100 top ranked genes of breast cancer microarray data were split into both
set of 66 percent.
There are three others classifiers beside RBNF were used in the experiment
such as Multilayer Perceptron (MLP), Naïve Bayes (NB) and C4.5. MLP is the
feedforward neural network model while NB is a classifier which applies Bayes’s
Theorem. A C4.5 is an algorithm used to generate decision tree which can be applied
in classification and known as statistical classifier. These three classifiers are also
used in the experiments in order to asses the effectiveness of chosen classifier
(RBNF) based on the accuracy of correctly classified instances.
65
4.3.3 Evaluation of Classifiers Performances
4.3.3.1 Experimental Settings
Mostly, the prognostic performance will be defined by the percentage of the
prognostic decisions that turned out to be correct. However, this would be wrong to
say before looking at the ROC analysis results (Tokan, F., Turker, N., and Tulay, Y.,
2006). With performing ROC analysis, the real performances of the networks can be
found. This can help to decide which classifier has the best performances for breast
cancer microarray data.
ROC can be represented by plotting the fraction of the true positive versus
false positive. A ROC space which depicts relative trade-offs between true positive
and false positive is defined by FPR and TPR as x and y axes respectively. TPR or
also known as sensitivity determine the performance of the classifier on classifying
instances correctly among all positive samples. FPR or also called as 1-specificity
define the number of incorrect positive results among negative samples. The
following figure illustrates the ROC space.
66
Figure 4.6: ROC Space
Based on the figure, we can see that ROC space is divided by the diagonal
line which is referred to as line of no discrimination (random guess). This line splits
the ROC space in area of good or bad classification. A good classification is shown
by the point above the diagonal line while the points below the line show the wrong
results. The perfect classification yield when a point in upper left corner or its
coordinate is (0,1) of the ROC space. This result represents 100 percent of sensitivity
which means no false negatives and specificity which means no false positives.
With the values of FPR and TPR, a ROC curve can be drawn. However, the
curves of the networks are hard to be investigated by visual inspection in ROC curve.
Therefore, the area under ROC curve, AUC, would be helpful to compare the
performance of the networks. The perfect classifier has an AUC of 1 while
completely random classifier has an AUC of 0.5. In order to evaluate the
performances of the classifiers in this study, the experiments were also carried out
using Weka 3.6.0. The following figure depicts the framework to plotting multiple
ROC curves using Weka.
67
Figure 4.7: Framework to plotting multiple ROC curve using Weka.
4.5
Summary
Overall, this chapter discussed the experiments that have been carried out
throughout the study. Features selection was the first experiments performed
followed by the classifications. There are three methods used which are Relief-F,
Information Gain and Chi Square in the features selection experiments. The selected
genes yield by the chosen features selection technique will be further used in the
classifications experiments. Four classifiers were involved in the classification
experiments where the performance of each of them is evaluated by ROC. All the
experimental results will be further analyzed in the next chapter.
CHAPTER 5
EXPERIMENTAL RESULTS ANALYSIS
5.1
Introduction
The experimental results will be analyzed and discussed in this chapter.
Comparisons of the feature selection results yielded by three types of techniques will
determine the best method for genes selection. The classification results of each
classifier will be analyzed based on the accuracy percentage and time taken to build
the network. These classifiers will be evaluated to show the performance of the best
classifier according to the area under ROC.
69
5.2
Results Discussion
5.2.1
Analysis Results of Features Selection Techniques
The experimental results of features selection are tabulated in the tables 4.1
until 4.3. Figure 4.4 and 4.5 indicate the graph of the cross validation classification
accuracy based on SVM in order to show the comparison clearly.
Table 5.1: Results on breast cancer dataset using Relief-F
50-top ranked genes
100-top ranked genes
Cross Validation
Cross Validation Classification
Classification Accuracy (%)
Accuracy (%)
k-NN
86.60
87.63
SVM
86.60
83.51
C4.5
68.04
65.98
Naive Bayes
80.41
78.35
Classifier
Table 5.2: Results on breast cancer using Information Gain
50-top ranked genes
100-top ranked genes
Cross Validation
Cross Validation Classification
Classification Accuracy (%)
Accuracy (%)
k-NN
79.38
77.32
SVM
73.20
72.16
C4.5
76.29
70.13
Naive Bayes
58.76
57.73
Classifier
70
Table 5.3: Results on breast cancer using χ2-statistic
50-top ranked genes
100-top ranked genes
Cross Validation
Cross Validation Classification
Classification Accuracy (%)
Accuracy (%)
k-NN
79.38
74.23
SVM
80.41
74.23
C4.5
74.23
70.10
Naive Bayes
55.67
59.79
Classifier
50 Top‐ranked genes
90
Percentage
85
80
75
Cross Validation Classification Accuracy
70
65
ReliefF
Information Gain
Chi‐Square
Method
Figure 5.1: Graph on cross validation classification accuracy using 50-top ranked
genes
71
Percentage
100 Top‐ranked genes
86
84
82
80
78
76
74
72
70
68
66
Cross Validation Classification Accuracy
ReliefF
Information Gain
Chi‐Square
Method
Figure 5.2: Graph on cross validation classification accuracy using 100-top ranked
genes
From the results, we can observe that all four classifiers yielded high cross
validation classification accuracy using the selected genes of ReliefF compared to
Information Gain and Chi-Square. This proves that ReliefF algorithm is a good
method in selecting informative genes on breast cancer dataset. The efficiency of
ReliefF algorithm to estimate attributes quality in problems with strong dependencies
between attributes (Marko, R.S., and Igor, K., 2003) caused this result. Unlike
ReliefF, Information Gain and Chi-Square methods are ignoring the features
dependencies which lead to their bad performances. Therefore, ReliefF algorithm
was chosen as the features selection technique in this study.
72
5.2.2
Analysis Results of Classifications
As discussed in previous section, ReliefF algorithm obtained a better result
during feature selection. Thus, the classifications were carried out on the 50 and 100
top-ranked genes which were selected by ReliefF algorithm. These classifications
process were done using four types of classifiers which are Radial Basis Functions
Network, RBNF, Multilayer Perceptron, MLP, Naïve Bayes, NB and C4.5. The
following table shows the comparison of the classification results yielded by four
different classifiers.
Table 5.4: Results on classification using different classifiers
Classifier
Top-ranked genes
Time taken (s)
Correctly
Classified
Instances (%)
Radial Basis
50
0.08
90.91
100
0.08
84.85
Multilayer
50
6.5
84.85
Perceptron, MLP
100
23.72
81.82
Naïve Bayes
50
0.02
81.81
100
0.03
78.79
50
0.03
57.58
100
0.08
57.58
Function Network,
RBFN
C4.5
The classifiers performance is compared based on the percentage of correctly
classified instances and time taken to build the model as shown in the table 4.5.
Referring to the table, it is clear that Radial Basis Function Network produced the
highest result compared to other classifiers although the time taken by RBNF to carry
out the model was not the smallest among all classifiers. From the table, we can also
observe that the correctly classified instances percentage of all classifiers decrease
7
73
w
when
classiffications werre carried ouut on 100 top
p-ranked gennes. This pro
oves that the
h
highest
classsification acccuracy can bbe achieved with
w the smaallest numbeer of genes.
F
Figure
4.6 illlustrate the classificationn results in graph.
g
Claassificattion usin
ng 50 top
p‐ranked
d genes
Percentage/Time
100
9
90.91
84.8
85
81.81
80
57.58
60
Correcttly Classified Instancce %
40
20
0.08
6.5
0.02
0.03
Time Taken (s)
0
RBNF
MLLP
NB
C4.5
Classifier
F
Figure
5.3: Percentage of
o Correctlyy Classified Instances
I
usiing 50 top-raanked genes.
Claassification usingg 100 top‐ranked genes
Percentage/Time
100
8
84.85
82
81.8
78.79
80
57.58
60
40
20
Correcttly Classified Instancce %
23.72
0.08
0.03
0.08
Time Taken (s)
0
RBNF
MLLP
NB
C4.5
Classifier
F
Figure
5.4: Percentage of
o Correctlyy Classified Instances
I
usiing 100 top-rranked genees.
74
As stated by Dian Hui Wang et al., (2002), one of the MLP drawbacks is the
time needed to train the data is more longer as the number of input is over 100.
According to the obtainable results, it is clearly shows that the time taken by MLP to
carry out the model especially when 100 top-ranked genes used was very long
compared to other classifiers. The differences of the time taken by MLP can be said
quite huge compared to RBNF, NB and C4.5. Time taken by MLP when 150 topranked genes were used is 52.24 seconds, which strengthen citation. Unlike MLP,
RBNF took less time to train the data due to its simple structure.
The extracted features of genes are distributed in a high dimensional space
with complex characteristics. This caused it difficult to satisfy a model that using
some parameterized approaches like Naïve Bayes and yield a bad result. The bad
performance of C4.5 is due to the decision tree techniques that generate complex
rules and difficult to understand. In brief, RBNF is the best classifier to classify
breast cancer microarray data.
5.2.3
Analysis Results of Classifiers Performance
As the first step of ROC analysis, the true positive rate or sensitivity and true
positive rate or 1-specificity values of RBNF, MLP, NB and C4.5 are given in the
following table. Plots of the results in the ROC space are shows in the Figure 5.6.
75
Table 5.5: TPR and FPR of classifiers
Classifier
Top-ranked
True Positive
False Positive,
genes
Rate, TPR
FPR
50
0.909
0.067
100
0.848
0.149
50
0.848
0.149
100
0.818
0.19
50
0.818
0.19
100
0.788
0.231
50
0.576
0.463
100
0.576
0.463
Radial Basis Function
Network, RBFN
Multilayer Perceptron, MLP
Naïve Bayes, NB
C4.5
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ROC Space RBNF
ML
NB
C4.5
TPR or Sensitivity
TPR or Sensitivity
ROC Space 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
RBNF
MLP
NB
C4.5
0 0.10.20.30.40.50.60.70.80.9 1
0 0.10.20.30.40.50.60.70.80.9 1
FPR or (1 ‐ Specificity)
FPR or (1 ‐ Specificity)
Figure 5.5: ROC Space of 50 top-ranked
Figure 5.6: ROC Space of 100 top-
ranked genes
genes
Based on the graphs, the result of RBNF clearly is the best among other classifiers.
The results of MLP and NB are slightly differs with RBNF while the result of C4.5
almost lies on the random guess line or the diagonal line. The second step of ROC
analysis is the ROC curves. Therefore, the ROC curves of the networks which
76
contain information for understanding the performances of the networks are obtained
as shown in the following figures. However, the curves of the networks are hard to
be investigated based on the same illustration.
Figure 5.7: ROC Curve of 50 top-ranked genes
Figure 5.8: ROC Curve of 100 top-ranked genes
77
Therefore, to compare the performances of the classifiers, results of the area under
the ROC curves as given in Table 5.6 are used.
Table 5.6: ROC Area of Classifiers
Classifier
Top-ranked genes
ROC Area (Area
Under Curve)
Radial Basis Function
50
0.94
100
0.882
Multilayer
50
0.94
Perceptron, MLP
100
0.932
Naïve Bayes
50
0.887
100
0.838
50
0.556
100
0.556
Network, RBFN
C4.5
Based on the table, we can observe that RBNF with area of 0.94 for 50 top-ranked
genes can be chosen as the best classifiers for breast cancer microarray data.
Although MLP also have the same area with RBNF, it has low FPR-TPR compares
to RBNF. Naïve Bayes with area of 0.887 and 0.838 comes after MLP while C4.5
yields the lowest area for both top-ranked genes.
5.3
Summary
In this chapter, the experimental results of features selection, classifications
and evaluation of the classifiers had been analyzed. From the results analysis of
78
features selection techniques, ReliefF algorithm was chosen as the best technique in
selecting informative genes. Radial Basis Function Network was proven as the best
classifier for classify microarray breast cancer data based on its classification
accuracy. Furthermore, the evaluation of its performance according to area under
ROC curve was strongly supportive as well.
CHAPTER 6
DISCUSSIONS AND CONCLUSION
6.1
Introduction
This final chapter discusses the overall of the research. The achievements that
have been obtained will be stated. The problems arise during the study will be
identified. Appropriate suggestions will be proposed to overcome the problems
followed by the conclusion to wrap up the study.
6.2
Overview
As stated in chapter 1, the goal of this study is to classify the breast cancer
microarray data. Several objectives and scopes were determined as the guides to
achieve the goal. The objectives are as follow:
80
i)
To select significant features from microarray data using a suitable
feature selection technique.
ii)
To classify selected features into non-cancerous and cancerous using
Radial Basis Function, RBF, network.
iii)
To evaluate the performance of the classifiers using Receiver
Operating Characteristic, ROC.
Related topics have been discussed in the literature sections which are breast
cancer, microarray technology, features selection, classifiers and the metric to
evaluate classifiers performance. These topics gave the views and idea to understand
the overall study before its implementation. This literature study is vital in the
research because it is the second step of the study in order to achieve all the
objectives.
After collecting the required information in the literature study, the next step
to achieve the objectives is the methodology. The methodology is the guidelines to
design appropriate implementations for the research so that the objectives can be
achieved. After the methodology has been determined, experiments can be carried
out to obtain the results.
Finally the experimental results were analyzed before the objectives can be
achieved. Therefore, the following section discussed the achievements and
contributions of this research. Moreover, the problem encountered during the
implementation will be discussed as well as to propose the best solution for it.
81
6.3
Achievements and Contributions
The achievements obtained from this research can also be considered as the
contributions. The following are the achievements that have been found in this
research:
i)
The best feature selections technique in selecting informative genes
for microarray breast cancer data has been identified. Relief F
algorithm is chosen as the best technique among Information Gain and
Chi-Square in selecting informative genes for breast cancer
microarray data based on the accuracy of the classifiers during
classification.
ii)
The selected features selection classified using four types of classifier
and their classification accuracy were compared. Radial Basis
Function Network has been proven as the efficient classifier in
classifying breast cancer microarray data since it shows the highest
percentage of accuracy (90.91% for 50-top ranked genes) among other
classifiers.
iii)
The performances of the classifiers were evaluated using Receiver
Operating Characteristic and compared based on the area under curve.
Among the four classifiers, Radial Basis Function Network also shows
the best performance with area of 0.94 and TPR-FPR of 0.909-0.067
for 50 top-ranked genes.
iv)
The classification of the accuracy decrease when the classifier
classified 100 top-ranked genes compare to 50 top-ranked genes. This
proved that the efficiency of the classification depends on the number
of the genes.
82
Based on the stated achievements, it can be concluded that all the objectives for this
research have been achieved.
6.4
Research Problem
Although several achievements have obtained from this study, there is still a
problem arises during the implementations of this research. The problem is as
follows:
i)
The time taken by the chosen features selection technique, ReliefF,
was too long compared to other two techniques even though it was an
effective method during the selection of informative genes.
6.5
Suggestion to Enhance the Research
It is necessary to overcome the problem arises in the study in order to
improve this research. Since the problem is due to the time taken by ReliefF, another
type of feature selection techniques is suggested to compare ReliefF algorithm. The
suggested techniques are other types of novel extended Relief method such as
orthogonal Relief (O-Relief), extended ReliefF (E-ReliefF) or the hybrid of ReliefFmRMR (minimum redundancy maximum relevance).
83
6.6
Conclusion
Breast cancer is the number one killer disease among woman compares to
other diseases. Early detection of breast cancer will help to secure a better chance of
survival. There is a various ways to detect, cure and treat breast cancer nowadays. A
lot of researches also have been done in this area in order to reduce the number of
deaths caused by breast cancer. The advent of microarray technology will increase
the number of studies of breast cancer due to its advantages.
The microarray technology allows thousands of the genes expressions to be
determined simultaneously. This advantage of microarray has lead to the application
in medical area namely management and classification of cancer and infectious
diseases. However, microarray technology also suffers several drawbacks. The
disadvantages of the microarray are the high dimensionality of data and also consist
of irrelevant information to classify disease accurately.
However, there are various techniques of feature selection that can be used to
solve the arising problem in microarray in order to improve the accuracy of the
classification. By using feature selection, only the appropriate subset of genes will be
selected among the microarray. There are three approaches of feature selection which
are filter, wrapper and embedded method. Based on the previous researches, filter
method with multivariate techniques is the best and commonly used among other
methods.
The goal of the classification is to distinguish between the cancerous
(malignant) and non-cancerous (benign). There are various classifier whether
traditionally or currently develop that have been studied to classify the microarray.
However, not all the classifier shows the high performance of accuracy. The suitable
metric can be used so as to evaluate the performance of the classifier.
84
In a nutshell, it is hopeful that this study can give valuable information in the
breast cancer researches using microarray technology in order to reduce the number
of deaths caused by breast cancer.
85
REFERENCES
Abu-Khalaf, M.M., Harris, L.N., and Chung, G.C. (2007). Chapter 10 DNA and
Tissue Microarrays. In Patel, H.R.H., Arya, M., and Shergill, I.S. Basic
Science Techniques in Clinical Practice. London: Springer
Adomas, A., Heller, G., Olson, A., Osborne, J., et al. (2008). Comparative analysis
of transcript abundance in Pinus sylverstric after challenge with a
saprotrophic, pathogenic or mutualistic fungus. Tree Physiology. 28, 885-897
Ahmad, A., and Dey, L. (2005). A feature selection technique for classificatory
analysis. Pattern Recognition Letters. 26(1), 43-56
Alizadeh A.A., Eisen M.M., Davis, R.E., Ma, C., Lossos, I.S., et al., (2000). Diffuse
types of diffuse large B-cell lymphoma identified by gene expression
profiling. Nature. 403, 503-511
Arauzo-Azofra, A., Benitez, J.M., Castro, J.L. (2004). A feature set measure based
on Relief. The 5th International Conference on Recent Advances in Soft
Computing (RASC2004). Nottinhham Trant Univ. 104-109
Ben-Bassat, M. (1982) Pattern Recognition and reduction of dimensionality. In
Krishnaiah, P. and Kanal, L., (eds) Handbook of Statistics II, Vol I. NorthHolland, Amsterdam; 773-791
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. E., and Yakhini,
Z. (2000). Tissue classification with gene expression profiles. Journal of
Computational Biology. 7; 559-584
Biesalski HK, Bueno de Musqeito B, Chesson A, et al., (1998). European Consensus
Statement on Lung Cancer: Risk Factors and Prevention. CA : a cancer
journal for clinicians. 48(3),167-176
86
Boldrick, J.C., Alizadeh, A.A., Diehn, M., Dudoit, S., Liu, C.L., et al. (2002).
Stereotyped and specific gene expression programs in human innate immune
responses to bacteria. Proceedings of the National Academy of Sciences of
the USA. 99, 972-977
Bors, A. G., Gabbouj, G., (1994). Minimal topology for radial basis function neural
network for pattern classification. Digital Signal Processing: a review
journal. 3, 302-309
Bors, A.G., Pitas, I., (1996). Median radial basis functions neural network. IEEE
Trans. On Neural Networks. 7(6), 1351-1364
Bray, F., D., P. M., and Parkin, M (2004). The changing global patterns of female
breast cancer incidence and mortality. Breast Cancer Research. 6(6)
Burchard, J., unpublished.
Chang, C. Y., and Chung, P. C. (1999). A Contextual-Constraint Based Hopfield
Neural Cube for Medical Image Segmentation. Proceedings of The IEEE
Region 10 Conference on TENCON. 2, 1170-1173
Cheung, VG., Morley, M., Aguilar, F., Massimi, A., Kucherlapati, R., and Childs, G.
(1999). Making and reading microarrays. Nature Genetics. 21, 15-19
Christian Harwanegg and Reinhard Hiller. (2005). Protein microarrays for the
diagnosis of allergic diseases: State-of-the-art and future development.
Clinical Chemistry and Laboratory Medicine. 29 (4), 272-277. Walter de
Gruyter.
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., et al. (1998). The transcriptional
program of sporulation in budding yeast. Science. 282, 699-705
Compton, R., Lioumi, M., and Falciani, F. (2008). Microarray Technology.
Encyclopedia of Molecular Pharmacology. Berlin Heidelberg, New York:
Springer-Verlag
Cover, T., and Thomas, J. (1991). Elements of Information Theory. New York :John
Wiley and Sons
Dasarathy, B.(1991). Nearest Neighbor Norms: NN Pattern Classification tehniques.
IEEE Computer Society Press, Los Alamitos, CA,USA
Dash, M., and Liu, H. (1997). Feature selection for classification. Intelligent Data
Analysis. 1, 131-156
87
Downey, T.J., Jr., Meyer, D. J., Price, R.K., and Spitznagel, E.L. (1999). Using the
receiver operating characteristic to asses the performance of neural classifier.
International Joint Conference on Neural Networks (IJCNN’99). 10-16 July.
Washigton, DC, USA. 5, 3642-3646
Driouch, K., Landemaine, T., Sin, S., Wang, SX., and Lidereau, R. (2007). Gene
arrays for diagnosis, prognosis and treatment of breast cancer metastasis.
Clinical & Experimental Metastasis. 24(8), 575-585
Duda, P., et al. (2001). Pattern Classification. Wiley, New York.
English, DR., Armstrong, BK., Kricker, A., Fleming, C. (1997). Sunlight and cancer.
Cancer Causes and Contol. 8(3); 271-83, 283
Farzana Kabir Ahmad (2008). An Integrated Clinical and Gene Expression Profiles
for Breast Cancer Prognosis Model. Master, Universiti Utara Malaysia
Fawcett, T. (2005). An introduction to ROC analysis. Pattern Recognition Letters.
Palo Alto, USA. 27(2006), 861-874
Fayyad, U., and Irani, K.(1993). Multi-interval discretization of continuous-values
attributes for classification learning. Proceedings of the 13th International
Joint Conference on Artificial Intelligence. Portland, OR. 1022-1027
Ferlay J., Bray F., Pisani P., and Parkin D. (2001). GLOBOCAN 2000: Cancer
incidence, mortality and prevalence worldwide
Ferran, E.A., Pflugfedder, B., and Ferrarap. (1994). Self Organized neural maps of
human protein sequences. Protein Science. 3,507-521
Ferri, F., et al. (1994). Pattern Recognition in Practice IV, Multiple Paradigms,
Comparative Studies and Hybrid Systems. Elsevier, Amsterdam. 403-413
Fodor, S.P., et al. (1991). Ligth-directed, spatially addressable parallel chemical
synthesis. Science. 251, 767-773
Fodor, S.P., et al. (1993). Multiplexed biochemical assays with biological chips.
Nature. 364, 555-556
Furey, T.S., Cristianni, N., Duffy N., Bednarski, D. W., Schummer, M. and Haussler,
D. (2000). Support vector machine classification and validation of cancer
tissue samples using microarray expression data. Bioinformatics. 16(10), 906914
Gianluca Bontempi and Benjamin Haibe-Kains. 2008. Feature selection methods for
mining bioinformatics data. Bruxelles, Belgium: ULB Machine Learning
Group
88
Green, D. M., and Swets, J. A. (1996). Signal detection theory and psychophysics.
New York: John Wiley and Sons, Inc
Gruvberger-Saal, S.K., Cunliffe, H.E., and Carr, M.K. 2006. Microarrays in breast
cancer research and clinical practice- the future lies ahead. Endocrine-Related
Cancer. 13, 1017-1031
Guyon, I., Weston, J., Stephen Barnhill, M. D., and Vapnik, V. (2002). Gene
Selection for Cancer Classification using Support Vector Machines. Machine
Learning. 46; 389-422
Guyon , I., and Elisseeff, A. (2003). An introduction to variable and feature
selection. Journal of Machine Learning Research. 3, 1157-1182
Hall, M. (1999). Correlation-based feature selection for machine learning. Doctoral
Philosophy, Waikato University, New Zealand.
Haykin, S. (1994). Neural Networks: A Comprehensive Foundation. New York:
Macmillan College Publishing Co. Inc
Holland, J. (1975). Adaptation in Natural and Artificial Systems. University if
Michigan Press, Ann Arbor.
Horng, M-H. (2008). Texture Classification of the Ultrasonic Images of Rotator Cuff
Disease based on Radial Basis Function Network. International Joint
Conference on Neural Networks (IJCNN 2008). Hong Kong. 91-97
Inza, I., et al. (2000). Feature subset selection by Bayesian networks based
optimization. Artificial Intelligence. 123, 157-184
Jin, X., Xu, A., Bie, R., and Guo, P. (2006). Machine Learning and Chi-Square
Feature Selection for Cancer Classification Using SAGE Gene Expression
Profiles. In Li, J. et al. (Eds) Data Mining for Biomedical Applications. (pp:
106-115). Berlin Heidelberg: Springer-Verlag
Kasabov, N.K. (2002). Evolving Connectionist Systems, Methods and Applications in
Bioinformatics, Brain Study and Intelligent Machines. Verlag: Springer
Kerekes, J. (2008). Receiver Operating Characteristic Curve Confidence Intervals
and Regions. IEEE Geoscience and Remote Sensing Letters. 5(2), 251-255
Kim, YS., Street W.N., Menczer, F. (2003). Feature selection in data mining. In John
Wang (Ed.) Data Mining: opportunities and challenges. (80-105) Hershey,
PA, USA : IGI Publishing
89
Kira, K., and Rendell, L.A. (1992). A practical approach to feature selection.
Proceedings of the ninth international workshop on Machine learning.
Morgan Kaufmann Publishers Inc. 249-256
Kittler, J. (1978). Pattern Recognition and Signal Processing, Chapter Feature Set
Search Algorithms Sijthoff and Noordhoff, Alphen aan den Rijn, Netherlands;
41-60
Kohavi, R., and John, G.H. (1997). Wrappers for feature subset selection. Artificial
Intelligence. 97, 273-324
Koller, D. and Sahami, M. (1996). Toward optimal feature selection. Proceedings of
the Thirteenth International Conference on Machine Learning, Bari, Italy,
284-292
Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF.
European Conference on Machine Learning. 171-182
Konvacevic, D., and Loncaric, S. (1997). Radial Basis Function Based Image
Segmentation Using A Receptive Field. Proceedings of Tenth IEEE
Symposium on Computer-Based Medical Systems. 126-130
Kuan, M. M., Lim, C. P., Ismail, O., Yuvaraj, R. M., and Singh, I. (2002).
Application of Artificial Neural Networks to The Diagnosis of Acute
Coronary Syndrome. Proceedings of International Conference on Artificial
Intelligence in Engineering and Technology. Kota Kinabalu, Malaysia. 470475
Kégl, B., Krzyzak, A., and Niemann, H. (1998). Radial Basis Function Network in
Nonparametric Classification and Function Learning. Proceedings of the
Fourteenth International Conference on Pattern Recognition. 16-20 August.
1, 565-570
Leonard, J.A., and Kramer, M.A. (1991). Radial basis function network for
classifying process faults. Control Systems Magazine, IEEE. 11(3), 31-38
Li, Y., Campbell, C., and Tipping, M. (2002). Bayesian automatic relevance
determination algorithms for classifying gene expression data.
Bioinformatics. 18(10); 1332-1339
Li, X., Zhou, G., and Zhou, L. (2006). Novel Data Classification Method on Radial
Basis Function Networks. Proceedings of the Sixth International Conference
on Intelligent Systems Design and Applications (ISDA’06)
90
Lim, G.C.C., Rampal, S., and Yahaya, H. (Eds.) (2008). Cancer Incidence in
Peninsular Malaysia, 2003-2005. Kuala Lumpur: National Cancer Registry
Lin, C.T., and Lee, C. S. G. (1996). Neural Fuzzy Systems - A Neuro-Fuzzy
Synergism to Intelligent Systems. New Jersey: Prentice Hall.
Liu, H., and Setiono, R. (1995). Chi2: feature selection and discretization of numeric
attributes. Proceedings of Seventh International Conference on Tools with
Artificial Intelligence. 11 May-11 August 1995. 388-391
Liu, H., and Motoda, H. (1998). Feature Selection for Knowledge Discovery and
Data Mining. Boston: Kluwer Academic Publishers.
Liu. J. and Iba, H. (2002). Selecting Informative Genes Using a Multiobjectives
Evolutionary Algorithm. Proceedings of the 2002 congress. 297-302
Lockhart, DJ., Dong, H., Byrne, MC., et al. (1996). Expression monitoring by
hybridization to high-density oligonucleotide arrays. Nature Biotechnology.
14, 1675-1680
Markov, R.S., and Igor, K. (2003). Theoretical and empirical analysis of relief and
rreliefF. Machine Learning Journal. 53,23-69
Mashor, M. Y., Mat Isa, N. A., and Othman, N. H. (2002). Automatic Pap Smear
Screening Using HMLP Network. Proceedings of International Conference
on Artificial Intelligence in Engineering and Technology. Kuala Lumpur,
Malaysia. 453-457
Mat Isa, N.A., Mashor, M. Y., and Othman, N.H. (2002). Diagnosis of Cervical
Cancer Using Hierachical Radial Basis Function (HRBF) Network.
Proceedings of International Conference on Artificial Intelligence in
Engineering and Technology. Kota Kinabalu, Malaysia. 458-463
Matej, S., Lewitt, R.M., (1996). Practical considerations for 3-D image
reconstruction using spherically symmetric volume elements. IEEE Trans. On
Medical Imaging.15(1), 68-78
Mat Isa, N.A., Mat Sakim, H.A., Zamli, K.Z., Haji A Hamid, N., and Mashor, M.Y.
(2004). Intelligent Classification System for Cancer Data Based on Artificial
Neural Network. Proceedings of the 2004 IEEE Conference on Cybernatics
and Intelligent Systems. 1-3 December. Singapore.
91
Miki, R., Kadota, K., Bono, H., Mizuno, Y., Tomaru, Y., et al. (2001). Delineating
developmental and metabolic pathways in vivo by expression profiling using
the RIKEN set of 18,816 full-length enriched mouse cDNA arrays. Proc Natl
Acad Sci USA. 98(2), 199-204
Miller, L.D and Liu, E.T. 2007. Expression genomics in breast cancer research:
microarrays at the crossroads of biology and medicine. Breast Cancer
Research. 9, 206
Mitchell, T.M. (1997). Machine Learning. New York : McGraw-Hill
Momin, B. A., Mitra, S., and Gupta, R. N. (2006). Reduct Generation and
Classification of Gene Expression Data. International Conference on Hybrid
Information Technology (ICHIT’06). 6-11 November. 1, 699-708
Moody, J., (1989). Fast learning in networks of locally-tuned processing units.
Neural Computation. 1, 281-294
National Cancer Institute. (2005). What You Need To Know About Breast Cancer.
National Institutes of Health, U.S Department of Health and Human
Services
Ng, E. Y. K., Peh, Y. C., Fok, S. C., Ng, F. C., and Sim, L. S. J. (2002). Early
Diagnosis of Breast Cancer Using Artificial Neural Network with Thermal
Images. Proceedings of Kuala Lumpur International Conference on
Biomedical Engineering. Kuala Lumpur , Malaysia. 458-463
Orr, M. J. L. (1996). Radial Basis Function Networks. Edinburgh, Scotland
Oyang, Y.J., Hwang S.C., Ou, Y.Y., et al. (2005). Data Classification with radial
basis function networks based on a novel kernel density estimation algorithm.
IEEE Transaction on Neural networks. 16(1), 225-236
O’reilly KM, MClaughin AM, Backett WS, Sime PJ (2007). Asbestos-Related Lung
Disease. American Academy of Family Physician. 75(5); 683-8,690
Palacios, G., Quan, P-L., Jabado, O.J., Conlan, S., Hirschberg, D.L., et al. (2007).
Panmicrobial Oligonucleotide Array for Diagnosis of Infectious Diseases.
Emerging Infectious Diseases. 13, 73-81
Park, H., and Kwon, H-C. (2007). Extended Relief algorithms in instance-based
Feature Filtering. Sixth International conference on Advanced Language
Processing and Web Information Technology. 123-128
Park, J., and Sandberg, I.W. (1993). Approximation and Radial-Basis-Function
Networks. Neural Computation. 5, 305-316
92
Park, S. H., Goo, J. G., and Jo, C-H. (2004). Receiver Operating Characteristic
(ROC) Curve: Practical Review for Radiologists. Korean Journal of
Radiology. 5, 11-18
Pease, A.C. et al. (1994). Light-generated oligonucleotide arrays for rapid DNA
sequence analysis. Proceedings of National Academy Science of USA. 91,
5022-5026
Pedrycz, W., editor. (1998). Computational Intelligence : An Introduction. NY: CRC
Press, Boca Raton
Poggio, T., Girosi, F., (1990). Networks for approximation and learning.
Proceedings IEEE. 78(9), 1481-1497
Powell, M.J.D. (1977). Restart Procedures for the Conjugate Gradient Method.
Mathematical Programming. 12, 241-254
Ramaswamy, S. and Golub, TR. (2002). DNA microarrays in clinical oncology.
Journal Clinical Oncology. 20, 1932-1941
Reisert, M., and Burkhardt, H. (2006). Feature Selection for Retrieval Purposes. In
Lectures Notes in Computer Science. Image Analysis and Recognition. 4141,
661-672. Heidelberg: Springer-Verlag
Sankupellay, M., and Selvanathan, N. (2002). Fuzzy Neural and Neural Fuzzy
Segmentation of Magnetic Resonance Images. Proceedings of Kuala Lumpur
International Conference on Biomedical Engineering. Kuala Lumpur,
Malaysia. 47-51
Sauter, G., and Simon, R. (2002). Predictive molecular pathology. New England
Journal Medicine. 347, 1995-1996
Sanner, R.M., Slotine, J.-J. E., (1994). Gaussian networks for direct adaptive control.
IEEE Trans. On Neural Networks. 3(6), 837-863
Saeys, Y. et al. (2007). Gene Selection, A review of feature selection techniques in
bioinformatics. Bioinformatics. 23(19), 2507-2517. Oxford University Press.
Schena, M., Shalon, D., Davis, R., and Brown, P.O. (1995). Quantitative monitoring
of gene expression patterns with a complementary DNA microarray. Science.
270, 467-470
Siedelecky, W. and Sklansky, J. (1998). On automatic feature selection.
International Journal Pattern Recognition. 2; 197-220
93
Sikonja, M.R., and Kononenko, I. (1997). An adaptation of relief for attribute
estimation in regression. In Morgan Kaufmann, editor. Machine Learning:
Proceedings of the Fourteenth International Conference. 296-304
Sikonja, M.R., and Kononenko, I. (2003). Theroretical and Empirical Analysis of
ReliefF and RReliefF. Journal of Machine Learning. 53(1-2), 23-69
Skalak, D. (1994). Prototype and feature selection by sampling and random mutation
hill climbing algorithms. Proceedings of the Eleventh International
Conference on Machine Learning, 293-301
Sorlie, T., Perou, C.M., Tibshirani, R., et al. (2001). Gene expressiom patterns of
breast carcinomas distinguish tumor subclasses with clinical implications.
Proceedings of National Academy Science of USA. 98, 10869-10874
Sun, J., Shen, R.M., Han, P. (2003). An original RBF network learning algorithm.
Chinese journal of computers. 26(11), 1562-1567
Tago, C., and Hanai, T., (2003). Prognosis prediction by microarray gene expression
using support vector machine. Genome Informatics. 14; 324-325
Tinker, A.V., Boussioutas, A., and Bowtell, D.D.L. (2006). The Challenges of gene
expression microarrays for the study of human cancer. Cancer Cell. 9(5),
333-339
Tokan, F., Turker, N., and Tulay, Y., (2006). ROC Analysis as a Useful Tool for
Performance Evaluation of Artificial Neural Networks. In Kollias, S., et al.
(Eds). ICANN 2006, Part II, LNCS 4132. 923-931. Springer-Verlag.
Tou, J.T., and Gonzalez, R.C. (1974). Pattern Recognition Principles. London:
Addison-Wesley.
Tzouvelekis, A., Patlakas, G., and Bouros, D. 2004. Application of microarray
technology in pulmonary diseases. Respiratory Research. 5(26)
van de Vijver, M.J., He, Y.D., van’t Veer L.J., Dai, H., Hart, A.A., et al. A geneexpression signatures as a predictor of survival in breast cancer. New England
Journal of Medicine. 98, 2199-2204
Vapnik, V.N. (1998). Statistical Learning Theory. Wiley-Interscience, New York,
USA
Wang, D.H., Lee, N.K., Dillon, S.T., and Hoogenraad, N.J. (2002) Protein Sequences
Classification using Radial Basis Function (RBF) Neural Networks.
Proceedings of the 9th International Conference on Neural information
Processing (ICONIP’02). 2,
94
Wang, H.C., Daparzo J., De La Fraga, L.G., Zhu, Y.P., Carazo, J.M.(1998). SelfOrganizing tree-growing network for the classification of protein sequences.
Protein Science, 2613-2622
Wang, Y., Makedon, F.S., Ford, J.C., and Pearlman, J. (2005). HykGene: a hybrid
approach for selecting marker genes for phenotype classification using
microarray gene expression data. Bioinformatics. 21(5), 1530-1537
Westin, L. K. (2001). Receiver operating characteristic (ROC) analysis: Evaluating
discriminance effects among decision support systems. Umea, Sweden
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., and Vapnik, V.
(2000). Feature selection for SVMs. Advances in Neural Information
Processing Systems. 13
Whitfield, M.L., Sherlock, G., Saldanha, A.J., et al. (2002). Identification of genes
periodically expressed in the human cell cycle and their expression in tumors.
Mol Biol Cell. 13, 1997-2000
Witten, I.H., and Frank, E. (1999). Data mining: practical machine learning tools
and techniques with Java implementations. Morgan Kaufmann, San
Francisco, CA
Wu, C.H., Berry, M., Shivakumar, S., and McLarty, J.(1995). Neural networks for
full-scale protein sequence classification: sequence encoding with singular
value decomposition. Machine Learning. 21, 177-193
Wu, C.H. Artificial neural networks for molecular sequence analysis. (1997).
Computers Chemistry. 21, 237-256
Yu, L. and Liu, H. (2004). Efficient feature selection via analysis of relevance and
redundancy. Journal of Machine Learning Respiratory. 5; 1205-1224
zur Hausen H (1991). Viruses in human cancers. Science. 254(5035); 1167-1173
Zhu, Z., Ong, Y-S., and Dash, M. (2007). Markov blanket-embedded genetic
algorithm for gene selection. Pattern Recognition. 40, 3236-3248
APPENDIX A
PROJECT 1 GANTT CHART
96
97
98
APPENDIX B
PROJECT 2 GANTT CHART
100
101
102
103
Download