COMPARATIVE STUDY OF FEATURE SELECTION METHOD OF NURULHUDA BINTI GHAZALI

advertisement
COMPARATIVE STUDY OF FEATURE SELECTION METHOD OF
MICROARRAY DATA FOR GENE CLASSIFICATION
NURULHUDA BINTI GHAZALI
UNIVERSITI TEKNOLOGI MALAYSIA
COMPARATIVE STUDY OF FEATURE SELECTION METHOD OF
MICROARRAY DATA FOR GENE CLASSIFICATION
NURULHUDA BINTI GHAZALI
A project report submitted in partial fulfillment of the
requirements for the award of the degree of
Master of Science (Computer Science)
Faculty of Computer Science and Information Systems
Universiti Teknologi Malaysia
OCTOBER 2009
iii
To my beloved Mummy and Abah…
Hazijun bt. Abdullah and Ghazali bin Sulong
My beloved sisters..
Nurhanani and Nur Hafizah
My beloved brother..
Ikmal Hakim
My brother-in-laws..
Saiful Azril and Faridun Naim
My beloved nieces..
Sarah Afrina and Sofea Alisya
My supervisor..
Assoc. Prof. Dr.Puteh Saad
and last but not least to all my supportive friends especially
Syara, Radhiah, Zalikha, Umi and Hidzir..
“Thank you for all the support and love given”
iv
ACKNOWLEDGEMENT
In the name of Allah, Most Gracious, Most Merciful.
All praise and thanks be to Allah for His guidance that had lead me in
completing this research. His blessings had given me strength and courage
throughout this past year and had helped me overcome difficulties during this
research period.
First and foremost, I would like to take this opportunity to express my sincere
gratitude to those who had assisted me in finishing this research. To my dear
supervisor, Assoc. Prof. Dr. Puteh Saad, thank you for all your supports and
guidance in showing me the right path towards completing this research. I really
appreciated your advices and motivations that you had given me within the period of
this research.
My infinite thank you are dedicated to my loving and caring family, who had
cherish me and give me full support in any kind. I am deeply appreciated for all the
motivations and inspirations. Without them, it is impossible for me to finish my
research.
And last but not least, an endless appreciation to all my fellow friends and
classmates for all the supports and encouragements. Their friendships never fail to
amaze me.
May Allah S.W.T bless them all and repay all of their kindness and sacrifices.
v
ABSTRACT
Recent advances in biotechnology such as microarray, offer the ability to
measure the levels of expression of thousands of genes in parallel. Analysis of
microarray data can provide understanding and insight into gene function and
regulatory mechanisms.
This analysis is crucial to identify and classify cancer
diseases. Recent technology in cancer classification is based on gene expression
profile rather than on morphological appearance of the tumor. However, this task is
made more difficult due to the noisy nature of microarray data and the overwhelming
number of genes. Thus, it is an important issue to select a small subset of genes to
represent thousands of genes in microarray data which is referred as informative
genes. These informative genes will then be classified according to its appropriate
classes. To achieve the best solution to the classification issue, we proposed an
approach of minimum Redundancy-Maximum Relevance feature selection method
together with Probabilistic Neural Network classifier. The minimum RedundancyMaximum Relevance feature selection method is used to select the informative genes
while the Probabilistic Neural Network classifier acts as the classifier. This approach
has been tested on a well-known cancer dataset which is Leukemia. The results
achieved shows that the gene selected had given high classification accuracy. This
reduction of genes helps take out some burdens from biologist and better
classification accuracy can be used widely to detect cancer in early stage.
vi
ABSTRAK
Kemajuan terkini dalam bioteknologi, contohnya mikroarray, membolehkan
tahap pengekspresan beribu-ribu gen diukur secara selari. Penganalisaan dari data
mikroarray dapat memberikan pemahaman dan pengetahuan berkenaan fungsi
sesuatu gen dan mekanisma pengaturannya. Penganalisaan ini adalah penting untuk
mengenalpasti dan mengkelaskan penyakit-penyakit kronik terutama sekali penyakit
kanser. Teknologi yang digunakan baru-baru ini dalam pengkelasan kanser adalah
berdasarkan kepada maklumat dari pengekspresan gen berbanding kemunculan
tumor itu secara fizikal.
Walaubagaimanapun, tugas ini menjadi sukar kerana
kewujudan pelbagai gangguan (noise) dalam pemprosesan data mikroarray dan juga
jumlah bilangan gen yang sangat banyak. Oleh itu, ianya merupakan satu isu penting
untuk memilih hanya sebilangan kecil gen daripada ribuan gen dalam data
mikroarray dan ini dipanggil sebagai gen bermaklumat. Gen bermaklumat ini akan
dikelaskan berdasarkan kelasnya yang sesuai. Untuk mencapai penyelesaian yang
terbaik bagi permasalahan ini, kami mancadangkan pendekatan kaedah pemilihan
gen iaitu ‘minimum Redundancy-Maximum Relevance’ bersama dengan pengkelas
‘Probabilistic Neural Network’. ‘minimum Redundancy-Maximum Relevance’
digunakan untuk memilih gen-gen bermaklumat itu manakala ‘Probabilistic Neural
Network’ bertindak sebagai pengkelas.
penyakit kanser iaitu Leukimia.
Kaedah ini telah
diuji ke atas sejenis
Keputusan eksperimen yang diperolehi sangat
memuaskan dan ini dapat membantu kerja pakar-pakar biologi serta memberi
harapan kepada masyarakat bagi mengesan kanser di peringkat awal.
vii
TABLE OF CONTENTS
CHAPTER
1
TITLE
PAGE
DECLARATION
ii
DEDICATION
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
v
ABSTRAK
vi
TABLE OF CONTENTS
vii
LIST OF TABLES
x
LIST OF FIGURES
xi
LIST OF ABBREVIATIONS
xii
LIST OF APPENDICES
xiv
INTRODUCTION
1
1.1 Introduction
1
1.2 Background of the Problem
3
1.3 Problem Statement
5
1.4 Objectives of Research
5
1.5 Scope of Research
6
1.6 Importance of the Study
6
viii
2
LITERATURE REVIEW
8
2.1 Introduction
8
2.2 Genes and Genes Expression
9
2.3 Microarray Technology
11
2.4 Feature Selection
12
2.4.1 ReliefF Algorithm
13
2.4.2 Information Gain
15
2.4.3 Chi Square
16
2.4.5 Minimum Redundancy-Maximum Relevance
16
Feature Selection
2.5 Classification
3
18
2.5.1 Random Forest
18
2.5.2 Naïve Bayes
19
2.5.3 Probabilistic Neural Network
20
2.6 Challenges in Genetic Expression Classification
22
2.7 Summary
23
METHODOLOGY
24
3.1 Introduction
24
3.2 Research Framework
25
3.2.1 Problem Definition
27
3.2.2 Related Studies
27
3.2.3 Study on Proposed Method
28
3.2.4 Data Preparation
29
3.2.5 Feature Selection
31
3.2.6 Classification
32
3.2.7 Evaluation and Validation
34
3.2.8 Result Analysis
34
3.3 Leukemia
35
3.4 Software Requirement
36
3.5 Summary
37
ix
4
5
6
IMPLEMENTATION
38
4.1 Introduction
38
4.2 Data Format
38
4.3 Data Preprocessing
39
4.4 Feature Selection Method
44
4.4.1 mRMR Feature Selection Method
45
4.4.2 ReliefF Algorithm
47
4.4.3 Information Gain
49
4.4.4 Chi Square
49
4.5 PNN Classifier
52
4.6 Experimental Settings
55
4.6.1 Feature Selection
56
4.6.2 Classification
57
4.7 Summary
57
EXPERIMENTAL RESULT ANALYSIS
58
5.1 Overview
59
5.2 Analysis of Results
50
5.3 Discussion
66
5.4 Summary
66
DISCUSSION AND CONCLUSION
67
6.1 Overview
67
6.2 Research Contribution
68
6.3 Problems and Limitation of Research
69
6.4 Suggestions for Better Research
69
REFERENCES
71
APPENDIX A
77
APPENDIX B
82
x
LIST OF TABLES
TABLE NO
TITLE
PAGE
2.1
Schemes in mRMR Optimization Condition
17
2.2
Comparison of k-NN and PNN using 4 Datasets
22
4.1
Leukemia Dataset
56
xi
LIST OF FIGURES
FIGURE NO
TITLE
PAGE
2.1
DNA Structure
9
2.2
Process of Producing Microarray
11
2.3
Sample of Microarray
12
2.4
Comparison of 3 Methods of Feature Selection
14
2.5
Architecture of PNN
21
3.1
Research Framework
26
3.2
Sample of Dataset
30
3.3
Sample of Dataset
30
3.4
Process of Feature Selection
31
3.5
Process of Classification
32
3.6
Overall Process of Feature Selection and Classification
33
3.7
Abnormal Proliferation of Cells in Bone Marrow Compared
35
To Normal Bone Marrow
4.1
Original Dataset in ARFF Format Showing Genes Values
40
4.2
Original Dataset in ARFF Format Showing Class Names
40
4.3
Dataset in IOS GeneLinker Software before Discretization
41
4.4
Dataset in IOS GeneLinker Software after Discretization
42
4.5
Discretized Data in CSV Format
43
4.6
Continuous Data in CSV Format
44
4.7
ReliefF Algorithm
48
4.8
Chi Square Algorithm
51
xii
5.1
Classification using PNN for Different Types of Data
59
5.2
Classification Accuracy using PNN for Different Scheme in 60
Feature Selection using mRMR
5.3
Classification using PNN by Different Number of Selected
61
Features
5.4
Comparison of Classification Accuracy by Different
63
Feature Selection Method using PNN
5.5
Comparison of Classification Accuracy using Different
64
Classifier
5.6
Classification Accuracy using 10-fold Cross Validation
65
xiii
LIST OF ABBREVIATIONS
ALL
-
Acute Lymphoblastic Leukaemia
AML
-
Acute Myeloid Leukaemia
ARFF
-
Attribute-Relation File Format
CSV
-
Comma-Separated Values
mRMR
-
Minimum Redundancy Maximum Relevance
PNN
-
Probabilistic Neural Network
DNA
-
Deoxyribonucleic Acid
k-NN
-
k-Nearest Neighbor
RNA
-
Ribonucleic Acid
mRNA
-
Messenger Ribonucleic Acid
xiv
LIST OF APPENDICES
APPENDIX
TITLE
PAGE
A
Project 1 Gantt Chart
77
B
Project 2 Gantt Chart
82
CHAPTER 1
INTRODUCTION
1.1
Introduction
Every living organism has discrete hereditary units known as genes. Each
gene provides some function or mechanism either by itself or it will combine with
other genes that will eventually producing some property of its organism. Genome is
a complete set of genes for an organism and is said as the ‘library” of genetic
instruction that an organism inherits (Campbell and Reese, 2002). Each gene is made
of deoxyribonucleic acid (DNA) molecule which consists of two long strands that
tightly wound together in a spiral structure known as double helix (Amaratunga and
Cabrera, 2004). Along each of these strands located various form of genes that
differs by its sequences for each organism. This makes each organism unique and
different from each other. The DNA molecule of an organism is located in a cell. A
cell is the fundamental units of all living organism and it contains many substructure
such as nucleus, cytoplasm and plasma membrane. The nucleus is where DNA is
embedded. Genes in DNA is expressed by transferring its coded information into
proteins that dwell in the cytoplasm. This process is called as gene expression
(Russell, 2003). There are several experimental techniques to measure gene
2
expression such as expression vector, reporter gene, northern blot, fluorescent
hybridization, and DNA microarray.
DNA microarray technology allows the simultaneous measurement of the
expression level of a great number of genes in tissue samples (Paul and Iba, 2005). It
yields a set of floating point and absolute values. Many explored on classification
methods to recognize cancerous and normal tissues by analyzing microarray data.
The microarray technology typically produces large datasets with expression values
for thousands of genes (2000-20000) in a cell mixture, but only few samples are
available (20-80) (Huerta et al.).
This study is focused on gene selection and classification of DNA microarray
data in order to identify tumor samples from normal samples. Gene selection is a
process where a set of informative genes is selected from the gene expression data in
a form of microarray dataset. This process helps improve the performance of the
classifier. On the other hand, classification is a process to classify microarray data in
several classes that have its own characteristics. There are several techniques that
have been used in gene selection such as ReliefF Algorithm, Information Gain,
minimum Redudancy Maximum Relevance (mRMR) and Chi Square. For
classification of microarray data, a few techniques have been applied in the
bioinformatics field to classify the highly dimensional data. These techniques include
Random Forest, Naïve Bayes and Probabilistic Neural Network (PNN).
The proposed method involved two stages where the first stage is the gene
selection stage and the second one would be the classification stage. In gene
selection method, the technique chosen is a technique called minimum RedundancyMaximum Relevance (mRMR) feature selection and will be compared to three other
method namely ReliefF, Information Gain and Chi Square. mRMR is a feature
selection framework that was introduced by Ding and Peng in 2005. They
supplement the maximum relevance criteria along with minimum redundancy criteria
to choose additional features that are maximally dissimilar to already identified ones.
3
This can expand the representative power of the feature subset and help improves
their generalization properties. The classification problem will be handled by
Probabilistic Neural Network (PNN) technique. PNN has been widely used in
solving classification problems. This is because it can categorize data accurately
(Nur Safawati Mahshos, 2008). Both techniques will be assessed on a bench mark
cancer dataset which is Leukemia (Golub et al, 1999).
1.2
Background of the Problem
Cancer is a killer disease to everyone worldwide. There are at least 100
different types of cancer that has been identified. Traditionally cancer is diagnosed
based on the microscopic examination of patients’ tissue. This kind of diagnosis may
fail when dealing with unusual or atypical tumors. Currently, cancer diagnosis is
based on clinical evaluation and also referring to medical history and physical
examination. This diagnosis takes a long time and might however limit the finding of
tumor cell especially in early tumor detection (Xu and Wunsch, 2003). If tumor cell
is found in its critical stage, then it might be too late to cure the patient.
Thus, classification for cancer diseases has been widely carried out for the
past 30 years. Unfortunately, there has been no general or perfect approach to
identify new classes or assigning tumors to known classes. This happens because
there are various ways that can cause cancer and too many types of cancer that
sometimes difficult to distinguish. By depending on morphological appearance of
tumors, it is hard to discriminate between two similar types of cancer (Golub et al,
1999).
4
In order to overcome the above issues, a new technique based on cancer
classification has been introduced. The technique employs an advanced microarray
technology that measures simultaneously the expression level of a great number of
genes in tissue samples. Nevertheless, this technique contributes to a new problem
whereby there exist a numerous number of irrelevant genes or overlapping of genes.
Hence, selection and classification must be done in order to select the most
significant genes from a pool of irrelevant genes and noises.
Nowadays, there are a lot of selection and classification techniques that has
already been studied and developed to help in better classification of microarray data.
Among these techniques, there are a few that gives promising result such as mRMR,
ReliefF, Information Gain and Chi Square for gene selection and PNN classification.
mRMR is chosen as the primary technique for gene selection since this technique are
proposed originally for gene selection (Ding and Peng, 2003). The advantage of this
technique is it focuses on redundancy of genes together with the relevance of genes.
Unlike other techniques; ReliefF (Kononenko, 1994), Information Gain (Cover and
Thomas, 1991) and Chi Square (Zheng et al, 2003), they were firstly proposed only
for general feature selection, rather than genes. For comparison, these four
techniques are used to select genes in order to measure the performance.
As for classification, the technique chosen in this research is Probabilistic
Neural Network (PNN) classifier. PNN has been use in many studies of feature
classification (Pastell and Kujala, 2007; Shan et al, 2002). These studies have proved
that PNN yield better result in classification accuracy compared to other existing
classifiers. Thus, this research combines a few feature selection methods together
with PNN classifier to classify microarray data according to its classes.
5
1.3
Problem Statement
The challenging issue in gene expression classification is the enormous
number of genes relative to the number of training samples in gene expression
dataset. Not all genes are relevant to distinguish between different tissue types
(classes) and introduced noise (Liu and Iba, 2002) in the classification process and
thus it drowns out the contribution of the relevant genes (Shen et al, 2007). On top of
that, a major goal of diagnostic research is to develop diagnostic procedures based on
inexpensive microarrays that have adequate number of genes to detect diseases.
Hence, it is crucial to recognize whether a small number of genes will be sufficient
enough for gene expression classification.
1.4
Objectives of Research
The aim of this research is to select a set of meaningful genes using a
minimum Redundancy-Maximum Relevance feature selection technique and to
classify them using Probabilistic Neural Network. In order to achieve aim, the
following objectives must be fulfilled:
1. To select a set of meaningful genes using Minimum Redundancy-Maximum
Relevance (mRMR), Information Gain, ReliefF and Chi Square.
2. To evaluate the effectiveness of feature selection method using Probabilistic
Neural Network (PNN) classifier.
3. To compare the performance of mRMR as feature selection method using
PNN, Random Forest, and Naïve Bayes classifiers.
6
1.5
Scope of Research
The scope of study is stated as below:
•
mRMR, ReliefF, Information Gain and Chi Square is utilized for gene
selection.
•
PNN technique is used for gene expression classification.
•
Leukemia microarray dataset is used for testing
(Data source: Weka
Software Package, http://www.cs.waikato.ac.nz/ml/weka/)
1.6
•
10-fold cross validation is utilized to perform the validation.
•
The tools used are Matlab, Knime, Weka and IOS GeneLinker
Importance of the Study
This study is carried out to aid in classification of cancer diseases. Cancer
diseases are lethal to human. Several methods have been conducted to detect this
deadly disease. Unfortunately, the time taken is too long to confirm that someone has
the disease. This is due to the symptoms that can only be seen after a very long time
and by the time, cancer level has reached a critical stage.
Common examination of patients require weekly checkup to precisely
identify the presence of the disease. Due to the long term of examination, the disease
might get more critical without exact cure or treatment. The advanced technology of
microarray lessens the burden among medical staffs. The microarray of human genes
can be used to detect cancer diseases earlier.
7
Despite the fact that microarray technology is said has the capability to solve
the problems, but unfortunately this technology requires an excellent technique to
select only the best subset of all genes to give enough information about a particular
cancer disease. This is due to the overwhelming number of genes produce by
microarray in a few sample sizes.
Thus, by doing this research, the best approach can be achieved to solve the
problems in gene selection and classification. The idea was to apply the minimum
Redundancy-Maximum Relevance feature selection technique (compared with other
feature selection techniques) together with Probabilistic Neural Network to give a
tremendous result in a short time. This research provides knowledge in the field of
bioinformatics and it gives benefit in medical area. Apart from that, it helps saving
human life by detecting cancer disease in early stage.
CHAPTER 2
LITERATURE REVIEW
2.1
Introduction
Every living organism has a basic functional unit called cell. This cell is the
smallest unit in an organism that contains the hereditary information necessary to
determine its own function and responsible to pass the information to the next
generation of cells. In human and some of other organisms, the hereditary
information contained in a genetic material called Deoxyribonucleic Acid (DNA).
DNA has been widely used as identification method in many applications and fields
such as forensics, biometrics, e-commerce, security and others. DNA structure is
made up of two strands that tightly around each other in a spiral structure and this
structure have been called as double helix structure. Each of these strands consists of
nucleotides. Each nucleotide consists of three basic components, namely sugar
deoxyribose, phosphate group and a base. There are four different types of bases in
DNA which are adenine, guanine, thymine and cytosine. These bases act as the
connectors of the double helix DNA structure. One base of a nucleotide in one of the
DNA strands will bond with its base pair on the other strands of DNA. This is called
as base pairing and it is bonded with hydrogen bonds. The bases have two categories;
9
purine and pyrimidine. Adenine and guanine belongs to purine group while thymine
and cytosine belongs to pyrimidine group. In base pairing, purine can only be paired
with pyrimidine, as example, adenine can only be paired with thymine and guanine
with cytosine. Since the bases differ for each nucleotide, it creates various sequences
that uniquely represent a person (DNA, Wikipedia, 11th May 2009).
Figure 2.1 : DNA Structure
2.2
Genes and Genes Expression
Genes are specific sequences of nucleotides which uniquely describe a
person’s characteristic, mostly physical characteristics. For example, the skin colour,
10
the shape of face, the colour of eyes and other features. Besides physical
characteristics or appearances, genes can also detect if a person is having a disease
such as cancer or diabetes. (Russell, 2003). Genes control all aspects of the life of an
organism, encoding the products responsible for development, reproduction and so
forth (Nurulhuda Ghazali, 2008).
Gene expression is a process where genetic information of a DNA is
converted into a functional protein. There are three major steps in this process;
transcription, messenger Ribonucleic Acid (mRNA) processing and translation. The
synthesis of DNA template to mRNA is termed transcription. Transcription occurs in
nucleus. This process is done by breaking the hydrogen bonds in DNA and it splits
the double helix structure of DNA. Some parts of this strands act as template and is
transcribed to mRNA. An enzyme RNA polymerase attaches itself on DNA strands
and initiates transcription. Nucleotides with complement bases to the bases on DNA
are added one at a time to elongate the strand. The enzyme RNA polymerase moves
along the DNA and when it reaches the termination triplet code, it detaches and the
mRNA strands moves away from DNA. The two DNA strands are joined again by
hydrogen bond.
Translation is the process in which codons in mRNA are used to assemble
amino acids in the correct sequence to produce polypeptide chain (protein). In the
first stage of translation, the mRNA binds to ribosomes. Then, an amino acid is
activated by an enzyme and the activation produces specific aminoacyl-tRNA
molecules. Initiation of polypeptide chain occurs when the anticodon of aminoacyltRNA molecule carrying methionine binds to the start codon on mRNA. A second
aminoacyl-tRNA with complimentary anticodon binds to the second mRNA. A
peptide bond is catalyses between the two adjacent amino acids to produce dipeptide.
This process of translation is repeated and finally forms a polypeptide chain (Russell,
2003).
11
2.3
Microarray Technology
A microarray is an analytical device that allows exploration of genomic in
inexpensive time. Thousands of genes contained in a glass chips are used to examine
fluorescent samples prepared by labeling mRNA from biological sources (cells,
tissues). Molecules in the fluorescent sample will then yield a chemical reaction,
causing each spot to glow with different percent of intensity based on the activity of
the expressed gene. Since the pattern of gene expression strongly related to its
function, it helps providing significant information regarding human disease, aging,
drug, mental illness and many other clinical matters.
Figure 2.2 : Process of Producing Microarray
Microarray has been widely used in gene expression. It account for 81% of
the scientific publications, but microarrays also are being used for other purposes that
12
includes genotyping, tissue analysis and protein studies. The technology of
microarray is commonly applied in human disease, and drug discovery. In detection
of human disease, cancer has accounted the highest percentage, 83.5% of the
microarray publication, compared to others disease. The other diseases include
AIDS, stroke, Alzheimer’s, diabetes, cardiovascular, anemia, autism, Parkinson’s
and cystic fibrosis. (Shena, 2003).
Figure 2.3 : Sample of Microarray
2.4
Feature Selection
There are two major stages involved in classification of microarray data; first
stage is feature selection or also called as gene selection and the final stage is
classification of the selected genes. In microarray technology, activities of thousand
genes can be measured simultaneously and helps in early detection of fatal diseases
(New Gene Selection Method, 14 May 2009). However, this technology yields a
high-dimensional data that represents the genes. This high-dimensional data brings
difficulties in classification process and because of human genes are in numerous
number and not all of these genes contribute in determining a disease, it is crucial to
select only the relevant genes to take part in classification process. Relevant genes
13
also sometimes called as informative genes are the genes that provide enough
information about a disease. Selection of relevant genes to be used as sample in
classification has been a common task in most gene expression studies, where
researchers try to identify the smallest possible set of genes that can still achieve high
accuracy of classification (Diaz and Alvarez, 2006). By selecting only the relevant
genes, the dimensionality of the data can be reduce and be processed by the
classifier, reducing the execution time and improve the classification performance.
Following are the significance of gene selection (Nurulhuda Ghazali, 2008):
•
Eliminates the irrelevant genes or useless genes
•
Reduce the dimension for input space
•
Reduce the complexity and execution time
•
Reduce cost in clinical setting
•
Improve performance of a classifier
There are many methods have been studied to perform gene selection task. The next
section will describe some of these methods that will then be compared in order to
evaluate which methods are the best to select genes.
2.4.1 ReliefF Algorithm
ReliefF is an extension of the standard Relief algorithm (Kononenko, 1994).
The idea of this algorithm is to estimate the quality of features based on their values
that distinguish between sample points that are near to each other. It is said as an
algorithm that is sensitive to interactions. Given a randomly selected instance Insm
from class L, ReliefF searches for K of its nearest neighbors from the same class
called nearest hits H, and also K nearest neighbors from each of the different classes,
called nearest misses M. It then updates the quality estimation Wi for gene i based on
14
their values for Insm, H, M. If instance Insm and those in H have different values on
gene i, then the quality estimation Wi is decreased. On the other hand, if instance Insm
and those in M have different values on the the gene i, then Wi is increased. The
whole process is repeated n times which is set by users. Below is the example of
result from an experiment carried by Zhang et al (2003) on handwritten characters
Chinese dataset. The graph shows the comparison of three methods; Genetic
algorithm- wrapper (G-W), ReliefF-Genetic algorithm-wrapper (R-G-W) and ReliefF
in selecting the best subset of features to be classified. The x-axis represent the type
of data and the y-axis represent the features selected by different methods
accordingly. The label ‘All’ is the overall number of original features.
Comparison of 3 methods of feature selection
250
200
150
100
50
0
G‐W
R‐G‐W
ReliefF
All
Figure 2.4 : Comparison of 3 Methods of Feature Selection
15
From the graph, there is some slight difference in the number of selected
features between methods (G-W) and (R-G-W) whereas ReliefF alone has an
obvious difference between the other two methods in terms of number of selected
features. This indicates ReliefF technique alone contribute to redundancy in features
selected even it is high in number of relevant features. If most of the features are
relevant to the concept, it would select most of them even though only a fraction is
necessary for concept description. (Kohavi and John, 1997). Despite from the fact
that it selects many redundancy features, it does however produce good accuracy
percentage in classification.
2.4.2
Information Gain
Information gain is commonly used as a surrogate for approximating a
conditional distribution in the classification setting (Cover and Thomas, 1991).
According to Mitchell (1997), by knowing the value of a feature, information gain
measure the number of bits of information obtained for class prediction. The
following is the equation of information gain :
InfoGain = H(Y) – H(Y|X)
(2.1)
Where X and Y is the feature and
H(Y) = - Σ p (y) log2 (p(y))
(2.2)
yεY
H(Y|X) = - Σ p (x) Σ p (y|x) log2 (p(y|x))
xεX
yεY
(2.3)
16
2.4.3
Chi-Square
The Chi-square algorithm will evaluate genes individually with respect to the
classes. This algorithm is based on comparing the obtained values of the frequency
of a class because of the split to the expected frequency of the class. From the N
examples, let Nij be the number of samples of the Ci class within the jth interval
while MIj is the number of samples in the jth interval. The expected frequency of Nij
is Eij = MIj | Ci | lN . Therefore, the Chi-squared statistic of a gene is then defined as
follow (Jin, X., et al., 2006):
C
l
χ = Σ Σ (Nij - Eij)2
2
i=1 j=1
(2.4)
Eij
where I is the number of intervals. The larger the the χ2 value, the more informative
the corresponding gene is.
2.4.5 Minimum Redundancy-Maximum Relevance Feature Selection
Minimum Redundancy-Maximum Relevance (mRMR) feature selection
method was introduced by Ding and Peng (2003). It was firstly proposed to reduce
the number of genes selected in classification of microarray data. This method
concerns more on the redundancy of genes selected caused by other gene selection
method. It is proposed to expand the representative power of the feature set by
requiring that features are maximally dissimilar to each other and it is supplemented
by maximal relevance criteria such as maximal mutual information with target
phenotypes. In mRMR, there are different schemes used to search for the next feature
in mRMR optimization condition. The following table list all the schemes :
17
Table 2.1 : Schemes in mRMR Optimization Condition
Type
Acronym Full Name
MID
Formula
Mutual
information
1
| |
,
Ω
,
difference
Discrete
(2.5)
MIQ
Mutual
information
,
Ω
/
1
| |
,
quotient
(2.6)
FCD
F-test
correlation
Ω
1
| |
,
|
, |
difference
Continuous
(2.7)
FCQ
F-test
correlation
Ω
,
/
1
| |
|
, |
quotient
(2.8)
The benefits of this approach is that it represent the phenotypes more than
usual method, leading to better generalization property and only small features can
be accurately classify genes according to its class, compared to large datasets.
18
2.5
Classification
Classification of microarray data has been studied years ago by researches
from various field especially computer and medical field. The purpose of this
classification is to classify data to its appropriate class. As example, classify the
cancerous and non cancerous genes. The classification of data is important in early
detection of a disease, more specifically, cancer disease. This can helps in treating
patients before the cancer reaches critical stage. The major process involved in
classification is training the classifier with some training samples to enable it to learn
the patterns of the genes. After the classifier has been trained, it will then be tested
using testing samples. The result yield is the percentage of accuracy of the
classification. Some of well-known classifiers are Random Forest, Naïve Bayes and
Probabilistic Neural Network.
2.5.1 Random Forest
Random Forest is a general term for ensemble methods using tree-type
classifiers {h(x,Θk), k = 1,. . . , } where the { Θk } are independent identically
distributed random vectors and x is an input pattern (Breiman, 2001). In training, the
Random Forest algorithm creates multiple CART-like trees (Breiman et al., 1984),
each trained on a bootstrapped sample of the original training data, and searches only
across a randomly selected subset of the input variables to determine a split (for each
node).
For classification, each tree in the Random Forest casts a unit vote for the
most popular class at input x. The output of the classifier is determined by a majority
vote of the trees. The number of variables is a user-defined parameter, but the
19
algorithm is not sensitive to it. Often, a blindly selected such value is set to the
square root of the number of inputs. By limiting the number of variables used for a
split, the computational complexity of the algorithm is reduced, and the correlation
between trees is also decreased. Finally, the trees in Random Forests are not pruned,
further reducing the computational load. As a result, the Random Forest algorithm
can handle high dimensional data and use a large number of trees in the ensemble.
This combined with the fact that the random selection of variables for a split seeks to
minimize the correlation between the trees in the ensemble, results in error rates that
have been compared to those of AdaBoost (Freund and Schapire, 1996) while being
computationally much lighter. As each tree is only using a portion of the input
variables in a Random Forest, the algorithm is considerably lighter than conventional
bagging with a comparable tree-type classifier.
The analysis of Random Forest (Breiman, 2003) shows that its computational
time is,
√
log
(2.9)
where c is a constant, T is the number of trees in the ensemble, M is the number of
variables and N is the number of samples in the data set. It should be noted that
although Random Forests are not computationally intensive, they require a fair
amount of memory as they store an N by T matrix in memory.
2.5.2 Naïve Bayes
The Naive Bayes method originally tackles problems where variables are
categorical, although it has natural extensions to other types of variables. It assumes
that variables are independent within each class, and simply estimates the probability
20
of observing a certain value in a given class by the ratio of its frequency in the class
of interest over the prior frequency of that class (Jamain and Hand, 2005). That is,
for any class c and vector X = (Xj)j
∏
|
= 1,. . .,k
of categorical variables,
|
and
=
#
#
(2.10)
Continuous variables are generally discretised, or a certain parametric form is
assumed (e.g. normal). One can also use non-parametric density estimators like
Kernel functions. Then similar frequency ratios are derived.
2.5.3 Probabilistic Neural Network
A probabilistic neural network (PNN) is an artificial neural network that is
used for data classification tasks. It is a model based on competitive learning with a
‘winner takes all attitudes” and the core concept based on multivariate probability
estimation. The PNN was initially developed by Specht (1990). This network
provides a general solution to pattern classification problems by following an
approach developed in statistics, called Bayesian classifiers. Bayes theory, developed
in the 1950's, takes into account the relative likelihood of events and uses a priori
information to improve prediction. The PNN network or architecture consists of an
input layer, two hidden layer and an output layer. The PNN classifier is sometimes
being taken as belongs to the radial basis function (RBF) class. But the difference
between PNN and RBF is that PNN works on estimation of probability density
function while RBF works on iterative function approximation (Balasundaram
21
Karthikeyan et. al, 2006). PNN has been commonly used in clinical cancer diagnosis
(Shan et al., 2002), a predictive classifier for hospital defibrillation outcomes (Yang
et al., 2005) and cereal grain classification (Visen et al., 2002). The advantage of
PNN is that they usually outperform traditional statistical classifiers in nonlinear
problems (Pastell and Kujala, 2007)
Figure 2.5 : Architecture of PNN (Pastell and Kujala, 2007)
PNN does have its advantages and disadvantages, compared to other
classifiers. PNN train faster than a multilayer perceptron network and it is often are
more accurate. However, PNN network require more memory space to store the
model. Table 2.2 shows a great performance of PNN compared to k-NN.
22
Table 2.2 : Comparison of k-NN and PNN using 4 Datasets
Dataset
Classifier
Leukemia
Embrayonal
CNS Tumor
Medulloblastoma Medulloblastoma
morphology
treatment
outcome
PNN
95.4%
84%
89.8%
69.2%
k-NN
94%
79.6%
85.2%
66.4%
2.6 Challenges in Genetic Expression Classification
Genetic expression has been a common issue in medical field where it
preserves the hereditary information of human genome. This human genome can
determine a physical appearance of a person or specific condition of a human being
which might lead to early detection of a fatal disease. However, in the analysis of
gene expression profiles, the number of tissue samples available is usually small
relative to the number of genes. This can cause to the problems of overfitting and
dimensional curse, where sometimes fail in the analysis of microarray data. Besides
the above matters, the microarray data sometimes is contaminated by noise due to the
devices used when conducting the microarray process or the noise might have
already exists biologically in the gene itself. These matters add to the number of
irrelevant genes and the computation for classification might be costly. According to
previous researches, the techniques applied in classification of microarray data
mostly takes a longer execution time. Not only time factor is important, but the
accuracy of the classification usually very small due to poor performance of the
classifier which might be caused by unwell training of the classifier. Thus, it is
crucial to develop a method to solve these matters. The method includes selection of
informative genes that have enough information to indicate a disease and the
23
classification of the gene expression data which classify the data accurately
according to their classes in an inexpensive time.
2.7
Summary
This chapter explains about the domains of the research, which is in feature
selection and classification. It gives some briefing on the background problems that
involves brief explanation about gene definition and microarray data. It shows that it
is crucial to select informative genes out of the microarray data and classify them to
its appropriate classes. Several methods have been discussed in this chapter based on
previous studies. Example of methods being implemented in feature selection are
ReliefF Algorithm, Information Gain, Chi Square and minimum RedundancyMaximum Relevance (mRMR), while for classification methods are Random Forest,
Naïve Bayes and Probabilistic Neural Network (PNN). Based on the description of
all the above methods, the methods for feature selection will be compared whereas
for classification purpose the PNN method is chosen. The chosen methods were
based on previous implementations that had shown great outcomes.
CHAPTER 3
METHODOLOGY
3.1
Introduction
This chapter explains the process involved during conducting this project.
The project is done by following the research framework as shown later. The process
involved many stages and requires a good time management to ensure that this
project can complete the activities as stated in the framework in a given time. Apart
from framework, this chapter also describes and clarifies about the dataset used in
analysis. The software involved and other equipments are listed to give a clear view
of this project.
25
3.2
Research Framework
There are few phases involve in order to achieve the objectives of this
research. Each of these phases involves different activities to complete this research
systematically and successfully. Figure 3.1 illustrate the overall research framework.
26
Start
Problem Definition
Related Studies
•
•
Problem domain
Existing techniques
•
•
MRMR, Information Gain, ReliefF and Chi Square for feature selection
PNN classifier for feature classification
Study on Proposed Method
Beginning of Experiment
Data Preparation
Feature Selection
•
MRMR, Information Gain, ReliefF and Chi Square is used to select relevant features
(
)
•
PNN act as classifier to classify features according to their classes
•
A comparison of feature selection method, namely MRMR, Information Gain,
ReliefF and Chi Square is made based on classification result
•
10-fold cross validation is implemented to evaluate and validate results
Classification
Comparison of Feature Selection Method
Evaluation and Validation
Result Analysis
Report Writing
End
Figure 3.1 : Research framework
27
3.2.1 Problem Definition
A research study starts with a problem that caught the attention of people and
has become a critical issue that needs to be solved. This research concerns on cancer
classification. Cancer has been widely known as a fatal disease and has become one
of the leading killers in whole world. Fortunately, this disease can be cured if it is
detected in early stage. Thus, there comes a technology called as microarray
technology.
Microarray technology produces gene expression data that helps in
identifying and classifying cancers. However, gene expression data was
overwhelming in numbers and this becomes a difficult task. This overwhelming data
can cause a misclassification and is very computationally expensive. Therefore, this
data need to be selected by picking up only informative or relevant genes as input to
classifier. It will greatly reduce the computation and gives better result of
classification. Knowing a correctly classified cancer class can help in early treatment
of patient.
3.2.2 Related Studies
Before constructing any design or analysis, firstly it is crucial to study all
story behind the problem, the history of previous works and also the process to
conduct this research. The very first step in research is to understand the problem
domain and from there, then the next step can be well-planned to achieve a better
research work. The problem domain in this research involved the area of biology, or
specifically in genomic field. Thus, before proceed with the design phase, it is a
28
necessity to learn and study a bit about this genomic field and the relation with
cancer classification.
Microarray technology is a technology where all human genes can be
expressed simultaneously in a very short time. On the other hand, this expressed
genes need to be classify into its appropriate classes. However, to classify these
genes accurately will require a very powerful classifier that can handle numerous
numbers of genes to be classified. As a second approach, these genes need to be
selected according to its score before it can be classified. There are many existing
techniques has been tested and experimented on this problem, however not many
achieved an encouraging results. The needs to study and learn about previous works
are very important in research. That is one of the reason literature review has to be
included in research framework. It helps in better understanding if problem and gave
ideas on how to solve it.
3.2.3 Study on Proposed Method
Based on literature review, a comparison of all existing technique have been
made, the best techniques are selected that were based on its result in previous study.
The chosen techniques to be compared, namely Minimum Redundancy Maximum
Relevance (mRMR), ReliefF, Chi Square and Information Gain (to select genes) and
Probabilistic Neural Network (PNN) act as classifier. In order to implement these
techniques, first and foremost the techniques should be studied in depth on how it
works, in producing better results.
The mRMR, ReliefF, Chi Square and Information Gain techniques will be
implemented to select informative genes to be used as input to the classifier. These
29
techniques have to be explored first before move to the classifier. As for the
classifier, called PNN classifier is being implemented right after feature selection
technique. The input would be the selected genes by feature selection techniques.
PNN has different kind of perspective by different researchers worldwide. Thus,
before implementing this classifier, a thorough review should be done to better
understand this PNN classifier thus yield a promising result.
3.2.4 Data Preparation
In the beginning stage of experiment, the most important step is data
preparation. Data preparation is defined as the process of converting data into a
machine-readable form so that it can be entered into a system via an available input
device (Data Preparation, Encyclopedia.com, 6th October 2009). Originally, gene
expression data is in form of images but then it is converted to numerical form so
that in can be read by machine (computer).
The numerical form of gene expression data can be easily accessed through
public databank. As for this research, the data source is retrieved from Weka
software. This data is in a Attribute-Relation File Format (ARFF). However, this data
usually contain noise and sometimes need to be normalized or discretized. The
following figures show the example of leukemia dataset in ARFF format.
30
Figure 3.2 : Sample of Dataset
Figure 3.3 : Sample of Dataset
Based on the Figure 3.2 above, the first column represent the number of
samples, in this case is 72 samples which consist of 47 ALL and 25 AML. The rest
of the column from second column until 7130 column, it consist of the genes value in
continuous form and the last column in Figure 3.3 show the class of the samples
which are ALL and AML.
31
In this research, the data is already in numerical but because it contains a
huge range of value, thus it is better to be discretized to minimize burden to the
classifier. In spite of that, discretization and normalization do not usually gives good
result, sometimes it yields poor result.
3.2.5 Feature Selection
After done preparing data, the next step would be selecting features by
implementing mRMR, ReliefF, Information Gain and Chi Square techniques. In this
stage, data that has already been preprocessed will become an input to these feature
selection techniques. This data that contains 7129 genes and 72 samples will be
narrowed down into only few numbers of genes. Only the number of genes will be
reduced instead of the number of samples. The selection of genes is depending on the
genes itself whether it carries information regarding cancer tissues or it might be just
a noise. A subset of informative genes that is sufficient enough to give information
will be the output in this stage of feature selection. Figure below indicate the process
of feature selection.
Preprocessed data
that contains genes
expression
Input
Feature selection
technique select
genes
Output
Subset of
informative genes
Figure 3.4 : Process of Feature Selection
32
3.2.6 Classification
Right after the feature selection steps has completed, the next step would be
classification. This is the key step in classification of microarray data. The output
from feature selection is used as input to classifier, as being stated before, the
classifier being implemented in this research is PNN classifier. The subset of
informative genes yield from feature selection will be inserted in PNN classifier and
then being classified to its correct classes. The output from this classification is the
number of correctly classified sample and also the samples where has been
incorrectly classified.
Subset of
informative genes
Input
PNN
classifier
Output
Number of correctly
classified samples and
incorrectly classified
samples
Figure 3.5 : Process of Classification
33
Preprocessed data
that contains genes
expression
Feature selection
method select
genes
Subset of
informative genes
Training Set
Test Set
PNN
Model
Training of
samples
using PNN
No
Number of
correctly classified
samples
Number of
correctly classified
samples
Yes
Met the
desired
output?
Figure 3.6 : Overall Process of Feature Selection and Classification
34
3.2.7 Evaluation and Validation
The very last step in experiment is evaluation and validation of results. In this
research, the evaluations of results are based on method called k-fold validation. The
‘k’ letter indicates the number of times to repeat the fold validation. As for this
research, the ‘k’ letter holds a value of 10, thus it is called 10-fold cross validation. In
brief, firstly the data sample is divided into 10 parts and for each experiment, one
tenth of the overall data become the train dataset and the rest will become test
dataset. This will be repeated 10 times (according to value of ‘k’) using different
train dataset and test dataset and the result are calculated by taking the average of 10
times experiment. This kind of validation is used to avoid any bias on the data itself
and the result produce is more convincing.
3.2.8 Result Analysis
Finally, after all results have been obtained, it should be analyzed before it
can be presented to people. Analyzation of results involved making a graph or charts,
calculate average or percentage and also justify possible causes of having such a
result.
35
3.3 Leukemia
Leukemia is a cancer-like disease characterized by a great degree of
uncontrolled proliferation of one of the types of white blood cells. Leukemia has
caused an approximate of 45,000 deaths each year throughout the world. In
leukemia, the bone marrow begins to produce damaged white blood cells, which do
not mature properly and unlike normal white blood cells, are able to multiply
uncontrollable and then rapidly displace the normal cells. In addition, the abnormal
functioning of the bone marrow also reduces the amount of red blood cells and
platelets formed there, and sufferers become greatly anemic and their blood does not
clot correctly, leaving them open to the risk of hemorrhage. The following figure
shows the abnormal proliferation of cells in bone marrow compared to normal bone
marrow.
Figure 3.7 : Abnormal Proliferation of Cells in Bone Marrow Compared to Normal
Bone Marrow
36
The causes of most cases in leukemia are unknown but certain specific
factors have been identified, such as excessive exposure to radiation and certain
chemicals, especially benzene. There are four types of leukemia:
1. Acute myeloid leukemia (AML)
2. Acute lymphoblastic leukemia (ALL)
3. Chronic myeloid leukemia (CML)
4. Chronic lymphocytic leukemia (CLL)
Here we chose to use the dataset leukemia which consists of 2 classes, AML
and ALL. AML is a life-threatening disease in which the cells that normally develop
into neutrophils become cancerous and rapidly replace normal cells in bone marrow
whereas ALL is also life-threatening disease in which the cells that normally develop
into lymphocytes become cancerous and rapidly replace normal cells in the bone
marrow, same as AML. The data will be tested, evaluated and classified into ALL
class and AML class.
3.4
Software Requirement
The major software requirements in this research includes Matlab, a
numerical computing environment and programming language. Matlab is maintained
by The MathWorks and it allows easy matrix manipulation, plotting of functions and
data, implementation of algorithms, creation of user interfaces, and interfacing with
programs in other languages. The second software requirements is the KNIME, a
modular data exploration platform that enables the user to visually create data flows,
selectively execute some or all analysis steps, and later investigate the results through
interactive views on data and models. KNIME was developed by the Chair for
Bioinformatics and Information Mining at the University of Konstanz, Germany.
37
This research is based on Windows platform as it is convenient in using KNIME and
Matlab. Apart from that, there are two other software that are being use only for
minor purpose which is Weka and IOS GeneLinker.
3.5
Summary
This chapter describes briefly on research framework for proposed method
and has explained details about some major steps in conducting this research. This
framework is presented in text and flowchart that includes the activities such as
literature review, studies on proposed method, preparation of data, implementation of
both feature selection techniques and PNN classifier, evaluation and validation of
result and lastly report writing. A brief explanation about the datasets (leukemia) and
software requirement has been explained in section 3.3 and 3.4 accordingly.
CHAPTER 4
IMPLEMENTATION
4.1
Introduction
This chapter explains details about the method used to select the informative
genes as an input to classifier and also the PNN classifier in performing
classification. There are four sections in this chapter which are divided into data
format, data preprocessing, feature selection task and classification task (PNN).
4.2
Data Format
The experiment conducted uses 2 types of data format, namely Comma
Separated Values (CSV) and Attribute-Relation File Format (ARFF). A CSV file is
commonly used to store data in a table of lists form, where the members are
39
separated by commas. Each line in the file corresponds to a row in the table, and
within a line, fields are separated by commas where each field belonging to one table
column. The CSV file format is very simple and supported by almost all spreadsheets
and database management systems. (Comma-separated_values, Wikipedia, 15 June
2009).
An ARFF file is in the form of ASCII text file that is purposed to describe a
list of instances sharing a set of attributes. In the case of genes classification, the
instances refer to samples while attributes refer to genes. ARFF files were developed
by the Machine Learning Project at the Department of Computer Science of The
University of Waikato for use with the Weka machine learning software. ARFF files
are partitioned into 2 sections, Header and Data. The Header contains the name of
relation, a list of attributes and their types while the Data contains the data value of
each instance.
4.3
Data Preprocessing
As mentioned in Chapter 3 (Methodology), before continue with experiment,
firstly the data needs to be preprocessed in order to fit the feature selection methods
and classifier. The original data is in form of ARFF format as shown in the following
Figure 4.1 and Figure 4.2:
40
Figure 4.1 : Original Dataset in ARFF Format Showing Genes Values
Figure 4.2 : Original Dataset in ARFF Format Showing Class Name
According to the above figures, the first column in Figure 4.1 represent the
number of samples, in this case is 72 samples which consist of 47 ALL and 25 AML.
The rest of the column from second column until 7130th column (not shown in the
figure), it consist of the genes value in continuous form and the last column in Figure
4.2 show the class of the samples which are ALL and AML. Since the data is varies
in its values, thus it is considered to be discretized to achieve better classification
41
result. Thus, there would be two types of dataset that will be experimented in first
stage of experiment in order to get the best dataset to be use in the following
experiment.
In statistics and machine learning, discretization refers to the process of
converting continuous features or variables to discretized or nominal features. This
can be useful when creating probability mass functions – formally, in density
estimation. It is a form of binning, as in making a histogram (Discretization of
Continuous Features, Wikipedia, 23rd Oct 2009). Typically data is discretized into
partitions of K equal lengths (equal intervals) or K% of the total data (equal
frequencies). The following figures show the process of discretization using IOS
GeneLinker software.
Figure 4.3 : Dataset in IOS GeneLinker Software before Discretization
42
Figure 4.4 : Dataset in IOS GeneLinker Software after Discretization
The process of discretization is in form of quantile discretization which is
each bin receives an equal number of data values. The data range of each bin varies
according to the data values it contains. As for this research, the number of bins used
is 3 bins which are 0, 1 and 2. The discretization target can be based on genes,
samples and all of data. The above example shows all data as discretization target. In
brief, to discretize this data, parameters involved are the operation (quantile or
range), target (per gene, per sample or all data) and the number of bins.
Now the dataset has two forms which are continuous data (original) and
discretized data. Both dataset are in format of ARFF and to fit this data into feature
selection method, these data need to be converted into appropriate format that is
acceptable by feature selection method. For mRMR, the data need to be converted to
CSV file format as shown below. Table below are examples of leukemia dataset in
CSV format, viewed in Microsoft Excel. The first column indicates the class of
43
leukemia (ALL/AML) while the rest of the table stores the values of genes. Figure
4.5 and 4.6 shows both types of data.
class ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 v1 ‐2 0 ‐2 2 ‐2 0 2 2 ‐2 0 ‐2 0 ‐2 0 ‐2 0 ‐2 ‐2 v2 2
0
‐2
0
0
2
‐2
0
‐2
2
‐2
0
0
2
0
0
2
0
v3 0
0
0
0
0
‐2
0
0
0
0
0
0
0
0
0
0
0
0
v4 0
0
0
0
0
0
‐2
0
0
0
‐2
0
0
0
‐2
0
‐2
‐2
v5 ‐2
‐2
2
0
2
‐2
0
‐2
2
0
‐2
2
0
0
2
0
0
0
v6 ‐2
0
2
0
2
0
0
‐2
2
0
2
0
2
2
2
0
2
0
Figure 4.5 : Discretized Data in CSV Format
v7 0 0 2 2 0 ‐2 0 2 2 ‐2 ‐2 ‐2 0 ‐2 0 0 2 0 v8 0
0
0
2
0
‐2
0
2
2
‐2
2
‐2
2
0
0
0
2
‐2
44
class 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
att1 ‐214 ‐139 ‐76 ‐135 ‐106 ‐138 ‐72 ‐413 5 ‐88 ‐165 ‐67 ‐92 ‐113 ‐107 ‐117 ‐476 ‐81 ‐44 att2 ‐153
‐73
‐49
‐114
‐125
‐85
‐144
‐260
‐127
‐105
‐155
‐93
‐119
‐147
‐72
‐219
‐213
‐150
‐51
att3 ‐58
‐1
‐307
265
‐76
215
238
7
106
42
‐71
84
‐31
‐118
‐126
‐50
‐18
‐119
100
att4 88
283
309
12
168
71
55
‐2
268
219
82
25
173
243
149
257
301
78
207
att5 ‐295
‐264
‐376
‐419
‐230
‐272
‐399
‐541
‐210
‐178
‐163
‐179
‐233
‐127
‐205
‐218
‐403
‐152
‐146
att6 ‐558
‐400
‐650
‐585
‐284
‐558
‐551
‐790
‐535
‐246
‐430
‐323
‐227
‐398
‐284
‐402
‐394
‐340
‐221
att7 199 ‐330 33 158 4 67 131 ‐275 0 328 100 ‐135 ‐49 ‐249 ‐166 228 ‐42 ‐36 83 att8 ‐176
‐168
‐367
‐253
‐122
‐186
‐179
‐463
‐174
‐148
‐109
‐127
‐62
‐228
‐185
‐147
‐144
‐141
‐198
Figure 4.6 : Continuous Data in CSV Format
4.4
Feature Selection Method
As been brief earlier, this study involves the comparison of 4 feature selection
methods, namely mRMR, ReliefF, Information Gain and Chi Square. The
comparisons of these methods in terms of its performance are being measured using
PNN classifier.
45
4.4.1 mRMR Feature Selection Method
mRMR was introduced by Ding and Peng in year 2003. mRMR stands for
minimum Redundancy-Maximum Relevance feature selection. The purpose of this
method is to select a feature subset that best characterizes the statistical property of a
target classification variable. These features has to be mutually as dissimilar to each
other as possible, but marginally as similar to the classification variable as possible
(Peng, 2005). The owner of this method believes that combining a “very effective”
gene with another “very effective” gene often does not form a better feature set. One
of the reasons is that the two genes could be highly correlated and leads to
redundancy of feature set. In brief, the mRMR minimizes redundancy and used a
series of intuitive measures of relevance and redundancy to select useful features for
both continuous and discrete datasets.
If a gene has expressions randomly or uniformly distributed in different
classes, its mutual information with these classes is zero whereas if a gene is strongly
differentially expressed for different classes, it should have large mutual information.
Thus the mutual information is used as a measure of relevance of genes. The mutual
information I of two variables x and y is defined based on their joint probabilistic
distribution p(x,y) and the respective marginal probabilities p(x) and p(y):
,
,
log
,
,
(4.1)
The measurements of the level of similarity between genes are besed on their
mutual information. The main idea of minimum redundancy is to select the genes
such that they are mutually maximally dissimilar. Let S denote the subset of features
that is the most relevant for classification. The minimum redundancy condition is
46
1
| |
,
, ,
,
(4.2)
where the I(i,j) is used to represent I(gi,gj) for notational simplicity and |S| is the
number of features in S.
To measure the level of discriminant powers of genes when they are
differentially expressed for different targeted classes, again mutual information
I(h,gi) is used between targeted classes h={h1,h2,…,hK} and the gene expression gi.
Thus I(h,gi) quantifies the relevance of gi for the classification task. Thus the
maximum relevance condition is to maximize the total relevance of all genes in S:
1
| |
,
, ,
(4.3)
where I(h,gi) is referred as I(h,i).
The minimum redundancy – maximum relevance feature set is obtained by
optimizing the conditions in Eqs.(4.2) and (4.3) simultaneously. Optimization of
these two conditions requires combining them into a single criterion function. In this
paper the two conditions are equally important, and consider two simplest combined
criteria:
max
,
(4.4)
max
/
,
(4.5)
An exact solution to the mRMR requirements requires O(N|S|) search to
obtain (N is the number of genes in the whole gene set, Ω). To achieve optimal
47
solution, a simple heuristic algorithm is used. The first feature is selected according
to Eq. (4.3), i.e. the feature with the highest I(h,i). The rest features are selected in an
incremental way: earlier selected features remain in the feature set. Suppose a set of
m features is already selected for the set S, and additional features are selected from
the set Ω S = Ω - S (i.e. all genes except those already selected). The following two
conditions are optimized:
Ω
, ,
(4.6)
Ω
1
| |
, .
(4.7)
The condition in Eq. (6) is equivalent to the condition in Eq. (4.3), while Eq.
(4.7) is an approximation of the condition of Eq. (4.2). The two combinations of Eqs.
(4.4) and (4.5) for relevance and redundancy lead to the selection criteria of a new
feature:
(1) MID: Mutual Information Difference criterion,
(2) MIQ: Mutual Information Quotient criterion,
These optimizations can be computed efficiently in O(|S|•N) complexity.
4.4.2 ReliefF Algorithm
The main idea of the ReliefF algorithm that was proposed by (Kononenko,
1994) is to estimate the quality of attributes that have weights greater than the
threshold using the distinction of an attribute value between a given instance and the
48
two nearest instances namely Hit and Miss. The algorithm of ReliefF is shown in the
following figure.
Input : a vector space for training instances with the value of
attributes and class values
Output : a vector space for training instances with the weight W of
each attribute
Set all weights W[A]=0.0
for i=1 to m do begin
randomly select instance Ri
find k nearest hit Hj
for each class C≠ class(Ri) do
find k nearest miss Mj(C) from class C
for A=1 to all attribute do
∑
∑
,
,
∑
,
,
,
/
end
Figure 4.7: ReliefF Algorithm (Park, H., and Kwon, H-C., 2007)
According to the algorithm, the nearest hits Hj is finds k the nearest neighbors
which is from the same class while the nearest misses Mj(C) is k the nearest
neighbors which is from a different class. W (A) which is the quality estimation for
all attributes A is updated based on their values for Ri, hits Hj and M(C). Each
probability weight must be divided by 1-P (class (Ri)) due to the missing of the class
of hits in the sum of diff (A, R, Mj(C)).
49
4.4.3 Information Gain
The next method of feature selection that will be used in the experiment is
Information Gain. The equation of the information gain method applied is as
following.
Let {ci}mi=1 denote the of classes. Let V be the set of possible values for
feature f. Thus, the information gain of a feature f is defined as:
log
|
|
(4.8)
In information gain, the numeric features need to be discretized. Therefore,
an entropy-based discretization method (Fayyad and Urani, 1993) is used and has
been implemented in WEKA software (Witten and Frank, 1999).
4.4.4 Chi Square
Chi- Square or χ2 –statistic is another feature selection method that will be
used as comparison. The equation of this method is as follow:
(4.9)
where
50
V is the set of possible values for feature f,
Ai ( f = v) is the number of instances in class ci with f = v
Ei( f = v) is the expected value of Ai ( f = v)
Ei( f = v) is computed with:
Ei( f = v) = P ( f = v)P(ci)N
(4.10)
where N is the total number of instances.
Same as information gain, this method also requires numeric features to be
discretized. The following is the example of the chi square algorithm which is
consists of two phases. For discretization, in the first phase, it begins with a high
significance level (sigLevel) for all numeric attributes. The process involved in phase
1 will be iterated with a decreased sigLevel until an inconsistency rate, δ is exceeded
in the discretized data. In the Phase 2, begin with sigLevel0 determined in Phase 1,
each attribute i is associated with a sigLevel [i] and takes turns for merging until no
attribute’s value can be merged. If an attribute is merged to only one value at the end
of Phase 2, it means that this attribute is not relevant in representing the original
dataset. Feature selection is accomplished when discretization ends.
51
Phase 1:
set sigLevel = .5;
do while (InConsistency (data) < δ) {
for each numeric attribute {
Sort(attribute, data);
chi-sq-calculation(attribute, data)
do {
chi-sq-calculation(attribute, data)
} while (Merge (data))
}
sigLevel0 = sigLevel;
sigLevel = decreSigLevel(sigLevel);
}
Phase 2:
set all sigLvl [i] = sigLevel0
do until no-attribute-can be-merged {
for each attribute i that can be merged {
Sort(attribute, data);
chi-sq-initialization(attribute, data);
do {
chi-sq-calculation(attribute, data)
} while (Merge(data))
if (InConsistency (data) < δ)
sigLvl [i] = decreSigLevel(sigLvl [i]);
else
attribute i cannot be merged
}
}
Figure 4.8: Chi Square Algorithm (Liu, H., and Setiono, R., 1995)
52
4.5
PNN Classifier
The probabilistic neural network was developed by Donald Specht. This
network provides a general solution to pattern classification problems by following
an approach developed in statistics, called Bayesian classifiers. Bayes theory,
developed in the 1950's, takes into account the relative likelihood of events and uses
a priori information to improve prediction.
The probabilistic neural network uses a supervised training set to develop
distribution functions within a pattern layer. These functions are used to estimate the
likelihood of an input feature vector being part of a learned category, or class. The
learned patterns can also be combined, or weighted, with the a priori probability, also
called the relative frequency, of each category to determine the most likely class for a
given input vector. If the relative frequency of the categories is unknown, then all
categories can be assumed to be equally likely and the determination of category is
solely based on the closeness of the input feature vector to the distribution function
of a class.
The probabilistic neural network has three layers. The network contains an
input layer which has as many elements as there are separable parameters needed to
describe the objects to be classified. It has a pattern layer, which organizes the
training set such that each input vector is represented by an individual processing
element. And finally, the network contains an output layer, called the summation
layer, which has as many processing elements as there are classes to be recognized.
Each element in this layer combines via processing elements within the pattern layer
which relate to the same class and prepares that category for output. Sometimes a
fourth layer is added to normalize the input vector, if the inputs are not already
normalized before they enter the network.
53
In the pattern layer, there is a processing element for each input vector in the
training set. Normally, there are equal amounts of processing elements for each
output class. Otherwise, one or more classes may be skewed incorrectly and the
network will generate poor results. Each processing element in the pattern layer is
trained once. An element is trained to generate a high output value when an input
vector matches the training vector. The training function may include a global
smoothing factor to better generalize classification results. In any case, the training
vectors do not have to be in any special order in the training set, since the category of
a particular vector is specified by the desired output of the input. The learning
function simply selects the first untrained processing element in the correct output
class and modifies its weights to match the training vector.
The pattern layer operates competitively, where only the highest match to an
input vector wins and generates an output. In this way, only one classification
category is generated for any given input vector. If the input does not relate well to
any patterns programmed into the pattern layer, no output is generated.
The Parzen estimation can be added to the pattern layer to fine tune the
classification of objects, This is done by adding the frequency of occurrence for each
training pattern built into a processing element. Basically, the probability distribution
of occurrence for each example in a class is multiplied into its respective training
node. In this way, a more accurate expectation of an object is added to the features
which make it recognizable as a class member.
The first layer (input layer) of PNN accepts input in d-dimensional input
vectors. The second layer calculates the Gaussian basis function (GBFs) as in the Eq.
4.11 below:
1
,
2
,
,
2
,
(4.11)
54
where it specifies the GBF for m-th cluster in the k-th class where
variance
,
,
is the
is the cluster centroid and d represents the dimension of the input
vector.
The third layer of PNN is where the class conditional probability density
function is estimated, given by the formula
,
,
(4.12)
where Mk is the number of clusters for class k and βm,k is the intra-class mixing
coefficient that can be defined as below.
,
1
(4.13)
The flow of PNN can be explained further by the following pseudo-code:
55
<1> input layer
Given an unknown pattern or feature vector x
<2> pattern layer: xi is the ith reference pattern vector
for i = 1:N
yi = xixT - 0.5(xxT + xi(xi)T);
yi = exp(yi/h2); % go through activation function
end
<3> summation layer
for j = 1:n
sum(j) = 0;
for all i in {1,..., N} % all instances in the same taxon
sum(j) += y(j,i);
end
sum(j) =
% go through activation function
/
end
pattern x belongs to taxon j with some memberships as
for all j in [1,n]
membership(j) = ∑
<4> output layer
assign pattern x to taxon j (sj) with the highest membership such
that
sj* = argmax{membership(j)}
all j∈{1,...,n}
Conclusion: assign x to taxon j with membership(j).
Figure 4.9: PNN Algorithm (Bi, C. et al, 2007)
4.6
Experimental Settings
Generally, the experiment was divided into 2 phases; Feature Selection and
Classification. For both phases, the dataset being tested was Leukemia dataset. This
dataset consist of 7129 genes, 72 samples and 2 classes namely Acute myeloid
leukemia (AML) and Acute lymphoblastic leukemia (ALL). Table 4.1 shows the
description of leukemia dataset. This experiment was conducted on a platform of
Microsoft Windows XP Professional Edition (Service Pack 3) using Intel Core 2 Duo
processor and 2.5 Gigabyte of RAM.
56
Table 4.1 : Leukemia dataset
Class
No. of samples
No. of genes per sample
ALL
47
7129
AML
25
7129
Total
72
The experiment firstly is carried out with the data preparation to be fit in
feature selection techniques and PNN classifier. Data is prepared using IOS
GeneLinker software and the format of data is being converted using free online
tools. The data is divided into two categories, discretized and continuous data. The
next phase of experiment is feature selection, followed by classification.
4.6.1
Feature Selection
The first phase of experiment is the feature selection phase. In feature
selection, the experiment conducted using a technique called as mRMR, ReliefF,
Information Gain and Chi Square (as mentioned also in Chapter 2). In this
experiment, the techniques are run using Matlab and Weka software. The purpose of
these techniques is to select several numbers of informative genes from overall of
7129 genes of leukemia dataset. The selected genes then will undergo the
classification phase.
57
4.6.2
Classification
The classification phase was held after the selection phase because it needs to
wait for informative genes to be selected by feature selection techniques first. Once
several subsets of informative genes have been obtained, the classification phase was
proceeded by using a PNN classifier. This classifier classifies the subsets of genes
according to its classes (AML, ALL) and the end result of this classification is the
percentage of correctly classified samples. To validate this result, a method called as
10-fold cross validation was done in order to have a convincing result. Furthermore,
the result from both phases will be compared to existing techniques to show the
effectiveness of the proposed technique.
4.7 Summary
This chapter discussed the techniques used in details and how the experiment
is being carried out throughout this research. The first phase of experiment is the
preparation of data before being fed into feature selection method. After that, feature
selection method is implemented using mRMR, ReliefF, Information Gain and Chi
Square. These features are then being classified using PNN classifier to measure the
performance of each subset. Lastly, to evaluate the performance of classifier, 10-fold
cross validation is performed to further analyze the results.
CHAPTER 5
EXPERIMENTAL RESULT ANALYSIS
5.1
Overview
This chapter discussed the result from the experiment conducted in this
research and further analysis was done to verify the techniques whether these
techniques are suitable to be used in classification purpose, hence improve
classification result, or vice versa. There are several sections in this chapter that
consist of the experimental settings, the dataset to be tested and also the analysis of
result obtained.
559
5 Analyssis of Resultts
5.2
The following reesults are prroduced usin
ng mRMR, R
ReliefF, Info
ormation Gaain
a
and
Chi Square techn
niques for genes seleection and PNN classsifier for thhe
c
classification
n of genes. Apart
A
from these
t
techniq
ques, there aare a few othher techniques
(
(classifiers)
being tested in this exxperiment inn order to ccompare and
d validate thhe
r
result
obtain
ned from prooposed technnique.
In seelection of geenes, there aare two types of data beiing experimented in ordder
t investigaate which type
to
t
of datta produce higher claassification accuracy. As
A
m
mentioned
earlier,
e
thesee types of ddata are in form
f
of disccrete data an
nd continuouus
d
data.
Figuree 5.1 below shows the percentage of classification by booth data types
u
using
mRMR
R technique and PNN classifier. Th
he mRMR seelects 100 geenes and these
d is then being
data
b
split innto two, 70%
% for traininng and 30% ffor testing.
Percentage of Classification (%)
Classification u
using PNN for Differeent Types of Datta 100
90
91%
80
73%
70
60
50
40
30
20
10
0
C
Continuous Da
ta
Disccrete Data
Types of Daata
Figgure 5.1 : Classificationn using PNN for Differennt Types of Data
D
660
Accoording to figgure above, tthe continuo
ous data produce higher classificatioon
a
accuracy
(91
1%) comparred to discreete data (73%
%). The low accuracy off discrete daata
w due to its
was
i changes of attributess values in thhe dataset. Since
S
classiffication learnns
f
from
patternn and uniqueeness of eachh attributes, thus it prodduces more accurate
a
resuult
i the data are
if
a very distinct among classes. As being vieweed in earlierr figure, it caan
b seen thatt the discrette data are unlikely
be
u
disttinct betweeen two classes. Thus, thhis
p
produces
low
w classificattion accuraccy. As in for continuouss data, it is clear that thhe
d on each
data
h class are very
v
distinct from each other
o
and this help in reecognizing thhe
p
pattern
of eaach class and
d hence, resuulting better classificatioon accuracy.
Nextt experimennt investigatees the differrences of mRMR
m
schemes to seleect
f
features,
wh
hich are Mu
utual Informaation Differrence (MID)) and Mutuaal Informatioon
Q
Quotient
(M
MIQ). The deetailed explannation on these schemess has been brrief in chaptter
4 Figure 5.2
4.
2 below disp
plays the resuult of classiffication accuuracy using both
b
schemess.
Classification Accuracy (%)
Classificaation usingg PNN for Different SScheme in Feature Se
election ussing MRMR
100
95
90
0
85
5
80
0
75
5
70
0
65
5
60
0
55
5
50
0
91%
81%
86% 81
1%
100 Features
200 Features
2
MIQ
Functtion of MRMR MID
Figure 5.22 : Classificaation Accuraacy using PN
NN for Diffeerent Schemee in Feature
Selecction using mRMR
m
661
Baseed on the abbove graph, the result obtained
o
froom MIQ schheme is bettter
t
than
MID scheme
s
for 100 featurees selected but for 2000 features selected,
s
booth
s
schemes
pro
oduce the sam
me accuracyy (81%). Acccording to pprevious worrks (Ding annd
P
Peng,
2005)), MIQ scheme is muchh better thann MID schem
me and the graph
g
showeed
p
proves
this theory.
t
Thiss is because, in MIQ schheme, the caalculation eliiminates moost
o the redunndant geness compared to MID. Thhe redundannt genes in MID schem
of
me
c
contribute
too misclassifiication of daata. The low result retrieved from claassification of
2
200
featurees selected might due to the red
dundancy oof features that lead to
misinterpretation in classsification. IIt can be saiid that only 100 genes are
m
a enough to
c
classify
the data.
d
Classification Accuracy (%)
Classificaation usingg PNN by D
Different N
Number off Selected Feattures
100
95
90
85
80
75
70
65
60
55
50
91
1%
91
1%
81%
10
81%
50
100
200
Number o
of Features
Figure 5.3
3 : Classification using PNN
P
by Diff
fferent Numbber of Selectted Features
The above figuree shows the classificatioon using PN
NN by differeent number of
s
selected
feaatures. The numbers off selected feeatures beinng tested to evaluate thhe
m
mRMR
perfformance arre 10, 50, 100 and 2000 features. According to the resuult
62
achieved, classification accuracy produced by 50 and 100 selected genes has 91% of
accuracy whereas for 10 and 200 selected genes, the accuracy was a bit lower (81%).
The low accuracy produced by 10 genes is due to its lack of informative genes to
give information about a class. Lacking information gives the classifier a very small
training data that is not enough to produce a great model. This lead to
misclassification of test set using a poor model. For 200 selected genes, as been
explained before; the poor performance is caused by redundancy among features. Let
say, in 200 genes, only 100 genes that gives information about the class whereas the
other 100 genes consist of noises and irrelevant genes. These genes will then being
fit into classifier to be train and yield a poor model because of its irrelevant
information. As for 50 and 100 genes selected, the information given was sufficient
for the classifier to produce a very good model. These subset of genes does not have
any redundant genes among them and produce a great performance of classification.
In feature selection, there are a few other techniques that have been used to
select informative genes such as Chi Square, ReliefF and Information Gain. Thus, to
ensure that mRMR technique gives promising results, a comparison of these
techniques is conducted to see which technique gives better classification accuracy.
Figure 5.4 displays the graph of the comparison among feature selection techniques
while Figure 5.5 shows the comparison of 3 classifiers namely PNN, Naïve Bayes
and Random Forest in classifying selected genes by mRMR.
663
Comparisson of Classsification A
Accuracy b
by Differen
nt Feaature Selection Method using PNN
Classification Accuracy (%)
100
0
95
5
91
1%
86%
90
0
81%
85
5
81
1%
80
0
75
5
70
0
65
5
60
0
55
5
50
0
MRM
MR
Inform
mation Gain
ReliefF
Chi Square
Metthods of Feature Selection
n
Figure 5.4 : Compariso
on of Classiffication Acccuracy by Different Featuure Selectionn
Meethod using PNN
P
Baseed on the thee graph aboove, it is verry clear that mRMR achhieve the beest
p
performance
e among othher techniquees. mRMR produces
p
91% classificaation accuraccy
c
compared
too other featuure selectionn techniquess that yield lower accurracy, 86% ffor
I
Information
Gain, 81%
% for RelieffF and Chii Square. T
The lowest percentage
p
in
c
comparison
of feature selection techniques belongs
b
to ReliefF andd Chi Squaare
t
techniques.
mRMR techhnique is this technique focuses moore
The key of success in m
o redundan
on
nt of genes rather
r
than relevance
r
onnly (Peng et al, 2005). This
T techniquue
h been prooven in givinng tremendoous result bassed on previious work. By
has
B eliminatinng
r
redundant
geenes, mRMR
R performs well
w by seleccting only veery informattive genes thhat
s
strongly
con
ntribute in determining its
i class. Coompared to R
ReliefF, Info
ormation Gaain
a Chi Squ
and
uare, these techniques compute only
y the relevannce of featuures/genes annd
i
ignoring
thhe existencce of reduundancy in
n feature ssubsets. Thhis leads to
m
misclassifica
ation of genees due to its irrelevant feeatures.
664
C
Compariso
n of Classification Acccuracy ussing Differe
ent Classifierr
Classification Accuracy (%)
100
0
99
9
98
8
97
7
96..4%
96.1%
96
6
96.1%
95
5
94
4
93
3
92
2
91
1
90
0
P
PNN
Random Forest
Naïve‐Bayes
Classifie
ers
Figure 5.5
5 : Comparrison of Classsification Accuracy
A
usinng Differentt Classifier
Figurre 5.5 show
ws the classsification accuracy
a
byy implementting differeent
c
classifiers
in
n order to coompare whicch classifier gives the beest accuracy.. Based on thhe
p
percentage
o classificaation accuracy, PNN prroduce the highest
of
h
accuuracy (96.4%
%)
c
compared
too the other two classifiiers which only
o
producce 96.1% acccuracy. Eveen
t
though
the result is only slightlly different, but still PNN prodduce the beest
p
performance
e. PNN applies the Bayees theory to solve patternn classificatiion. The bassic
i
idea
of Bayees theory is that it will m
make relativve likelihoodd of events and
a also prioori
i
information.
. PNN also avoids
a
the risk of classiifying data iinto the wron
ng class. Thhis
g
gives
a betteer performannce comparedd to other classifiers.
665
C
Classificatio
on Accuracy using 10
0‐fold Crosss Validation
Classification Accuracy (%)
110
105
100
95
90
85
80
75
70
Fold 1 Fold 2 Fold 3 FFold 4 Fold 5 Fold 6 Fold 7
7 Fold 8 Fold
d 9 Fold 10
10‐fold Crosss Validation
Naïve Bayes
Random Forest
PNN
A
ussing 10-fold Cross Validdation
Figure 5.6 : Claassification Accuracy
Refeerring to the above figuree, most of th
he times the result of classification are
a
100% for the 3 classifierrs. But this oonly can be seen on the early fold, th
hen at the ennd
o fold (endd dataset), thhe classificattion result gets
of
g lower esspecially forr PNN. Baseed
o observatiions, the end dataset coonsist more AML samples and from
on
m this we caan
c
conclude
thaat AML classs is difficullt to distinguuish in a dattaset of AM
ML/ALL rathher
t
than
ALL inn the same dataset.
d
This is due to thee number off AML sampples that migght
n enough to provide information of its patternn. As been ssaid earlier, ALL consissts
not
o 47 samplles whereas AML consissts of only 25
of
2 samples. This makess it difficult to
r
retrieve
mucch valuable informationn from AML
L samples. T
That is whyy the fold thhat
i
involved
AM
ML samples gives poor pperformancee compared to ALL.
66
5.3 Discussion
Based on overall results achieved, it is clear that mRMR and PNN classifier
gives better results compared to other existing techniques . mRMR technique focuses
on redundancy of genes and at the same time the maximum relevance also been
taken into account. Unlike other feature selection techniques (Information Gain,
ReliefF and Chi Square), where they only focus on the maximum relevance of
features and ignore the existing of redundancy problem in features subset. Thus, this
gives credit to mRMR where it concerns on both problem and that is the major
reason why it produces better results. As in for the classifier, PNN has achieved
slightly better result that other classifiers. This is due to the nature of PNN
architecture that has the rule where it tries to minimize the expected risk of
classifying features in the wrong class. Furthermore, PNN is less likely to be
sensitive towards noise in data and it produces outputs with Bayes posterior
probabilities.
5.4 Summary
The result obtained from several experiment has prove that mRMR
techniques select very useful genes and reduce redundancy whereas PNN acts as a
great classifiers that gives a better result than other existing classifiers. The selection
of genes is important since it affect the efficiency of classifiers if the data given are
huge. Furthermore, by selecting only the relevant genes, biologists do not have to
waste time in investigating the wrong genes that causes cancer. They only have to
rely on the selected genes to carry on their research. Thus, it can be said that
combination of mRMR technique and PNN classifier gives great result in
classification of microarray data.
CHAPTER 6
DISCUSSION AND CONCLUSION
6.1
Overview
This chapter discussed the general conclusion of this research, disadvantages
and problems of this research and also suggestions for future works. The problems
that existed during this research will be analyzed to overcome the situation and the
conclusion will discussed about the overall results of the experiment of comparisons
between feature selection technique to select informative genes and PNN to classify
microarray data.
68
6.2
Research Contribution
This research has contributes some knowledge in the area of classification of
microarray data. The following listed contributions of this research:
1. A set of meaningful and informative genes have been selected using mRMR,
Information Gain, ReliefF and Chi Square as an input to the classifier.
2. The effectiveness of the four techniques is evaluated using PNN classifier and
after making a comparison, it showed that from the four techniques, mRMR
performs much better compared to others. Thus, it can be conclude that
mRMR is the best genes selection technique for classification of microarray
data.
3. Further evaluation of mRMR performance as genes selection technique have
been done by using different classifiers namely PNN, Random Forest, and
Naïve Bayes in order to compare which classifier give the best classification
accuracy. A validation using 10-fold cross validation also has been
implemented so that the result will not be biased. The result yielded that PNN
classifier gives the highest accuracy. From this result, it can be said that PNN
is a very effective classifier to classify microarray data.
4. The reduction of genes and an accurate prediction of microarray data are
crucial because it can detect cancer in early stages and further save human
life. This research has helped in reducing the number of genes and gives
insight to biologist on the informative genes. The reduction of genes and
classification of genes are being experimented using computer science
techniques. Thus, this research has contributed to both computer and medical
fields.
69
6.3
Problems and Limitations of Research
During this research, there were some problems and limitations exist while
implementing the feature selection techniques and PNN technique.
One of the
problems and limitations is the used of limited dataset. This research only uses one
dataset which is leukemia. Hence it does not generalize the result obtained and there
are no comparisons of results between other dataset which might have different type
of diseases, different number of sizes and different number of classes.
6.4
Suggestions for Better Research
Since the problems exists because of a limited dataset, thus, to overcome this
situation, it is important to use different types of dataset which different in terms of
sizes, diseases and classes. This will help getting a better result and comparison
between the datasets can show the most suitable dataset to be used by feature
selection techniques and PNN technique and further validate the performance of the
designed technique. Examples of other datasets would be colon tumor, lung cancer,
breast cancer, and prostate cancer.
In this study, the data involved is leukemia which is a fatal disease that has
been increase its population every year compared to other diseases. Early detection
of leukemia will help to in early treatment and thus cure the disease. There is a
various ways to detect, cure and treat leukemia nowadays. A lot of researches also
have been done in this area in order to reduce the number of deaths caused by
leukemia.
70
The new technology of microarray increase allows thousands of the genes
expressions to be determined simultaneously. This advantage of microarray has lead
to the application in medical area namely management and classification of cancer
and infectious diseases. However, microarray technology also suffers several
drawbacks. The disadvantages of the microarray are the high dimensionality of data
and also consist of irrelevant information to classify disease accurately. Nevertheless,
there are various techniques of feature selection that can be used to solve the arising
problem in microarray in order to improve the accuracy of the classification. By
using feature selection, only the appropriate subset of genes will be selected among
the microarray.
The goal of the classification in leukemia is to distinguish between ALL and
AML leukemia. There are many classifiers that have been studied to classify the
microarray. However, not all the classifier shows the high performance of accuracy.
This research has concluded that mRMR and PNN serves as the best combination of
genes selection and classification of microarray data. However, these techniques
should be explored deeply and a further investigation should be conducted to
overcome the problems and limitations of this research and thus, obtain a better
result. This knowledge can be expand throughout the world for future generations to
learn and improves any flaws in this method or research.
REFERENCES
Ahlers, F.J., Carlo, W.D., Fleiner, C., Godwin, L., Mick, Nath, R.D., Neumaier, A.,
Phillips, J.R., Price,K., Storn, R., Turney,P., Wang, F., Zandt, J.V., Geldon,
H., Gauden, P.A. Differential Evolution. (accessed May 20, 2009).
http://www.icsi.berkeley.edu/~storn/code.html
Alon, U., Barkai, N., Notterman, D.A., et al. (1999). Broad patterns of gene
expression revealed by clustering analysis of tumor and normal colon tissues
probed by oligonucleotide arrays. PNAS. Vol 96: 6745-6750
Amaratunga. D. and Cabrera, J. (2004). Exploration and Analysis of DNA
Microarray and Protein Array Data. New Jersey, USA: Wiley Inter-Science.
8-10
Babu, B.V. and Chaturvedi, G. Evolutionary Computation Strategy for Optimization
of an Alkylation Reaction. Birla Institute of Technology and Science.
Babu, B.V. and Sastry, K.N.N (1999). Estimation of Heat Transfer Parameters in a
Trickle-bed Reactor using Differential Evolution and Orthogonal Collocation.
Elsevier Science.
Balasundaram
Karthikeyan,
Srinivasan
Gopal,
Srinivasan
Venkatesh
and
Subramanian Saravanan. (2006). PNN and its Adaptive Version – An Ingenious
Approach to PD Pattern Classification Compared with BPA Network. Journal
of Electrical Engineering. Vol 57: 138-145.
Bi, C., Saunders, M. C. and McPheron, B. A. (2007). Wing Pattern-Based
Classification of the Rhagoletis pomonella Species Complex Using Genetic
Neural Networks. International Journal of Computer Science & Applications.
Vol 4: 1-14
Breiman, L., 2001. Random Forests. Mach. Learn. 40, 5–32.
72
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification
and Regression Trees. Wadsworth, Belmont.
Breiman, L., 2003. RF/tools—A class of two eyed algorithms. In: SIAM
Workshop, http://oz.berkeley.edu/users/breiman/siamtalk2003.pdf.
Campbell, N.A. and Reece, J.B. (2002). Biology. Sixth edition. San Francisco:
Benjamin Cummings.
Comma-separated Values. Wikipedia. (accessed June 15, 2009).
http://en.wikipedia.org/wiki/Comma-separated_values.
Cover, T., and Thomas, J. (1991). Elements of Information Theory. New York :John
Wiley and Sons.
Data
Preparation.
Encyclopedia.com,
(accessed
October
6,
2009).
http://www.encyclopedia.com/doc/1O11-datapreparation.html
Díaz-Uriarte R, Alvarez de Andrés S. (2006). Gene Selection and Classification of
Microarray Data using Random Forest. BMC Bioinformatics. 2006 Jan 6;7:3.
Ding, C. and Peng, H. (2003). Minimum Redundancy Feature Selection from
Microarray Gene Expression Data. Proceedings of the Computational Systems
Bioinformatics.
Ding, C. and Peng, H. (2005). Minimum Redundancy Feature Selection from
Microarray
Gene
Expression
Data.
Journal
of
Bioinformatics
and
Computational Biology. Vol 3: 185-205.
Discretization of Continuous Features, Wikipedia, (accessed Oct 23, 2009)
http://en.wikipedia.org/wiki/Discretization_of_continuous_features
DNA. Wikipedia, (accessed May 11, 2009). http://en.wikipedia.org/wiki/DNA
Dudoit, S. and Gentleman, R. (2003). Classification in Microarray Experiment.
Freund, Y., Schapire, R.E., 1996. Experiments with a new boosting algorithm. In:
Machine Learning. Proceedings of the Thirteenth International Conference.
pp. 148–156.
Golub, T.R., Slonim, D.K, Tamayo, P., et al. (1999). Molecular classification of
cancer: Class discovery and class prediction by gene expression monitoring.
Science. Vol 286: 531-537
Huerta, E.B., Duval, B., Hao, J.K. A hybrid GA/SVM approach for gene selection
and classification of microarray data.
Jamain, A. and Hand, D. J. (2005). The Naïve Bayes Mystery: A Classification
Detective Story. Pattern Recognition Letters. Vol 26: 1752-1760
73
Jin, X., Xu, A., Bie, R., and Guo, P. (2006). Machine Learning and Chi-Square
Feature Selection for Cancer Classification Using SAGE Gene Expression
Profiles. In Li, J. et al. (Eds) Data Mining for Biomedical Applications. (pp:
106-115). Berlin Heidelberg: Springer-Verlag
Kim, Y.B. and Gao, J. (2006). A New Hybrid Approach for Unsupervised Gene
Selection. IEEE Explorer.
Kohavi, R and John, G.H. (1997). Wrappers for Feature Subset Selection.
Kononenko, I. (1994). Estimating Attributes: Analysis and Extensions of Relief.
Proceedings of the European Conference on Machine Learning. SpringerVerlag New York. 171-182
K-nearest_neighbor_algorithm, Wikipedia, (accessed May 21, 2009)
en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
Lakshminarasimman, L. and Subramanian, S. (2008). Applications of Differential
Evolution in Power System Optimization. Advances in Differential Evolution.
Vol 143: 257-273.
Langdon, W.B. (2005). Evolving Benchmarks. The 17th Belgian-Dutch Conference
on Artificial Intelligence. 365–367.
Liu, J. and Iba, H. (2002). Selecting Informative Genes Using a Multiobjective
Evolutionary Algorithm. Proceedings of the 2002 congress. 12-17 May. 297302
Mendes, S.P., Pulido, J.A.G., Rodriguez, M.A.V., Simon, M.D.J., Perez, J.M.S.
(2006). A Differential Evolution Based Algorithm to Optimize the Radio
Network Design Problem.
Mishra, S.K. (2006). Global Optimization by Differential Evolution and Particle
Swarm Methods: Evaluation on Some Benchmark Functions. MPRA Paper
1005. 7 November 2007.
Mitchell, T.M. (1997). Machine Learning. New York : McGraw-Hill
Muhammad Faiz bin Misman (2007). Pembangunan Program Selari Menggunakan
Message Passing Interface(MPI) Pada Teknik Gabungan Algoritma Genetik
dan Mesin Sokongan Vektor. Universiti Teknologi Malaysia: Tesis Sarjana
Muda
New Gene Selection Method. The Medical News, July 8, 2004 (accessed May 14,
2009). http://www.news-medical.net/news/2004/07/08/3157.aspx
74
Nur Safawati binti Mahshos (2008). Pengecaman Imej Kapal Terbang Dengan
Menggunakan Teknik Rangkaian Neural Radial Basis Fuction Dan
Rambatan Balik. Universiti Teknologi Malaysia: Tesis Sarjana Muda
Nurulhuda binti Ghazali (2008). A Hybrid of Particle Swarm Optimization and
Support Vector Machine Approach for Genes Selection and Classification of
Microarray Data. Universiti Teknologi Malaysia: Tesis Sarjana Muda
Park, H., and Kwon, H-C. (2007). Extended Relief algorithms in instance-based
Feature Filtering. Sixth International conference on Advanced Language
Processing and Web Information Technology. 123-128
Pastell, M.E. and Kujala, M. (2007). A Probabilistic Neural Network Model for
Lameness Detection. American Dairy Science Association. Vol 90: 22832292.
Paul, T.K and Iba, H. (2005). Gene selection for classification of cancers using
probabilistic model building genetic algorithm. BioSystems. Vol 82: 208-225
Peng, H., Long, F. and Ding, C. (2005). Feature Selection Based on Mutual
Imformation: Criteria of Max-Dependency, Max-Relevance, and MinRedundancy. IEEE Transactions on Pettern Analysis and Machine
Intelligence. Vol 27: 1226-1238.
Peng, H. (2005). mRMR (minimum Redundancy Maximum Relevance Feature
Selection).(accessed June 1, 2009).
http://penglab.janelia.org/proj/mRMR/index.htm
Principal Component Analysis, Wikipedia, (accessed May 14, 2009)
http://en.wikipedia.org/wiki/Principal_components_analysis
Russell, P.J. (2003). Essential iGenetics. San Francisco: Benjamin Cummings.
226-265
Savitch, W. (2006). Problem Solving with C++. Sixth edition. USA: Pearson
International Edition.
Shan, Y., Zhao, R., Xu, G., Liebich, H.M. and Zhang, Y. (2002). Application of
Probabilistic Neural Network in the Clinical Diagnosis of Cancers based on
Clinical Chemistry Data. Analytica Chimica Acta. 77-86.
Shen, Q., Shi, W.-M., Kong, W., Ye, B.-X. (2007). A Combination of Modified
Particle Swarm Optimization Algorithm and Support Vector Machine for
Gene Selection and Tumor Classification. Talanta. Vol 71: 1679-1683
Shena, M. (2003). Microarray Analysis. New Jersey: John Wiley & Sons, Inc.
75
Specht, D.F. (1990). Probabilistic neural networks. Neural Networks.Vol 3 :
110-118.
Storn, R. and Price, K. (1997). Differential Evolution – A Simple and Efficient
Adaptive Scheme for Global Optimization over Continuous Spaces. Journal
of Global Optimization.
Suzila binti Sabil (2007). Aplikasi Prinsip Analisis Komponen dan Rangkaian Neural
Perceptron untuk Mengkelaskan Data Kanser Usus. Universiti Teknologi
Malaysia: Tesis Sarjana Muda
Vapnik, V. N. (1995). The nature of Statistical Learning Theory. Springer, New
York
Vasan, A. and Raju, K.S. Optimal Reservoir Operation Using Differential Evolution.
Birla Institute of Technology and Science.
Visen, N. S., Paliwal, J., Jayas, D.S. and White, N.D.G. (2002). Specialist Neural
Networks for Cereal Grain Classification. Biosyst. Eng.Vol 82:151–159.
Xu, R. and Wunsch, D. C. (2003). Probabilistic Neural Networks for Multi-class
Tissue Discrimination with Gene Expression Data . Proceedings of the
International Joint Conference on Neural Network. Vol 3: 1696-1701
Xue,
F.
(2004).
Multi-objective
Differential
Evolution:
Theory
and
Applications.Rensselaer Polytechnic Institute: Doctor of Philosophy.
Yang, Z., Yang, Z., Lu, W., Harrison, R.G., Eftestøl, T. and Steene, P.A. (2005). A
Probabilistic Neural Network as the Predictive Classifier of out-of-hospital
Defibrillation Outcomes. Resuscitation. Vol 64:31–36.
Yousefi, H., Handroos, H. and Soleymani, A. (2008). Application of Differential
Evolution in System Identification of a Servo-hydraulic System with a
Flexible Load. Elsevier. Vol 18: 513-528
Yuan, S.-F. and Chu, F.-L. (2007). Fault Diagnostics based on Particle Swarm
Optimization and Support Vector Machines. Mechanical Systems and Signal
Processing. Vol 21: 1787-1798
Zhang, L.-X., Wang, J.-X., Zhao, Y.-N. and Yang, Z.-H. (2003). A Novel Hybrid
Feature Selection Algorithm: Using ReliefF Estimation for GA-Wrapper
Search. Proceedings of the Second International Conference on Machine
Learning and Cybernetics. Xi’an. 380-384.
76
Zheng, Z., Srihari, R. and Srihari, S. (2003). A Feature Selection Framework Text
Filtering. Proceedings of the Third IEEE International Conference on Data
Mining.
APPENDIX A
PROJECT 1 GANNT CHART
Topic confirmation
Topic changes and confirmation
Discussion with supervisor
5
6
7
Submission of chapter 1
Literature Review
Studying proposed method and previous method
Submission of chapter 2
11
12
13
14
Submission of report draft
Progress
Milestone
Submission of final report
28
Split
Correction of report
27
Task
Project presentation
26
Project: Project 1
Date:10/28/09
Preparation of presentation
25
Project presentation
23
24
Writing the draft of report
22
Report Writing
Submission of chapter 4
20
21
Design of the proposed method
19
Initial result
Submission of chapter 3
17
18
Determine the project methodology
16
Methodology
Studying the current information
10
15
Identify backgroud af problem, objectives and importance of the study
9
Analyse
Choosing a topic
4
8
Selecting of supervisor
Planning
Task Name
Execution of project 1
3
2
ID
1
1 day
2 days
1 day
4 days
8 days
1 day
3 days
4 days
2 days
5 days
7 days
1 day
2 days
3 days
1 day
2 days
3 days
1 day
4 days
2 days
13 days
1 day
4 days
1 day
1 day
2 days
15 days
Duration
39 days
Wed 7/1/09
Mon 6/29/09
Fri 6/26/09
Mon 6/22/09
Mon 6/22/09
Thu 6/18/09
Mon 6/15/09
Mon 6/15/09
Fri 6/12/09
Fri 6/5/09
Fri 6/5/09
Thu 6/4/09
Tue 6/2/09
Tue 6/2/09
Thu 6/4/09
Tue 6/2/09
Thu 5/28/09
Wed 5/27/09
Thu 5/21/09
Tue 5/19/09
Tue 5/19/09
Thu 5/28/09
Fri 5/15/09
Thu 5/14/09
Wed 5/13/09
Mon 5/11/09
Mon 5/11/09
Start
Mon 5/11/09
Project Summary
Summary
Wed 7/1/09
Tue 6/30/09
Fri 6/26/09
Thu 6/25/09
Wed 7/1/09
Thu 6/18/09
Wed 6/17/09
Thu 6/18/09
Mon 6/15/09
Thu 6/11/09
Mon 6/15/09
Thu 6/4/09
Wed 6/3/09
Thu 6/4/09
Thu 6/4/09
Wed 6/3/09
Mon 6/1/09
Wed 5/27/09
Tue 5/26/09
Wed 5/20/09
Thu 6/4/09
Thu 5/28/09
Wed 5/27/09
Thu 5/14/09
Wed 5/13/09
Tue 5/12/09
Thu 5/28/09
Mon May 4
Finish
Wed 7/1/09
Appendix A
Project 1 Gantt Chart
Wed May 6
External Milestone
External Tasks
Tue May 5
Thu May 7
Deadline
Fri May 8
Sat May 9
Sun May 10
Mon May 11
Wed May 13
Project: Project 1
Date:10/28/09
Tue May 12
Thu May 14
Milestone
Mon May 18
Progress
Sun May 17
Split
Sat May 16
Task
Fri May 15
Tue May 19
Thu May 21
Project Summary
Summary
Wed May 20
Appendix A
Project 1 Gantt Chart
Fri May 22
Sun May 24
External Milestone
External Tasks
Sat May 23
Mon May 25
Deadline
Tue May 26
5/27
Wed May 27
Thu May 28
Sat May 30
Project: Project 1
Date:10/28/09
Fri May 29
Sun May 31
Milestone
6/4
Thu Jun 4
Progress
Wed Jun 3
Split
Tue Jun 2
Task
Mon Jun 1
Fri Jun 5
Sun Jun 7
Project Summary
Summary
Sat Jun 6
Appendix A
Project 1 Gantt Chart
Mon Jun 8
Wed Jun 10
External Milestone
External Tasks
Tue Jun 9
Thu Jun 11
Deadline
6/12
Fri Jun 12
Sat Jun 13
Sun Jun 14
Tue Jun 16
Project: Project 1
Date:10/28/09
Mon Jun 15
Wed Jun 17
Milestone
Sun Jun 21
Progress
Sat Jun 20
Split
Fri Jun 19
Task
6/18
Thu Jun 18
Mon Jun 22
Wed Jun 24
Project Summary
Summary
Tue Jun 23
Appendix A
Project 1 Gantt Chart
Thu Jun 25
Sat Jun 27
External Milestone
External Tasks
Fri Jun 26
Sun Jun 28
Deadline
Mon Jun 29
Tue Jun 30
7/1
Wed Jul 1
APPENDIX B
PROJECT 2 GANNT CHART
Re-code
Discussion with supervisor
Phase of Results and Discussion
10
11
12
Supervisor checking the report draft
Correction of the report draft
17
18
Thu 10/29/09
Milestone
2 days
Thu 10/22/09
Tue 10/13/09
Progress
Submission of the final report
23
5 days
2 days
Tue 9/29/09
Tue 9/29/09
Tue 9/15/09
Thu 9/10/09
Fri 9/4/09
Mon 8/24/09
Mon 8/24/09
Wed 8/19/09
Wed 8/19/09
Tue 8/18/09
Thu 8/13/09
Mon 8/10/09
Wed 7/29/09
Wed 7/29/09
Mon 7/20/09
Thu 7/9/09
Thu 7/9/09
Mon 7/6/09
Mon 7/6/09
Start
Mon 7/6/09
Split
Correction of the project report
22
10 days
24 days
10 days
3 days
2 days
9 days
26 days
3 days
3 days
1 day
3 days
3 days
8 days
15 days
7 days
7 days
14 days
3 days
3 days
Duration
85 days
Task
Project presentation
21
Project: Project 2
Date: 10/28/09
Preparation of presentation
20
Project presentation
Submission of report draft
19
Writing the draft of the report
16
Report Writing
15
14
Making conclusion
Testing the program
9
13
Coding
8
Phase of Implementation
Understanding of current research
6
7
Studying previous research
5
Phases of Literature Review
Collecting data
4
Phase of Collecting Data
3
Task Name
Execution of Project 2
2
ID
1
Fri 10/30/09
Wed 10/28/09
Wed 10/14/09
Mon 10/12/09
Fri 10/30/09
Mon 9/28/09
Mon 9/14/09
Mon 9/7/09
Thu 9/3/09
Mon 9/28/09
Fri 8/21/09
Fri 8/21/09
Tue 8/18/09
Mon 8/17/09
Wed 8/12/09
Fri 8/7/09
Tue 8/18/09
Tue 7/28/09
Fri 7/17/09
Tue 7/28/09
Wed 7/8/09
Wed 7/8/09
Fri Jul 3
Finish
Fri 10/30/09
Sun Jul 5
Project Summary
Summary
Sat Jul 4
Appendix B
Project 2 Gantt Chart
Mon Jul 6
Wed Jul 8
External Milestone
External Tasks
Tue Jul 7
Thu Jul 9
Deadline
Fri Jul 10
Sat Jul 11
Sun Jul 12
Mon Jul 13
Wed Jul 15
Project: Project 2
Date: 10/28/09
Tue Jul 14
Thu Jul 16
Milestone
Mon Jul 20
Progress
Sun Jul 19
Split
Sat Jul 18
Task
Fri Jul 17
Tue Jul 21
Wed Jul 22
Fri Jul 24
Project Summary
Summary
Thu Jul 23
Appendix B
Project 2 Gantt Chart
Sat Jul 25
Mon Jul 27
External Milestone
External Tasks
Sun Jul 26
Tue Jul 28
Deadline
Wed Jul 29
Thu Jul 30
Fri Jul 31
Sat Aug 1
Mon Aug 3
Project: Project 2
Date: 10/28/09
Sun Aug 2
Tue Aug 4
Milestone
Sat Aug 8
Progress
Fri Aug 7
Split
Thu Aug 6
Task
Wed Aug 5
Sun Aug 9
Mon Aug 10
Wed Aug 12
Project Summary
Summary
Tue Aug 11
Appendix B
Project 2 Gantt Chart
Thu Aug 13
Sat Aug 15
External Milestone
External Tasks
Fri Aug 14
Sun Aug 16
Deadline
Mon Aug 17
Tue Aug 18
Wed Aug 19
Thu Aug 20
Sat Aug 22
Project: Project 2
Date: 10/28/09
Fri Aug 21
Sun Aug 23
Milestone
Thu Aug 27
Progress
Wed Aug 26
Split
Tue Aug 25
Task
Mon Aug 24
Fri Aug 28
Sat Aug 29
Mon Aug 31
Project Summary
Summary
Sun Aug 30
Appendix B
Project 2 Gantt Chart
Tue Sep 1
Thu Sep 3
External Milestone
External Tasks
Wed Sep 2
9/4
Fri Sep 4
Deadline
Sat Sep 5
Sun Sep 6
Mon Sep 7
Tue Sep 8
Thu Sep 10
Project: Project 2
Date: 10/28/09
Wed Sep 9
Fri Sep 11
Milestone
Tue Sep 15
Progress
Mon Sep 14
Split
Sun Sep 13
Task
Sat Sep 12
Wed Sep 16
Thu Sep 17
Sat Sep 19
Project Summary
Summary
Fri Sep 18
Appendix B
Project 2 Gantt Chart
Sun Sep 20
Tue Sep 22
External Milestone
External Tasks
Mon Sep 21
Wed Sep 23
Deadline
Thu Sep 24
Fri Sep 25
Sat Sep 26
Sun Sep 27
Tue Sep 29
Project: Project 2
Date: 10/28/09
Mon Sep 28
Wed Sep 30
Milestone
Sun Oct 4
Progress
Sat Oct 3
Split
Fri Oct 2
Task
Thu Oct 1
Mon Oct 5
Tue Oct 6
Thu Oct 8
Project Summary
Summary
Wed Oct 7
Appendix B
Project 2 Gantt Chart
Fri Oct 9
Sun Oct 11
External Milestone
External Tasks
Sat Oct 10
Mon Oct 12
Deadline
10/13
Tue Oct 13
Wed Oct 14
Thu Oct 15
Fri Oct 16
Sun Oct 18
Project: Project 2
Date: 10/28/09
Sat Oct 17
Mon Oct 19
Milestone
Fri Oct 23
Progress
Thu Oct 22
Split
Wed Oct 21
Task
Tue Oct 20
Sat Oct 24
Sun Oct 25
Tue Oct 27
Project Summary
Summary
Mon Oct 26
Appendix B
Project 2 Gantt Chart
Wed Oct 28
Fri Oct 30
External Milestone
External Tasks
10/29
Thu Oct 29
Sat Oct 31
Deadline
Sun Nov 1
Mon Nov 2
Tue Nov 3
Wed Nov 4
Download