COMPARATIVE STUDY OF FEATURE SELECTION METHOD OF MICROARRAY DATA FOR GENE CLASSIFICATION NURULHUDA BINTI GHAZALI UNIVERSITI TEKNOLOGI MALAYSIA COMPARATIVE STUDY OF FEATURE SELECTION METHOD OF MICROARRAY DATA FOR GENE CLASSIFICATION NURULHUDA BINTI GHAZALI A project report submitted in partial fulfillment of the requirements for the award of the degree of Master of Science (Computer Science) Faculty of Computer Science and Information Systems Universiti Teknologi Malaysia OCTOBER 2009 iii To my beloved Mummy and Abah… Hazijun bt. Abdullah and Ghazali bin Sulong My beloved sisters.. Nurhanani and Nur Hafizah My beloved brother.. Ikmal Hakim My brother-in-laws.. Saiful Azril and Faridun Naim My beloved nieces.. Sarah Afrina and Sofea Alisya My supervisor.. Assoc. Prof. Dr.Puteh Saad and last but not least to all my supportive friends especially Syara, Radhiah, Zalikha, Umi and Hidzir.. “Thank you for all the support and love given” iv ACKNOWLEDGEMENT In the name of Allah, Most Gracious, Most Merciful. All praise and thanks be to Allah for His guidance that had lead me in completing this research. His blessings had given me strength and courage throughout this past year and had helped me overcome difficulties during this research period. First and foremost, I would like to take this opportunity to express my sincere gratitude to those who had assisted me in finishing this research. To my dear supervisor, Assoc. Prof. Dr. Puteh Saad, thank you for all your supports and guidance in showing me the right path towards completing this research. I really appreciated your advices and motivations that you had given me within the period of this research. My infinite thank you are dedicated to my loving and caring family, who had cherish me and give me full support in any kind. I am deeply appreciated for all the motivations and inspirations. Without them, it is impossible for me to finish my research. And last but not least, an endless appreciation to all my fellow friends and classmates for all the supports and encouragements. Their friendships never fail to amaze me. May Allah S.W.T bless them all and repay all of their kindness and sacrifices. v ABSTRACT Recent advances in biotechnology such as microarray, offer the ability to measure the levels of expression of thousands of genes in parallel. Analysis of microarray data can provide understanding and insight into gene function and regulatory mechanisms. This analysis is crucial to identify and classify cancer diseases. Recent technology in cancer classification is based on gene expression profile rather than on morphological appearance of the tumor. However, this task is made more difficult due to the noisy nature of microarray data and the overwhelming number of genes. Thus, it is an important issue to select a small subset of genes to represent thousands of genes in microarray data which is referred as informative genes. These informative genes will then be classified according to its appropriate classes. To achieve the best solution to the classification issue, we proposed an approach of minimum Redundancy-Maximum Relevance feature selection method together with Probabilistic Neural Network classifier. The minimum RedundancyMaximum Relevance feature selection method is used to select the informative genes while the Probabilistic Neural Network classifier acts as the classifier. This approach has been tested on a well-known cancer dataset which is Leukemia. The results achieved shows that the gene selected had given high classification accuracy. This reduction of genes helps take out some burdens from biologist and better classification accuracy can be used widely to detect cancer in early stage. vi ABSTRAK Kemajuan terkini dalam bioteknologi, contohnya mikroarray, membolehkan tahap pengekspresan beribu-ribu gen diukur secara selari. Penganalisaan dari data mikroarray dapat memberikan pemahaman dan pengetahuan berkenaan fungsi sesuatu gen dan mekanisma pengaturannya. Penganalisaan ini adalah penting untuk mengenalpasti dan mengkelaskan penyakit-penyakit kronik terutama sekali penyakit kanser. Teknologi yang digunakan baru-baru ini dalam pengkelasan kanser adalah berdasarkan kepada maklumat dari pengekspresan gen berbanding kemunculan tumor itu secara fizikal. Walaubagaimanapun, tugas ini menjadi sukar kerana kewujudan pelbagai gangguan (noise) dalam pemprosesan data mikroarray dan juga jumlah bilangan gen yang sangat banyak. Oleh itu, ianya merupakan satu isu penting untuk memilih hanya sebilangan kecil gen daripada ribuan gen dalam data mikroarray dan ini dipanggil sebagai gen bermaklumat. Gen bermaklumat ini akan dikelaskan berdasarkan kelasnya yang sesuai. Untuk mencapai penyelesaian yang terbaik bagi permasalahan ini, kami mancadangkan pendekatan kaedah pemilihan gen iaitu ‘minimum Redundancy-Maximum Relevance’ bersama dengan pengkelas ‘Probabilistic Neural Network’. ‘minimum Redundancy-Maximum Relevance’ digunakan untuk memilih gen-gen bermaklumat itu manakala ‘Probabilistic Neural Network’ bertindak sebagai pengkelas. penyakit kanser iaitu Leukimia. Kaedah ini telah diuji ke atas sejenis Keputusan eksperimen yang diperolehi sangat memuaskan dan ini dapat membantu kerja pakar-pakar biologi serta memberi harapan kepada masyarakat bagi mengesan kanser di peringkat awal. vii TABLE OF CONTENTS CHAPTER 1 TITLE PAGE DECLARATION ii DEDICATION iii ACKNOWLEDGEMENT iv ABSTRACT v ABSTRAK vi TABLE OF CONTENTS vii LIST OF TABLES x LIST OF FIGURES xi LIST OF ABBREVIATIONS xii LIST OF APPENDICES xiv INTRODUCTION 1 1.1 Introduction 1 1.2 Background of the Problem 3 1.3 Problem Statement 5 1.4 Objectives of Research 5 1.5 Scope of Research 6 1.6 Importance of the Study 6 viii 2 LITERATURE REVIEW 8 2.1 Introduction 8 2.2 Genes and Genes Expression 9 2.3 Microarray Technology 11 2.4 Feature Selection 12 2.4.1 ReliefF Algorithm 13 2.4.2 Information Gain 15 2.4.3 Chi Square 16 2.4.5 Minimum Redundancy-Maximum Relevance 16 Feature Selection 2.5 Classification 3 18 2.5.1 Random Forest 18 2.5.2 Naïve Bayes 19 2.5.3 Probabilistic Neural Network 20 2.6 Challenges in Genetic Expression Classification 22 2.7 Summary 23 METHODOLOGY 24 3.1 Introduction 24 3.2 Research Framework 25 3.2.1 Problem Definition 27 3.2.2 Related Studies 27 3.2.3 Study on Proposed Method 28 3.2.4 Data Preparation 29 3.2.5 Feature Selection 31 3.2.6 Classification 32 3.2.7 Evaluation and Validation 34 3.2.8 Result Analysis 34 3.3 Leukemia 35 3.4 Software Requirement 36 3.5 Summary 37 ix 4 5 6 IMPLEMENTATION 38 4.1 Introduction 38 4.2 Data Format 38 4.3 Data Preprocessing 39 4.4 Feature Selection Method 44 4.4.1 mRMR Feature Selection Method 45 4.4.2 ReliefF Algorithm 47 4.4.3 Information Gain 49 4.4.4 Chi Square 49 4.5 PNN Classifier 52 4.6 Experimental Settings 55 4.6.1 Feature Selection 56 4.6.2 Classification 57 4.7 Summary 57 EXPERIMENTAL RESULT ANALYSIS 58 5.1 Overview 59 5.2 Analysis of Results 50 5.3 Discussion 66 5.4 Summary 66 DISCUSSION AND CONCLUSION 67 6.1 Overview 67 6.2 Research Contribution 68 6.3 Problems and Limitation of Research 69 6.4 Suggestions for Better Research 69 REFERENCES 71 APPENDIX A 77 APPENDIX B 82 x LIST OF TABLES TABLE NO TITLE PAGE 2.1 Schemes in mRMR Optimization Condition 17 2.2 Comparison of k-NN and PNN using 4 Datasets 22 4.1 Leukemia Dataset 56 xi LIST OF FIGURES FIGURE NO TITLE PAGE 2.1 DNA Structure 9 2.2 Process of Producing Microarray 11 2.3 Sample of Microarray 12 2.4 Comparison of 3 Methods of Feature Selection 14 2.5 Architecture of PNN 21 3.1 Research Framework 26 3.2 Sample of Dataset 30 3.3 Sample of Dataset 30 3.4 Process of Feature Selection 31 3.5 Process of Classification 32 3.6 Overall Process of Feature Selection and Classification 33 3.7 Abnormal Proliferation of Cells in Bone Marrow Compared 35 To Normal Bone Marrow 4.1 Original Dataset in ARFF Format Showing Genes Values 40 4.2 Original Dataset in ARFF Format Showing Class Names 40 4.3 Dataset in IOS GeneLinker Software before Discretization 41 4.4 Dataset in IOS GeneLinker Software after Discretization 42 4.5 Discretized Data in CSV Format 43 4.6 Continuous Data in CSV Format 44 4.7 ReliefF Algorithm 48 4.8 Chi Square Algorithm 51 xii 5.1 Classification using PNN for Different Types of Data 59 5.2 Classification Accuracy using PNN for Different Scheme in 60 Feature Selection using mRMR 5.3 Classification using PNN by Different Number of Selected 61 Features 5.4 Comparison of Classification Accuracy by Different 63 Feature Selection Method using PNN 5.5 Comparison of Classification Accuracy using Different 64 Classifier 5.6 Classification Accuracy using 10-fold Cross Validation 65 xiii LIST OF ABBREVIATIONS ALL - Acute Lymphoblastic Leukaemia AML - Acute Myeloid Leukaemia ARFF - Attribute-Relation File Format CSV - Comma-Separated Values mRMR - Minimum Redundancy Maximum Relevance PNN - Probabilistic Neural Network DNA - Deoxyribonucleic Acid k-NN - k-Nearest Neighbor RNA - Ribonucleic Acid mRNA - Messenger Ribonucleic Acid xiv LIST OF APPENDICES APPENDIX TITLE PAGE A Project 1 Gantt Chart 77 B Project 2 Gantt Chart 82 CHAPTER 1 INTRODUCTION 1.1 Introduction Every living organism has discrete hereditary units known as genes. Each gene provides some function or mechanism either by itself or it will combine with other genes that will eventually producing some property of its organism. Genome is a complete set of genes for an organism and is said as the ‘library” of genetic instruction that an organism inherits (Campbell and Reese, 2002). Each gene is made of deoxyribonucleic acid (DNA) molecule which consists of two long strands that tightly wound together in a spiral structure known as double helix (Amaratunga and Cabrera, 2004). Along each of these strands located various form of genes that differs by its sequences for each organism. This makes each organism unique and different from each other. The DNA molecule of an organism is located in a cell. A cell is the fundamental units of all living organism and it contains many substructure such as nucleus, cytoplasm and plasma membrane. The nucleus is where DNA is embedded. Genes in DNA is expressed by transferring its coded information into proteins that dwell in the cytoplasm. This process is called as gene expression (Russell, 2003). There are several experimental techniques to measure gene 2 expression such as expression vector, reporter gene, northern blot, fluorescent hybridization, and DNA microarray. DNA microarray technology allows the simultaneous measurement of the expression level of a great number of genes in tissue samples (Paul and Iba, 2005). It yields a set of floating point and absolute values. Many explored on classification methods to recognize cancerous and normal tissues by analyzing microarray data. The microarray technology typically produces large datasets with expression values for thousands of genes (2000-20000) in a cell mixture, but only few samples are available (20-80) (Huerta et al.). This study is focused on gene selection and classification of DNA microarray data in order to identify tumor samples from normal samples. Gene selection is a process where a set of informative genes is selected from the gene expression data in a form of microarray dataset. This process helps improve the performance of the classifier. On the other hand, classification is a process to classify microarray data in several classes that have its own characteristics. There are several techniques that have been used in gene selection such as ReliefF Algorithm, Information Gain, minimum Redudancy Maximum Relevance (mRMR) and Chi Square. For classification of microarray data, a few techniques have been applied in the bioinformatics field to classify the highly dimensional data. These techniques include Random Forest, Naïve Bayes and Probabilistic Neural Network (PNN). The proposed method involved two stages where the first stage is the gene selection stage and the second one would be the classification stage. In gene selection method, the technique chosen is a technique called minimum RedundancyMaximum Relevance (mRMR) feature selection and will be compared to three other method namely ReliefF, Information Gain and Chi Square. mRMR is a feature selection framework that was introduced by Ding and Peng in 2005. They supplement the maximum relevance criteria along with minimum redundancy criteria to choose additional features that are maximally dissimilar to already identified ones. 3 This can expand the representative power of the feature subset and help improves their generalization properties. The classification problem will be handled by Probabilistic Neural Network (PNN) technique. PNN has been widely used in solving classification problems. This is because it can categorize data accurately (Nur Safawati Mahshos, 2008). Both techniques will be assessed on a bench mark cancer dataset which is Leukemia (Golub et al, 1999). 1.2 Background of the Problem Cancer is a killer disease to everyone worldwide. There are at least 100 different types of cancer that has been identified. Traditionally cancer is diagnosed based on the microscopic examination of patients’ tissue. This kind of diagnosis may fail when dealing with unusual or atypical tumors. Currently, cancer diagnosis is based on clinical evaluation and also referring to medical history and physical examination. This diagnosis takes a long time and might however limit the finding of tumor cell especially in early tumor detection (Xu and Wunsch, 2003). If tumor cell is found in its critical stage, then it might be too late to cure the patient. Thus, classification for cancer diseases has been widely carried out for the past 30 years. Unfortunately, there has been no general or perfect approach to identify new classes or assigning tumors to known classes. This happens because there are various ways that can cause cancer and too many types of cancer that sometimes difficult to distinguish. By depending on morphological appearance of tumors, it is hard to discriminate between two similar types of cancer (Golub et al, 1999). 4 In order to overcome the above issues, a new technique based on cancer classification has been introduced. The technique employs an advanced microarray technology that measures simultaneously the expression level of a great number of genes in tissue samples. Nevertheless, this technique contributes to a new problem whereby there exist a numerous number of irrelevant genes or overlapping of genes. Hence, selection and classification must be done in order to select the most significant genes from a pool of irrelevant genes and noises. Nowadays, there are a lot of selection and classification techniques that has already been studied and developed to help in better classification of microarray data. Among these techniques, there are a few that gives promising result such as mRMR, ReliefF, Information Gain and Chi Square for gene selection and PNN classification. mRMR is chosen as the primary technique for gene selection since this technique are proposed originally for gene selection (Ding and Peng, 2003). The advantage of this technique is it focuses on redundancy of genes together with the relevance of genes. Unlike other techniques; ReliefF (Kononenko, 1994), Information Gain (Cover and Thomas, 1991) and Chi Square (Zheng et al, 2003), they were firstly proposed only for general feature selection, rather than genes. For comparison, these four techniques are used to select genes in order to measure the performance. As for classification, the technique chosen in this research is Probabilistic Neural Network (PNN) classifier. PNN has been use in many studies of feature classification (Pastell and Kujala, 2007; Shan et al, 2002). These studies have proved that PNN yield better result in classification accuracy compared to other existing classifiers. Thus, this research combines a few feature selection methods together with PNN classifier to classify microarray data according to its classes. 5 1.3 Problem Statement The challenging issue in gene expression classification is the enormous number of genes relative to the number of training samples in gene expression dataset. Not all genes are relevant to distinguish between different tissue types (classes) and introduced noise (Liu and Iba, 2002) in the classification process and thus it drowns out the contribution of the relevant genes (Shen et al, 2007). On top of that, a major goal of diagnostic research is to develop diagnostic procedures based on inexpensive microarrays that have adequate number of genes to detect diseases. Hence, it is crucial to recognize whether a small number of genes will be sufficient enough for gene expression classification. 1.4 Objectives of Research The aim of this research is to select a set of meaningful genes using a minimum Redundancy-Maximum Relevance feature selection technique and to classify them using Probabilistic Neural Network. In order to achieve aim, the following objectives must be fulfilled: 1. To select a set of meaningful genes using Minimum Redundancy-Maximum Relevance (mRMR), Information Gain, ReliefF and Chi Square. 2. To evaluate the effectiveness of feature selection method using Probabilistic Neural Network (PNN) classifier. 3. To compare the performance of mRMR as feature selection method using PNN, Random Forest, and Naïve Bayes classifiers. 6 1.5 Scope of Research The scope of study is stated as below: • mRMR, ReliefF, Information Gain and Chi Square is utilized for gene selection. • PNN technique is used for gene expression classification. • Leukemia microarray dataset is used for testing (Data source: Weka Software Package, http://www.cs.waikato.ac.nz/ml/weka/) 1.6 • 10-fold cross validation is utilized to perform the validation. • The tools used are Matlab, Knime, Weka and IOS GeneLinker Importance of the Study This study is carried out to aid in classification of cancer diseases. Cancer diseases are lethal to human. Several methods have been conducted to detect this deadly disease. Unfortunately, the time taken is too long to confirm that someone has the disease. This is due to the symptoms that can only be seen after a very long time and by the time, cancer level has reached a critical stage. Common examination of patients require weekly checkup to precisely identify the presence of the disease. Due to the long term of examination, the disease might get more critical without exact cure or treatment. The advanced technology of microarray lessens the burden among medical staffs. The microarray of human genes can be used to detect cancer diseases earlier. 7 Despite the fact that microarray technology is said has the capability to solve the problems, but unfortunately this technology requires an excellent technique to select only the best subset of all genes to give enough information about a particular cancer disease. This is due to the overwhelming number of genes produce by microarray in a few sample sizes. Thus, by doing this research, the best approach can be achieved to solve the problems in gene selection and classification. The idea was to apply the minimum Redundancy-Maximum Relevance feature selection technique (compared with other feature selection techniques) together with Probabilistic Neural Network to give a tremendous result in a short time. This research provides knowledge in the field of bioinformatics and it gives benefit in medical area. Apart from that, it helps saving human life by detecting cancer disease in early stage. CHAPTER 2 LITERATURE REVIEW 2.1 Introduction Every living organism has a basic functional unit called cell. This cell is the smallest unit in an organism that contains the hereditary information necessary to determine its own function and responsible to pass the information to the next generation of cells. In human and some of other organisms, the hereditary information contained in a genetic material called Deoxyribonucleic Acid (DNA). DNA has been widely used as identification method in many applications and fields such as forensics, biometrics, e-commerce, security and others. DNA structure is made up of two strands that tightly around each other in a spiral structure and this structure have been called as double helix structure. Each of these strands consists of nucleotides. Each nucleotide consists of three basic components, namely sugar deoxyribose, phosphate group and a base. There are four different types of bases in DNA which are adenine, guanine, thymine and cytosine. These bases act as the connectors of the double helix DNA structure. One base of a nucleotide in one of the DNA strands will bond with its base pair on the other strands of DNA. This is called as base pairing and it is bonded with hydrogen bonds. The bases have two categories; 9 purine and pyrimidine. Adenine and guanine belongs to purine group while thymine and cytosine belongs to pyrimidine group. In base pairing, purine can only be paired with pyrimidine, as example, adenine can only be paired with thymine and guanine with cytosine. Since the bases differ for each nucleotide, it creates various sequences that uniquely represent a person (DNA, Wikipedia, 11th May 2009). Figure 2.1 : DNA Structure 2.2 Genes and Genes Expression Genes are specific sequences of nucleotides which uniquely describe a person’s characteristic, mostly physical characteristics. For example, the skin colour, 10 the shape of face, the colour of eyes and other features. Besides physical characteristics or appearances, genes can also detect if a person is having a disease such as cancer or diabetes. (Russell, 2003). Genes control all aspects of the life of an organism, encoding the products responsible for development, reproduction and so forth (Nurulhuda Ghazali, 2008). Gene expression is a process where genetic information of a DNA is converted into a functional protein. There are three major steps in this process; transcription, messenger Ribonucleic Acid (mRNA) processing and translation. The synthesis of DNA template to mRNA is termed transcription. Transcription occurs in nucleus. This process is done by breaking the hydrogen bonds in DNA and it splits the double helix structure of DNA. Some parts of this strands act as template and is transcribed to mRNA. An enzyme RNA polymerase attaches itself on DNA strands and initiates transcription. Nucleotides with complement bases to the bases on DNA are added one at a time to elongate the strand. The enzyme RNA polymerase moves along the DNA and when it reaches the termination triplet code, it detaches and the mRNA strands moves away from DNA. The two DNA strands are joined again by hydrogen bond. Translation is the process in which codons in mRNA are used to assemble amino acids in the correct sequence to produce polypeptide chain (protein). In the first stage of translation, the mRNA binds to ribosomes. Then, an amino acid is activated by an enzyme and the activation produces specific aminoacyl-tRNA molecules. Initiation of polypeptide chain occurs when the anticodon of aminoacyltRNA molecule carrying methionine binds to the start codon on mRNA. A second aminoacyl-tRNA with complimentary anticodon binds to the second mRNA. A peptide bond is catalyses between the two adjacent amino acids to produce dipeptide. This process of translation is repeated and finally forms a polypeptide chain (Russell, 2003). 11 2.3 Microarray Technology A microarray is an analytical device that allows exploration of genomic in inexpensive time. Thousands of genes contained in a glass chips are used to examine fluorescent samples prepared by labeling mRNA from biological sources (cells, tissues). Molecules in the fluorescent sample will then yield a chemical reaction, causing each spot to glow with different percent of intensity based on the activity of the expressed gene. Since the pattern of gene expression strongly related to its function, it helps providing significant information regarding human disease, aging, drug, mental illness and many other clinical matters. Figure 2.2 : Process of Producing Microarray Microarray has been widely used in gene expression. It account for 81% of the scientific publications, but microarrays also are being used for other purposes that 12 includes genotyping, tissue analysis and protein studies. The technology of microarray is commonly applied in human disease, and drug discovery. In detection of human disease, cancer has accounted the highest percentage, 83.5% of the microarray publication, compared to others disease. The other diseases include AIDS, stroke, Alzheimer’s, diabetes, cardiovascular, anemia, autism, Parkinson’s and cystic fibrosis. (Shena, 2003). Figure 2.3 : Sample of Microarray 2.4 Feature Selection There are two major stages involved in classification of microarray data; first stage is feature selection or also called as gene selection and the final stage is classification of the selected genes. In microarray technology, activities of thousand genes can be measured simultaneously and helps in early detection of fatal diseases (New Gene Selection Method, 14 May 2009). However, this technology yields a high-dimensional data that represents the genes. This high-dimensional data brings difficulties in classification process and because of human genes are in numerous number and not all of these genes contribute in determining a disease, it is crucial to select only the relevant genes to take part in classification process. Relevant genes 13 also sometimes called as informative genes are the genes that provide enough information about a disease. Selection of relevant genes to be used as sample in classification has been a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve high accuracy of classification (Diaz and Alvarez, 2006). By selecting only the relevant genes, the dimensionality of the data can be reduce and be processed by the classifier, reducing the execution time and improve the classification performance. Following are the significance of gene selection (Nurulhuda Ghazali, 2008): • Eliminates the irrelevant genes or useless genes • Reduce the dimension for input space • Reduce the complexity and execution time • Reduce cost in clinical setting • Improve performance of a classifier There are many methods have been studied to perform gene selection task. The next section will describe some of these methods that will then be compared in order to evaluate which methods are the best to select genes. 2.4.1 ReliefF Algorithm ReliefF is an extension of the standard Relief algorithm (Kononenko, 1994). The idea of this algorithm is to estimate the quality of features based on their values that distinguish between sample points that are near to each other. It is said as an algorithm that is sensitive to interactions. Given a randomly selected instance Insm from class L, ReliefF searches for K of its nearest neighbors from the same class called nearest hits H, and also K nearest neighbors from each of the different classes, called nearest misses M. It then updates the quality estimation Wi for gene i based on 14 their values for Insm, H, M. If instance Insm and those in H have different values on gene i, then the quality estimation Wi is decreased. On the other hand, if instance Insm and those in M have different values on the the gene i, then Wi is increased. The whole process is repeated n times which is set by users. Below is the example of result from an experiment carried by Zhang et al (2003) on handwritten characters Chinese dataset. The graph shows the comparison of three methods; Genetic algorithm- wrapper (G-W), ReliefF-Genetic algorithm-wrapper (R-G-W) and ReliefF in selecting the best subset of features to be classified. The x-axis represent the type of data and the y-axis represent the features selected by different methods accordingly. The label ‘All’ is the overall number of original features. Comparison of 3 methods of feature selection 250 200 150 100 50 0 G‐W R‐G‐W ReliefF All Figure 2.4 : Comparison of 3 Methods of Feature Selection 15 From the graph, there is some slight difference in the number of selected features between methods (G-W) and (R-G-W) whereas ReliefF alone has an obvious difference between the other two methods in terms of number of selected features. This indicates ReliefF technique alone contribute to redundancy in features selected even it is high in number of relevant features. If most of the features are relevant to the concept, it would select most of them even though only a fraction is necessary for concept description. (Kohavi and John, 1997). Despite from the fact that it selects many redundancy features, it does however produce good accuracy percentage in classification. 2.4.2 Information Gain Information gain is commonly used as a surrogate for approximating a conditional distribution in the classification setting (Cover and Thomas, 1991). According to Mitchell (1997), by knowing the value of a feature, information gain measure the number of bits of information obtained for class prediction. The following is the equation of information gain : InfoGain = H(Y) – H(Y|X) (2.1) Where X and Y is the feature and H(Y) = - Σ p (y) log2 (p(y)) (2.2) yεY H(Y|X) = - Σ p (x) Σ p (y|x) log2 (p(y|x)) xεX yεY (2.3) 16 2.4.3 Chi-Square The Chi-square algorithm will evaluate genes individually with respect to the classes. This algorithm is based on comparing the obtained values of the frequency of a class because of the split to the expected frequency of the class. From the N examples, let Nij be the number of samples of the Ci class within the jth interval while MIj is the number of samples in the jth interval. The expected frequency of Nij is Eij = MIj | Ci | lN . Therefore, the Chi-squared statistic of a gene is then defined as follow (Jin, X., et al., 2006): C l χ = Σ Σ (Nij - Eij)2 2 i=1 j=1 (2.4) Eij where I is the number of intervals. The larger the the χ2 value, the more informative the corresponding gene is. 2.4.5 Minimum Redundancy-Maximum Relevance Feature Selection Minimum Redundancy-Maximum Relevance (mRMR) feature selection method was introduced by Ding and Peng (2003). It was firstly proposed to reduce the number of genes selected in classification of microarray data. This method concerns more on the redundancy of genes selected caused by other gene selection method. It is proposed to expand the representative power of the feature set by requiring that features are maximally dissimilar to each other and it is supplemented by maximal relevance criteria such as maximal mutual information with target phenotypes. In mRMR, there are different schemes used to search for the next feature in mRMR optimization condition. The following table list all the schemes : 17 Table 2.1 : Schemes in mRMR Optimization Condition Type Acronym Full Name MID Formula Mutual information 1 | | , Ω , difference Discrete (2.5) MIQ Mutual information , Ω / 1 | | , quotient (2.6) FCD F-test correlation Ω 1 | | , | , | difference Continuous (2.7) FCQ F-test correlation Ω , / 1 | | | , | quotient (2.8) The benefits of this approach is that it represent the phenotypes more than usual method, leading to better generalization property and only small features can be accurately classify genes according to its class, compared to large datasets. 18 2.5 Classification Classification of microarray data has been studied years ago by researches from various field especially computer and medical field. The purpose of this classification is to classify data to its appropriate class. As example, classify the cancerous and non cancerous genes. The classification of data is important in early detection of a disease, more specifically, cancer disease. This can helps in treating patients before the cancer reaches critical stage. The major process involved in classification is training the classifier with some training samples to enable it to learn the patterns of the genes. After the classifier has been trained, it will then be tested using testing samples. The result yield is the percentage of accuracy of the classification. Some of well-known classifiers are Random Forest, Naïve Bayes and Probabilistic Neural Network. 2.5.1 Random Forest Random Forest is a general term for ensemble methods using tree-type classifiers {h(x,Θk), k = 1,. . . , } where the { Θk } are independent identically distributed random vectors and x is an input pattern (Breiman, 2001). In training, the Random Forest algorithm creates multiple CART-like trees (Breiman et al., 1984), each trained on a bootstrapped sample of the original training data, and searches only across a randomly selected subset of the input variables to determine a split (for each node). For classification, each tree in the Random Forest casts a unit vote for the most popular class at input x. The output of the classifier is determined by a majority vote of the trees. The number of variables is a user-defined parameter, but the 19 algorithm is not sensitive to it. Often, a blindly selected such value is set to the square root of the number of inputs. By limiting the number of variables used for a split, the computational complexity of the algorithm is reduced, and the correlation between trees is also decreased. Finally, the trees in Random Forests are not pruned, further reducing the computational load. As a result, the Random Forest algorithm can handle high dimensional data and use a large number of trees in the ensemble. This combined with the fact that the random selection of variables for a split seeks to minimize the correlation between the trees in the ensemble, results in error rates that have been compared to those of AdaBoost (Freund and Schapire, 1996) while being computationally much lighter. As each tree is only using a portion of the input variables in a Random Forest, the algorithm is considerably lighter than conventional bagging with a comparable tree-type classifier. The analysis of Random Forest (Breiman, 2003) shows that its computational time is, √ log (2.9) where c is a constant, T is the number of trees in the ensemble, M is the number of variables and N is the number of samples in the data set. It should be noted that although Random Forests are not computationally intensive, they require a fair amount of memory as they store an N by T matrix in memory. 2.5.2 Naïve Bayes The Naive Bayes method originally tackles problems where variables are categorical, although it has natural extensions to other types of variables. It assumes that variables are independent within each class, and simply estimates the probability 20 of observing a certain value in a given class by the ratio of its frequency in the class of interest over the prior frequency of that class (Jamain and Hand, 2005). That is, for any class c and vector X = (Xj)j ∏ | = 1,. . .,k of categorical variables, | and = # # (2.10) Continuous variables are generally discretised, or a certain parametric form is assumed (e.g. normal). One can also use non-parametric density estimators like Kernel functions. Then similar frequency ratios are derived. 2.5.3 Probabilistic Neural Network A probabilistic neural network (PNN) is an artificial neural network that is used for data classification tasks. It is a model based on competitive learning with a ‘winner takes all attitudes” and the core concept based on multivariate probability estimation. The PNN was initially developed by Specht (1990). This network provides a general solution to pattern classification problems by following an approach developed in statistics, called Bayesian classifiers. Bayes theory, developed in the 1950's, takes into account the relative likelihood of events and uses a priori information to improve prediction. The PNN network or architecture consists of an input layer, two hidden layer and an output layer. The PNN classifier is sometimes being taken as belongs to the radial basis function (RBF) class. But the difference between PNN and RBF is that PNN works on estimation of probability density function while RBF works on iterative function approximation (Balasundaram 21 Karthikeyan et. al, 2006). PNN has been commonly used in clinical cancer diagnosis (Shan et al., 2002), a predictive classifier for hospital defibrillation outcomes (Yang et al., 2005) and cereal grain classification (Visen et al., 2002). The advantage of PNN is that they usually outperform traditional statistical classifiers in nonlinear problems (Pastell and Kujala, 2007) Figure 2.5 : Architecture of PNN (Pastell and Kujala, 2007) PNN does have its advantages and disadvantages, compared to other classifiers. PNN train faster than a multilayer perceptron network and it is often are more accurate. However, PNN network require more memory space to store the model. Table 2.2 shows a great performance of PNN compared to k-NN. 22 Table 2.2 : Comparison of k-NN and PNN using 4 Datasets Dataset Classifier Leukemia Embrayonal CNS Tumor Medulloblastoma Medulloblastoma morphology treatment outcome PNN 95.4% 84% 89.8% 69.2% k-NN 94% 79.6% 85.2% 66.4% 2.6 Challenges in Genetic Expression Classification Genetic expression has been a common issue in medical field where it preserves the hereditary information of human genome. This human genome can determine a physical appearance of a person or specific condition of a human being which might lead to early detection of a fatal disease. However, in the analysis of gene expression profiles, the number of tissue samples available is usually small relative to the number of genes. This can cause to the problems of overfitting and dimensional curse, where sometimes fail in the analysis of microarray data. Besides the above matters, the microarray data sometimes is contaminated by noise due to the devices used when conducting the microarray process or the noise might have already exists biologically in the gene itself. These matters add to the number of irrelevant genes and the computation for classification might be costly. According to previous researches, the techniques applied in classification of microarray data mostly takes a longer execution time. Not only time factor is important, but the accuracy of the classification usually very small due to poor performance of the classifier which might be caused by unwell training of the classifier. Thus, it is crucial to develop a method to solve these matters. The method includes selection of informative genes that have enough information to indicate a disease and the 23 classification of the gene expression data which classify the data accurately according to their classes in an inexpensive time. 2.7 Summary This chapter explains about the domains of the research, which is in feature selection and classification. It gives some briefing on the background problems that involves brief explanation about gene definition and microarray data. It shows that it is crucial to select informative genes out of the microarray data and classify them to its appropriate classes. Several methods have been discussed in this chapter based on previous studies. Example of methods being implemented in feature selection are ReliefF Algorithm, Information Gain, Chi Square and minimum RedundancyMaximum Relevance (mRMR), while for classification methods are Random Forest, Naïve Bayes and Probabilistic Neural Network (PNN). Based on the description of all the above methods, the methods for feature selection will be compared whereas for classification purpose the PNN method is chosen. The chosen methods were based on previous implementations that had shown great outcomes. CHAPTER 3 METHODOLOGY 3.1 Introduction This chapter explains the process involved during conducting this project. The project is done by following the research framework as shown later. The process involved many stages and requires a good time management to ensure that this project can complete the activities as stated in the framework in a given time. Apart from framework, this chapter also describes and clarifies about the dataset used in analysis. The software involved and other equipments are listed to give a clear view of this project. 25 3.2 Research Framework There are few phases involve in order to achieve the objectives of this research. Each of these phases involves different activities to complete this research systematically and successfully. Figure 3.1 illustrate the overall research framework. 26 Start Problem Definition Related Studies • • Problem domain Existing techniques • • MRMR, Information Gain, ReliefF and Chi Square for feature selection PNN classifier for feature classification Study on Proposed Method Beginning of Experiment Data Preparation Feature Selection • MRMR, Information Gain, ReliefF and Chi Square is used to select relevant features ( ) • PNN act as classifier to classify features according to their classes • A comparison of feature selection method, namely MRMR, Information Gain, ReliefF and Chi Square is made based on classification result • 10-fold cross validation is implemented to evaluate and validate results Classification Comparison of Feature Selection Method Evaluation and Validation Result Analysis Report Writing End Figure 3.1 : Research framework 27 3.2.1 Problem Definition A research study starts with a problem that caught the attention of people and has become a critical issue that needs to be solved. This research concerns on cancer classification. Cancer has been widely known as a fatal disease and has become one of the leading killers in whole world. Fortunately, this disease can be cured if it is detected in early stage. Thus, there comes a technology called as microarray technology. Microarray technology produces gene expression data that helps in identifying and classifying cancers. However, gene expression data was overwhelming in numbers and this becomes a difficult task. This overwhelming data can cause a misclassification and is very computationally expensive. Therefore, this data need to be selected by picking up only informative or relevant genes as input to classifier. It will greatly reduce the computation and gives better result of classification. Knowing a correctly classified cancer class can help in early treatment of patient. 3.2.2 Related Studies Before constructing any design or analysis, firstly it is crucial to study all story behind the problem, the history of previous works and also the process to conduct this research. The very first step in research is to understand the problem domain and from there, then the next step can be well-planned to achieve a better research work. The problem domain in this research involved the area of biology, or specifically in genomic field. Thus, before proceed with the design phase, it is a 28 necessity to learn and study a bit about this genomic field and the relation with cancer classification. Microarray technology is a technology where all human genes can be expressed simultaneously in a very short time. On the other hand, this expressed genes need to be classify into its appropriate classes. However, to classify these genes accurately will require a very powerful classifier that can handle numerous numbers of genes to be classified. As a second approach, these genes need to be selected according to its score before it can be classified. There are many existing techniques has been tested and experimented on this problem, however not many achieved an encouraging results. The needs to study and learn about previous works are very important in research. That is one of the reason literature review has to be included in research framework. It helps in better understanding if problem and gave ideas on how to solve it. 3.2.3 Study on Proposed Method Based on literature review, a comparison of all existing technique have been made, the best techniques are selected that were based on its result in previous study. The chosen techniques to be compared, namely Minimum Redundancy Maximum Relevance (mRMR), ReliefF, Chi Square and Information Gain (to select genes) and Probabilistic Neural Network (PNN) act as classifier. In order to implement these techniques, first and foremost the techniques should be studied in depth on how it works, in producing better results. The mRMR, ReliefF, Chi Square and Information Gain techniques will be implemented to select informative genes to be used as input to the classifier. These 29 techniques have to be explored first before move to the classifier. As for the classifier, called PNN classifier is being implemented right after feature selection technique. The input would be the selected genes by feature selection techniques. PNN has different kind of perspective by different researchers worldwide. Thus, before implementing this classifier, a thorough review should be done to better understand this PNN classifier thus yield a promising result. 3.2.4 Data Preparation In the beginning stage of experiment, the most important step is data preparation. Data preparation is defined as the process of converting data into a machine-readable form so that it can be entered into a system via an available input device (Data Preparation, Encyclopedia.com, 6th October 2009). Originally, gene expression data is in form of images but then it is converted to numerical form so that in can be read by machine (computer). The numerical form of gene expression data can be easily accessed through public databank. As for this research, the data source is retrieved from Weka software. This data is in a Attribute-Relation File Format (ARFF). However, this data usually contain noise and sometimes need to be normalized or discretized. The following figures show the example of leukemia dataset in ARFF format. 30 Figure 3.2 : Sample of Dataset Figure 3.3 : Sample of Dataset Based on the Figure 3.2 above, the first column represent the number of samples, in this case is 72 samples which consist of 47 ALL and 25 AML. The rest of the column from second column until 7130 column, it consist of the genes value in continuous form and the last column in Figure 3.3 show the class of the samples which are ALL and AML. 31 In this research, the data is already in numerical but because it contains a huge range of value, thus it is better to be discretized to minimize burden to the classifier. In spite of that, discretization and normalization do not usually gives good result, sometimes it yields poor result. 3.2.5 Feature Selection After done preparing data, the next step would be selecting features by implementing mRMR, ReliefF, Information Gain and Chi Square techniques. In this stage, data that has already been preprocessed will become an input to these feature selection techniques. This data that contains 7129 genes and 72 samples will be narrowed down into only few numbers of genes. Only the number of genes will be reduced instead of the number of samples. The selection of genes is depending on the genes itself whether it carries information regarding cancer tissues or it might be just a noise. A subset of informative genes that is sufficient enough to give information will be the output in this stage of feature selection. Figure below indicate the process of feature selection. Preprocessed data that contains genes expression Input Feature selection technique select genes Output Subset of informative genes Figure 3.4 : Process of Feature Selection 32 3.2.6 Classification Right after the feature selection steps has completed, the next step would be classification. This is the key step in classification of microarray data. The output from feature selection is used as input to classifier, as being stated before, the classifier being implemented in this research is PNN classifier. The subset of informative genes yield from feature selection will be inserted in PNN classifier and then being classified to its correct classes. The output from this classification is the number of correctly classified sample and also the samples where has been incorrectly classified. Subset of informative genes Input PNN classifier Output Number of correctly classified samples and incorrectly classified samples Figure 3.5 : Process of Classification 33 Preprocessed data that contains genes expression Feature selection method select genes Subset of informative genes Training Set Test Set PNN Model Training of samples using PNN No Number of correctly classified samples Number of correctly classified samples Yes Met the desired output? Figure 3.6 : Overall Process of Feature Selection and Classification 34 3.2.7 Evaluation and Validation The very last step in experiment is evaluation and validation of results. In this research, the evaluations of results are based on method called k-fold validation. The ‘k’ letter indicates the number of times to repeat the fold validation. As for this research, the ‘k’ letter holds a value of 10, thus it is called 10-fold cross validation. In brief, firstly the data sample is divided into 10 parts and for each experiment, one tenth of the overall data become the train dataset and the rest will become test dataset. This will be repeated 10 times (according to value of ‘k’) using different train dataset and test dataset and the result are calculated by taking the average of 10 times experiment. This kind of validation is used to avoid any bias on the data itself and the result produce is more convincing. 3.2.8 Result Analysis Finally, after all results have been obtained, it should be analyzed before it can be presented to people. Analyzation of results involved making a graph or charts, calculate average or percentage and also justify possible causes of having such a result. 35 3.3 Leukemia Leukemia is a cancer-like disease characterized by a great degree of uncontrolled proliferation of one of the types of white blood cells. Leukemia has caused an approximate of 45,000 deaths each year throughout the world. In leukemia, the bone marrow begins to produce damaged white blood cells, which do not mature properly and unlike normal white blood cells, are able to multiply uncontrollable and then rapidly displace the normal cells. In addition, the abnormal functioning of the bone marrow also reduces the amount of red blood cells and platelets formed there, and sufferers become greatly anemic and their blood does not clot correctly, leaving them open to the risk of hemorrhage. The following figure shows the abnormal proliferation of cells in bone marrow compared to normal bone marrow. Figure 3.7 : Abnormal Proliferation of Cells in Bone Marrow Compared to Normal Bone Marrow 36 The causes of most cases in leukemia are unknown but certain specific factors have been identified, such as excessive exposure to radiation and certain chemicals, especially benzene. There are four types of leukemia: 1. Acute myeloid leukemia (AML) 2. Acute lymphoblastic leukemia (ALL) 3. Chronic myeloid leukemia (CML) 4. Chronic lymphocytic leukemia (CLL) Here we chose to use the dataset leukemia which consists of 2 classes, AML and ALL. AML is a life-threatening disease in which the cells that normally develop into neutrophils become cancerous and rapidly replace normal cells in bone marrow whereas ALL is also life-threatening disease in which the cells that normally develop into lymphocytes become cancerous and rapidly replace normal cells in the bone marrow, same as AML. The data will be tested, evaluated and classified into ALL class and AML class. 3.4 Software Requirement The major software requirements in this research includes Matlab, a numerical computing environment and programming language. Matlab is maintained by The MathWorks and it allows easy matrix manipulation, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs in other languages. The second software requirements is the KNIME, a modular data exploration platform that enables the user to visually create data flows, selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models. KNIME was developed by the Chair for Bioinformatics and Information Mining at the University of Konstanz, Germany. 37 This research is based on Windows platform as it is convenient in using KNIME and Matlab. Apart from that, there are two other software that are being use only for minor purpose which is Weka and IOS GeneLinker. 3.5 Summary This chapter describes briefly on research framework for proposed method and has explained details about some major steps in conducting this research. This framework is presented in text and flowchart that includes the activities such as literature review, studies on proposed method, preparation of data, implementation of both feature selection techniques and PNN classifier, evaluation and validation of result and lastly report writing. A brief explanation about the datasets (leukemia) and software requirement has been explained in section 3.3 and 3.4 accordingly. CHAPTER 4 IMPLEMENTATION 4.1 Introduction This chapter explains details about the method used to select the informative genes as an input to classifier and also the PNN classifier in performing classification. There are four sections in this chapter which are divided into data format, data preprocessing, feature selection task and classification task (PNN). 4.2 Data Format The experiment conducted uses 2 types of data format, namely Comma Separated Values (CSV) and Attribute-Relation File Format (ARFF). A CSV file is commonly used to store data in a table of lists form, where the members are 39 separated by commas. Each line in the file corresponds to a row in the table, and within a line, fields are separated by commas where each field belonging to one table column. The CSV file format is very simple and supported by almost all spreadsheets and database management systems. (Comma-separated_values, Wikipedia, 15 June 2009). An ARFF file is in the form of ASCII text file that is purposed to describe a list of instances sharing a set of attributes. In the case of genes classification, the instances refer to samples while attributes refer to genes. ARFF files were developed by the Machine Learning Project at the Department of Computer Science of The University of Waikato for use with the Weka machine learning software. ARFF files are partitioned into 2 sections, Header and Data. The Header contains the name of relation, a list of attributes and their types while the Data contains the data value of each instance. 4.3 Data Preprocessing As mentioned in Chapter 3 (Methodology), before continue with experiment, firstly the data needs to be preprocessed in order to fit the feature selection methods and classifier. The original data is in form of ARFF format as shown in the following Figure 4.1 and Figure 4.2: 40 Figure 4.1 : Original Dataset in ARFF Format Showing Genes Values Figure 4.2 : Original Dataset in ARFF Format Showing Class Name According to the above figures, the first column in Figure 4.1 represent the number of samples, in this case is 72 samples which consist of 47 ALL and 25 AML. The rest of the column from second column until 7130th column (not shown in the figure), it consist of the genes value in continuous form and the last column in Figure 4.2 show the class of the samples which are ALL and AML. Since the data is varies in its values, thus it is considered to be discretized to achieve better classification 41 result. Thus, there would be two types of dataset that will be experimented in first stage of experiment in order to get the best dataset to be use in the following experiment. In statistics and machine learning, discretization refers to the process of converting continuous features or variables to discretized or nominal features. This can be useful when creating probability mass functions – formally, in density estimation. It is a form of binning, as in making a histogram (Discretization of Continuous Features, Wikipedia, 23rd Oct 2009). Typically data is discretized into partitions of K equal lengths (equal intervals) or K% of the total data (equal frequencies). The following figures show the process of discretization using IOS GeneLinker software. Figure 4.3 : Dataset in IOS GeneLinker Software before Discretization 42 Figure 4.4 : Dataset in IOS GeneLinker Software after Discretization The process of discretization is in form of quantile discretization which is each bin receives an equal number of data values. The data range of each bin varies according to the data values it contains. As for this research, the number of bins used is 3 bins which are 0, 1 and 2. The discretization target can be based on genes, samples and all of data. The above example shows all data as discretization target. In brief, to discretize this data, parameters involved are the operation (quantile or range), target (per gene, per sample or all data) and the number of bins. Now the dataset has two forms which are continuous data (original) and discretized data. Both dataset are in format of ARFF and to fit this data into feature selection method, these data need to be converted into appropriate format that is acceptable by feature selection method. For mRMR, the data need to be converted to CSV file format as shown below. Table below are examples of leukemia dataset in CSV format, viewed in Microsoft Excel. The first column indicates the class of 43 leukemia (ALL/AML) while the rest of the table stores the values of genes. Figure 4.5 and 4.6 shows both types of data. class ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 ‐1 v1 ‐2 0 ‐2 2 ‐2 0 2 2 ‐2 0 ‐2 0 ‐2 0 ‐2 0 ‐2 ‐2 v2 2 0 ‐2 0 0 2 ‐2 0 ‐2 2 ‐2 0 0 2 0 0 2 0 v3 0 0 0 0 0 ‐2 0 0 0 0 0 0 0 0 0 0 0 0 v4 0 0 0 0 0 0 ‐2 0 0 0 ‐2 0 0 0 ‐2 0 ‐2 ‐2 v5 ‐2 ‐2 2 0 2 ‐2 0 ‐2 2 0 ‐2 2 0 0 2 0 0 0 v6 ‐2 0 2 0 2 0 0 ‐2 2 0 2 0 2 2 2 0 2 0 Figure 4.5 : Discretized Data in CSV Format v7 0 0 2 2 0 ‐2 0 2 2 ‐2 ‐2 ‐2 0 ‐2 0 0 2 0 v8 0 0 0 2 0 ‐2 0 2 2 ‐2 2 ‐2 2 0 0 0 2 ‐2 44 class 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 att1 ‐214 ‐139 ‐76 ‐135 ‐106 ‐138 ‐72 ‐413 5 ‐88 ‐165 ‐67 ‐92 ‐113 ‐107 ‐117 ‐476 ‐81 ‐44 att2 ‐153 ‐73 ‐49 ‐114 ‐125 ‐85 ‐144 ‐260 ‐127 ‐105 ‐155 ‐93 ‐119 ‐147 ‐72 ‐219 ‐213 ‐150 ‐51 att3 ‐58 ‐1 ‐307 265 ‐76 215 238 7 106 42 ‐71 84 ‐31 ‐118 ‐126 ‐50 ‐18 ‐119 100 att4 88 283 309 12 168 71 55 ‐2 268 219 82 25 173 243 149 257 301 78 207 att5 ‐295 ‐264 ‐376 ‐419 ‐230 ‐272 ‐399 ‐541 ‐210 ‐178 ‐163 ‐179 ‐233 ‐127 ‐205 ‐218 ‐403 ‐152 ‐146 att6 ‐558 ‐400 ‐650 ‐585 ‐284 ‐558 ‐551 ‐790 ‐535 ‐246 ‐430 ‐323 ‐227 ‐398 ‐284 ‐402 ‐394 ‐340 ‐221 att7 199 ‐330 33 158 4 67 131 ‐275 0 328 100 ‐135 ‐49 ‐249 ‐166 228 ‐42 ‐36 83 att8 ‐176 ‐168 ‐367 ‐253 ‐122 ‐186 ‐179 ‐463 ‐174 ‐148 ‐109 ‐127 ‐62 ‐228 ‐185 ‐147 ‐144 ‐141 ‐198 Figure 4.6 : Continuous Data in CSV Format 4.4 Feature Selection Method As been brief earlier, this study involves the comparison of 4 feature selection methods, namely mRMR, ReliefF, Information Gain and Chi Square. The comparisons of these methods in terms of its performance are being measured using PNN classifier. 45 4.4.1 mRMR Feature Selection Method mRMR was introduced by Ding and Peng in year 2003. mRMR stands for minimum Redundancy-Maximum Relevance feature selection. The purpose of this method is to select a feature subset that best characterizes the statistical property of a target classification variable. These features has to be mutually as dissimilar to each other as possible, but marginally as similar to the classification variable as possible (Peng, 2005). The owner of this method believes that combining a “very effective” gene with another “very effective” gene often does not form a better feature set. One of the reasons is that the two genes could be highly correlated and leads to redundancy of feature set. In brief, the mRMR minimizes redundancy and used a series of intuitive measures of relevance and redundancy to select useful features for both continuous and discrete datasets. If a gene has expressions randomly or uniformly distributed in different classes, its mutual information with these classes is zero whereas if a gene is strongly differentially expressed for different classes, it should have large mutual information. Thus the mutual information is used as a measure of relevance of genes. The mutual information I of two variables x and y is defined based on their joint probabilistic distribution p(x,y) and the respective marginal probabilities p(x) and p(y): , , log , , (4.1) The measurements of the level of similarity between genes are besed on their mutual information. The main idea of minimum redundancy is to select the genes such that they are mutually maximally dissimilar. Let S denote the subset of features that is the most relevant for classification. The minimum redundancy condition is 46 1 | | , , , , (4.2) where the I(i,j) is used to represent I(gi,gj) for notational simplicity and |S| is the number of features in S. To measure the level of discriminant powers of genes when they are differentially expressed for different targeted classes, again mutual information I(h,gi) is used between targeted classes h={h1,h2,…,hK} and the gene expression gi. Thus I(h,gi) quantifies the relevance of gi for the classification task. Thus the maximum relevance condition is to maximize the total relevance of all genes in S: 1 | | , , , (4.3) where I(h,gi) is referred as I(h,i). The minimum redundancy – maximum relevance feature set is obtained by optimizing the conditions in Eqs.(4.2) and (4.3) simultaneously. Optimization of these two conditions requires combining them into a single criterion function. In this paper the two conditions are equally important, and consider two simplest combined criteria: max , (4.4) max / , (4.5) An exact solution to the mRMR requirements requires O(N|S|) search to obtain (N is the number of genes in the whole gene set, Ω). To achieve optimal 47 solution, a simple heuristic algorithm is used. The first feature is selected according to Eq. (4.3), i.e. the feature with the highest I(h,i). The rest features are selected in an incremental way: earlier selected features remain in the feature set. Suppose a set of m features is already selected for the set S, and additional features are selected from the set Ω S = Ω - S (i.e. all genes except those already selected). The following two conditions are optimized: Ω , , (4.6) Ω 1 | | , . (4.7) The condition in Eq. (6) is equivalent to the condition in Eq. (4.3), while Eq. (4.7) is an approximation of the condition of Eq. (4.2). The two combinations of Eqs. (4.4) and (4.5) for relevance and redundancy lead to the selection criteria of a new feature: (1) MID: Mutual Information Difference criterion, (2) MIQ: Mutual Information Quotient criterion, These optimizations can be computed efficiently in O(|S|•N) complexity. 4.4.2 ReliefF Algorithm The main idea of the ReliefF algorithm that was proposed by (Kononenko, 1994) is to estimate the quality of attributes that have weights greater than the threshold using the distinction of an attribute value between a given instance and the 48 two nearest instances namely Hit and Miss. The algorithm of ReliefF is shown in the following figure. Input : a vector space for training instances with the value of attributes and class values Output : a vector space for training instances with the weight W of each attribute Set all weights W[A]=0.0 for i=1 to m do begin randomly select instance Ri find k nearest hit Hj for each class C≠ class(Ri) do find k nearest miss Mj(C) from class C for A=1 to all attribute do ∑ ∑ , , ∑ , , , / end Figure 4.7: ReliefF Algorithm (Park, H., and Kwon, H-C., 2007) According to the algorithm, the nearest hits Hj is finds k the nearest neighbors which is from the same class while the nearest misses Mj(C) is k the nearest neighbors which is from a different class. W (A) which is the quality estimation for all attributes A is updated based on their values for Ri, hits Hj and M(C). Each probability weight must be divided by 1-P (class (Ri)) due to the missing of the class of hits in the sum of diff (A, R, Mj(C)). 49 4.4.3 Information Gain The next method of feature selection that will be used in the experiment is Information Gain. The equation of the information gain method applied is as following. Let {ci}mi=1 denote the of classes. Let V be the set of possible values for feature f. Thus, the information gain of a feature f is defined as: log | | (4.8) In information gain, the numeric features need to be discretized. Therefore, an entropy-based discretization method (Fayyad and Urani, 1993) is used and has been implemented in WEKA software (Witten and Frank, 1999). 4.4.4 Chi Square Chi- Square or χ2 –statistic is another feature selection method that will be used as comparison. The equation of this method is as follow: (4.9) where 50 V is the set of possible values for feature f, Ai ( f = v) is the number of instances in class ci with f = v Ei( f = v) is the expected value of Ai ( f = v) Ei( f = v) is computed with: Ei( f = v) = P ( f = v)P(ci)N (4.10) where N is the total number of instances. Same as information gain, this method also requires numeric features to be discretized. The following is the example of the chi square algorithm which is consists of two phases. For discretization, in the first phase, it begins with a high significance level (sigLevel) for all numeric attributes. The process involved in phase 1 will be iterated with a decreased sigLevel until an inconsistency rate, δ is exceeded in the discretized data. In the Phase 2, begin with sigLevel0 determined in Phase 1, each attribute i is associated with a sigLevel [i] and takes turns for merging until no attribute’s value can be merged. If an attribute is merged to only one value at the end of Phase 2, it means that this attribute is not relevant in representing the original dataset. Feature selection is accomplished when discretization ends. 51 Phase 1: set sigLevel = .5; do while (InConsistency (data) < δ) { for each numeric attribute { Sort(attribute, data); chi-sq-calculation(attribute, data) do { chi-sq-calculation(attribute, data) } while (Merge (data)) } sigLevel0 = sigLevel; sigLevel = decreSigLevel(sigLevel); } Phase 2: set all sigLvl [i] = sigLevel0 do until no-attribute-can be-merged { for each attribute i that can be merged { Sort(attribute, data); chi-sq-initialization(attribute, data); do { chi-sq-calculation(attribute, data) } while (Merge(data)) if (InConsistency (data) < δ) sigLvl [i] = decreSigLevel(sigLvl [i]); else attribute i cannot be merged } } Figure 4.8: Chi Square Algorithm (Liu, H., and Setiono, R., 1995) 52 4.5 PNN Classifier The probabilistic neural network was developed by Donald Specht. This network provides a general solution to pattern classification problems by following an approach developed in statistics, called Bayesian classifiers. Bayes theory, developed in the 1950's, takes into account the relative likelihood of events and uses a priori information to improve prediction. The probabilistic neural network uses a supervised training set to develop distribution functions within a pattern layer. These functions are used to estimate the likelihood of an input feature vector being part of a learned category, or class. The learned patterns can also be combined, or weighted, with the a priori probability, also called the relative frequency, of each category to determine the most likely class for a given input vector. If the relative frequency of the categories is unknown, then all categories can be assumed to be equally likely and the determination of category is solely based on the closeness of the input feature vector to the distribution function of a class. The probabilistic neural network has three layers. The network contains an input layer which has as many elements as there are separable parameters needed to describe the objects to be classified. It has a pattern layer, which organizes the training set such that each input vector is represented by an individual processing element. And finally, the network contains an output layer, called the summation layer, which has as many processing elements as there are classes to be recognized. Each element in this layer combines via processing elements within the pattern layer which relate to the same class and prepares that category for output. Sometimes a fourth layer is added to normalize the input vector, if the inputs are not already normalized before they enter the network. 53 In the pattern layer, there is a processing element for each input vector in the training set. Normally, there are equal amounts of processing elements for each output class. Otherwise, one or more classes may be skewed incorrectly and the network will generate poor results. Each processing element in the pattern layer is trained once. An element is trained to generate a high output value when an input vector matches the training vector. The training function may include a global smoothing factor to better generalize classification results. In any case, the training vectors do not have to be in any special order in the training set, since the category of a particular vector is specified by the desired output of the input. The learning function simply selects the first untrained processing element in the correct output class and modifies its weights to match the training vector. The pattern layer operates competitively, where only the highest match to an input vector wins and generates an output. In this way, only one classification category is generated for any given input vector. If the input does not relate well to any patterns programmed into the pattern layer, no output is generated. The Parzen estimation can be added to the pattern layer to fine tune the classification of objects, This is done by adding the frequency of occurrence for each training pattern built into a processing element. Basically, the probability distribution of occurrence for each example in a class is multiplied into its respective training node. In this way, a more accurate expectation of an object is added to the features which make it recognizable as a class member. The first layer (input layer) of PNN accepts input in d-dimensional input vectors. The second layer calculates the Gaussian basis function (GBFs) as in the Eq. 4.11 below: 1 , 2 , , 2 , (4.11) 54 where it specifies the GBF for m-th cluster in the k-th class where variance , , is the is the cluster centroid and d represents the dimension of the input vector. The third layer of PNN is where the class conditional probability density function is estimated, given by the formula , , (4.12) where Mk is the number of clusters for class k and βm,k is the intra-class mixing coefficient that can be defined as below. , 1 (4.13) The flow of PNN can be explained further by the following pseudo-code: 55 <1> input layer Given an unknown pattern or feature vector x <2> pattern layer: xi is the ith reference pattern vector for i = 1:N yi = xixT - 0.5(xxT + xi(xi)T); yi = exp(yi/h2); % go through activation function end <3> summation layer for j = 1:n sum(j) = 0; for all i in {1,..., N} % all instances in the same taxon sum(j) += y(j,i); end sum(j) = % go through activation function / end pattern x belongs to taxon j with some memberships as for all j in [1,n] membership(j) = ∑ <4> output layer assign pattern x to taxon j (sj) with the highest membership such that sj* = argmax{membership(j)} all j∈{1,...,n} Conclusion: assign x to taxon j with membership(j). Figure 4.9: PNN Algorithm (Bi, C. et al, 2007) 4.6 Experimental Settings Generally, the experiment was divided into 2 phases; Feature Selection and Classification. For both phases, the dataset being tested was Leukemia dataset. This dataset consist of 7129 genes, 72 samples and 2 classes namely Acute myeloid leukemia (AML) and Acute lymphoblastic leukemia (ALL). Table 4.1 shows the description of leukemia dataset. This experiment was conducted on a platform of Microsoft Windows XP Professional Edition (Service Pack 3) using Intel Core 2 Duo processor and 2.5 Gigabyte of RAM. 56 Table 4.1 : Leukemia dataset Class No. of samples No. of genes per sample ALL 47 7129 AML 25 7129 Total 72 The experiment firstly is carried out with the data preparation to be fit in feature selection techniques and PNN classifier. Data is prepared using IOS GeneLinker software and the format of data is being converted using free online tools. The data is divided into two categories, discretized and continuous data. The next phase of experiment is feature selection, followed by classification. 4.6.1 Feature Selection The first phase of experiment is the feature selection phase. In feature selection, the experiment conducted using a technique called as mRMR, ReliefF, Information Gain and Chi Square (as mentioned also in Chapter 2). In this experiment, the techniques are run using Matlab and Weka software. The purpose of these techniques is to select several numbers of informative genes from overall of 7129 genes of leukemia dataset. The selected genes then will undergo the classification phase. 57 4.6.2 Classification The classification phase was held after the selection phase because it needs to wait for informative genes to be selected by feature selection techniques first. Once several subsets of informative genes have been obtained, the classification phase was proceeded by using a PNN classifier. This classifier classifies the subsets of genes according to its classes (AML, ALL) and the end result of this classification is the percentage of correctly classified samples. To validate this result, a method called as 10-fold cross validation was done in order to have a convincing result. Furthermore, the result from both phases will be compared to existing techniques to show the effectiveness of the proposed technique. 4.7 Summary This chapter discussed the techniques used in details and how the experiment is being carried out throughout this research. The first phase of experiment is the preparation of data before being fed into feature selection method. After that, feature selection method is implemented using mRMR, ReliefF, Information Gain and Chi Square. These features are then being classified using PNN classifier to measure the performance of each subset. Lastly, to evaluate the performance of classifier, 10-fold cross validation is performed to further analyze the results. CHAPTER 5 EXPERIMENTAL RESULT ANALYSIS 5.1 Overview This chapter discussed the result from the experiment conducted in this research and further analysis was done to verify the techniques whether these techniques are suitable to be used in classification purpose, hence improve classification result, or vice versa. There are several sections in this chapter that consist of the experimental settings, the dataset to be tested and also the analysis of result obtained. 559 5 Analyssis of Resultts 5.2 The following reesults are prroduced usin ng mRMR, R ReliefF, Info ormation Gaain a and Chi Square techn niques for genes seleection and PNN classsifier for thhe c classification n of genes. Apart A from these t techniq ques, there aare a few othher techniques ( (classifiers) being tested in this exxperiment inn order to ccompare and d validate thhe r result obtain ned from prooposed technnique. In seelection of geenes, there aare two types of data beiing experimented in ordder t investigaate which type to t of datta produce higher claassification accuracy. As A m mentioned earlier, e thesee types of ddata are in form f of disccrete data an nd continuouus d data. Figuree 5.1 below shows the percentage of classification by booth data types u using mRMR R technique and PNN classifier. Th he mRMR seelects 100 geenes and these d is then being data b split innto two, 70% % for traininng and 30% ffor testing. Percentage of Classification (%) Classification u using PNN for Differeent Types of Datta 100 90 91% 80 73% 70 60 50 40 30 20 10 0 C Continuous Da ta Disccrete Data Types of Daata Figgure 5.1 : Classificationn using PNN for Differennt Types of Data D 660 Accoording to figgure above, tthe continuo ous data produce higher classificatioon a accuracy (91 1%) comparred to discreete data (73% %). The low accuracy off discrete daata w due to its was i changes of attributess values in thhe dataset. Since S classiffication learnns f from patternn and uniqueeness of eachh attributes, thus it prodduces more accurate a resuult i the data are if a very distinct among classes. As being vieweed in earlierr figure, it caan b seen thatt the discrette data are unlikely be u disttinct betweeen two classes. Thus, thhis p produces low w classificattion accuraccy. As in for continuouss data, it is clear that thhe d on each data h class are very v distinct from each other o and this help in reecognizing thhe p pattern of eaach class and d hence, resuulting better classificatioon accuracy. Nextt experimennt investigatees the differrences of mRMR m schemes to seleect f features, wh hich are Mu utual Informaation Differrence (MID)) and Mutuaal Informatioon Q Quotient (M MIQ). The deetailed explannation on these schemess has been brrief in chaptter 4 Figure 5.2 4. 2 below disp plays the resuult of classiffication accuuracy using both b schemess. Classification Accuracy (%) Classificaation usingg PNN for Different SScheme in Feature Se election ussing MRMR 100 95 90 0 85 5 80 0 75 5 70 0 65 5 60 0 55 5 50 0 91% 81% 86% 81 1% 100 Features 200 Features 2 MIQ Functtion of MRMR MID Figure 5.22 : Classificaation Accuraacy using PN NN for Diffeerent Schemee in Feature Selecction using mRMR m 661 Baseed on the abbove graph, the result obtained o froom MIQ schheme is bettter t than MID scheme s for 100 featurees selected but for 2000 features selected, s booth s schemes pro oduce the sam me accuracyy (81%). Acccording to pprevious worrks (Ding annd P Peng, 2005)), MIQ scheme is muchh better thann MID schem me and the graph g showeed p proves this theory. t Thiss is because, in MIQ schheme, the caalculation eliiminates moost o the redunndant geness compared to MID. Thhe redundannt genes in MID schem of me c contribute too misclassifiication of daata. The low result retrieved from claassification of 2 200 featurees selected might due to the red dundancy oof features that lead to misinterpretation in classsification. IIt can be saiid that only 100 genes are m a enough to c classify the data. d Classification Accuracy (%) Classificaation usingg PNN by D Different N Number off Selected Feattures 100 95 90 85 80 75 70 65 60 55 50 91 1% 91 1% 81% 10 81% 50 100 200 Number o of Features Figure 5.3 3 : Classification using PNN P by Diff fferent Numbber of Selectted Features The above figuree shows the classificatioon using PN NN by differeent number of s selected feaatures. The numbers off selected feeatures beinng tested to evaluate thhe m mRMR perfformance arre 10, 50, 100 and 2000 features. According to the resuult 62 achieved, classification accuracy produced by 50 and 100 selected genes has 91% of accuracy whereas for 10 and 200 selected genes, the accuracy was a bit lower (81%). The low accuracy produced by 10 genes is due to its lack of informative genes to give information about a class. Lacking information gives the classifier a very small training data that is not enough to produce a great model. This lead to misclassification of test set using a poor model. For 200 selected genes, as been explained before; the poor performance is caused by redundancy among features. Let say, in 200 genes, only 100 genes that gives information about the class whereas the other 100 genes consist of noises and irrelevant genes. These genes will then being fit into classifier to be train and yield a poor model because of its irrelevant information. As for 50 and 100 genes selected, the information given was sufficient for the classifier to produce a very good model. These subset of genes does not have any redundant genes among them and produce a great performance of classification. In feature selection, there are a few other techniques that have been used to select informative genes such as Chi Square, ReliefF and Information Gain. Thus, to ensure that mRMR technique gives promising results, a comparison of these techniques is conducted to see which technique gives better classification accuracy. Figure 5.4 displays the graph of the comparison among feature selection techniques while Figure 5.5 shows the comparison of 3 classifiers namely PNN, Naïve Bayes and Random Forest in classifying selected genes by mRMR. 663 Comparisson of Classsification A Accuracy b by Differen nt Feaature Selection Method using PNN Classification Accuracy (%) 100 0 95 5 91 1% 86% 90 0 81% 85 5 81 1% 80 0 75 5 70 0 65 5 60 0 55 5 50 0 MRM MR Inform mation Gain ReliefF Chi Square Metthods of Feature Selection n Figure 5.4 : Compariso on of Classiffication Acccuracy by Different Featuure Selectionn Meethod using PNN P Baseed on the thee graph aboove, it is verry clear that mRMR achhieve the beest p performance e among othher techniquees. mRMR produces p 91% classificaation accuraccy c compared too other featuure selectionn techniquess that yield lower accurracy, 86% ffor I Information Gain, 81% % for RelieffF and Chii Square. T The lowest percentage p in c comparison of feature selection techniques belongs b to ReliefF andd Chi Squaare t techniques. mRMR techhnique is this technique focuses moore The key of success in m o redundan on nt of genes rather r than relevance r onnly (Peng et al, 2005). This T techniquue h been prooven in givinng tremendoous result bassed on previious work. By has B eliminatinng r redundant geenes, mRMR R performs well w by seleccting only veery informattive genes thhat s strongly con ntribute in determining its i class. Coompared to R ReliefF, Info ormation Gaain a Chi Squ and uare, these techniques compute only y the relevannce of featuures/genes annd i ignoring thhe existencce of reduundancy in n feature ssubsets. Thhis leads to m misclassifica ation of genees due to its irrelevant feeatures. 664 C Compariso n of Classification Acccuracy ussing Differe ent Classifierr Classification Accuracy (%) 100 0 99 9 98 8 97 7 96..4% 96.1% 96 6 96.1% 95 5 94 4 93 3 92 2 91 1 90 0 P PNN Random Forest Naïve‐Bayes Classifie ers Figure 5.5 5 : Comparrison of Classsification Accuracy A usinng Differentt Classifier Figurre 5.5 show ws the classsification accuracy a byy implementting differeent c classifiers in n order to coompare whicch classifier gives the beest accuracy.. Based on thhe p percentage o classificaation accuracy, PNN prroduce the highest of h accuuracy (96.4% %) c compared too the other two classifiiers which only o producce 96.1% acccuracy. Eveen t though the result is only slightlly different, but still PNN prodduce the beest p performance e. PNN applies the Bayees theory to solve patternn classificatiion. The bassic i idea of Bayees theory is that it will m make relativve likelihoodd of events and a also prioori i information. . PNN also avoids a the risk of classiifying data iinto the wron ng class. Thhis g gives a betteer performannce comparedd to other classifiers. 665 C Classificatio on Accuracy using 10 0‐fold Crosss Validation Classification Accuracy (%) 110 105 100 95 90 85 80 75 70 Fold 1 Fold 2 Fold 3 FFold 4 Fold 5 Fold 6 Fold 7 7 Fold 8 Fold d 9 Fold 10 10‐fold Crosss Validation Naïve Bayes Random Forest PNN A ussing 10-fold Cross Validdation Figure 5.6 : Claassification Accuracy Refeerring to the above figuree, most of th he times the result of classification are a 100% for the 3 classifierrs. But this oonly can be seen on the early fold, th hen at the ennd o fold (endd dataset), thhe classificattion result gets of g lower esspecially forr PNN. Baseed o observatiions, the end dataset coonsist more AML samples and from on m this we caan c conclude thaat AML classs is difficullt to distinguuish in a dattaset of AM ML/ALL rathher t than ALL inn the same dataset. d This is due to thee number off AML sampples that migght n enough to provide information of its patternn. As been ssaid earlier, ALL consissts not o 47 samplles whereas AML consissts of only 25 of 2 samples. This makess it difficult to r retrieve mucch valuable informationn from AML L samples. T That is whyy the fold thhat i involved AM ML samples gives poor pperformancee compared to ALL. 66 5.3 Discussion Based on overall results achieved, it is clear that mRMR and PNN classifier gives better results compared to other existing techniques . mRMR technique focuses on redundancy of genes and at the same time the maximum relevance also been taken into account. Unlike other feature selection techniques (Information Gain, ReliefF and Chi Square), where they only focus on the maximum relevance of features and ignore the existing of redundancy problem in features subset. Thus, this gives credit to mRMR where it concerns on both problem and that is the major reason why it produces better results. As in for the classifier, PNN has achieved slightly better result that other classifiers. This is due to the nature of PNN architecture that has the rule where it tries to minimize the expected risk of classifying features in the wrong class. Furthermore, PNN is less likely to be sensitive towards noise in data and it produces outputs with Bayes posterior probabilities. 5.4 Summary The result obtained from several experiment has prove that mRMR techniques select very useful genes and reduce redundancy whereas PNN acts as a great classifiers that gives a better result than other existing classifiers. The selection of genes is important since it affect the efficiency of classifiers if the data given are huge. Furthermore, by selecting only the relevant genes, biologists do not have to waste time in investigating the wrong genes that causes cancer. They only have to rely on the selected genes to carry on their research. Thus, it can be said that combination of mRMR technique and PNN classifier gives great result in classification of microarray data. CHAPTER 6 DISCUSSION AND CONCLUSION 6.1 Overview This chapter discussed the general conclusion of this research, disadvantages and problems of this research and also suggestions for future works. The problems that existed during this research will be analyzed to overcome the situation and the conclusion will discussed about the overall results of the experiment of comparisons between feature selection technique to select informative genes and PNN to classify microarray data. 68 6.2 Research Contribution This research has contributes some knowledge in the area of classification of microarray data. The following listed contributions of this research: 1. A set of meaningful and informative genes have been selected using mRMR, Information Gain, ReliefF and Chi Square as an input to the classifier. 2. The effectiveness of the four techniques is evaluated using PNN classifier and after making a comparison, it showed that from the four techniques, mRMR performs much better compared to others. Thus, it can be conclude that mRMR is the best genes selection technique for classification of microarray data. 3. Further evaluation of mRMR performance as genes selection technique have been done by using different classifiers namely PNN, Random Forest, and Naïve Bayes in order to compare which classifier give the best classification accuracy. A validation using 10-fold cross validation also has been implemented so that the result will not be biased. The result yielded that PNN classifier gives the highest accuracy. From this result, it can be said that PNN is a very effective classifier to classify microarray data. 4. The reduction of genes and an accurate prediction of microarray data are crucial because it can detect cancer in early stages and further save human life. This research has helped in reducing the number of genes and gives insight to biologist on the informative genes. The reduction of genes and classification of genes are being experimented using computer science techniques. Thus, this research has contributed to both computer and medical fields. 69 6.3 Problems and Limitations of Research During this research, there were some problems and limitations exist while implementing the feature selection techniques and PNN technique. One of the problems and limitations is the used of limited dataset. This research only uses one dataset which is leukemia. Hence it does not generalize the result obtained and there are no comparisons of results between other dataset which might have different type of diseases, different number of sizes and different number of classes. 6.4 Suggestions for Better Research Since the problems exists because of a limited dataset, thus, to overcome this situation, it is important to use different types of dataset which different in terms of sizes, diseases and classes. This will help getting a better result and comparison between the datasets can show the most suitable dataset to be used by feature selection techniques and PNN technique and further validate the performance of the designed technique. Examples of other datasets would be colon tumor, lung cancer, breast cancer, and prostate cancer. In this study, the data involved is leukemia which is a fatal disease that has been increase its population every year compared to other diseases. Early detection of leukemia will help to in early treatment and thus cure the disease. There is a various ways to detect, cure and treat leukemia nowadays. A lot of researches also have been done in this area in order to reduce the number of deaths caused by leukemia. 70 The new technology of microarray increase allows thousands of the genes expressions to be determined simultaneously. This advantage of microarray has lead to the application in medical area namely management and classification of cancer and infectious diseases. However, microarray technology also suffers several drawbacks. The disadvantages of the microarray are the high dimensionality of data and also consist of irrelevant information to classify disease accurately. Nevertheless, there are various techniques of feature selection that can be used to solve the arising problem in microarray in order to improve the accuracy of the classification. By using feature selection, only the appropriate subset of genes will be selected among the microarray. The goal of the classification in leukemia is to distinguish between ALL and AML leukemia. There are many classifiers that have been studied to classify the microarray. However, not all the classifier shows the high performance of accuracy. This research has concluded that mRMR and PNN serves as the best combination of genes selection and classification of microarray data. However, these techniques should be explored deeply and a further investigation should be conducted to overcome the problems and limitations of this research and thus, obtain a better result. This knowledge can be expand throughout the world for future generations to learn and improves any flaws in this method or research. REFERENCES Ahlers, F.J., Carlo, W.D., Fleiner, C., Godwin, L., Mick, Nath, R.D., Neumaier, A., Phillips, J.R., Price,K., Storn, R., Turney,P., Wang, F., Zandt, J.V., Geldon, H., Gauden, P.A. Differential Evolution. (accessed May 20, 2009). http://www.icsi.berkeley.edu/~storn/code.html Alon, U., Barkai, N., Notterman, D.A., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS. Vol 96: 6745-6750 Amaratunga. D. and Cabrera, J. (2004). Exploration and Analysis of DNA Microarray and Protein Array Data. New Jersey, USA: Wiley Inter-Science. 8-10 Babu, B.V. and Chaturvedi, G. Evolutionary Computation Strategy for Optimization of an Alkylation Reaction. Birla Institute of Technology and Science. Babu, B.V. and Sastry, K.N.N (1999). Estimation of Heat Transfer Parameters in a Trickle-bed Reactor using Differential Evolution and Orthogonal Collocation. Elsevier Science. Balasundaram Karthikeyan, Srinivasan Gopal, Srinivasan Venkatesh and Subramanian Saravanan. (2006). PNN and its Adaptive Version – An Ingenious Approach to PD Pattern Classification Compared with BPA Network. Journal of Electrical Engineering. Vol 57: 138-145. Bi, C., Saunders, M. C. and McPheron, B. A. (2007). Wing Pattern-Based Classification of the Rhagoletis pomonella Species Complex Using Genetic Neural Networks. International Journal of Computer Science & Applications. Vol 4: 1-14 Breiman, L., 2001. Random Forests. Mach. Learn. 40, 5–32. 72 Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Trees. Wadsworth, Belmont. Breiman, L., 2003. RF/tools—A class of two eyed algorithms. In: SIAM Workshop, http://oz.berkeley.edu/users/breiman/siamtalk2003.pdf. Campbell, N.A. and Reece, J.B. (2002). Biology. Sixth edition. San Francisco: Benjamin Cummings. Comma-separated Values. Wikipedia. (accessed June 15, 2009). http://en.wikipedia.org/wiki/Comma-separated_values. Cover, T., and Thomas, J. (1991). Elements of Information Theory. New York :John Wiley and Sons. Data Preparation. Encyclopedia.com, (accessed October 6, 2009). http://www.encyclopedia.com/doc/1O11-datapreparation.html Díaz-Uriarte R, Alvarez de Andrés S. (2006). Gene Selection and Classification of Microarray Data using Random Forest. BMC Bioinformatics. 2006 Jan 6;7:3. Ding, C. and Peng, H. (2003). Minimum Redundancy Feature Selection from Microarray Gene Expression Data. Proceedings of the Computational Systems Bioinformatics. Ding, C. and Peng, H. (2005). Minimum Redundancy Feature Selection from Microarray Gene Expression Data. Journal of Bioinformatics and Computational Biology. Vol 3: 185-205. Discretization of Continuous Features, Wikipedia, (accessed Oct 23, 2009) http://en.wikipedia.org/wiki/Discretization_of_continuous_features DNA. Wikipedia, (accessed May 11, 2009). http://en.wikipedia.org/wiki/DNA Dudoit, S. and Gentleman, R. (2003). Classification in Microarray Experiment. Freund, Y., Schapire, R.E., 1996. Experiments with a new boosting algorithm. In: Machine Learning. Proceedings of the Thirteenth International Conference. pp. 148–156. Golub, T.R., Slonim, D.K, Tamayo, P., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. Vol 286: 531-537 Huerta, E.B., Duval, B., Hao, J.K. A hybrid GA/SVM approach for gene selection and classification of microarray data. Jamain, A. and Hand, D. J. (2005). The Naïve Bayes Mystery: A Classification Detective Story. Pattern Recognition Letters. Vol 26: 1752-1760 73 Jin, X., Xu, A., Bie, R., and Guo, P. (2006). Machine Learning and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles. In Li, J. et al. (Eds) Data Mining for Biomedical Applications. (pp: 106-115). Berlin Heidelberg: Springer-Verlag Kim, Y.B. and Gao, J. (2006). A New Hybrid Approach for Unsupervised Gene Selection. IEEE Explorer. Kohavi, R and John, G.H. (1997). Wrappers for Feature Subset Selection. Kononenko, I. (1994). Estimating Attributes: Analysis and Extensions of Relief. Proceedings of the European Conference on Machine Learning. SpringerVerlag New York. 171-182 K-nearest_neighbor_algorithm, Wikipedia, (accessed May 21, 2009) en.wikipedia.org/wiki/K-nearest_neighbor_algorithm Lakshminarasimman, L. and Subramanian, S. (2008). Applications of Differential Evolution in Power System Optimization. Advances in Differential Evolution. Vol 143: 257-273. Langdon, W.B. (2005). Evolving Benchmarks. The 17th Belgian-Dutch Conference on Artificial Intelligence. 365–367. Liu, J. and Iba, H. (2002). Selecting Informative Genes Using a Multiobjective Evolutionary Algorithm. Proceedings of the 2002 congress. 12-17 May. 297302 Mendes, S.P., Pulido, J.A.G., Rodriguez, M.A.V., Simon, M.D.J., Perez, J.M.S. (2006). A Differential Evolution Based Algorithm to Optimize the Radio Network Design Problem. Mishra, S.K. (2006). Global Optimization by Differential Evolution and Particle Swarm Methods: Evaluation on Some Benchmark Functions. MPRA Paper 1005. 7 November 2007. Mitchell, T.M. (1997). Machine Learning. New York : McGraw-Hill Muhammad Faiz bin Misman (2007). Pembangunan Program Selari Menggunakan Message Passing Interface(MPI) Pada Teknik Gabungan Algoritma Genetik dan Mesin Sokongan Vektor. Universiti Teknologi Malaysia: Tesis Sarjana Muda New Gene Selection Method. The Medical News, July 8, 2004 (accessed May 14, 2009). http://www.news-medical.net/news/2004/07/08/3157.aspx 74 Nur Safawati binti Mahshos (2008). Pengecaman Imej Kapal Terbang Dengan Menggunakan Teknik Rangkaian Neural Radial Basis Fuction Dan Rambatan Balik. Universiti Teknologi Malaysia: Tesis Sarjana Muda Nurulhuda binti Ghazali (2008). A Hybrid of Particle Swarm Optimization and Support Vector Machine Approach for Genes Selection and Classification of Microarray Data. Universiti Teknologi Malaysia: Tesis Sarjana Muda Park, H., and Kwon, H-C. (2007). Extended Relief algorithms in instance-based Feature Filtering. Sixth International conference on Advanced Language Processing and Web Information Technology. 123-128 Pastell, M.E. and Kujala, M. (2007). A Probabilistic Neural Network Model for Lameness Detection. American Dairy Science Association. Vol 90: 22832292. Paul, T.K and Iba, H. (2005). Gene selection for classification of cancers using probabilistic model building genetic algorithm. BioSystems. Vol 82: 208-225 Peng, H., Long, F. and Ding, C. (2005). Feature Selection Based on Mutual Imformation: Criteria of Max-Dependency, Max-Relevance, and MinRedundancy. IEEE Transactions on Pettern Analysis and Machine Intelligence. Vol 27: 1226-1238. Peng, H. (2005). mRMR (minimum Redundancy Maximum Relevance Feature Selection).(accessed June 1, 2009). http://penglab.janelia.org/proj/mRMR/index.htm Principal Component Analysis, Wikipedia, (accessed May 14, 2009) http://en.wikipedia.org/wiki/Principal_components_analysis Russell, P.J. (2003). Essential iGenetics. San Francisco: Benjamin Cummings. 226-265 Savitch, W. (2006). Problem Solving with C++. Sixth edition. USA: Pearson International Edition. Shan, Y., Zhao, R., Xu, G., Liebich, H.M. and Zhang, Y. (2002). Application of Probabilistic Neural Network in the Clinical Diagnosis of Cancers based on Clinical Chemistry Data. Analytica Chimica Acta. 77-86. Shen, Q., Shi, W.-M., Kong, W., Ye, B.-X. (2007). A Combination of Modified Particle Swarm Optimization Algorithm and Support Vector Machine for Gene Selection and Tumor Classification. Talanta. Vol 71: 1679-1683 Shena, M. (2003). Microarray Analysis. New Jersey: John Wiley & Sons, Inc. 75 Specht, D.F. (1990). Probabilistic neural networks. Neural Networks.Vol 3 : 110-118. Storn, R. and Price, K. (1997). Differential Evolution – A Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces. Journal of Global Optimization. Suzila binti Sabil (2007). Aplikasi Prinsip Analisis Komponen dan Rangkaian Neural Perceptron untuk Mengkelaskan Data Kanser Usus. Universiti Teknologi Malaysia: Tesis Sarjana Muda Vapnik, V. N. (1995). The nature of Statistical Learning Theory. Springer, New York Vasan, A. and Raju, K.S. Optimal Reservoir Operation Using Differential Evolution. Birla Institute of Technology and Science. Visen, N. S., Paliwal, J., Jayas, D.S. and White, N.D.G. (2002). Specialist Neural Networks for Cereal Grain Classification. Biosyst. Eng.Vol 82:151–159. Xu, R. and Wunsch, D. C. (2003). Probabilistic Neural Networks for Multi-class Tissue Discrimination with Gene Expression Data . Proceedings of the International Joint Conference on Neural Network. Vol 3: 1696-1701 Xue, F. (2004). Multi-objective Differential Evolution: Theory and Applications.Rensselaer Polytechnic Institute: Doctor of Philosophy. Yang, Z., Yang, Z., Lu, W., Harrison, R.G., Eftestøl, T. and Steene, P.A. (2005). A Probabilistic Neural Network as the Predictive Classifier of out-of-hospital Defibrillation Outcomes. Resuscitation. Vol 64:31–36. Yousefi, H., Handroos, H. and Soleymani, A. (2008). Application of Differential Evolution in System Identification of a Servo-hydraulic System with a Flexible Load. Elsevier. Vol 18: 513-528 Yuan, S.-F. and Chu, F.-L. (2007). Fault Diagnostics based on Particle Swarm Optimization and Support Vector Machines. Mechanical Systems and Signal Processing. Vol 21: 1787-1798 Zhang, L.-X., Wang, J.-X., Zhao, Y.-N. and Yang, Z.-H. (2003). A Novel Hybrid Feature Selection Algorithm: Using ReliefF Estimation for GA-Wrapper Search. Proceedings of the Second International Conference on Machine Learning and Cybernetics. Xi’an. 380-384. 76 Zheng, Z., Srihari, R. and Srihari, S. (2003). A Feature Selection Framework Text Filtering. Proceedings of the Third IEEE International Conference on Data Mining. APPENDIX A PROJECT 1 GANNT CHART Topic confirmation Topic changes and confirmation Discussion with supervisor 5 6 7 Submission of chapter 1 Literature Review Studying proposed method and previous method Submission of chapter 2 11 12 13 14 Submission of report draft Progress Milestone Submission of final report 28 Split Correction of report 27 Task Project presentation 26 Project: Project 1 Date:10/28/09 Preparation of presentation 25 Project presentation 23 24 Writing the draft of report 22 Report Writing Submission of chapter 4 20 21 Design of the proposed method 19 Initial result Submission of chapter 3 17 18 Determine the project methodology 16 Methodology Studying the current information 10 15 Identify backgroud af problem, objectives and importance of the study 9 Analyse Choosing a topic 4 8 Selecting of supervisor Planning Task Name Execution of project 1 3 2 ID 1 1 day 2 days 1 day 4 days 8 days 1 day 3 days 4 days 2 days 5 days 7 days 1 day 2 days 3 days 1 day 2 days 3 days 1 day 4 days 2 days 13 days 1 day 4 days 1 day 1 day 2 days 15 days Duration 39 days Wed 7/1/09 Mon 6/29/09 Fri 6/26/09 Mon 6/22/09 Mon 6/22/09 Thu 6/18/09 Mon 6/15/09 Mon 6/15/09 Fri 6/12/09 Fri 6/5/09 Fri 6/5/09 Thu 6/4/09 Tue 6/2/09 Tue 6/2/09 Thu 6/4/09 Tue 6/2/09 Thu 5/28/09 Wed 5/27/09 Thu 5/21/09 Tue 5/19/09 Tue 5/19/09 Thu 5/28/09 Fri 5/15/09 Thu 5/14/09 Wed 5/13/09 Mon 5/11/09 Mon 5/11/09 Start Mon 5/11/09 Project Summary Summary Wed 7/1/09 Tue 6/30/09 Fri 6/26/09 Thu 6/25/09 Wed 7/1/09 Thu 6/18/09 Wed 6/17/09 Thu 6/18/09 Mon 6/15/09 Thu 6/11/09 Mon 6/15/09 Thu 6/4/09 Wed 6/3/09 Thu 6/4/09 Thu 6/4/09 Wed 6/3/09 Mon 6/1/09 Wed 5/27/09 Tue 5/26/09 Wed 5/20/09 Thu 6/4/09 Thu 5/28/09 Wed 5/27/09 Thu 5/14/09 Wed 5/13/09 Tue 5/12/09 Thu 5/28/09 Mon May 4 Finish Wed 7/1/09 Appendix A Project 1 Gantt Chart Wed May 6 External Milestone External Tasks Tue May 5 Thu May 7 Deadline Fri May 8 Sat May 9 Sun May 10 Mon May 11 Wed May 13 Project: Project 1 Date:10/28/09 Tue May 12 Thu May 14 Milestone Mon May 18 Progress Sun May 17 Split Sat May 16 Task Fri May 15 Tue May 19 Thu May 21 Project Summary Summary Wed May 20 Appendix A Project 1 Gantt Chart Fri May 22 Sun May 24 External Milestone External Tasks Sat May 23 Mon May 25 Deadline Tue May 26 5/27 Wed May 27 Thu May 28 Sat May 30 Project: Project 1 Date:10/28/09 Fri May 29 Sun May 31 Milestone 6/4 Thu Jun 4 Progress Wed Jun 3 Split Tue Jun 2 Task Mon Jun 1 Fri Jun 5 Sun Jun 7 Project Summary Summary Sat Jun 6 Appendix A Project 1 Gantt Chart Mon Jun 8 Wed Jun 10 External Milestone External Tasks Tue Jun 9 Thu Jun 11 Deadline 6/12 Fri Jun 12 Sat Jun 13 Sun Jun 14 Tue Jun 16 Project: Project 1 Date:10/28/09 Mon Jun 15 Wed Jun 17 Milestone Sun Jun 21 Progress Sat Jun 20 Split Fri Jun 19 Task 6/18 Thu Jun 18 Mon Jun 22 Wed Jun 24 Project Summary Summary Tue Jun 23 Appendix A Project 1 Gantt Chart Thu Jun 25 Sat Jun 27 External Milestone External Tasks Fri Jun 26 Sun Jun 28 Deadline Mon Jun 29 Tue Jun 30 7/1 Wed Jul 1 APPENDIX B PROJECT 2 GANNT CHART Re-code Discussion with supervisor Phase of Results and Discussion 10 11 12 Supervisor checking the report draft Correction of the report draft 17 18 Thu 10/29/09 Milestone 2 days Thu 10/22/09 Tue 10/13/09 Progress Submission of the final report 23 5 days 2 days Tue 9/29/09 Tue 9/29/09 Tue 9/15/09 Thu 9/10/09 Fri 9/4/09 Mon 8/24/09 Mon 8/24/09 Wed 8/19/09 Wed 8/19/09 Tue 8/18/09 Thu 8/13/09 Mon 8/10/09 Wed 7/29/09 Wed 7/29/09 Mon 7/20/09 Thu 7/9/09 Thu 7/9/09 Mon 7/6/09 Mon 7/6/09 Start Mon 7/6/09 Split Correction of the project report 22 10 days 24 days 10 days 3 days 2 days 9 days 26 days 3 days 3 days 1 day 3 days 3 days 8 days 15 days 7 days 7 days 14 days 3 days 3 days Duration 85 days Task Project presentation 21 Project: Project 2 Date: 10/28/09 Preparation of presentation 20 Project presentation Submission of report draft 19 Writing the draft of the report 16 Report Writing 15 14 Making conclusion Testing the program 9 13 Coding 8 Phase of Implementation Understanding of current research 6 7 Studying previous research 5 Phases of Literature Review Collecting data 4 Phase of Collecting Data 3 Task Name Execution of Project 2 2 ID 1 Fri 10/30/09 Wed 10/28/09 Wed 10/14/09 Mon 10/12/09 Fri 10/30/09 Mon 9/28/09 Mon 9/14/09 Mon 9/7/09 Thu 9/3/09 Mon 9/28/09 Fri 8/21/09 Fri 8/21/09 Tue 8/18/09 Mon 8/17/09 Wed 8/12/09 Fri 8/7/09 Tue 8/18/09 Tue 7/28/09 Fri 7/17/09 Tue 7/28/09 Wed 7/8/09 Wed 7/8/09 Fri Jul 3 Finish Fri 10/30/09 Sun Jul 5 Project Summary Summary Sat Jul 4 Appendix B Project 2 Gantt Chart Mon Jul 6 Wed Jul 8 External Milestone External Tasks Tue Jul 7 Thu Jul 9 Deadline Fri Jul 10 Sat Jul 11 Sun Jul 12 Mon Jul 13 Wed Jul 15 Project: Project 2 Date: 10/28/09 Tue Jul 14 Thu Jul 16 Milestone Mon Jul 20 Progress Sun Jul 19 Split Sat Jul 18 Task Fri Jul 17 Tue Jul 21 Wed Jul 22 Fri Jul 24 Project Summary Summary Thu Jul 23 Appendix B Project 2 Gantt Chart Sat Jul 25 Mon Jul 27 External Milestone External Tasks Sun Jul 26 Tue Jul 28 Deadline Wed Jul 29 Thu Jul 30 Fri Jul 31 Sat Aug 1 Mon Aug 3 Project: Project 2 Date: 10/28/09 Sun Aug 2 Tue Aug 4 Milestone Sat Aug 8 Progress Fri Aug 7 Split Thu Aug 6 Task Wed Aug 5 Sun Aug 9 Mon Aug 10 Wed Aug 12 Project Summary Summary Tue Aug 11 Appendix B Project 2 Gantt Chart Thu Aug 13 Sat Aug 15 External Milestone External Tasks Fri Aug 14 Sun Aug 16 Deadline Mon Aug 17 Tue Aug 18 Wed Aug 19 Thu Aug 20 Sat Aug 22 Project: Project 2 Date: 10/28/09 Fri Aug 21 Sun Aug 23 Milestone Thu Aug 27 Progress Wed Aug 26 Split Tue Aug 25 Task Mon Aug 24 Fri Aug 28 Sat Aug 29 Mon Aug 31 Project Summary Summary Sun Aug 30 Appendix B Project 2 Gantt Chart Tue Sep 1 Thu Sep 3 External Milestone External Tasks Wed Sep 2 9/4 Fri Sep 4 Deadline Sat Sep 5 Sun Sep 6 Mon Sep 7 Tue Sep 8 Thu Sep 10 Project: Project 2 Date: 10/28/09 Wed Sep 9 Fri Sep 11 Milestone Tue Sep 15 Progress Mon Sep 14 Split Sun Sep 13 Task Sat Sep 12 Wed Sep 16 Thu Sep 17 Sat Sep 19 Project Summary Summary Fri Sep 18 Appendix B Project 2 Gantt Chart Sun Sep 20 Tue Sep 22 External Milestone External Tasks Mon Sep 21 Wed Sep 23 Deadline Thu Sep 24 Fri Sep 25 Sat Sep 26 Sun Sep 27 Tue Sep 29 Project: Project 2 Date: 10/28/09 Mon Sep 28 Wed Sep 30 Milestone Sun Oct 4 Progress Sat Oct 3 Split Fri Oct 2 Task Thu Oct 1 Mon Oct 5 Tue Oct 6 Thu Oct 8 Project Summary Summary Wed Oct 7 Appendix B Project 2 Gantt Chart Fri Oct 9 Sun Oct 11 External Milestone External Tasks Sat Oct 10 Mon Oct 12 Deadline 10/13 Tue Oct 13 Wed Oct 14 Thu Oct 15 Fri Oct 16 Sun Oct 18 Project: Project 2 Date: 10/28/09 Sat Oct 17 Mon Oct 19 Milestone Fri Oct 23 Progress Thu Oct 22 Split Wed Oct 21 Task Tue Oct 20 Sat Oct 24 Sun Oct 25 Tue Oct 27 Project Summary Summary Mon Oct 26 Appendix B Project 2 Gantt Chart Wed Oct 28 Fri Oct 30 External Milestone External Tasks 10/29 Thu Oct 29 Sat Oct 31 Deadline Sun Nov 1 Mon Nov 2 Tue Nov 3 Wed Nov 4