Selecting Informative Genes with Parallel Genetic Algorithms in Tissue Classification Paper By: Juan Liu, Hitoshi Iba and Mitsuru Ishizuka Online link: hc.ims.u-tokyo.ac.jp/JSBi/journal/GIW01/GIW01F02.pdf Comments By: Prashant Jain 1. Introduction This paper talks about using Parallel Genetic Algorithms to select Informative Genes, after which they combine this with a classification method that has been given by Golub and Slonim, they classify data sets with tissues of different classes. Before we go on into the details of the paper, we need to know a few basics about genes, gene expression, informative genes etc. Gene Expression is the process of writing out a gene’s DNA sequence into RNA. RNA is the building block that serves as a model for protein production. A gene expression level basically indicates the number of copies of the gene’s RNA that have been produced in the cell. Another related and important concept is that of DNA arrays. DNA array technologies help us to look at the expression pattern of thousands of genes at the same time. They have a lot of advantages as they give us an overall picture of gene expression. They can be used to cluster genes into groups and also to classify tissues on the basis of gene expression patterns. But this task of using the DNA array data is not easy. There are a huge number of genes in comparison to the training samples and data sets. Hence classification of these is difficult. Another major problem is the noise factor that is introduced by those genes that are not relevant to distinguish between different tissue types. The methods proposed in this paper intend to do tissue classification based on only gene expression levels and try to get a small subset of genes. There are two types of gene subset selection algorithms -: Filter Approach -: Gene Selection done independent of classification process. Wrapper Approach -: Evaluate candidate subsets by using classifying algorithms on datasets. 2. Parallel Genetic Algorithm Based Gene Selection Method The paper defines genetic algorithm as “a global optimization procedure that uses an analogy of genetic evolution of biological organisms”. The basic genetic algorithm that is described in the paper is as follows -: 1. 2. 3. 4. 5. 6. 7. 8. 9. Start Generate initial population Evaluate the population If termination condition satisfied goto 9 Else goto 5. Select the fittest individuals Apply genetic operators Generate new population. Goto 3. End The problem with this as they say is that in case the population size is large the computation cost is immense. Hence, they use parallel genetic algorithms. The parallel genetic algorithm used is given below -: 1. Start 2. Generate intial population 3. Evaluate the population 4. If termination condition satisfied goto 10 else Goto 5. 5. Select fittest individuals 6. Apply genetic operators 7. Send Migrants if necessary. 8. Receive Migrants if necessary. 9. Generate new population 10. Goto 3. The P.G.A. basically divides the entire population into subpopulations and runs a normal G.A. on each subpopulation. It does however incorporate migration of the fittest individuals amongst processors. Hence they use the island model(also known as distributed G.A.). The gene subsets are represented using a fixed length binary string representation in which value 1 means that the gene has been included in the subset and 0 means that the gene is not included. There underlying thought regarding using a parallel genetic algorithm is to search gene subsets that have lesser number of elements and search the entire search space (all possible subsets of the whole gene set). They use both the accuracy of their classification and the subset size in their fitness evaluation. Accuracy is measured by invoking a classifier that runs on the found gene subset and some training data(which has only the expression levels of the given genes in the subset). The fitness function is as follows -: fitness(x) = w1 * accuracy(x) + w2 *(1 – dimensionality(x)) Here, accuracy(x) =test accuracy of the classifier built with the gene subset represented by x dimensionality(x) [0,1] = the dimension of the subset w1 = weight assigned to accuracy w2 = weight assigned to dimensionality As we can see from the fitness function, the subset which has low dimensionality and high accuracy will have a higher fitness value than anything else. 3. Experiments and Results The experiments were run on two data sets –: The tests were run on two processors with the following parameters of the G.A. in each processor-: Population Size -: 1000 Trials -: 400000 Crossover Probability -: 0.6 (This has been kept relatively high so as to make sure that the subpopulation that is generated is different from its parents but is kept sufficiently low enough so as not to remove good traits{read fit individuals} in the parents). Mutation Probability -: 0.001 (This is kept low so that the randomness of the selection is reduced .) Selection Strategy -: Elitist (This is a good strategy to follow because using this, the best gene subsets are kept and passed on from each generation to the next and hence lead to the finding of more accurate gene subsets) Migration Probability -: 0.002 (This is to ensure that there is sufficient exchange of population between the two processors so that there can be diversity in the gene subset formed ) They use a classifier proposed by Golub and Slonim to test the gene selecting method which basically gives a positive value to a sample belonging to class 1 and a negative value to a sample belonging to class 2. They take the accuracy of the classification to be more important and hence use w1 = 0.75 and w2 = 0.25 in their experiments. The results are given in the following table -: For the Leukemia data set a subset with 29 genes was found which predicted 36 of the 38 training samples and 30 of the 34 samples correctly. However, there were three meaningless genes in that classifier which when were dropped from it resulted in 91% accuracy on the test data set. For the Colon data set a subset of 30 genes was found which gave a 92% accuracy on the training data set. These results are better than many other algorithms such as the G-S and the NB algorithms which have accuracies of less that 90% with gene numbers varying from 10 to 500. The average performance of the P.G.A. is shown in the graphs below. We can see that the average performance for both the leukemia data set and the colon data set borders around the 0.9 fitness level which is very good. 4. Conclusion The results of their experiments show that P.G.A. does infact provide us with a smaller gene subset which gives better accuracy on the test and train datasets. However, as pointed out by the authors themselves, the final subset that is found by the PGA might not be the best one due to the population or the fitness function that is being used, which is very simple. Hence they suggest using a more sophisticated formula for fitness evaluation which can lead to a more effective search and a better subset being found. Also this problem can be extended to a multiclass tissue classification problem instead of a two class problem which it is now, since we are using the classifier proposed by Golub and Slonim. 5. References Juan Liu, Hitoshi Iba and Mitsuru Ishizuka, Proceedings of the Genome Informatics Workshop, pp 14-23, Yebiso, Tokyo, Japan 2001. Yang J. and Honavar V., Feature subset selection using a genetic algorithm, 1997.