Selecting Informative Genes with parallel Genetic Algorithms in

advertisement
Selecting Informative Genes with Parallel Genetic
Algorithms in Tissue Classification
Paper By: Juan Liu, Hitoshi Iba and Mitsuru Ishizuka
Online link: hc.ims.u-tokyo.ac.jp/JSBi/journal/GIW01/GIW01F02.pdf
Comments By: Prashant Jain
1. Introduction
This paper talks about using Parallel Genetic Algorithms to select
Informative Genes, after which they combine this with a classification
method that has been given by Golub and Slonim, they classify data
sets with tissues of different classes.
Before we go on into the details of the paper, we need to know a few
basics about genes, gene expression, informative genes etc.
Gene Expression is the process of writing out a gene’s DNA sequence
into RNA. RNA is the building block that serves as a model for protein
production. A gene expression level basically indicates the number of
copies of the gene’s RNA that have been produced in the cell.
Another related and important concept is that of DNA arrays. DNA
array technologies help us to look at the expression pattern of
thousands of genes at the same time. They have a lot of advantages
as they give us an overall picture of gene expression. They can be
used to cluster genes into groups and also to classify tissues on the
basis of gene expression patterns.
But this task of using the DNA array data is not easy. There are a huge
number of genes in comparison to the training samples and data sets.
Hence classification of these is difficult. Another major problem is the
noise factor that is introduced by those genes that are not relevant to
distinguish between different tissue types.
The methods proposed in this paper intend to do tissue classification
based on only gene expression levels and try to get a small subset of
genes.
There are two types of gene subset selection algorithms -:
Filter Approach -: Gene Selection done independent of classification
process.
Wrapper Approach -: Evaluate candidate subsets by using classifying
algorithms on datasets.
2. Parallel Genetic Algorithm Based Gene Selection Method
The paper defines genetic algorithm as “a global optimization
procedure that uses an analogy of genetic evolution of biological
organisms”. The basic genetic algorithm that is described in the paper
is as follows -:
1.
2.
3.
4.
5.
6.
7.
8.
9.
Start
Generate initial population
Evaluate the population
If termination condition satisfied goto 9 Else goto 5.
Select the fittest individuals
Apply genetic operators
Generate new population.
Goto 3.
End
The problem with this as they say is that in case the population size is
large the computation cost is immense. Hence, they use parallel
genetic algorithms.
The parallel genetic algorithm used is given below -:
1. Start
2. Generate intial population
3. Evaluate the population
4. If termination condition satisfied goto 10 else Goto 5.
5. Select fittest individuals
6. Apply genetic operators
7. Send Migrants if necessary.
8. Receive Migrants if necessary.
9. Generate new population
10. Goto 3.
The P.G.A. basically divides the entire population into subpopulations
and runs a normal G.A. on each subpopulation. It does however
incorporate migration of the fittest individuals amongst processors.
Hence they use the island model(also known as distributed G.A.).
The gene subsets are represented using a fixed length binary string
representation in which value 1 means that the gene has been
included in the subset and 0 means that the gene is not included.
There underlying thought regarding using a parallel genetic algorithm
is to search gene subsets that have lesser number of elements and
search the entire search space (all possible subsets of the whole gene
set). They use both the accuracy of their classification and the subset
size in their fitness evaluation. Accuracy is measured by invoking a
classifier that runs on the found gene subset and some training
data(which has only the expression levels of the given genes in the
subset). The fitness function is as follows -:
fitness(x) = w1 * accuracy(x) + w2 *(1 – dimensionality(x))
Here,
accuracy(x) =test accuracy of the classifier built with the gene subset
represented by x
dimensionality(x)  [0,1] = the dimension of the subset
w1 = weight assigned to accuracy
w2 = weight assigned to dimensionality
As we can see from the fitness function, the subset which has low
dimensionality and high accuracy will have a higher fitness value than
anything else.
3. Experiments and Results
The experiments were run on two data sets –:
The tests were run on two processors with the following parameters of
the G.A. in each processor-:
Population Size -: 1000
Trials -: 400000
Crossover Probability -: 0.6 (This has been kept relatively high so as to
make sure that the subpopulation that is generated is different from its
parents but is kept sufficiently low enough so as not to remove good
traits{read fit individuals} in the parents).
Mutation Probability -: 0.001 (This is kept low so that the randomness
of the selection is reduced .)
Selection Strategy -: Elitist (This is a good strategy to follow because
using this, the best gene subsets are kept and passed on from each
generation to the next and hence lead to the finding of more accurate
gene subsets)
Migration Probability -: 0.002 (This is to ensure that there is sufficient
exchange of population between the two processors so that there can
be diversity in the gene subset formed )
They use a classifier proposed by Golub and Slonim to test the gene
selecting method which basically gives a positive value to a sample
belonging to class 1 and a negative value to a sample belonging to
class 2. They take the accuracy of the classification to be more
important and hence use w1 = 0.75 and w2 = 0.25 in their
experiments.
The results are given in the following table -:
For the Leukemia data set a subset with 29 genes was found which
predicted 36 of the 38 training samples and 30 of the 34 samples
correctly. However, there were three meaningless genes in that
classifier which when were dropped from it resulted in 91% accuracy
on the test data set.
For the Colon data set a subset of 30 genes was found which gave a
92% accuracy on the training data set.
These results are better than many other algorithms such as the G-S
and the NB algorithms which have accuracies of less that 90% with
gene numbers varying from 10 to 500.
The average performance of the P.G.A. is shown in the graphs below.
We can see that the average performance for both the leukemia data
set and the colon data set borders around the 0.9 fitness level which is
very good.
4. Conclusion
The results of their experiments show that P.G.A. does infact provide
us with a smaller gene subset which gives better accuracy on the test
and train datasets. However, as pointed out by the authors
themselves, the final subset that is found by the PGA might not be the
best one due to the population or the fitness function that is being
used, which is very simple. Hence they suggest using a more
sophisticated formula for fitness evaluation which can lead to a more
effective search and a better subset being found. Also this problem can
be extended to a multiclass tissue classification problem instead of a
two class problem which it is now, since we are using the classifier
proposed by Golub and Slonim.
5. References
Juan Liu, Hitoshi Iba and Mitsuru Ishizuka, Proceedings of the Genome
Informatics Workshop, pp 14-23, Yebiso, Tokyo, Japan 2001.
Yang J. and Honavar V., Feature subset selection using a genetic
algorithm, 1997.
Download