Project Report

advertisement
ECE 539: Introduction to Artificial Neural Networks and
Fuzzy Systems
Project Report
DNA Microarray Data Analysis
using Artificial Neural Network Models
By
Venkatanand Venkatachalpathy
Email: venkatac@cae.wisc.edu
Student ID: 9016417330
Section # : 1 ( MWF – 11am)
Introduction:
Background:
According to the Central Dogma of Molecular Biology, genes made of
Deoxyribo-Nucleic Acid (DNA) are the basic units of heredity [1]. The genes could also
be called as the information molecules of life. This is because of the fact that the genetic
code represented as a sequence of chemical monomers is decoded in each living cell into
the functional molecular network of proteins and RNA molecules (RNA – Ribonucleic
Acid). The proteins and RNA are termed as the functional molecules of life. These
functional molecules are responsible for the physical, chemical and biological properties
of the cell, which in turn is manifested globally in the behavioral nature of a living
organism like humans who are made of millions of cells. Figure 1.illustrates this principle
of genetic information flow from DNA to proteins.
Figure 1. Genetic Information flow in a cell
Genes {DNA}
RNA intermediate
GENE EXPRESSION
(Refers to both transcription and translation)
Protein
Gene Expression:
The biochemical process by which Genes are first transcribed into RNA
molecules (Transcription) and then converted to the Protein molecules (Translation) is
referred to as Gene expression. This process of gene to protein information translation is
not a static phenomenon. It varies dynamically with time depending on several factors
like the stage of development of the cell (or organism), environmental conditions etc. For
example the heat shock genes are over expressed (that is more proteins are synthesized)
as soon as the cell is subjected to an environment of high temperature. However the rate
of this gene expression decreases as the cell settles down from the shock. Thus the gene
expression level of heat shock genes remains at a basal rate under normal or standard
conditions and as soon as these conditions change, the expression level rises which then
finally returns to the normal level of expression in time as the cell recovers from the
shock. This simple example illustrates how gene expression level by varying at a
molecular level represents the chemical (to some extent behavioral) response of the cell
as the environmental conditions change, the environment being one of several factors that
affects gene expression rate.
The state of gene expression under normal conditions ( or average conditions) is
referred to as the normal level or basal rate. When the expression level of a gene
suddenly increases during an interval of time (such as the case that was illustrated earlier
by the example of heat shock genes), then the genes are said to be switched “ON” during
that time interval. Similarly a gene is switched off when its expression level decreases
from the normal rate[2].
Microarray Experiments:
DNA microarray technology is a recent advancement in biotechnology. It
provides biologists with the ability to measure the expression levels of thousands of
genes in a single experiment.
These arrays consist of large numbers of specific
oligonucleotides or cDNA sequences, each corresponding to a different gene, affixed to a
solid surface at very precise locations [3]. When an array chip is hybridized to labeled
cDNA derived from a particular tissue of interest, it yields simultaneous measurements of
the mRNA levels in the sample for each gene represented on the chip. Since mRNA
levels are expected to correlate roughly with the levels of their translation products, the
active molecules of interest, array results can be used as a crude approximation to the
protein content and thus the ‘state’ of the sample. Ideally, one would like in addition to
measure the levels of proteins in a cell directly, and such technology is currently being
developed. The intensity of the points in the array reflects the gene expression level of the
genes at the corresponding location. In short, DNA micro arrays yield a global view of
gene expression.
Microarray Experimental Data:
Each data point produced by a DNA microarray hybridization experiment
represents the ratio of expression levels of a particular gene under two different
experimental conditions. Typically the one of the experiment is carried out under
standard conditions while the other is done under the varying conditions of interest. Thus
the ratio reflects the increase or decrease in the level of gene expression when the cell is
under the probing conditions with respect to the basal expression level.
The result data, from a single experiment with n genes on a single chip, is a series of n
expression-level ratios. The numerator of each ratio is the expression level of the gene in
the varying condition of interest, whereas the denominator is the expression level of the
gene in the reference condition. The data from a series of m such experiments may be
represented as a gene expression matrix, in which each of the n rows consists of an melement expression vector for a single gene. The values of the Gene expression vector
are normalized on a logarithmic scale with the total norm of the log values of the ratios
being 1.
Motivation for the project:
The pattern of gene expression in a cell characterizes its current state. Virtually
all differences in cell state or type are correlated with changes in the mRNA levels of
many genes. Expression patterns of many uncharacterized genes provide clues to their
possible function by comparison [1]. This leads to great many potential applications in
medicine and molecular biology especially in identification of metabolic pathways,
complex genetic diseases, drug discovery and toxicology analysis etc.
One major application of microarray data is in the area of functional genomics
where the functional significance of the genes is studied. This application is based on the
observation that genes of similar function yield similar expression patterns. Thus based
on their microarray expression profile, genes can be grouped into classes of genes that are
functionally related. As data from such experiments accumulates, it will be essential to
have accurate means for extracting biological significance and using the data to assign
functions to genes. Also with the completion of the Genome projects, we have the
sequence information of the genes. What is missing in the current set up is the functional
information about the different genes. Thus we see that Data obtained from microarray
experiments could be extensively used for the assignment of functions to unknown genes
by correlating their expression profile with the profiles of genes whose function is
already known.
Problem statement:
As we observed earlier, microarray data can be used to determine the function of
unknown genes by correlating the expression profile of these genes with the profiles of
genes whose function is already known. At an instance glance, this problem could be
recognized as a pattern classification problem in which we assign the functional class to
each gene based on the feature of microarray gene expression. The knowledge for
performing the classification is contained in the gene expression profile and function
class mapping of genes whose functions are already known. Thus in short, the problem is
to find whether or not a gene belongs to a functional class, based on its expression profile
along with the knowledge base of expression profile of genes whose function is already
known. It was shown by Brown et al [4] that this kind of classification could be done
using Support Vector Machine. In this project I have tried to solve this pattern
classification problem using Multi layer Perceptron as well as SVM models. As both
these models are well known for their ability to encode a knowledge base and perform a
pattern classification based on the knowledge base, I have decided to apply them to gene
functional class determination problem that was described in earlier sections. Also, an
easy to use GUI developed in the process of analysis would be a valuable tool.
Chosen problem of gene function analysis:
A well-known problem in this area is the determination of the functional class of
unknown yeast genes using the known genes. Traditionally some number of the yeast
genes has been grouped into biologically relevant functional classes like TCA, Histones
etc. After the completion of yeast genome project, we have thousands of other genes
whose functional class is still not known and their function detection by experimental
methods have proved elusive. The microarray expression profiles of these genes could
now be used to find their functionality under the domain of the functional classes that
have been known so far.
Data specifications:
The yeast microarray expression data was obtained from Stanford Microarray
Database and Munich Information Center for Protein Sequences Yeast Genome Database
(MYGD). The yeast gene expression data consists of gene expression ratios from 2,467
genes from the budding yeast (Saccharomyces cerevisiae) measured in 79 different DNA
microarray hybridization experiments plus the normal expression value giving rise to 80
different column values. From these data, we learn to recognize the functional class of
TCA as defined in the Munich Information Center for Protein Sequences Yeast Genome
Database (MYGD).
Work Performed:
The project work consisted of the following steps:
1. Statistical data analysis: This was done to extract significant feature vectors.
Correlation coefficients between different feature vectors were computed and the
significant features were identified. Since there are 80 features obtained in the
initial data, our problem is computationally demanding and hence it would be
quite unwise if redundant feature vectors were used for developing the model.
Correlation coefficient was computed between the different features of the data. If
the correlation between two features were high (> 0.8), then the two features
would be combined into a single component by taking the average value. Also the
mean and variance of different features were computed. This would provide
insight into how the different features are statistically distributed. The results of
this analysis were used as guidelines in designing a suitable architecture for the
neural network.
2. Building ANN models:
I used 2 different neural network models for the pattern classification problem
described. The models are as follows:
 Multilayer Perceptron
The bp.m program given in the course home page was used for implementing the
MLP algorithm. Different Architectures were experimented with the hidden layer
number varying from 1 to 3 and the number of hidden neurons ranging from 10 to
40 for hidden layer 1, 5 to 10 for hidden layer 2, 1 to 4 for hidden layer 3.
Also experiments were performed to optimize the learning rate and momentum
factor of the MLP developed.
 Support Vector Machine:
Three different kernels were tried for SVM implementation - linear kernel,
polynomial kernel of order 2 Radial basis functions. The support vector machines
were implemented based on the svmdemo.m code given in the class website. Also
the code from SVM package was used.
3. Performance of the ANN models:
The performance of the different ANN models employed were compared
based on a 4 way cross validation experiment on the training data used and also
based on the complexity of the model. Simpler models were chosen for similar
results. Finally an optimal model was chosen.
4. Provide a GUI interface. (This was done ad joint in time frame to the steps
described earlier)
Results and Discussion:
 Statistical Analysis:
The statistical parameters of correlation coefficient between expression feature vectors,
mean and variance of the features were computed. The average, minimum and maximum
value of these parameters are given in the table below. It can be observed that the
experimental results have been quite independent necessitating the need for all the feature
vectors. Also the mean and variance of each of the feature is well distributed as seen from
the average, minimum and maximum of these values, indicating that the features are
equally important for functional classification.
Statistical Parameters for the Yeast Gene Expression Feature data
Statistical
Average
Minimum Maximum
Parameter
value
value
value
Correlation
0.23
- 0.4
0.4
Mean
-0.01
-0.1
0.3
Variance
0.3383
0.4121
0.6778
Coefficient
 MLP Model - Optimization of MLP learning parameters:
For all the MLP and SVM models described below, Crate computed by 4 –way cross
validation using the data from the yeast gene expression database. For a simple 3-layer
model of the MLP: 80 – 40 – 1, the learning rate and momentum factor was varied and
the optimum values were computed.
Optimization of MLP learning parameters:
MLP parameter
Crate
(maximum
> 95%
possible value)
Learning rate
0.1
Momentum factor
0.8
 MLP models- Optimization of number of hidden layers:
The table below summarizes the results of the experiments that were carried out with
40 neurons for hidden layer #1 and 0 (if absent) or 7 for hidden layer #2 (if present)
and 0 (if absent) or 2 neurons for hidden layer #3 (if present) for different number of
hidden layers. This was done to find an optimum number of hidden layers for our
MLP architecture.
Optimization of number of hidden layers
MLP parameter
Crate possible
(maximum
possible value)
1
98.8%
2
98.9%
3
98.4%
From the data it is clearly evident that one hidden layer with 40 neurons does a good job
of classification. Usage of additional layers does not produce any increase in
classification.
 MLP models - Optimization of number of hidden neurons per layer:
The table below summarizes the results of the experiments that were carried out with
varying number of hidden neurons for each of the hidden layers, with the number of
hidden neurons varying from 10 to 40 neurons for hidden layer #1, 5 to 10 for hidden
layer 2, 1 to 4 for hidden layer 3.
Optimization of number of hidden layers
MLP parameter
Max Crate
Number of
(maximum
possible
hidden neurons
1
98.92%
39
2
99.08%
6
3
98.11%
2
possible value)
Based on the results from the optimization of number of hidden layers and the number of
hidden neurons per hidden layer, it is clearly evident that one hidden layer with 42
neurons is able to classify very well comparable to the best possible higher layered
models. We conclude that the usage of additional layers does not produce any increase in
classification for our problem.
 SVM Results:
Crate results for SVMs of different kernels
SVM Kernel
Max Crate
Linear
98.71%
Polynomial order 2
99.03%
Radial basis
99.52%
function
The Crate results obtained for different SVM kernel are summarized in the table
above. The best Crate occurs when radial basis functions were used. Though linear
and polynomial methods were not far behind.
 The Graphic User Interface for testing SVM and MLP models:
The GUI was developed in MATLAB is shown below:
Key Features:
1. Can be used for visualization of the gene expression profiles based on
their classes.
2. Provides easy interface for reading the CSV formatted microarray gene
expression matrix.
3. Provides for rigorous experimentation with the different kinds of MLP
models.
4. Also provides for comparison of the results of the MLP models with the
better forming SVM models.
5. Provides means for classifying unknown genes based on a trained
network.
6. Serves as a test bed for trying out MLP models and comparing them with
SVM models for functional classification based on gene expression data.
Besides the above key features, a noticeable drawback is the slowness of the application
especially when trying to test huge MLP models or SVM models of higher kernel order.
The SVM results are comparable to what was obtained in literature for SVM [4]. This is
primarily because of the limitation of speeds of the GUI routines of MATLAB. Porting to
C++ or any other compiler-based language could provide better performance with regard
to the speed of execution.
Conclusions:
Identification of functional class of genes based on their gene expression profiles
obtained from microarray experiments has great many applications with regard to finding
the function of unknown genes which in turn could be used for further study of genes
involvement in a particular functional class. Multi Layer Perceptron and Support Vector
Machine models have been built and tested for this particular problem. The results
obtained for the identification of genes belonging to the TCA class based on their
functional class are very good with both the MLP and SVM models giving a
classification rate around 99%, thus providing an important application of neural network
models and learning theory in genomics.
References:
[1] Lander E.S (1996). The new genomics: global view of biology, Science 274, 536539.
[2] DeRise et al (1997). Exploring the metabolic and genetic control of gene expression
on a genomic scale. Science 278, 680 - 686.
[3] Eisen (1995). Cluster analysis and display of genome-wide expression patterns.
Proceedings of National Academy of Sciences, 1995, 14863-14868.
[4] Brown. et al (1999). Knowledge-based analysis of microarray gene expression data.
Proceedings of National Academy of Sciences, 1997, 262-267
Download