ECE 539: Introduction to Artificial Neural Networks and Fuzzy Systems Project Report DNA Microarray Data Analysis using Artificial Neural Network Models By Venkatanand Venkatachalpathy Email: venkatac@cae.wisc.edu Student ID: 9016417330 Section # : 1 ( MWF – 11am) Introduction: Background: According to the Central Dogma of Molecular Biology, genes made of Deoxyribo-Nucleic Acid (DNA) are the basic units of heredity [1]. The genes could also be called as the information molecules of life. This is because of the fact that the genetic code represented as a sequence of chemical monomers is decoded in each living cell into the functional molecular network of proteins and RNA molecules (RNA – Ribonucleic Acid). The proteins and RNA are termed as the functional molecules of life. These functional molecules are responsible for the physical, chemical and biological properties of the cell, which in turn is manifested globally in the behavioral nature of a living organism like humans who are made of millions of cells. Figure 1.illustrates this principle of genetic information flow from DNA to proteins. Figure 1. Genetic Information flow in a cell Genes {DNA} RNA intermediate GENE EXPRESSION (Refers to both transcription and translation) Protein Gene Expression: The biochemical process by which Genes are first transcribed into RNA molecules (Transcription) and then converted to the Protein molecules (Translation) is referred to as Gene expression. This process of gene to protein information translation is not a static phenomenon. It varies dynamically with time depending on several factors like the stage of development of the cell (or organism), environmental conditions etc. For example the heat shock genes are over expressed (that is more proteins are synthesized) as soon as the cell is subjected to an environment of high temperature. However the rate of this gene expression decreases as the cell settles down from the shock. Thus the gene expression level of heat shock genes remains at a basal rate under normal or standard conditions and as soon as these conditions change, the expression level rises which then finally returns to the normal level of expression in time as the cell recovers from the shock. This simple example illustrates how gene expression level by varying at a molecular level represents the chemical (to some extent behavioral) response of the cell as the environmental conditions change, the environment being one of several factors that affects gene expression rate. The state of gene expression under normal conditions ( or average conditions) is referred to as the normal level or basal rate. When the expression level of a gene suddenly increases during an interval of time (such as the case that was illustrated earlier by the example of heat shock genes), then the genes are said to be switched “ON” during that time interval. Similarly a gene is switched off when its expression level decreases from the normal rate[2]. Microarray Experiments: DNA microarray technology is a recent advancement in biotechnology. It provides biologists with the ability to measure the expression levels of thousands of genes in a single experiment. These arrays consist of large numbers of specific oligonucleotides or cDNA sequences, each corresponding to a different gene, affixed to a solid surface at very precise locations [3]. When an array chip is hybridized to labeled cDNA derived from a particular tissue of interest, it yields simultaneous measurements of the mRNA levels in the sample for each gene represented on the chip. Since mRNA levels are expected to correlate roughly with the levels of their translation products, the active molecules of interest, array results can be used as a crude approximation to the protein content and thus the ‘state’ of the sample. Ideally, one would like in addition to measure the levels of proteins in a cell directly, and such technology is currently being developed. The intensity of the points in the array reflects the gene expression level of the genes at the corresponding location. In short, DNA micro arrays yield a global view of gene expression. Microarray Experimental Data: Each data point produced by a DNA microarray hybridization experiment represents the ratio of expression levels of a particular gene under two different experimental conditions. Typically the one of the experiment is carried out under standard conditions while the other is done under the varying conditions of interest. Thus the ratio reflects the increase or decrease in the level of gene expression when the cell is under the probing conditions with respect to the basal expression level. The result data, from a single experiment with n genes on a single chip, is a series of n expression-level ratios. The numerator of each ratio is the expression level of the gene in the varying condition of interest, whereas the denominator is the expression level of the gene in the reference condition. The data from a series of m such experiments may be represented as a gene expression matrix, in which each of the n rows consists of an melement expression vector for a single gene. The values of the Gene expression vector are normalized on a logarithmic scale with the total norm of the log values of the ratios being 1. Motivation for the project: The pattern of gene expression in a cell characterizes its current state. Virtually all differences in cell state or type are correlated with changes in the mRNA levels of many genes. Expression patterns of many uncharacterized genes provide clues to their possible function by comparison [1]. This leads to great many potential applications in medicine and molecular biology especially in identification of metabolic pathways, complex genetic diseases, drug discovery and toxicology analysis etc. One major application of microarray data is in the area of functional genomics where the functional significance of the genes is studied. This application is based on the observation that genes of similar function yield similar expression patterns. Thus based on their microarray expression profile, genes can be grouped into classes of genes that are functionally related. As data from such experiments accumulates, it will be essential to have accurate means for extracting biological significance and using the data to assign functions to genes. Also with the completion of the Genome projects, we have the sequence information of the genes. What is missing in the current set up is the functional information about the different genes. Thus we see that Data obtained from microarray experiments could be extensively used for the assignment of functions to unknown genes by correlating their expression profile with the profiles of genes whose function is already known. Problem statement: As we observed earlier, microarray data can be used to determine the function of unknown genes by correlating the expression profile of these genes with the profiles of genes whose function is already known. At an instance glance, this problem could be recognized as a pattern classification problem in which we assign the functional class to each gene based on the feature of microarray gene expression. The knowledge for performing the classification is contained in the gene expression profile and function class mapping of genes whose functions are already known. Thus in short, the problem is to find whether or not a gene belongs to a functional class, based on its expression profile along with the knowledge base of expression profile of genes whose function is already known. It was shown by Brown et al [4] that this kind of classification could be done using Support Vector Machine. In this project I have tried to solve this pattern classification problem using Multi layer Perceptron as well as SVM models. As both these models are well known for their ability to encode a knowledge base and perform a pattern classification based on the knowledge base, I have decided to apply them to gene functional class determination problem that was described in earlier sections. Also, an easy to use GUI developed in the process of analysis would be a valuable tool. Chosen problem of gene function analysis: A well-known problem in this area is the determination of the functional class of unknown yeast genes using the known genes. Traditionally some number of the yeast genes has been grouped into biologically relevant functional classes like TCA, Histones etc. After the completion of yeast genome project, we have thousands of other genes whose functional class is still not known and their function detection by experimental methods have proved elusive. The microarray expression profiles of these genes could now be used to find their functionality under the domain of the functional classes that have been known so far. Data specifications: The yeast microarray expression data was obtained from Stanford Microarray Database and Munich Information Center for Protein Sequences Yeast Genome Database (MYGD). The yeast gene expression data consists of gene expression ratios from 2,467 genes from the budding yeast (Saccharomyces cerevisiae) measured in 79 different DNA microarray hybridization experiments plus the normal expression value giving rise to 80 different column values. From these data, we learn to recognize the functional class of TCA as defined in the Munich Information Center for Protein Sequences Yeast Genome Database (MYGD). Work Performed: The project work consisted of the following steps: 1. Statistical data analysis: This was done to extract significant feature vectors. Correlation coefficients between different feature vectors were computed and the significant features were identified. Since there are 80 features obtained in the initial data, our problem is computationally demanding and hence it would be quite unwise if redundant feature vectors were used for developing the model. Correlation coefficient was computed between the different features of the data. If the correlation between two features were high (> 0.8), then the two features would be combined into a single component by taking the average value. Also the mean and variance of different features were computed. This would provide insight into how the different features are statistically distributed. The results of this analysis were used as guidelines in designing a suitable architecture for the neural network. 2. Building ANN models: I used 2 different neural network models for the pattern classification problem described. The models are as follows: Multilayer Perceptron The bp.m program given in the course home page was used for implementing the MLP algorithm. Different Architectures were experimented with the hidden layer number varying from 1 to 3 and the number of hidden neurons ranging from 10 to 40 for hidden layer 1, 5 to 10 for hidden layer 2, 1 to 4 for hidden layer 3. Also experiments were performed to optimize the learning rate and momentum factor of the MLP developed. Support Vector Machine: Three different kernels were tried for SVM implementation - linear kernel, polynomial kernel of order 2 Radial basis functions. The support vector machines were implemented based on the svmdemo.m code given in the class website. Also the code from SVM package was used. 3. Performance of the ANN models: The performance of the different ANN models employed were compared based on a 4 way cross validation experiment on the training data used and also based on the complexity of the model. Simpler models were chosen for similar results. Finally an optimal model was chosen. 4. Provide a GUI interface. (This was done ad joint in time frame to the steps described earlier) Results and Discussion: Statistical Analysis: The statistical parameters of correlation coefficient between expression feature vectors, mean and variance of the features were computed. The average, minimum and maximum value of these parameters are given in the table below. It can be observed that the experimental results have been quite independent necessitating the need for all the feature vectors. Also the mean and variance of each of the feature is well distributed as seen from the average, minimum and maximum of these values, indicating that the features are equally important for functional classification. Statistical Parameters for the Yeast Gene Expression Feature data Statistical Average Minimum Maximum Parameter value value value Correlation 0.23 - 0.4 0.4 Mean -0.01 -0.1 0.3 Variance 0.3383 0.4121 0.6778 Coefficient MLP Model - Optimization of MLP learning parameters: For all the MLP and SVM models described below, Crate computed by 4 –way cross validation using the data from the yeast gene expression database. For a simple 3-layer model of the MLP: 80 – 40 – 1, the learning rate and momentum factor was varied and the optimum values were computed. Optimization of MLP learning parameters: MLP parameter Crate (maximum > 95% possible value) Learning rate 0.1 Momentum factor 0.8 MLP models- Optimization of number of hidden layers: The table below summarizes the results of the experiments that were carried out with 40 neurons for hidden layer #1 and 0 (if absent) or 7 for hidden layer #2 (if present) and 0 (if absent) or 2 neurons for hidden layer #3 (if present) for different number of hidden layers. This was done to find an optimum number of hidden layers for our MLP architecture. Optimization of number of hidden layers MLP parameter Crate possible (maximum possible value) 1 98.8% 2 98.9% 3 98.4% From the data it is clearly evident that one hidden layer with 40 neurons does a good job of classification. Usage of additional layers does not produce any increase in classification. MLP models - Optimization of number of hidden neurons per layer: The table below summarizes the results of the experiments that were carried out with varying number of hidden neurons for each of the hidden layers, with the number of hidden neurons varying from 10 to 40 neurons for hidden layer #1, 5 to 10 for hidden layer 2, 1 to 4 for hidden layer 3. Optimization of number of hidden layers MLP parameter Max Crate Number of (maximum possible hidden neurons 1 98.92% 39 2 99.08% 6 3 98.11% 2 possible value) Based on the results from the optimization of number of hidden layers and the number of hidden neurons per hidden layer, it is clearly evident that one hidden layer with 42 neurons is able to classify very well comparable to the best possible higher layered models. We conclude that the usage of additional layers does not produce any increase in classification for our problem. SVM Results: Crate results for SVMs of different kernels SVM Kernel Max Crate Linear 98.71% Polynomial order 2 99.03% Radial basis 99.52% function The Crate results obtained for different SVM kernel are summarized in the table above. The best Crate occurs when radial basis functions were used. Though linear and polynomial methods were not far behind. The Graphic User Interface for testing SVM and MLP models: The GUI was developed in MATLAB is shown below: Key Features: 1. Can be used for visualization of the gene expression profiles based on their classes. 2. Provides easy interface for reading the CSV formatted microarray gene expression matrix. 3. Provides for rigorous experimentation with the different kinds of MLP models. 4. Also provides for comparison of the results of the MLP models with the better forming SVM models. 5. Provides means for classifying unknown genes based on a trained network. 6. Serves as a test bed for trying out MLP models and comparing them with SVM models for functional classification based on gene expression data. Besides the above key features, a noticeable drawback is the slowness of the application especially when trying to test huge MLP models or SVM models of higher kernel order. The SVM results are comparable to what was obtained in literature for SVM [4]. This is primarily because of the limitation of speeds of the GUI routines of MATLAB. Porting to C++ or any other compiler-based language could provide better performance with regard to the speed of execution. Conclusions: Identification of functional class of genes based on their gene expression profiles obtained from microarray experiments has great many applications with regard to finding the function of unknown genes which in turn could be used for further study of genes involvement in a particular functional class. Multi Layer Perceptron and Support Vector Machine models have been built and tested for this particular problem. The results obtained for the identification of genes belonging to the TCA class based on their functional class are very good with both the MLP and SVM models giving a classification rate around 99%, thus providing an important application of neural network models and learning theory in genomics. References: [1] Lander E.S (1996). The new genomics: global view of biology, Science 274, 536539. [2] DeRise et al (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680 - 686. [3] Eisen (1995). Cluster analysis and display of genome-wide expression patterns. Proceedings of National Academy of Sciences, 1995, 14863-14868. [4] Brown. et al (1999). Knowledge-based analysis of microarray gene expression data. Proceedings of National Academy of Sciences, 1997, 262-267