ABSTRACT In this project, we will propose a set of clustering algorithms and a new approach for analysis of gene expression data. These algorithms will be applied on the gene expression data to find out the disease mediating genes for the particular disease of a genome. As we know that gene expression data is basically high dimensional data, some dimensionality reduction algorithms is further used to reduce the dataset. To do this, we will propose the models related to dimensionality reduction using various statistical and mathematical techniques. The clustering algorithms will group the genes of a genome for its diseased and normal conditions. The result will be further validated by some cluster validity indices. In this connection we have already implemented some existing clustering algorithms. These algorithms are mainly based on partitioning, hierarchical, densitybased approach. A further study will be needed to compare the results with our proposed methodology. Our proposed methodology will incorporate the concept of neural network model for machine learning. The efficiency of the proposed algorithm will be compared. The genes selected by our result, will be further validated biologically by studying their bio-chemical property. At the end of the work, we want to regroup our methods and programs into a software module. This software will be treated as a bioinformatics tool for gene expression data analysis. We are thus trying to establish relationships among different genetic data sets basically diseased and normal using neuro-fuzzy logic. Neuro-fuzzy refers to combinations of artificial neural networks and fuzzy logic. Neuro-fuzzy hybridization results in a hybrid intelligent system that synergizes these two techniques by combining the human-like reasoning style of fuzzy systems with the learning and connectionist structure of neural networks. The word neural network is self explanatory i.e. systems designed to respond like the human nervous system using neurons and fuzzy systems allow the gradual assessment of membership of the genes. However for that grouping we first need a data set which is nothing but the collection of information on which the analysis of the deviation of diseased and normal genes can be done. Gene expression is the process by which information from a gene is used in the manifestation of a condition of a genome. This manifestation of this gene is done using a data set called the gene expression data. Gene Expression Data PROJECT-MODEL Neural Network Normalize using exhaustive search method Clustering Layer Neuro-fuzzy proposed logic Existing Method pr Proposed Method Existing Method DBSCAN K-means AGNES DIANA Gene Selection Gene Selection Gene Selection Gene Selection Validation Validation Result Result Accuracy INTRODUCTION GENE EXPRESSION DATA Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product i.e. it is the conversion of the information from the gene into mRNA via the transcription process and then to protein via the translation process resulting in the phenotypic manifestation of the gene. This manifestation of this gene is done using a data set called the gene expression data. This data is used in the parallel execution of experiments on a large number of genes simultaneously. Example: DNA micro-arrays/DNA chips. DNA –transcription mRNA –translation Protein Transcription + Translation = Gene Expression MICRO-ARRAY and GENE EXPRESSION DATA SETS: Micro-arrays are slides or membranes with numerous probes that represent various genes of some genome. Micro-arrays are hybridized with labelled cDNA synthesized from mRNA sample. The intensity (radioactive or fluorescent) of each spot on a micro-array indicates the expression of each gene. One-Dye arrays (radioactive label) shows the absolute expression level of each gene. Two-Dye arrays (fluorescent label) indicates the relative expression level of the same gene in two samples that are labelled with different colours and mixed before hybridization. There is also a Universal Reference that is used to compare samples in different arrays. Thus micro-array gene expression data sets are commonly very they are large, and their analytical precision is influenced by a number of variables. They may require further processing aimed at reducing the dimensionality of the data to aid comprehension and more focused analysis. Normalization methods may be used in these cases to reduce the dimensionality. USES Micro-arrays are an automated way of carrying out thousands of experiments at once, and allow scientists to obtain huge amounts of information very quickly. Gene expression profiling experiment the expression levels of thousands of genes are simultaneously monitored to study the effects of certain treatments, diseases, and developmental stages on gene expression. For example, micro-array based gene expression profiling can be used to identify genes whose expression is changed in response to pathogens or other organisms by comparing gene expression in infected to that in uninfected cells or tissues. METHODOLOGY CLUSTERING A cluster is a group or accumulation of objects with similar attributes. Conditions for clusters: Homogeneity within a cluster. Heterogeneity with other clusters. Thus clustering is a technique for discovery of data distribution and feature selection and identification. There are several algorithms: Partitioning Algorithms K-MEANS K-means clustering is a method of partition cluster analysis where each cluster is represented by the mean value of all the samples within that cluster. Here n observations are partitioned into k clusters. The distance of each object from the mean of the cluster is the Euclidean Distance. STEPS: 1. K initial means are selected. 2. K clusters are created by associating every observation with the nearest mean. 3. The centroid of each of the k clusters becomes the new mean. 4. Steps 2 and 3 are repeated until convergence has been reached. K-MEDOIDS It is a variant of the K-Means algorithm. Here instead of finding the new means we represent the mean with the closest object of the cluster. STEPS: 1. Randomly place the cluster centroids. 2. Join every object into the cluster with the nearest cluster centroid. 3. Compute the new cluster centroids. 4. Choose the sample with the nearest with the above computed cluster centroid as the new cluster centroid. 5. Repeat (2), (3) and (4) until the centroids become fixed. Hierarchical Algorithm AGNES Agglomerative Nesting (AGNES) starts by placing each sample in separate clusters and then merges these clusters into larger ones. This continues until a single cluster is formed or till a termination condition is reached. It is a Bottom-up approach. STEPS: 1. Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0. 2. Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to d [(r),(s)] = min d [(i),(j)]. Where the minimum is over all pairs of clusters in the current clustering. 3. Increment the sequence number: m = m +1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r),(s)] 4. Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in this way: d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)] 5. If all objects are in one cluster, stop. Else, go to step 2. DIANA A divisive clustering (DIANA) start with one large cluster and then subdivides it into smaller clusters until each sample forms a cluster or a termination is reached. It is the opposite of AGNES. It is a Top-down approach. STEPS: 1. Begin with the single clustering having level L(0) = 0 and sequence number m = 0. 2. Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to d [(r),(s)] = min d [(i),(j)]. Where the minimum is over all pairs of clusters in the current clustering. 3. Increment the sequence number: m = m +1. Break clusters (r) and (s) into a smaller clusters to form the next clustering m. Set the level of this clustering to L(m) = d[(r),(s)] 4. If all objects are in broken into clusters, stop. Else, go to step 2. Density based Algorithms DBSCAN (Density Based Spatial Clustering of Application of Noise) uses a density-based concept to discover clusters. The key idea of DBSCAN is that, for each object of a cluster, the neighbourhood of a given radius has to contain at least a minimum number of data objects. In other words, the density of the neighbourhood (denoted as ε)must exceed a threshold. DBSCAN has two objects: Core Objects: It is an object which has a neighbourhood of user-specified minimum density. Non-core Objects. At every step, the algorithm starts with an unclassified object. We study its neighbourhood to see if it is adequately dense or not. If its density does not exceed the threshold then it is marked as a noise object. STEPS: e = neighbourhood variable 1. Input the dataset D, e and MinPts. 2. Set cluster C=0. 3. For each unvisited point P in dataset D, mark P as visited 4. N = getNeighbors (P, e) 5. If sizeof(N) < MinPts, then mark P as NOISE 6. Else C = next cluster 7. Expand the cluster (P, N, C, e, MinPts). This is done by adding P to cluster C For each point P' in N, if P' is not visited then, mark P' as visited. N' = getNeighbors (P', e). If sizeof (N') >= MinPts, then N = N joined with N' If P' is not yet member of any cluster, add P' to cluster C. The common distance metric for getting neighbours is the Euclidean distance. DBSCAN does not respond well to data sets with varying densities (called hierarchical data sets). ADVANTAGES OF DBSCAN 1. Has minimum requirement of domain knowledge to determine input parameters. 2. Discovery of clusters with arbitrary shape. 3. Good efficiency. FUZZY LOGIC Fuzzy logic is a multi-valued logic derived from fuzzy set that helps in reasoning. Contrast with binary logic (crisp logic) which has binary sets fuzzy logic samples have membership functions having membership values. Thus it deals with reasoning that is approximate rather than precise. Fuzzy approach is based on the premise that classes of objects in which the transition from membership to non-membership is gradual rather than abrupt. Membership Function: The membership function in fuzzy logic is a graphical representation of the magnitude of participation of each input. It associates a weighting with each of the inputs that are processed, define functional overlap between inputs, and ultimately determines an output response. How is Fuzzy Logic used? 1. Define the control objectives and criteria. 2. Determine the input and output relationships and choose a minimum number of variables for input to the Fuzzy Logic engine (typically error and rate-of-change-of-error). 3. Using constraints break the control problem down into a series of IF X AND Y THEN Z rule that defines the desired system output response for given system input conditions. The number and complexity of rules depends on the number of input parameters. 4. Create Fuzzy Logic membership functions that define the meaning (values) of Input/Output terms used in the rules. 5. Create the necessary pre and post-processing Fuzzy Logic routines. 6. Test the system, evaluate the results, tune the rules and membership functions, and retest until satisfactory results are obtained. Example of Fuzzy-Logic An integration of neural network and fuzzy logic is known as neuro-fuzzy computing. Fuzzy logic application in neural networks allows us to incorporate the concept of error handling. Neuro-fuzzy hybridization is done broadly in two ways: 1. A neural network equipped with the capability of handling fuzzy information [termed fuzzy-neural network (FNN)] 2. A fuzzy system augmented by neural networks to enhance some of its characteristics like flexibility, speed, and adaptability [termed neural-fuzzy system (NFS)] In an FNN the input signals and/or connection weights and/or the outputs are fuzzy subsets or membership values to some fuzzy sets. Neural networks with fuzzy neurons are also termed FNN as they are capable of processing fuzzy information. A NFS, on the other hand, is designed to realize the process of fuzzy reasoning, where the connection weights of the network correspond to the parameters of fuzzy reasoning. NEURAL NETWORK APPROACH It is a type of machine learning technique. A neural network is an information processing system. It consists of massive simple processing units with a high degree of interconnections between each unit together with a learning rule. The processing units work cooperatively with each other and achieve massive parallel distributed processing. The design and function of neural networks simulate some functionality of biological brains and neural systems. The neural network architecture layers: There are generally two types of neural networks: 1. A simple neural network called Perceptron. They are also called FeedForward networks or Multilayer Perceptrons. Feed-forward means that there is no feedback to the input. Each neural network has a weight assigned to it as the knowledge. The perceptron models a neuron by taking a weighted sum of its inputs and sending the output as 1 if the sum is greater than some threshold else it sends an output of 0. A perceptron thus computes a binary function of its input. 2. Feed-Back networks or Back-propagation networks. They are also called Auto-associative Neural Networks. Here the feedback would be used to reconstruct the input patterns from the feedback and make them free from error; thus increasing the performance of the neural networks i.e. it is similar to the way that human beings learn from mistakes. Activation flows from the input layer through the hidden layer and then to the output layer. The knowledge of a network is encoded in the weights. However unlike perceptrons a back propagation network starts out with a random set of weights. The network adjusts its weight each time, from the output. Thus the forward pass involves presenting a sample input to the network to get an output and the backward pass involves comparing the actual output (from the forward pass) with the target output and computing the error. The output of y is again feed into x1 and x2 to adjust the weights w1 and w2 respectively of the two nodes. ADVANTAGES OF NEURAL NETWORKS 1. Adaptive learning 2. Self-organization 3. Fault tolerance capabilities 4. Simple computations PROPOSED METHODOLOGY System Configuration: CORE REQUIREMENTS: INTEL XEON PROCESSOR 1 GB RAM (MINIMUM) 200 GB HDD (MINIMUM) OPERATING SYSTEM REQUIREMENTS: LINUX OR WINDOWSXP LANGUAGE REQUIREMENTS: GCC COMPILER, VC++ The Data Set: A data set is the collection of information on which the analysis of the deviation of diseased and normal genes can be done. The given data sets that have been used are normal and diseased data sets. The rows represent the sample values of a particular gene and the columns represent the gene samples. Our aim is to do a comparative study between them to determine the deviation of the normal to the diseased genes. Thus there are two data sets under consideration: Normal Data Sets Cancer Data Sets Cancer data sets are the diseased or tumour data sets which will be compared against the normal data set. Based on this we will find out the genes that have been mutated or deviated to cause a particular disease of a genome. Normalization: The next step is then to reduce the high dimensionality of the data set to reduce the code complexity and to identify the mediating genes correctly. We use an exhaustive method for the normalization of the data. The Exhaustive Search Method: STEPS: 1. Map the two data sets (diseased and normal) between scales of 0 to 1. Zero corresponds to the minimum value and 1 corresponds to the maximum value in both the data sets. 2. Select the first feature (gene) and plot it on the x co-ordinate. 3. A threshold of 0.5 is taken which divides the x co-ordinate into two equal segments initially. Recursively keep on dividing the threshold value into halves. 4. Find the mean of each segment of the particular gene. 5. Find the deviation of each point lying within the segment with respect to this mean. This gives the error in that segment. 6. Find the error in all the segments which is given as the total error of the chosen gene. 7. Continue the above steps till a user defined total error limit. On exceeding; the process is stopped and the preceding total error value is considered. 8. After step 7 all the segments are traversed to see which segment has maximum number of data points. That segment is considered as the representative of that gene containing the representative data i.e. that particular gene can be expressed by those data values. After normalization the data sets are now ready for use on the proposed neural network and clustering algorithms. The proposed Neuro-Fuzzy Algorithm READ NORMAL DATA SET INTO normal[i] [j] READ disease DATA SET INTO diseased[i] [j] 1.D =[i(Xmaxi-Xmini)2]1/2 where D is a constant 2. dpq ==[i(Xpi-Xqi)2]1/2 3. d dpq =[Wi2(Xpi-Xqi)2]1/2 4. dupq= 1- ddpq where dupq < D 5. upq= (1- d pq)/D where dupq > D 6.E=[1/s(s-1)][uT(1-u0)+ u0 (1-uT) 7. After evaluating the values of ut & u0 after derivation of E w.r.t w i.e.dE/dw 8. WJ=de/dw. If all the genes are not compared we go to step 2 else continue. 9.Objective E should be minimized 10. Value of W bigger indicates corresponding Xj is more important FEATURE SELECTION Introduction Simple feature selection algorithms are ad hoc, but there are also more methodical approaches. From a theoretical perspective, it can be shown that optimal feature selection for supervised learning problems requires an exhaustive search of all possible subsets of features of the chosen cardinality. If large numbers of features are available, this is impractical. For practical supervised learning algorithms, the search is for a satisfactory set of features instead of an optimal set. Feature selection algorithms typically fall into two categories: feature ranking and subset selection. Feature ranking ranks the features by a metric and eliminates all features that do not achieve an adequate score. Subset selection searches the set of possible features for the optimal subset. In statistics, the most popular form of feature selection is stepwise regression. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network. Subset selection Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken into Wrappers, Filters and Embedded. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in and specific to a model. Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc. Search approaches include: Exhaustive Best first Simulated annealing Genetic algorithm Greedy forward selection Greedy backward elimination Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. Other available filter metrics include: Class separability o Error probability o Inter-class distance o Probabilistic distance o Entropy Consistency-based feature selection Correlation-based feature selection FUTURE SCOPE The entire methodology will be implemented in java platform. We will develop a software tool with some data preprocessing tools. It can be accessed through web. Initially we have written some codes in C language on LINUX platform. This software will be treated as a bioinformatics tool for gene expression data analysis. It will support Windows environment and LINUX environment. gene identification & drug discovery Complete genome sequences have provided a plethora of potential drug targets. But the hard task of finding their weak spots is just beginning.With the complete sequencing of the human genome, one might think that the identification of new potential drug targets would be coming to a halt. Yet there are still many exciting developments to come in the field of target identification, with technological advances enabling challenging biological questions to be addressed in increasingly creative ways. Biosimulation and mathematical modeling are powerful approaches for characterizing complex biological systems and their dynamic evolution. The modeling process enables research scientists to systematically identify critical gaps in their knowledge and explicitly formulate candidate hypotheses to span them. Biosimulations are being used to explore “what if” scenarios that can lead to recommendations for designing the best, most informative ‘next experiment.’ This targeted approach to assay development, data interpretation and decision making promises to dramatically narrow the ‘predictability gap’ between drug discovery and clinical development.