File

advertisement
ABSTRACT
In this project, we will propose a set of clustering algorithms and a new approach for
analysis of gene expression data. These algorithms will be applied on the gene
expression data to find out the disease mediating genes for the particular disease of
a genome. As we know that gene expression data is basically high dimensional data,
some dimensionality reduction algorithms is further used to reduce the dataset. To
do this, we will propose the models related to dimensionality reduction using various
statistical and mathematical techniques.
The clustering algorithms will group the genes of a genome for its diseased and
normal conditions. The result will be further validated by some cluster validity
indices. In this connection we have already implemented some existing clustering
algorithms. These algorithms are mainly based on partitioning, hierarchical, densitybased approach. A further study will be needed to compare the results with our
proposed methodology.
Our proposed methodology will incorporate the concept of neural network model for
machine learning. The efficiency of the proposed algorithm will be compared.
The genes selected by our result, will be further validated biologically by studying
their bio-chemical property. At the end of the work, we want to regroup our methods
and programs into a software module. This software will be treated as a
bioinformatics tool for gene expression data analysis.
We are thus trying to establish relationships among different genetic data sets
basically diseased and normal using neuro-fuzzy logic.
Neuro-fuzzy refers to combinations of artificial neural networks and fuzzy logic.
Neuro-fuzzy hybridization results in a hybrid intelligent system that synergizes these
two techniques by combining the human-like reasoning style of fuzzy systems with
the learning and connectionist structure of neural networks. The word neural network
is self explanatory i.e. systems designed to respond like the human nervous system
using neurons and fuzzy systems allow the gradual assessment of membership of
the genes.
However for that grouping we first need a data set which is nothing but the collection
of information on which the analysis of the deviation of diseased and normal genes
can be done. Gene expression is the process by which information from a gene is
used in the manifestation of a condition of a genome. This manifestation of this gene
is done using a data set called the gene expression data.
Gene
Expression
Data
PROJECT-MODEL
Neural Network
Normalize
using
exhaustive search
method
Clustering Layer
Neuro-fuzzy
proposed logic
Existing Method
pr
Proposed
Method
Existing
Method
DBSCAN
K-means
AGNES
DIANA
Gene Selection
Gene Selection
Gene Selection
Gene Selection
Validation
Validation
Result
Result
Accuracy
INTRODUCTION

GENE EXPRESSION DATA
Gene expression is the process by which information from a gene is used in the
synthesis of a functional gene product i.e. it is the conversion of the information from
the gene into mRNA via the transcription process and then to protein via the
translation process resulting in the phenotypic manifestation of the gene.
This manifestation of this gene is done using a data set called the gene expression
data. This data is used in the parallel execution of experiments on a large number of
genes simultaneously. Example: DNA micro-arrays/DNA chips.
DNA –transcription mRNA –translation Protein
Transcription + Translation = Gene Expression

MICRO-ARRAY and GENE EXPRESSION DATA SETS:
Micro-arrays are slides or membranes with numerous probes that represent various
genes of some genome.
Micro-arrays are hybridized with labelled cDNA synthesized from mRNA sample.
The intensity (radioactive or fluorescent) of each spot on a micro-array indicates the
expression of each gene.
One-Dye arrays (radioactive label) shows the absolute expression level of each
gene.
Two-Dye arrays (fluorescent label) indicates the relative expression level of the
same gene in two samples that are labelled with different colours and mixed before
hybridization.
There is also a Universal Reference that is used to compare samples in different
arrays.
Thus micro-array gene expression data sets are commonly very they are large,
and their analytical precision is influenced by a number of variables. They may
require further processing aimed at reducing the dimensionality of the data to aid
comprehension and more focused analysis. Normalization methods may be used in
these cases to reduce the dimensionality.
USES

Micro-arrays are an automated way of carrying out thousands of experiments
at once, and allow scientists to obtain huge amounts of information very
quickly.

Gene expression profiling experiment the expression levels of thousands of
genes are simultaneously monitored to study the effects of certain treatments,
diseases, and developmental stages on gene expression. For example,
micro-array based gene expression profiling can be used to identify genes
whose expression is changed in response to pathogens or other organisms by
comparing gene expression in infected to that in uninfected cells or tissues.
METHODOLOGY
CLUSTERING
A cluster is a group or accumulation of objects with similar attributes.
Conditions for clusters:

Homogeneity within a cluster.

Heterogeneity with other clusters.
Thus clustering is a technique for discovery of data distribution and feature selection
and identification.
There are several algorithms:
Partitioning Algorithms

K-MEANS
K-means clustering is a method of partition cluster analysis where each cluster is
represented by the mean value of all the samples within that cluster. Here n
observations are partitioned into k clusters. The distance of each object from the
mean of the cluster is the Euclidean Distance.
STEPS:
1. K initial means are selected.
2. K clusters are created by associating every observation with the nearest
mean.
3. The centroid of each of the k clusters becomes the new mean.
4. Steps 2 and 3 are repeated until convergence has been reached.

K-MEDOIDS
It is a variant of the K-Means algorithm. Here instead of finding the new means we
represent the mean with the closest object of the cluster.
STEPS:
1. Randomly place the cluster centroids.
2. Join every object into the cluster with the nearest cluster centroid.
3. Compute the new cluster centroids.
4. Choose the sample with the nearest with the above computed cluster centroid
as the new cluster centroid.
5. Repeat (2), (3) and (4) until the centroids become fixed.
Hierarchical Algorithm

AGNES
Agglomerative Nesting (AGNES) starts by placing each sample in separate
clusters and then merges these clusters into larger ones. This continues until a
single cluster is formed or till a termination condition is reached. It is a Bottom-up
approach.
STEPS:
1. Begin with the disjoint clustering having level L(0) = 0 and sequence number
m = 0.
2. Find the least dissimilar pair of clusters in the current clustering, say pair (r),
(s), according to d [(r),(s)] = min d [(i),(j)]. Where the minimum is over all pairs
of clusters in the current clustering.
3. Increment the sequence number: m = m +1. Merge clusters (r) and (s) into a
single cluster to form the next clustering m. Set the level of this clustering to
L(m) = d[(r),(s)]
4. Update the proximity matrix, D, by deleting the rows and columns
corresponding to clusters (r) and (s) and adding a row and column
corresponding to the newly formed cluster. The proximity between the new
cluster, denoted (r,s) and old cluster (k) is defined in this way:
d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
5. If all objects are in one cluster, stop. Else, go to step 2.

DIANA
A divisive clustering (DIANA) start with one large cluster and then subdivides it into
smaller clusters until each sample forms a cluster or a termination is reached. It is
the opposite of AGNES. It is a Top-down approach.
STEPS:
1. Begin with the single clustering having level L(0) = 0 and sequence number m
= 0.
2. Find the least dissimilar pair of clusters in the current clustering, say pair (r),
(s), according to d [(r),(s)] = min d [(i),(j)]. Where the minimum is over all pairs
of clusters in the current clustering.
3. Increment the sequence number: m = m +1. Break clusters (r) and (s) into a
smaller clusters to form the next clustering m. Set the level of this clustering to
L(m) = d[(r),(s)]
4. If all objects are in broken into clusters, stop. Else, go to step 2.
Density based Algorithms
DBSCAN (Density Based Spatial Clustering of Application of Noise) uses a
density-based concept to discover clusters. The key idea of DBSCAN is that, for
each object of a cluster, the neighbourhood of a given radius has to contain at least
a minimum number of data objects. In other words, the density of the neighbourhood
(denoted as ε)must exceed a threshold.
DBSCAN has two objects:
 Core Objects: It is an object which has a neighbourhood of user-specified
minimum density.
 Non-core Objects.
At every step, the algorithm starts with an unclassified object. We study its
neighbourhood to see if it is adequately dense or not. If its density does not exceed
the threshold then it is marked as a noise object.
STEPS:
e = neighbourhood variable
1. Input the dataset D, e and MinPts.
2. Set cluster C=0.
3. For each unvisited point P in dataset D, mark P as visited
4. N = getNeighbors (P, e)
5. If sizeof(N) < MinPts, then mark P as NOISE
6. Else C = next cluster
7. Expand the cluster (P, N, C, e, MinPts). This is done by adding P to cluster C
For each point P' in N, if P' is not visited then, mark P' as visited.
N' = getNeighbors (P', e).
If sizeof (N') >= MinPts, then N = N joined with N'
If P' is not yet member of any cluster, add P' to cluster C.
The common distance metric for getting neighbours is the Euclidean distance.
DBSCAN does not respond well to data sets with varying densities (called
hierarchical data sets).
ADVANTAGES OF DBSCAN
1. Has minimum requirement of domain knowledge to determine input parameters.
2. Discovery of clusters with arbitrary shape.
3. Good efficiency.
FUZZY LOGIC
Fuzzy logic is a multi-valued logic derived from fuzzy set that helps in reasoning.
Contrast with binary logic (crisp logic) which has binary sets fuzzy logic samples
have membership functions having membership values. Thus it deals with reasoning
that is approximate rather than precise.
Fuzzy approach is based on the premise that classes of objects in which the
transition from membership to non-membership is gradual rather than abrupt.

Membership Function:
The membership function in fuzzy logic is a graphical representation of the
magnitude of participation of each input. It associates a weighting with each of the
inputs that are processed, define functional overlap between inputs, and ultimately
determines an output response.
How is Fuzzy Logic used?
1. Define the control objectives and criteria.
2. Determine the input and output relationships and choose a minimum
number of variables for input to the Fuzzy Logic engine (typically error and
rate-of-change-of-error).
3. Using constraints break the control problem down into a series of IF X
AND Y THEN Z rule that defines the desired system output response for
given system input conditions. The number and complexity of rules depends
on the number of input parameters.
4. Create Fuzzy Logic membership functions that define the meaning (values)
of Input/Output terms used in the rules.
5. Create the necessary pre and post-processing Fuzzy Logic routines.
6. Test the system, evaluate the results, tune the rules and membership
functions, and retest until satisfactory results are obtained.
Example of Fuzzy-Logic
An integration of neural network and fuzzy logic is known as neuro-fuzzy
computing. Fuzzy logic application in neural networks allows us to incorporate the
concept of error handling.
Neuro-fuzzy hybridization is done broadly in two ways:
1. A neural network equipped with the capability of handling fuzzy information
[termed fuzzy-neural network (FNN)]
2. A fuzzy system augmented by neural networks to enhance some of its
characteristics like flexibility, speed, and adaptability [termed neural-fuzzy
system (NFS)]
In an FNN the input signals and/or connection weights and/or the outputs are fuzzy
subsets or membership values to some fuzzy sets. Neural networks with fuzzy
neurons are also termed FNN as they are capable of processing fuzzy information.
A NFS, on the other hand, is designed to realize the process of fuzzy reasoning,
where the connection weights of the network correspond to the parameters of fuzzy
reasoning.
NEURAL NETWORK APPROACH
It is a type of machine learning technique. A neural network is an information
processing system. It consists of massive simple processing units with a high degree
of interconnections between each unit together with a learning rule. The processing
units work cooperatively with each other and achieve massive parallel distributed
processing. The design and function of neural networks simulate some functionality
of biological brains and neural systems.
The neural network architecture layers:
There are generally two types of neural networks:
1. A simple neural network called Perceptron. They are also called FeedForward networks or Multilayer Perceptrons.
Feed-forward means that there is no feedback to the input. Each neural network
has a weight assigned to it as the knowledge. The perceptron models a neuron
by taking a weighted sum of its inputs and sending the output as 1 if the sum is
greater than some threshold else it sends an output of 0.
A perceptron thus computes a binary function of its input.
2. Feed-Back networks or Back-propagation networks. They are also called
Auto-associative Neural Networks.
Here the feedback would be used to reconstruct the input patterns from the
feedback and make them free from error; thus increasing the performance of the
neural networks i.e. it is similar to the way that human beings learn from
mistakes.
Activation flows from the input layer through the hidden layer and then to the
output layer. The knowledge of a network is encoded in the weights. However
unlike perceptrons a back propagation network starts out with a random set of
weights. The network adjusts its weight each time, from the output. Thus the
forward pass involves presenting a sample input to the network to get an output
and the backward pass involves comparing the actual output (from the forward
pass) with the target output and computing the error.
The output of y is again feed into x1 and x2 to adjust the weights w1 and w2
respectively of the two nodes.
ADVANTAGES OF NEURAL NETWORKS
1. Adaptive learning
2. Self-organization
3. Fault tolerance capabilities
4. Simple computations
PROPOSED METHODOLOGY
System Configuration:

CORE REQUIREMENTS:
INTEL XEON PROCESSOR
1 GB RAM (MINIMUM)
200 GB HDD (MINIMUM)

OPERATING SYSTEM REQUIREMENTS:
LINUX OR WINDOWSXP

LANGUAGE REQUIREMENTS:
GCC COMPILER, VC++
The Data Set:
A data set is the collection of information on which the analysis of the deviation of
diseased and normal genes can be done.
The given data sets that have been used are normal and diseased data sets. The
rows represent the sample values of a particular gene and the columns represent the
gene samples.
Our aim is to do a comparative study between them to determine the deviation of the
normal to the diseased genes.
Thus there are two data sets under consideration:

Normal Data Sets

Cancer Data Sets
Cancer data sets are the diseased or tumour data sets which will be compared
against the normal data set. Based on this we will find out the genes that have been
mutated or deviated to cause a particular disease of a genome.
Normalization:
The next step is then to reduce the high dimensionality of the data set to reduce the
code complexity and to identify the mediating genes correctly. We use an exhaustive
method for the normalization of the data.
The Exhaustive Search Method:
STEPS:
1. Map the two data sets (diseased and normal) between scales of 0 to 1. Zero
corresponds to the minimum value and 1 corresponds to the maximum value
in both the data sets.
2. Select the first feature (gene) and plot it on the x co-ordinate.
3. A threshold of 0.5 is taken which divides the x co-ordinate into two equal
segments initially. Recursively keep on dividing the threshold value into
halves.
4. Find the mean of each segment of the particular gene.
5. Find the deviation of each point lying within the segment with respect to this
mean. This gives the error in that segment.
6. Find the error in all the segments which is given as the total error of the
chosen gene.
7. Continue the above steps till a user defined total error limit. On exceeding;
the process is stopped and the preceding total error value is considered.
8. After step 7 all the segments are traversed to see which segment has
maximum number of data points. That segment is considered as the
representative of that gene containing the representative data i.e. that
particular gene can be expressed by those data values.
After normalization the data sets are now ready for use on the proposed neural
network and clustering algorithms.
The proposed Neuro-Fuzzy Algorithm
READ NORMAL DATA SET INTO normal[i] [j]
READ disease DATA SET INTO diseased[i] [j]
1.D =[i(Xmaxi-Xmini)2]1/2 where D is a constant
2. dpq ==[i(Xpi-Xqi)2]1/2
3. d dpq =[Wi2(Xpi-Xqi)2]1/2
4. dupq= 1- ddpq where dupq < D
5. upq= (1- d pq)/D where dupq > D
6.E=[1/s(s-1)][uT(1-u0)+ u0 (1-uT)
7. After evaluating the values of ut & u0 after derivation of E w.r.t w
i.e.dE/dw
8.
WJ=de/dw. If all the genes are not compared we go to step 2 else
continue.
9.Objective E should be minimized
10. Value of W bigger indicates corresponding Xj is more important
FEATURE SELECTION
Introduction
Simple feature selection algorithms are ad hoc, but there are also more
methodical approaches. From a theoretical perspective, it can be shown
that optimal feature selection for supervised learning problems requires
an exhaustive search of all possible subsets of features of the chosen
cardinality. If large numbers of features are available, this is impractical.
For practical supervised learning algorithms, the search is for a
satisfactory set of features instead of an optimal set.
Feature selection algorithms typically fall into two categories: feature
ranking and subset selection. Feature ranking ranks the features by a
metric and eliminates all features that do not achieve an adequate score.
Subset selection searches the set of possible features for the optimal
subset.
In statistics, the most popular form of feature selection is stepwise
regression. It is a greedy algorithm that adds the best feature (or deletes
the worst feature) at each round. The main control issue is deciding
when to stop the algorithm. In machine learning, this is typically done by
cross-validation. In statistics, some criteria are optimized. This leads to
the inherent problem of nesting. More robust methods have been
explored, such as branch and bound and piecewise linear network.
Subset selection
Subset selection evaluates a subset of features as a group for suitability.
Subset selection algorithms can be broken into Wrappers, Filters and
Embedded. Wrappers use a search algorithm to search through the
space of possible features and evaluate each subset by running a model
on the subset. Wrappers can be computationally expensive and have a
risk of over fitting to the model. Filters are similar to Wrappers in the
search approach, but instead of evaluating against a model, a simpler
filter is evaluated. Embedded techniques are embedded in and specific
to a model.
Many popular search approaches use greedy hill climbing, which
iteratively evaluates a candidate subset of features, then modifies the
subset and evaluates if the new subset is an improvement over the old.
Evaluation of the subsets requires a scoring metric that grades a subset
of features. Exhaustive search is generally impractical, so at some
implementor (or operator) defined stopping point, the subset of features
with the highest score discovered up to that point is selected as the
satisfactory feature subset. The stopping criterion varies by algorithm;
possible criteria include: a subset score exceeds a threshold, a
program's maximum allowed run time has been surpassed, etc.
Search approaches include:

Exhaustive

Best first

Simulated annealing

Genetic algorithm

Greedy forward selection

Greedy backward elimination
Two popular filter metrics for classification problems are correlation and
mutual information, although neither are true metrics or 'distance
measures' in the mathematical sense, since they fail to obey the triangle
inequality and thus do not compute any actual 'distance' – they should
rather be regarded as 'scores'. These scores are computed between a
candidate feature (or set of features) and the desired output category.
Other available filter metrics include:

Class separability
o
Error probability
o
Inter-class distance
o
Probabilistic distance
o
Entropy

Consistency-based feature selection

Correlation-based feature selection
FUTURE SCOPE
The entire methodology will be implemented in java platform. We will
develop a software tool with some data preprocessing tools. It can be
accessed through web. Initially we have written some codes in C
language on LINUX platform. This software will be treated as a
bioinformatics tool for gene expression data analysis. It will support
Windows environment and LINUX environment.
gene identification & drug discovery
Complete genome sequences have provided a plethora of potential drug
targets. But the hard task of finding their weak spots is just
beginning.With the complete sequencing of the human genome, one
might think that the identification of new potential drug targets would be
coming to a halt. Yet there are still many exciting developments to come
in the field of target identification, with technological advances enabling
challenging biological questions to be addressed in increasingly creative
ways.
Biosimulation and mathematical modeling are powerful approaches for
characterizing complex biological systems and their dynamic evolution.
The modeling process enables research scientists to systematically
identify critical gaps in their knowledge and explicitly formulate candidate
hypotheses to span them. Biosimulations are being used to explore
“what if” scenarios that can lead to recommendations for designing the
best, most informative ‘next experiment.’ This targeted approach to
assay development, data interpretation and decision making promises to
dramatically narrow the ‘predictability gap’ between drug discovery and
clinical development.
Download