International Journal of Computer Engineering & Technology (IJCET) Volume 10, Issue 1, January – February 2019, pp. 38–47, Article ID: IJCET_10_01_005

REVIEW ON CLUSTERING CANCER GENES

Prabhuraj
Assistant Professor, Dept. of Computer Science & Engineering
EPCET, Bengaluru, Karnataka, India

Dr P.M Mallikarjuna Shastry
Professor, School of Computing & Information Technology, Reva University,
Kattigenahalli, Yelahanka, Bengaluru, Karnataka, India

Dr. S.S Patil
Professor & Head, University Head, Dept. of Agriculture Statistics, Applied Mathematics & Computer Science
UAS, GKVK, Bengaluru, Karnataka, India

ABSTRACT

Present studies, development of genomic technologies are highly concentrated on galactic scale gene data. In Bioinformatics community, the sizable volume of gene data investigation and distinguishing the behavior of genes in antithetical conditions are the intriguing task. This cognitive factor can be deal by the clustering technique, its groups the similarity patterns at various features. Moreover, gene expression data indicates the contrastive levels of gene behaviors in various tissue cells and it does provide the feature information effectively. This gene clustering investigation is precise and accommodating in cancer uncovering because of its easiness to detect the cancerous and non-cancerous genes. The precautionary measures cancer diagnostic is precise crucial for cancer prevention and treatment. The existing cancer gene clustering techniques includes several limitations such as time complexities in training and testing samples, maximum redundant features and high dimensional data. These issues are severely influences the data clustering accuracy. This paper focuses on survey of various clustering techniques of cancer gene clustering with respect to cancer gene benchmark datasets. Furthermore, review of existing cancer gene clustering technique describes the advantages and limitations comprehensively. Key words: Bioinformatics, Cancer, Clustering technique, Gene expression. Cite this Article: Prabhuraj, P.M Mallikarjuna Shastry, S.S Patil, Review on Clustering Cancer Genes. International Journal of Computer Engineering and Technology, 10(1), 2019, pp. 38–47. 38 Prabhuraj, P.M Mallikarjuna Shastry, S.S Patil 1. INTRODUCTION The field of Bioinformatics significantly includes several different sections such as molecular biology, genetics, mathematics, data intensive and etc. Bioinformatics provides the more information about specific elements and usually represented in the form of sequence [1]. In present scenario, genomics sequence data is growing exponentially and it’s compelling to analyze such a vast cancer genome. In order to analyze and handle such biological information, several data clustering techniques are utilized [2]. The clustering approaches are divide the sequence data into various groups and those groups are helps to predict the genes functions [3]. The clustering technique highly depends on the distance and similarity between the data. The clustering technique is applied in different applications such as image segmentation, pattern recognition, web search pattern and etc. [4]. In the field of medical, the significant objective of clustering technique develops the structure of uncertain molecules to determine the intrinsic hidden patterns and find the link between the molecules. This information helps to identify the patterns for diagnosis and treatment [5]. In micro-array technology, genes store the significant biological information’s of each living organisms. In gene data analysis, discover the similarity between the genes based on the functions and expression values, which is available in the Gene Ontology databases [6-7]. Numerous researchers are used clustering technique for analyzing the gene activities and cancer topologies. Cancer gene-based clustering algorithms groups thousands of genes into several smaller clusters to find out the different levels of gene expression, which is useful for understanding the functions of many genes. Sample-based clustering methods cluster samples which has similar expression pattern to facilitate the discovery of new tumor types [8]. The cancer gene clustering technique is classified into two types such as supervised clustering method and Un-supervised clustering method. The unsupervised clustering processes a set of different groups of data items that belongs to the similar groups based on particular criteria. In supervised clustering, the actual class labels of some data points are construct the model, which is further used to assign class labels to some unknown samples [9]. The several standard clustering algorithms such as K-means (KM), Fuzzy C-Means (FCM), SelfOrganizing Maps (SOM), and Genetic Algorithm-based (GA) clustering algorithms have been utilized for clustering gene expression data [10]. 2. TAXONOMY OF GENE EXPRESSION CLUSTERING Genes are small segments in chromosome that have more functions related to the encapsulated data which is responsible for generating proteins results in a large range of sequence length in chromosomes, and some of them share specific functions. Figure 1 shows the overall structures of a chromosome and a gene which are formed by a string of nucleotides A, C, G, T corresponding to adenine, cytosine, thymine and guanine bases. The analysis of DNA sequence is a crucial application area in computational biology, and finding of similarity between genes and DNA subsequences provides an essential knowledge of their structures and their functions. The sample image of human chromosome is shown in the figure.1 (a) chromosome is made of DNA sequences which consists of genes and figure.1 (b) is indicates the gene sequence. 39 Review on Clustering Cancer Genes Figure 1 (a) Structure of human chromosome (b) Gene Sequence 2.1. Significant Gene Sequence Clustering Techniques Generally, clustering technique is defined as a group of objects that are similar to one another in same cluster and dissimilar objects are considered in other clusters. The group of data are preserved collectively in single group; it’s also known as data compression. The clustering algorithms are based on distance measured between two objects. Basically, the goal is to minimize the distance of every object from the center of the cluster to which the object belongs. In general, the major clustering methods can be classified into several categories and it’s explained in the following sections. Partition technique: This clustering technique segments the data objects into non overlapping clusters, since every data object is accurately present in one subset. Moreover, partition provides every data objects with cluster index values. In this technique, every cluster is indicated as centroids for example, K-Means, neighboring data points or medoids. Generally, number of clusters are selected randomly in order to optimize the clustering criterion for reassigning the data points. Advantage: The major benefit of this approach is reduced mean squared errors between the data points. Disadvantage: The significant limitation of K-Means is more number of possible solutions occurred as a result [11]. Hierarchical technique: This technique is extensively employed in identifying the clusters in genomic data. Initially, the set of divisions generates the cluster hierarchy based on the significant criteria such as single linkage, complete linkage and average linkage clusters. These cluster hierarchies also known as tree of cluster or dendrogram. The hierarchical clustering method consists of two types such as agglomerative (bottom-up) approach and divisive (top-down) approach. The Bottom-up technique is start with the single clusters after that combines the more number of relevant clusters. This clustering process is continuous until it reaches the certain criterion. Disadvantage: The major limitations of Hierarchical clustering technique is when the size of the cluster tree is increased, then time complexity also high [12]. Density based technique: The clusters in this approach are dense regions of objects in space that are separated by low density regions where cluster density is defined by the criteria of each cluster must have a minimum number of points in its neighborhood. The density based clustering technique is used in the dense region of objects on data space for example, DBSCAN-KM algorithm. Advantage: The major 40 Prabhuraj, P.M Mallikarjuna Shastry, S.S Patil point focused on DBSCAN-KM algorithm is use of the constant radius, every instance’s neighboring element enclosed with minimum number of objects. While counting the objects, every object’s neighborhood density is computed without discretization. Disadvantage: The significant limitation of this technique is when the difference between the gene densities is maximum, then it’s not able to cluster accurately [13]. Graph based clustering: It’s a kind of clustering technique and graph structure is formed by group of vertices and edges that are connected between the pair of vertices. This technique performs by creating a set of vertices which are indicated as graph after that graphs are clustered. In each clusters, the graph includes more number of edges and some of the edges are in between the clusters. Advantage: The graph based method reduces the information loss and use the minimum sample hence, running time is decreased. Disadvantage: The algorithm seems to be working well for randomly scattered data points but it could not properly derive regular geometrical patterns and clusters in such case results in irregular structures [14]. Evolutionary Clustering Technique (ECT): This technique is highly concentrated on resolving the time complexity in clustering data. The ECT includes two significant optimization criteria such as (i) clustering of any data over time should represent the appropriate clusters (ii) data clustering does not shift from one-time period to another. Advantage: The ECT is effectively remove the noise, maximum consistency, and more correspondence clusters. Disadvantage: In high dimensional data, maximum time is required for searching the optimal solution in search space [15-16]. Ensemble Clustering Technique: It is a popular way of combining the classification strategies to overcome instabilities in different classification algorithms. It scales linearly among the number of data points and the number of repetitions by making it feasible to apply for large data sets. Advantage: The algorithm also improves the ability of a clustering algorithm to find structures in a data set as it can find any cluster shapes in the data set. Disadvantage: The miss prediction in certain cluster structures [17]. 3. ANALYSIS OF CANCER GENE CLUSTERING TECHNIQUES In past decades, cancer is the severe disease which is difficult to detect accurately hence, detection of cancer is the significant phase for diagnosis as well as treatment. The various kinds of cancer classified based on the gene activity in the tumor cell. In this section, evaluate the different kinds of cancer gene clustering techniques for example, density based, model based, ensemble and etc. A brief evaluation of some essential contributions to the existing literatures are presented in this section. M. Soruri, et al. [18] presented an efficient method for gene clustering namely Hidden Markov Model with Particle Swarm Optimization (HMM-PSO) method. The HMM model defines the particular gene sequence after that model calculates the probability of each sequence. The HMM model is helps to calculate the similarity between the sequences. The PSO algorithm is optimize the similarity values and based on the clustering sequence the symmetric distance matrix is constructed. The model based gene clustering technique clusters the gene sequence based on analytical algorithm and not able to model by feature vectors. In order to achieve accurate cancer gene clustering, the number of iterations are maximum. 41 Review on Clustering Cancer Genes N. Nidheesh, et al. [19] developed a density based KM algorithm for the estimation of cancer subspace in gene expression data. Generally, KM algorithm includes several benefits such as easy to implement, resolves the problem of linear space complexity without any time complexity. Similarly, KM approach includes several disadvantages i.e. non-deterministic nature, and random set of data points are being considered as centroid. The random selection of data points degraded the cluster efficiency. In this literature, the density based KM algorithm has difficulty to perform in outlier data hence, time complexity was gradually increased. S. Saha, et al. [20] established ensemble based clustering namely Multi-Objective (MO) fuzzy technique for enhancing the performance of cancer gene classification. The few processes are merged with the ensemble based framework (i) To detect the overlapped clusters, fuzzy logic is used (ii) In order to identify the various shape of the clusters and calculated the distance between the clusters by symmetry based distance measure. (iii) MOoptimization approach and MO-differential evolution methods are used to improve the search space efficiency for finding the optimal partitioning in minimum time. The ensemble clustering method needs prior information about the number of clusters in the datasets which is the major limitation of this method. Huang, X., et al. [21] presented an efficient method namely Support Vector Machine Recursive Feature Elimination (SVM-RFE) for gene feature selection. Initially, SVM-RFE method randomly selects the genes after that ranks the selected genes and finally clusters the genes as similar expression profiles. According to experimental analysis, in contrast to the traditional clustering method, the SVM-RFE algorithm shows better clustering efficiency and minimum computation complexity. Also, SVM-RFE method minimizes the relevant gene features and maximizes the redundant features. S. R. Kannan, et al. [22] developed Kernel based Fuzzy clustering (KF) system for evaluating the cancer data. This clustering algorithm considers the breast cancer data, these data are the high dimensional gene expression profile. The KF method helps to select the various levels of non-linearity to identify the membership functions complexity. The significant advantage of KF method are reduced number of iterations in prototype initialization and decreased running time. But, the features have high dimensional data hence complexities of data clustering is bit increased. 4. SIGNIFICANT CHALLENGES OF GENE CLUSTERING The gene data has several levels of genes and monitoring the expressional behavior of the genes under various experiments. The traditional research studies of gene clustering under different experiments with different conditions are difficult to accomplish the goals because of several limitations. The typical challenges of clustering techniques in gene data are addressed below. High Dimensionality: The gene data is the high dimensional data because the gene matrix includes the more number of rows and columns. Moreover, number of attributes are increased in the dataset, hence distance measure faces difficulty to measure the difference between the clusters. Noisy Data: The gene data samples calculate the levels of variation in gene expression between cells. The public gene dataset generally includes noisy data like missed cell values, unlabeled data, difficult to identify the outliers, poor quality and etc. These kinds of noises are influences the cluster process. Redundancy: The biological process in a gene study under scrutiny is assumed as a complicated process, which involves determined gene reactions in different pathways. 42 Prabhuraj, P.M Mallikarjuna Shastry, S.S Patil While some genes can be even involved in more than one pathway, while some others might not be relevant to the biological process. Moreover, all gene data values are dependent on other gene values hence, gene values are redundant. Scalability: The gene data includes the large size of datasets and it’s includes number of data items. The existing clustering technique increase the running time linearly because of large size dataset. Sometimes re-scan the data on servers may be an expensive operation since data are generated by an expensive join query over potentially distributed data warehouse. Thus, only one data scan is usually required. Time and Space Complexity: The computational complexity is linear in input features, objects and number of iterations. In every iteration, loops search the nearest neighbor in the clusters also, performs the insertion of few clusters or remove the clusters from the stack. If clusters are removed from the stack, it influences the other clusters hence, time and space complexity are increased gradually. 5. DISTANCE MEASURES IN GENE CLUSTERS Generally, defined clusters are measured using two kinds of methods such as model based approaches and distance based approaches. First, the model based method helps to calculate the different data points in high dimensional space. Secondly, distance based method calculate the pair of relation between the data points in high dimensional space. The brief description of different distance measures for clustering gene data is mentioned in the following sections. Pearson correlation: This is one kind of similarity measures in the clustering technique. This method is the dot product of the two dimensional vectors or cosine between the two vectors. It’s calculate the similarity in the shapes of two gene profiles and not consider the magnitude of the profiles [23]. Euclidean Distance: In biological samples, distance measure identifies the heterogeneity in gene clusters. This similarity metric calculates the distance between two different data points in the space and represents the absolute behavior of the genes [24]. Jackknife: This is one kind of similarity measure, it decreases the effects of gene’s outliers values in the correlation values. If two sequences show the similar values at a time irrelevant values are removed by Jackknife metric. If the sequences do not have outliers then correlation value is stable. Kendall: The traditional Kendall distance measure considers only same size and same gene elements in the space. An extended Kendall rank distance, measures the difference between ranked position of an element present in all analyzed lists [25]. Mahalanobis Distance: It measure the distance between the two data points as the sum of the absolute of their coordinates. Further, it does not depend upon the translation and reflection of the coordinate system. The one disadvantage is that it depends upon the rotation of the coordinate system [26]. The comparative study of various existing techniques for different kinds of cancer gene clustering approach analyzed with its merits, demerits, use of standard datasets, and similarity measures are described in the table 1. 43 Review on Clustering Cancer Genes Table 1 Related Work Author Methodology Technique Advantage Dataset Name Employed Category M. Soruri, HMM-PSO Model Improve lung canceret al. [18] cluster quality related genes data N. KM Algorithm Nidheesh, [19] Density S. Saha, et FCM-PSOal. [20] Differential Evolution Ensemble X. Huang, SVM-RFE et al. [21] Ensemble S. R. KF Clustering Kannan, et System al. [22] Partition J. Ramos, Clustering based Partition [27] Multi Agent system H. Chen, et Kernel-Based Distance al. [28] Clustering method for Gene Selection S. S. Ray, Supervised and S. Weighted Misra, [29] Similarity Z. Yu, et al. [30] Distance Adaptive Random Partition Double Clustering based Cluster Select the data UCI dataset points which belong to dense regions. Limitation Number of iterations are maximum Similarity Performance Measure Evaluation Distance DBL:0.45 Matrix, Similarity Matrix Euclidean Adjusted Rand distance Index: 0.714 High Time complexity because it is difficult to clusters the outliers data MultiGene Difficult to Euclidean objective Ontology (GO) clusters the distance based annotation gene data clustering database because of techniques, for noisy raw data the allocation of data points to different clusters. Decreases the Gene High Euclidean computational expression computational distance complexities dataset complexity and redundancy among genes Introduced Breast cancer High Laplacian prototype data dimensional kernelinitialization features hence, induced method to difficult to distance, avoid more cluster Canberra number of distance iterations. Gene Lung cancer Poor Euclidean clustering data, Colon performance distance through and with respect to coordinated Leukemia multiple agents to cancer data datasets discover an informative gene subsets. It searches for Public cancer Euclidean the best datasets High running distance weights of time because genes of more iteratively at number of the same time features to optimize the clustering objective function. Improve the Saccharomyces One feature is weighted positive Genome dependent on Pearson predictive Database other features correlation value for gene and all pairs attributes are correlated hence, computational complexity is high Reduces the Cancer gene Euclidean feature expression Missing values distance, dimension and Profile data in the dataset Similarity- 44 Silhouette index: 0.49, Execution Time: 57 sec. Accuracy: 88.24%, Running time: 1404 sec Accuracy: 73.1% Accuracy: Leukemia Dataset- 90.2%, CRC-dataset85.4% Accuracy: 94.5%, TPR: 0.94, FPR:0.06, Positive Predictive Value: 0.91 Random Index:7.92 Purity Measure: Prabhuraj, P.M Mallikarjuna Shastry, S.S Patil Ensemble Framework (ARDCCE) the sample dimension to lessen the effect of noise. J. Wang, et Laplacian LLRR method Public cancer al. [31] regularized Lowsimultaneously gene dataset Rank capture the Representation global Clustering structures and technique the intrinsic local geometrical information within the data. Z. Multi-Objective Evolutionary Fast Gene Zareizadeh, clonal selection convergence to expression et al. [32] optimization the optimal datasets algorithm solutions and frequently update the solutions hence, difficult Rank to predict the (SimRank). values. Not able to perform in large scale dataset. symmetric similarity matrix 7.95 Accuracy: 95.83% Maximum Davis– Dunnnumber of Bouldin index Index:0.1744, iterations Execution Time: hence, time 1028sec complexity is maximum 5. CONCLUSIONS Cancer gene data has high dimensionality and cluster structure. Numerous statistical strategies are helps to detect the cancer genes in order to improve the cancer detection and development stages. To detect derivative expressed genes under comparative conditions, various hypothesis testing methods and the false discovery rate approach are used. The existing cancer gene clustering technique is helps to identity the cancerous and normal gene effectively. The clustering technique is classified into several types such as partition based, density, graph based and etc. In this paper, review on existing cancer gene clustering technique advantage, limitation and similarity measure is described. 