1 Sejun Kim Department of Electrical and Computer Engineering Missouri University of Science and Technology, Rolla, MO 65409 A Comparison of Clustering, Biclustering and Hierarchical Biclustering Algorithms Abstract—Biclustering has proven to be a more powerful method than conventional clustering algorithms for analyzing high-dimensional data, such as gene microarray samples. It involves finding a partition of the vectors and a subset of the dimensions such that the correlations among the biclusters are determined and automatically associated. Thus, it can be considered an unsupervised version of heteroassociative learning. Biclustering ARTMAP (BARTMAP) is a recently introduced algorithm that enables high-quality clustering by modifying the ARTMAP structure, and it outperforms previous biclustering approaches. Hierarchical BARTMAP (HBARTMAP), introduced here, offers a biclustering solution to problems in which the degree of attribute-sample association varies. We also have developed a hierarchical version of Iterative TwoWay Clustering for comparison purposes and have compared these results with other methods, including various clustering algorithms. Experimental results on multiple genetic datasets reveal that HBARTMAP can offer in-depth interpretation of microarrays, which other conventional biclustering or clustering algorithms cannot achieve. Thus, this paper contributes two hierarchical extensions of biclustering or co-clustering algorithms and comparatively analyzes their performance in the context of microarray data. Index Terms—Adaptive Resonance Theory (ART), ARTMAP, Hierarchical clustering, Biclustering, BARTMAP I. I NTRODUCTION Clustering is a common data-mining technique used to obtain information from raw data sets. However, major challenges arise when large numbers of samples must be analyzed, and these challenges escalate as techniques improve and the speed of data acquisition continues to increase, especially regarding the ability to gather high-dimensional data [1], such as gene expressions. The curse of dimensionality renders the conventional clustering of high-dimensional data infeasible [2]–[5]. The two critical traits of bioinformatics data are noise and high dimensionality, both of which diminish the robustness of clustering results [6]. Thus, biclustering was introduced to overcome computational obstacles and provide higher quality analyses [7]–[13]. This approach finds subsets of samples correlated to subsets of attributes. Due to the simultaneous row and column decomposition of the data matrix, biclustering, unlike clustering, can generate various correlated segments within a matrix. Sejun Kim is with the Applied Computational Intelligence Laboratory, Department of Electrical & Computer Engineering, Missouri University of Science & Technology, Rolla, MO 65409 (phone: 573-341-6811; fax:573341-4532; e-mail: skgcf@mst.edu). D. C. Wunsch II is with the Department of Electrical & Computer Engineering, Missouri University of Science & Technology, Rolla, MO 65409 (e-mail: dwunsch@mst.edu). The amount of biological data being produced is increasing at a significant rate [14]–[16]. For instance, since the publication of the H. influenzae genome [17], complete sequences for over 40 organisms have been released, ranging from 450 genes to over 100,000 genes. Given this data, one can imagine the enormous quantity and variety of information being generated in gene expression research. The surge in data has resulted in the indispensability of computers in biological research. Data sets, such as earth science data and stock market measures, are collected at a rapid rate [18], [19] as are microarray gene expression of bioinformatics. The discovery of biclusters has allowed sets with coherent values to be searched across a subset of transactions or examples. An important example of the utility of biclustering is the discovery of transcription modules from microarray data, which denote groups of genes that show coherent activity only across a subset of all conditions constituting the data set, and may reveal important information about the regulatory mechanisms operating in a cell [20]. Neural networks have played a major role in data mining and clustering [21]–[24]. Adaptive Resonance Theory (ART) [25] is one of the most well-known neural networkbased clustering algorithms. The ARTMAP architecture is a neural network for supervised learning composed of two ART modules and an inter-ART module. Xu and Wunsch revised the ARTMAP architecture to develop Biclustering ARTMAP (BARTMAP) [26]. Biclustering through BARTMAP is achieved by performing row-wise and columnwise Fuzzy ART clustering with the intervention of correlation calculations. The greatest advantage of ART is that its structure, unlike other unsupervised clustering algorithms, allows flexibility in the clustering process. This strength also applies to BARTMAP, as the number of biclusters is adjusted automatically. This paper contains a discussion of Hierarchical BARTMAP (HBARTMAP), which inherits the advantages of BARTMAP. HBARTMAP also automatically generates a BARTMAP tree with attention given to each cluster obtained on every node, starting from the root BARTMAP node. After generating the tree, this technique uses a correlation comparison method to recursively calculate the measurement of the row and column clusters from every terminal node and eventually creating a full hierarchical biclustering classification. We will display these results as a heat map, illustrating the relationship between data elements. The remainder of the paper is organized as follows. Section 2 introduces ART and BARTMAP. Section 3 introduces the HBARTMAP approach and Hierarchical Interrelated 2 Two-way Clustering (H-ITWC) for comparsion, followed by Section 4, which includes the experimental setup, data description and results. Finally, the conclusion is provided in Section 5. II. BACKGROUND A. Fuzzy Adaptive Resonance Theory (ART) and ARTMAP Fuzzy Adaptive Resonance Theory (ART) is a neural network-based unsupervised learning method proposed by Carpenter and Grossberg [25]. The framework is composed of two layers of neurons, which include the feature representation field F1 , and the category representation field F2 . The neurons in layer F1 are activated by the input pattern, while the prototypes of the formed clusters are stored in layer F2 . The neurons in layer F2 that already represent input patterns are said to be committed. Correspondingly, the uncommitted neuron encodes no input patterns. The two layers are connected via adaptive weights wj , emanating from node j in layer F2 , which are initially set as 1. Once an input pattern A is registered, the neurons in layer F2 compete by calculating the category function Tj = |A ∧ wj | , α + |wj | (1) where ∧ is the fuzzy AND operator defined by (A ∧ w)i = min(Ai , wi ), (2) (3) The winning neuron, J , then becomes activated, and an expectation is reflected in layer F1 and compared with the input pattern. The orienting subsystem with the pre-specified vigilance parameter ρ(0 ≤ ρ ≤ 1) determines whether the expectation and the input pattern are closely matched. If the match meets the vigilance criterion, |A ∧ wJ | ρ≤ , |A| (4) (5) where β (0 ≤ β ≤ 1) is the learning rate parameter, and β = 1 corresponds to fast learning. This procedure is called resonance, which suggests the name of ART. On the other hand, if the vigilance criterion is not met, a reset signal is sent back to layer F2 to ignore the winning neuron. A new competition will occur among the remaining neurons, excluding the ignored neurons. This new expectation then is projected into layer F1 , and this process repeats until (6) where y b is the binary output vector of field F2 in t ARTb and y b b i = 1 only if the i h category wins in ART . Similar to the vigilance mechanism in ARTa , the map field also performs a vigilance test such that a match tracking procedure is activated if | xab| , |yb | (7) where ρab (0 ≤ ρab ≤ 1) is the map field vigilance parameter. In this case, the ARTa vigilance parameter ρa is increased from its baseline vigilance to a value just above the current match value. This procedure ensures the shut-off of the current winning neuron in ARTa , whose prediction does not comply with the label represented in ARTb . Another ARTa neuron then will be selected, and the match tracking mechanism again will verify its appropriateness. If no such neuron exists, a new ARTa category is created. Once the map field vigilance test criterion is satisfied, the weight wJab of the neuron J in ARTa is updated using the following learning rule: ab wJab(new) = γ(y b ∧ wab J (old)) + (1 − γ)wJ (old), weight adaptation occurs, where learning begins and the weights are updated using the following learning rule, wJ (new) = β(x ∧ wJ (old)) + (1 − β)wJ (old), xab = yb ∧ wjab , ρab > and α > 0 is the choice parameter that breaks the tie when more than one prototype vector is a fuzzy subset of the input pattern, based on the winner-take-all rule, TJ = max{Tj |∀j}. the vigilance criterion is met. If an uncommitted neuron is selected for coding, a new uncommitted neuron is created to represent a potential new cluster, thus maintaining a consistent supply of uncommitted neurons. By incorporating two ART modules, which receive input patterns (ARTa ) and corresponding labels (ARTb ), respectively, with an inter-ART module, the resulting ARTMAP system can be used for supervised classifications [27]. The vigilance parameter of ARTb is set to 1, which causes each label to be represented as a specific cluster. The information regarding the input-output associations is stored in the weights w ab of the inter-ART module. The j th row of the weights of the inter-ART module wjab denotes the weight vector from the jth neuron in ARTa to the map field. When the map field is activated, the output vector of the map field is (8) where γ(0 ≤ γ ≤ 1) is the learning rate parameter. Note that with fast learning (γ = 1), once neuron J learns to predict the ARTb category I , the association is permanent, i.e., wJabI = 1 for all input pattern presentations. In a test phase in which only an input pattern is provided to ARTa without the corresponding label to ARTb , no match tracking occurs. The class prediction is obtained from the map field weights of the winning ARTa neuron. However, if the neuron is uncommitted, the input pattern cannot be classified solely based on prior experience. B. Biclustering ARTMAP (BARTMAP) The BARTMAP architecture is derived from Fuzzy ARTMAP, which also consists of two Fuzzy ART modules 3 where rk,jl = Ni t=1 (esk git Ni t=1 (esk git − es kG i)(esj l git − esj lG i ) − es k Gi )2 Ni t=1 (esjl git , − esjl Gi )2 (10) and es k Gi = esjl Gi = Fig. 1. Structure of BARTMAP. Gene clusters first form in the ARTb module, and sample clusters form in the ARTa module with the requirement that members of the same cluster behave similarly across at least one of the formed gene clusters. The match tracking mechanism will increase the vigilance parameter of the ARTa module if this condition is not met. communicating through the inter-ART module, as shown in Fig. 1. However, the inputs to the ARTb module are attributes (rows) instead of labels. Obviously, the inputs to the ARTa module are samples (columns), although the inputs to the modules can be exchanged, depending on the focus of the biclustering procedure. The objective of BARTMAP is to combine the clustering results of the attributes and samples of the data matrix from each ARTa and ARTb module to create biclusters that project the correlations of attributes and samples. Thus, BARTMAP can be categorized as two-way clustering. The first step of BARTMAP is to create a set of Kg gene clusters Gi , i = 1, · · · , Kg , for N genes by using the ARTb module, which behaves like standard Fuzzy ART. The goal of the following step is to create Ks sample clusters Sj , j = 1, · · · , Ks , for M samples within the ARTa module while calculating the correlations between the attribute and sample clusters. When a new data sample is registered to the ARTa module, the candidate sample cluster that is eligible to represent this sample is determined based on the winner-take-all rule using the standard Fuzzy ART vigilance test. If this candidate cluster corresponds to an uncommitted neuron, learning will occur to create a new one-element sample cluster that represents this sample, as in Fuzzy ART. Otherwise, before updating the weights of the winning neuron, it will check whether the following condition is satisfied: A sample is absorbed into an existing sample cluster if and only if it displays behavior or patterns similar to the other members in the cluster across at least one gene cluster formed in the ARTb module. The similarity between the new sample sk and the sample cluster Sj = {sj1 , · · · , sjMj } with Mj samples across a gene cluster Gj = {gi1 , · · · , giNi } with Ni genes is calculated as the average Pearson correlation coefficient between the sample and all the samples in the cluster, Mj rkj 1 = Mj rk,jl , l=1 (9) 1 Ni 1 Ni Ni eSk git , (11) eSjl git . (12) t=1 Ni t=1 The sample sk is enclosed in cluster Sj only when rkj is above some threshold η; learning will occur following the Fuzzy ART updating rule. If the sample does not show any behaviors similar to those of the sample cluster that the winning neuron represents for any clusters of genes, the match tracking mechanism will increase the ARTa vigilance parameter ρa from its baseline vigilance to just above the current match value to disable the current winning neuron in ARTa . This shut-off will force the sample to be included into some other cluster or will create a new cluster for the sample if no existing sample cluster matches it well. III. H IERARCHICAL BARTMAP Fig. 2. Main idea of hierarchical biclustering. Within a subset, the biclustering procedure is reiterated to obtain fewer results. In HBARTMAP, increasing the vigilances of the ARTa and ARTb modules as well as the correlation threshold by a preset interval enables diversification. A. Method The basic idea of Hierarchical BARTMAP (HBARTMAP) is to reiterate BARTMAP within the obtained BARTMAP results in order to obtain sub-biclusters, as shown in Fig. 2. Such subdivision provides insight into reinterpreting the generated biclusters by conjugating or disbanding sub-biclusters of the initial results. The overall procedure is as follows: In this algorithm, vi a, vi b and corthi are the increased intervals of the vigilance of ARTa and ARTb and the correlation threshold of the inter-ART module, respectively. BARTMAP does not have the ability to evaluate and pair the attribute and sample biclusters. Thus, a scatter search [28] is 4 Algorithm 1 Pseudo Code of Overall HBARTMAP Algorithm Initialize BARTMAP and Load data Run BARTMAP (whole data set) Bicluster Evaluation for i = 1 to Number of Sample Biclusters do Run ChildBARTMAP (SampleBicluster[i], vi a, vi b, corthi) end for applied to calculate the regulations of each bicluster pairs. The correlation coefficient between two variables X and Y measures the grade of linear dependency between them and is defined by, ρ(X, Y ) = cov(X, Y ) = σX σY n i (xi − x)(yi − y) , nσX σY (13) where cov(X, Y ) is the covariance of the variables X and Y ; x and y are the mean of the values of the variables X and Y ; and σX and σY are the standard deviations of X and Y , respectively. Given a bicluster B composed of N samples and M attributes, B = [g1 , · · · , gN ], the average correlation of B, ρ(B), is defined as ρ(B) = 1 N −1 N ρ(gi , gj ) N (2) (14) i=1 j=i+1 where ρ(gi , gj ) is the correlation coefficient between samN ples i and j. Because ρ(gi , gj ) = ρ(gj , gi ), only ( 2 ) elements have been considered. With the calculated average correlation of bicluster B, the scatter search fitness function is applied, which is defined by f (B) = (1 − ρ(B)) + σρ + 1 1 + , N M Algorithm 2 Pseudo Code of ChildBARTMAP Function Initialize BARTMAP and Load data (SampleBicluster[i]) Adjust Variables(vi a, vi b, corthi) Run BARTMAP Bicluster Evaluation if Number of Sample Biclusters == 1 then Return end if for i = 1 to Number of Sample Biclusters do if Number of Attributes(SampleBicluster[i])≥3 and Number of Samples(SampleBicluster[i])≥3 then Run ChildBARTMAP(SampleBicluster[i], vi a, vi b, corthi) end if end for Return (15) where σρ is the standrard deviation of the values ρ(gi , gj ). The standard deviation is included in order to avoid the value of the average correlation being high. The best biclusters are those with the lowest fitness function values. During the bicluster evaluation process, once the fitness of every bicluster of a sample group is calculated, the most highly correlated attributes begin to be sorted out in accordance with a preset threshold. If the fitness of an attribute is smaller than the fitness threshold, it is selected. Once the attribute scan is complete, the process advances to the next sample group and progresses through the selection step again. However, to avoid previously-selected attributes overlapping in different sample groups, they are excluded from the search. The ChildBARTMAP function shown in Algorithm III-A is a recursive function used to generate a tree of BARTMAP modules that solely compute the subset. Biclustering with less than three samples or attributes is revealed to be meaningless, so the recursion process will stop under such conditions. B. Hierarchical Interrelated Two-way Clustering (H-ITWC) In order to compare and contrast HBARTMAP with other biclustering algorithms, the hierarchical version of interrelated two-way clustering (ITWC) [29] also was programmed using the same HBARTMAP method.The ITWC technique is a clustering method used widely in cases in which information is spread out over a large body of experimental data [30], [31]. It was developed to achieve clustering in high-dimensional data spaces. ITWC takes an approach similar to that of BARTMAP. First, ITWC performs clustering on the attribute side, which is the larger dimension, in order to reduce it to a reasonable level. Then, scores such as correlation coefficients [32]–[34] are applied in patterns in order to sort out important samples. ITWC involves five main steps: • Step 1: Clustering in the gene dimension. In this step, the data are clustered gene-wise into k groups using any clustering method, such as K-means or self-organizing maps (SOMs). • Step 2: Clustering in the sample dimension. The samples are clustered into two groups, - Si,a and Si,b ,- based on each gene cluster i. • Step 3: Combining the clustering results. The results from steps 1 and 2 are combined into 2k groups. If k = 2, then the samples can be divided into the following four groups: - C1 (all samples clustered into S1,a ,a based on G1 and into S2,a ,a based on G2 ); - C2 (all samples clustered into S1,a ,a based on G1 and into S2,b ,a based on G2 ); - C3 (all samples clustered into S1,b ,a based on G1 and into S2,a ,a based on G2 ); - C4 (all samples clustered into S1,b ,a based on G1 and into S2,b ,a based on G2 ); 5 • • Step 4: Finding heterogeneous groups. Among the sample groups Ci , two distinct groups are selected that satisfy the following condition: ∀u ∈ Cs , ∀v ∈ Ct , where u and v are samples, and s and t are the two selected groups, respectively. If u ∈ Si,r1 , v ∈ Si,r2 , then r1 j= r2 (r1 , r2 ∈ {a, b}) for all i(1 ≤ i ≤ k). The group (Cs , Ct ) is called a heterogeneous group. Step 5: Sorting and reducing. For each heterogeneous group, two patterns are introduced. The vector-cosine defined in Eq. 16 is calculated for each pattern, and then all genes are sorted according to the similarity values in descending order. Thefirst one-third of the sorted gene sequence is kept, and the other two-thirds of the sequence is removed. Correspondingly, the number of pairs of samples for the four cases are denoted as a, b, c, and d, respectively. The total number of pairs of samples is M (M − 1)/2, denoted as L; therefore, a + b + c + d = L. The Rand index then can be defined as follows, with larger values indicating greater similarity between C and P: (a + d) (17) L In order to correct the Rand index for randomness, it should be normalized so that its value is 0 when two partitions are randomly selected and 1 when two partitions are perfectly matched, R= AdjR = m j=1 gl , E cos(θ) = gl · E = m j=1 wi,j w 2i,j m j=1 (16) e2j Steps 1 through 5 are reiterated until the terminal conditions are satisfied. The H-ITWC algorithm is composed of the same structure as HBARTMAP. However, the individual child nodes are applied with the ITWC algorithm instead of BARTMAP. IV. E XPERIMENTAL R ESULTS A. Setup The leukemia data set [35] consists of 72 samples, including bone marrow samples, peripheral blood samples and childhood AML cases. Twenty-five of these samples are acute myeloid leukemia (AML), and 47 are acute lymphoblastic leukemia (ALL), which is composed of two subcategories due to the influences of T-cells and B-cells. The expression levels for 7,129 genes were measured across all of the samples by high-density oligonucleotide microarrays. The raw data were preprocessed through linear transform to fit the HBARTMAP requirement of an [0, 1] interval. Similar preprocessing was performed for H-ITWC. The result of HBARTMAP is evaluated by comparing the resulting clusters with the real structures in terms of external criteria. Both the Rand index and the adjusted Rand index [36] are applied. We assume that P is a pre-specified partition of dataset X with N data objects, which also is independent from a clustering structure C resulting from the use of the BARTMAP algorithm. Therefore, a pair of data objects xi and xj , will yield four different cases based on how xi and xj are placed in C and P. • Case 1 xi and xj belong to the same cluster of C and the same category of P. • Case 2 xi and xj belong to the same cluster of C and different categories of P. • Case 3 xi and xj belong to different clusters of C and the same category of P. • Case 4 xi and xj belong to different clusters of C and different categories of P. R − E(R) , max(R) − E(R) (18) where E(R) is the expected value of R under the baseline distribution, and max(R) is the maximum value of R. Specifically, the adjusted Rand index [37] assumes that the model of randomness takes the form of the generalized hypergeometric distribution, which is written as, M AdjR = ( 2 )(a + d) − ((a + b)(a + c) + (c + d)(b + d)) M ( 2 )2 − ((a + b)(a + c) + (c + d)(b + d)) . (19) The adjusted Rand index has demonstrated consistently good performance in previous studies compared to other indices. For additional peformance comparison, a synthetic data set developed by Handl and Knowles [38] also was used for the simulation, which consists of 1286 samples, and 100 attributes. B. Results Fig. 3. HBARTMAP heatmap result on the leukemia data set. The child biclusters are generated within the bicluster generated from the parent BARTMAP node. Certain portions of the attributes (genes) are ignored based on the scatter search evaluation. Fig. 3 depicts the resulting HBARTMAP heat map on the leukemia data set showing that bicluster subsets are being generated within the biclusters. The main focus of the leukemia data set simulation is to perform biclustering with HBARTMAP and to judge how precisely the condition that this approach computes matches external criteria. Fig. 4 6 depicts the Rand index and the adjusted Rand index result of each layer, beginning with the biclustering result from the root HBARTMAP node. Because the initial vigilance and threshold parameters are set low, the result is rough. While the leukemia data set has three conditions, the root node only found two. As a result, deeper layers were evaluated, and the growth of both the Rand index and the adjusted Rand index is obvious. At layer 3, the Rand index was 0.9711, and the adjusted Rand index was 0.8863, both of which values are higher than the best BARTMAP result (0.7926). R AND I NDEX AND TABLE I A DJUSTED R AND I NDEX Algorithm HBARTMAP (2nd layer) ds BARTMAP H-ITWC WITH Rand Index 0.9576 0.9817 0.8779 S YNTHETIC D ATA Adjusted Rand Index 0.9347 0.8944 0.6032 On the first iteration, HBARTMAP divides the data set into two major clusters. Then, the algorithm performs BARTMAP within each cluster to split it into two subclusters. HBARTMAP is terminated when the subclusters cannot be not divided any further even with stricter variables. A comparison of the evaluation results with BARTMAP and H-ITWC is shown in Table I. HBARTMAP clearly performs better than H-ITWC and performs slightly better than BARTMAP based on the Adjusted Rand Index criterion. Fig. 4. Rand index and Adjusted Rand Index calculation of each layer. The biclusters generated on layer 3 produce the best result based on Rand index and adjusted Rand index evaluation. Fig. 5 compares the two indices using BARTMAP, Fuzzy ART, K-means, hierarchical clustering with four different linkages and various versions of ITWC. The performances of the well-known ITWC and SOM also are presented. This comparison indicates that the hierarchical version of ITWC with K-means is also improved; however, the increased performance is not as significant as that offered by HBARTMAP. Fig. 6. The result of HBARTMAP running the synthetic data. HBARTMAP ends with 2 layers - 2 subclusters each under 2 root clusters. V. C ONCLUSION Fig. 5. Results from HBARTMAP and various clustering algorithms on the leukemia data set in terms of Rand and adjusted Rand index. The methods used for the comparison are BARTMAP (BAM), Fuzzy ART (FA), K-means (KM), hierarchical clustering with single linkage (HC-S), complete linkage (HC-C), average linkage (HC-A), Ward’s method (HCW) and interrelated two-way clustering with SOFM (ITWC-S), K-means (ITWC-K) and hierarchical version with SOFM (H-ITWC-S). Fig. 6 shows the clustering result of the synthetic data. In this paper, we propose hierarchical BARTMAP, an hierarchical approach for biclustering. The tasks of clustering high dimensional data are achieved by incorporating scatter search into biclustering algorithms. The experimental results imply that HBARTMAP provides better biclustering than BARTMAP. The sudden increase in the adjusted Rand index while searching each layer indicates that the advanced version of BARTMAP can be implemented effectively in high-dimensional data analysis. It suggests that utilizing the scatter search method on HBARTMAP biclustering was the major factor of successful experiments. ACKNOWLEDGEMENT Partial support of this research from the National Science Foundation (Grants 1102159 and 1238097), the Mary K. Fin- 7 ley Missouri Endowment, and the Missouri S&T Intelligent Systems Center is gratefully acknowledged. R EFERENCES [1] T. Havens, J. Bezdek, C. Leckie, L. Hall, and M. Palaniswami, “Fuzzy c-means algorithms for very large data,” Fuzzy Systems, IEEE Transactions on, vol. 20, pp. 1130 –1146, dec. 2012. [2] R. E. Bellman, Dynamic Programming. Courier Dover Publications, 1957. [3] R. Xu and D. C. Wunsch II, Clustering. IEEE Press Series on Computational Intelligence, John Wiley & Sons, 2008. [4] J. A. Hartigan, “Direct clustering of a data matrix,” Journal of the American Statistical Association, vol. 67, no. 337, pp. 123–129, 1972. [5] R. Xu, D. Wunsch, et al., “Survey of clustering algorithms,” Neural Networks, IEEE Transactions on, vol. 16, no. 3, pp. 645–678, 2005. [6] S. Kesh and W. Raghupathi, “Critical issues in bioinformatics and computing,” Perspect Health Inf Manag, vol. 1, p. 9, 2004. [7] Y. Cheng and G. M. Church, “Biclustering of expression data,” in Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 93–103, AAAI Press, 2000. [8] S. Busygin, O. Prokopyev, and P. M. Pardalos, “Biclustering in data mining,” Computers and Operations Research, vol. 35, no. 9, pp. 2964–2987, 2008. [9] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biological data analysis: A survey,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 24–45, 2004. [10] P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, and V. Robles, “Machine learning in bioinformatics,” Brief. Bioinformatics, vol. 7, pp. 86–112, Mar 2006. [11] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003. [12] J. Keller and M. Popescu, “Soft computing in bioinformatics,” in Fuzzy Systems, 2005. FUZZ’05. The 14th IEEE International Conference on, pp. 3–3, IEEE, 2005. [13] J. Zhang, J. Wang, and H. Yan, “A neural-network approach for biclustering of gene expression data based on the plaid model,” in Machine Learning and Cybernetics, 2008 International Conference on, vol. 2, pp. 1082–1087, IEEE, 2008. [14] T. Reichhardt, “It’s sink or swim as a tidal wave of data approaches,” Nature, vol. 399, pp. 517–520, June 1999. [15] N. M. Luscombe, D. Greenbaum, and M. Gerstein, “What is bioinformatics? A proposed definition and overview of the field,” Methods Inf Med, vol. 40, no. 4, pp. 346–358, 2001. [16] R. Xu and D. C. Wunsch, “Clustering algorithms in biomedical research: A review,” Biomedical Engineering, IEEE Reviews in, vol. 3, pp. 120–154, 2010. [17] R. D. Fleischmann, M. D. Adams, O. White, R. A. Clayton, E. F. Kirkness, A. R. Kerlavage, C. J. Bult, J. F. Tomb, B. A. Dougherty, and J. M. Merrick, “Whole-genome random sequencing and assembly of Haemophilus influenzae Rd,” Science, vol. 269, pp. 496–512, Jul 1995. [18] G. Pandey, G. Atluri, M. Steinbach, C. L. Myers, and V. Kumar, “An association analysis approach to biclustering,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, (New York, NY, USA), pp. 677–686, ACM, 2009. [19] R. Xu, S. Damelin, B. Nadler, and D. C. Wunsch, “Clustering of highdimensional gene expression data with feature filtering methods and diffusion maps,” in BioMedical Engineering and Informatics, 2008. BMEI 2008. International Conference on, vol. 1, pp. 245–249, IEEE, 2008. [20] J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, and N. Barkai, “Revealing modular organization in the yeast transcriptional network,” Nat. Genet., vol. 31, pp. 370–377, Aug 2002. [21] G. Lim and J. Bezdek, “Small targets in ladar images using fuzzy clustering,” in Fuzzy Systems Proceedings, 1998. IEEE World Congress on Computational Intelligence., The 1998 IEEE International Conference on, vol. 1, pp. 61–66, May 1998. [22] H. Kim and B. Kosko, “Neural fuzzy motion estimation and compensation,” Signal Processing, IEEE Transactions on, vol. 45, no. 10, pp. 2515–2532, 1997. [23] Z. Hou, M. Polycarpou, and H. He, “Editorial to special issue: Neural networks for pattern recognition and data mining,” Soft ComputingA Fusion of Foundations, Methodologies and Applications, vol. 12, no. 7, pp. 613–614, 2008. [24] P. Werbos, “Neurocontrol and elastic fuzzy logic: Capabilities, concepts, and applications,” Industrial Electronics, IEEE Transactions on, vol. 40, no. 2, pp. 170–180, 1993. [25] G. A. Carpenter, S. Grossberg, and D. B. Rosen, “Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system,” Neural Networks, vol. 4, no. 6, pp. 759 – 771, 1991. [26] R. Xu and D. C. Wunsch II, “Bartmap: A viable structure for biclustering,” Neural Networks, vol. 24, no. 7, pp. 709 – 716, 2011. [27] G. A. Carpenter, S. Grossberg, N. Markuzon, J. H. Reynolds, and D. B. Rosen, “Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps,” IEEE Transactions on Neural Networks, vol. 3, no. 5, pp. 698–713, 1992. [28] J. A. Nepomuceno, A. Troncoso, and J. S. Aguilar-Ruiz, “Biclustering of gene expression data by correlation-based scatter search,” BioData Mining, vol. 4, no. 1, 2011. [29] C. Tang and A. Zhang, “Interrelated two-way clustering and its application on gene expression data,” International Journal on Artificial Intelligence Tools, vol. 14, no. 4, pp. 577–597, 2005. [30] C. Tang, L. Zhang, A. Zhang, and M. Ramanathan, “Interrelated twoway clustering: An unsupervised approach for gene expression data analysis,” in Bioinformatics and Bioengineering Conference, 2001. Proceedings of the IEEE 2nd International Symposium on, pp. 41– 48, IEEE, 2001. [31] B. Chandra, S. Shanker, and S. Mishra, “A new approach: Interrelated two-way clustering of gene expression data,” Statistical Methodology, vol. 3, no. 1, pp. 93 – 102, 2006. [32] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, et al., “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–537, 1999. [33] A. Jorgensen, “Clustering excipient near infrared spectra using different chemometric methods,” , Technical report, Dept. of Pharmacy, University of Helsinki, 2000. [34] J. Devore, Probability and Statistics for Engineering and the Sciences. Duxbury Press, 2011. [35] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. Natl. Acad. Sci. U.S.A., vol. 95, pp. 14863–14868, Dec 1998. [36] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971. [37] D. Steinley, “Properties of the Hubert-Arabie adjusted Rand index,” Psychol Methods, vol. 9, pp. 386–396, Sep 2004. [38] J. Handl and J. Knowles, “Improvements to the scalability of multiobjective clustering,” in Evolutionary Computation, 2005. The 2005 IEEE Congress on, vol. 3, pp. 2372–2379, IEEE, 2005.