From: ISMB-00 Proceedings. Copyright © 2000, AAAI (www.aaai.org). All rights reserved. Biclustering of Expression Data Yizong Cheng $f* St and George M. Church tDepartment of Genetics, Harvard Medical School, Boston, MA02115 §Department of ECECS,University of Cincinnati, Cincinnati, OH45221 yizong.cheng@uc.edu, church@salt2.med.harvard.edu Abstract Anefficient node-deletion algorithm is introduced to find submatrices in expression data that have low mean squared residue scores and it is shownto perform well in finding co-regulation patterns in yeast and human. This introduces "biclustering’, or simultaneousclustering of both genes and conditions, to knowledgediscovery from expression data. This approach overcomes someproblems associated with traditional clustering methods,by allowing automatic discovery of similarity basedon a subset of attributes, simultaneousclustering of genes and conditions, and overlapped grouping that provides a better representation for geneswith multiple functions or regulated by manyfactors. Keywords: microarray, gene expression pattern, tering clus- Introduction Gene expression data are being generated by DNAchips and other microarray techniques and they are often presented as matrices of expression levels of genes under different conditions (including environments, individuals, and tissues). One of the usual goals in expression data analysis is to group genes according to their expression under multiple conditions, or to group conditions based on the expression of a number of genes. This may lead to discovery of regulatory patterns or condition similarities. The current practice is often the application of some agglomerative or divisive clustering algorithm that partitions the genes or conditions into mutually exclusive groups or hierarchies. The basis for clustering is often the similarity between genes or conditions as a function of the rows or columns in the expression matrix. The similarity between rows is often a function of the row vectors involved and that between columns a function of the column vectors. Functions that have been used include Euclidean distance (or related coefficient of correlation and Gaussian similarity) and the dot product "Tel: (513) 556-1809,Fax: (513) 556-7326. tTeh (617) 432-7266, Fax: (617) 432-7663. Copyright (~) 2000, AmericanAssociation for Artificial Intelligence (www.aaai.org).All rights reserved. (or a nonlinear generalization of it, as used in kernel methods) between the vectors. All conditions are given equal weights in the computation of gene similarity and vice versa. One must doubt not only the rationale of equally weighing all conditions or all genes, but that of giving the same weight to the same condition for the similarity computation between all the genes and vice versa as well. Any such formula leads to the discovery of some similarity groups at the expense of obscuring someother similarity groups. In expression data analysis, beyond grouping genes and conditions based on overall similarity, it is sometimes needed to salvage information lost during oversimplied similarity and grouping computation. One of the goals of doing so is to disclose the involvement of a gene or a condition in multiple pathways, some of which can only be discovered under the dominance of more consistent ones. In this article, we introduce the concept of bicluster, corresponding to a subset of genes and a subset of conditions with a high similarity score. Similarity is not treated as an function of pairs of genes or pairs of conditions. Instead, it is a measure of the coherence of the genes and conditions in the bichister. This measure can be a symmetric function of the genes and conditions involved and thus the finding of biclusters is a process that groups genes and conditions simultaneously. If we project these biclusters onto the dimension of genes or that of conditions, then we can see the result as clustering of either genes or conditions, into possibly overlapping groups. A particular score that applies to expression data transformed by a logarithm and augmented by the additive inverse is the meansquared residue. The residue of element a~j in the bicluster indicated by the subsets I and J is a~j - aij - aij -t- alj , (1) where aij is the mean of the ith row in the bicluster, aij the mean of the jth columnin the bicluster, and aij that of all elements in the bicluster. The mean squared residue is the variance of the set of all elements in the bicluster, plus the mean row variance and the mean column variance. Wewant to find biclusters with low mean squared residue, in particular, large and maximal ISMB 2000 93 ones with scores below a certain threshold. A special case for a perfect score (a zero meansquared residue) is a constant bicluster of elements of a single value. Whena bicluster has a non-zero score, it is always possible to remove a row or a column to lower the score, until the remaining bicluster becomesconstant. The problem of finding a maximumbicluster with a score lower than a threshold includes the problem of finding a maximumbiclique (complete bipartite subgraph) in a bipartite graph as a special case. If a maximumbiclique is one that maximizes the number of vertices involved (maximizing [I] + [J]), then the problem is equivalent to finding a maximummatching in the bipartite complement and can be solved using polynomial time max-flow algorithms. However, this approach often results in a submatrix with maximumperimeter and zero area, particularly in the case of expression data, where the number of genes may be hundreds times more than the number of conditions. If the goal is to find the largest balanced biclique, for example, the largest constant square submatrix, then the problem is proven to be NP-hard (Johnson, 1987). On the other hand, the hardness of finding one with the maximumarea is still unknown. Divisive algorithms for partitioning data into sets with approximately constant values have been proposed by Morgan and Sonquist (1963) and Hartigan (1972). The result is an hierarchy of clusters, and the algorithms foretold the more recent decision tree procedures. Hartigan (1972) also mentioned that the criterion for partitioning may be other than a constant value, for example, a two-way analysis of variance model, which is quite similar to the mean squared residue scoring proposed in this article. Rather than a divisive algorithm, our approach is more of the type of node deletion (Yannakakis, 1981). The term biclustering has been used by Mirkin (1996) to describe "simultaneous clustering of both row and column sets in a data matrix". Other terms that have been associated to the same idea include "direct clustering" (Hartigan, 1972) and "box clustering" (Mirkin, 1996). Mirkin (1996) presents a node addition algorithm, starting with a single cell in the matrix, to find a maximal constant bicluster. The algorithms mentioned above either find one constant bicluster, or find a set of mutually exclusive nearconstant biclusters that cover the data matrix. There are ample reasons to allow biclusters to overlap in expression data analysis. One of the reasons is that a Single gene may participate in multiple pathways that may or may not be co-active under all conditions. The problem of finding a minimumset of biclusters, either mutually exclusive or overlapping, to cover all the elements in a data matrix is a generalization of the problem of covering a bipartite graph by a minimumset of bicliques, either mutually exclusive or overlapping, which has been shown to be NP-hard (Orlin, 1977). Nan, Markowsky, Woodbury, and Amos (1978) had interesting application of biclique covering on the in94 CHENG terpretation of leukocyte-serum immunological reaction matrices, which are not unlike the gene-condition expression matrices. In expression data analysis, the uttermost important goal may not be finding the maximumbicluster or even finding a bicluster cover for the data matrix. More interesting is the finding of a set of genes showingstrikingly similar up-regulation and down-regulation under a set of conditions. A low mean squared residue score plus a large variation from the constant may be a good criterion for identifying these genes and conditions. In the following sections, we present a set of efficient algorithms that find these interesting gene and condition sets. The basic iterate in the method consists of the steps of maskingnull values and biclusters that have been discovered, coarse and fine node deletion, node addition, and the inclusion of inverted data. Methods A gene-condition expression matrix is a matrix of real numbers, with possible null values as some of the elements. Each element represents the logarithm of the relative abundance of the mRNA of a gene under a specific condition. The logarithm transformation is used to convert doubling or other multiplicative changes of the relative abundance into additive increments. Definition 1. Let X be the set of genes and Y the set of conditions. Let aij be the element of the expression matrix A representing the logarithm of the relative abundance of the mRNA of the ith gene under the jth condition. Let I C X and J C Y be subsets of genes and conditions. The pair (I, J) specifies a submatrix AIj with the following mean squared residue score. 1 2, (2) H(I,J)= ]ill.j] ~ (aij-aij-alj+alj) iEI,jEJ where 1 Z a j, aiJ = ~ a lj = 1 a j, jEJ (3) icl and aij 1 : iliigi 1 ~ aij ~ a,j = -~[~ iEI,jEJ ~- 1 -(--~a,j (4) ’ ’ jEJ iEl are the row and column means and the mean in the submatrix (I, J). A submatrix AIj is called a ~-bicluster if H(I, J) < for so me ~ > 0. [] The lowest score H(I, J) = 0 indicates that the gene expression levels fluctuate in unison. This includes the trivial or constant biclusters where there is no fluctuation. These trivial biclusters may not be very interesting but need to be discovered and masked so more interesting ones can be found. The row variance may be an accompanyingscore to reject trivial biclusters. v(I, J) 1 _ jEJ (5) The matrix aij = ij, i,j > 0 has the property that no submatrix of a size larger than a single cell has a score lower than 0.5. A K x K matrix of all 0s except one 1 has the score 1 hK = ~-~(K- 1) [(K- 3 + 2( K- 1) 2 - 2] . (6 A matrix with elements randomly and uniformly generated in the range of [a, b/ has an expected score of (b - a)2/12. This result is independent of the size the matrix. For example, when the range is [0,800], the expected score is 53,333. A translation (addition by a constant) to the matrix will not affect the H(I, J) score. A scaling (multiplication by a constant) will affect the score (by the square of the constant), but will have no affect if the score is zero. Neither translation nor scaling affects the ranking of the biclusters in a matrix. Theorem1. The problem of finding the largest 5-bicluster (III--IJI) is NP-hard. square We construct a reduction from the Proof. BALANCED COMPLETE BIPARTITE SUBGRAPH problem (GT24 in Garey and Johnson, 1979) to this problem. Given a bipartite graph (V,, V2, E) and positive integer K, form a real-valued matrix A with aij = 0 if and only if (i,j) ¯ E and aij 2i j ot herwise, for i,j > 0. If the largest square 1/hK-bicluster in A has a size larger than or equal to K, then there is a K × K biclique (complete bipartite subgraph). Since the BALANCED COMPLETE BIPARTITE SUBGRAPH problem is NP-complete, the problem of this theorem is NP-hard. [] Algorithm 0 (Brute-Force Deletion and Addition). Input: A, a matrix of real numbers, and 5 > 0, the maximumacceptable mean squared residue score. Output: Axj, a 5-bicluster that is a submatrix of A with row set I and columnset J, with a score no larger than & Initialization: I and J are initialized to the gene and condition sets in the data and AH = A. Iteration: 1. Compute the score H for each possible row/column addition/deletion and choose the action that decreases H the most. If no action will decrease H, or if H <= 5, return AIj. Algorithm 0, although a polynomial-time one, will not be efficient enough for a quick analysis of most expression data matrices. Wepropose in the following Algorithm 1 with time complexity in O(nm) and Algorithm 2 in O(mlog n). The combination of the two will provide a very efficient node-deletion algorithm for finding a biclnster of a low score. The correctness and efficiency of these algorithms are based on a number of lemmas, in which rows (or columns) are treated points in a space where a distance is defined. Lemrna1. Let S be a finite set of points in a space in which a non-negative real-valued function of two arguments, d is defined. Let m(S) be a point that minimizes the function f(s) = ~ d(x, (7) xES Define the measure 1 E(S) = ~[ ~ d(x, re(S)). Node Deletion (8) xES Every expression matrix contains a submatrix with the perfect score (H(I, J) = 0), and any single element such a submatrix. Certainly the kind of biclusters we look for should have a maximumsize, both in terms of the number of genes involved (lII) and in terms of the numberof conditions (]JI)- Then, the removal of any non-empty subset If we start with a large matrix, say, the one with all the data, then the question is how to select a submatrix with a low H score. A greedy method is to remove the row or column to achieve the largest decrease of the score. This requires the computation of the scores of all the submatrices that may be the consequences of any row or column removal, before each choice of removal can be made. This method (Algorithm 0) requires time in O((n + m)nm), where n and m are the row and column sizes of the expression matrix, to find one bicluster. Proof. Condition (10) can be rewritten ~ A A+B R C {x ¯ S: d(x,m(S)) > E(S)} (9) will only make E(S - R) < E(S). IS- (10) < IS---T’ (11) where A= d(x,m(S)), A’ = ~ d(x,m(S xES-R xcS-R (12) B = d(x, m(Sl) (13) xER ISMB 2000 95 The definition of the function rn requires that A’ < A. Thus, a sufficient condition for the inequality (11) A Is- A+B (14) <IS---V-’ Proo]. Let the points in Lemma1 be IJI-dimensional real-valued vectors and S be the set of vectors bi with components bij = aij - aij for i E I and j 6 J. The function d is defined as 2. d(bi, bk) = E(bo - bkj) which is equivalent to E(S)- A+B B _ IS I < IR I iR--~ ~ d(x, xER rn(S)). In this case, (15) Clearly, (9) is a sufficient condition for this inequality and therefore also for (10). Lenuna 2. Suppose the set removed from S is R C {x E S: d(x,m(S)) > aE(S)} (16) with a _> 1. Then the reduction rate of the score E(S) can be characterized as ~- 1 E(S) - E(S - R) > (17) E(S) ISl/IRI- 1 Whena single point x is removed, the reduction rate has the bound E(S)- E(S- R) d(x, re(S)) - E(S) (18) ISl -1 Proof.Usingnotationin LemmaI, we havenow aE(S) A+B B I = a--~ < I - IR I E d( x,m(S)). (1 9) z6R This leads to alRIA < (IS] - aIRI)B , (20) ISIA < (ISI - aIRI)(A + (21) or, equivalently, This is the same as A [SI - a]RI A + B IS - R------~ < IS - RI ISI E(S - R) < ISI - alRI E(S), Is - RI (22) CHENG and has the componentsalj - alj. [] There is also a similar result for the columns. Lemma2 acts as a guide on the trade-off between two types of node deletion, that of deleting one node a time, and that of deleting a set of node a time, before the score is recalculated. These two algorithms are listed below. Algorithm 1 (Single Node Deletion). Input: A, a matrix of real numbers, and 5 > 0, the maximumacceptable mean squared residue score. Output: AIj, a 5-bicluster that is a submatrix of A with row set I and columnset J, with a score no larger than & Initialization: I and J are initialized to the gene and condition sets in the data and A;j = A. Iteration: 1. Computeaij for all i 6 I, a1j for all j 6 J, aij, and H(I, J). If H(I, <= 5, return AIj. 2. Find the row i 6 I with the largest = 1 E(aij j6J -aij - azj 2+ aIj) 1 d(j) = -~[ 2~(aij ie’ -- aia -- aIj + alj) (23) Theorem2. The set of rows that can be completely or partially removed with the net effect of decreasing the score of a bicluster AIj is 96 (26) i6/ and the column j E J with the largest which is (17). Inequality (18) can be derived from (17). [] a = i e I; IJI jeJ (a~j - a~j azj "~-alJ)2 > H(I, re(S)=1 Z d(i) Using the inequality A’ < A and the facts that E(S R) = A’/IS - RI and E(S-) (A+B)/ISI, th is le ads to the inequality { (25) jEJ J) } (24) removethe row or columnwhichever withthelargerd valueby updating eitherI or J. The correctness of Algorithm 1 is shown by Theorem 2, in the sense that every removal decreases the score. Because there are only finite number of rows and columns to remove, the algorithm terminates in no more than n + m iterates, where n and m are the number of genes and the numberof conditions in the initial data matrix. However, it may happen that all d(i) and d(j) are equal to H(I, J) for i 6 I and j E J and hence Theorem 2 does not apply. In this case, the removal of one of them may still decrease the score, unless the score is already zero. Step 1 in each iterate requires time in O(nm) and completerecalculation of all d values in Step 2 is also an O(nm) effort. The selection of the best row and column candidates takes O(logn + logrn) time. Whenthe matrix is bi-level, specifying "on" and "off" of the genes, the update of various variables after the removal of a row takes only O(m) time and that after the removal of a column only O(n) time. In this case, the algorithm can be made very efficient even for whole genome expression data, with overall running time in O(nm). But, for non-bi-level matrices, updates are more expensive and it is advisable to use the following Multiple Node Deletion before the matrix is reduced to a manageable size, when Single Node Deletion is appropriate. Algorithm 2 (Multiple Node Deletion). Input: A, a matrix of real numbers, 6 > 0, the maximum acceptable mean squared residue score, and a > 1, a threshold for multiple node deletion. Output: AIj, a 6-bichister that is a submatrix of A with row set I and columnset J, with a score no larger than & Initialization: I and J are initialized to the gene and condition sets in the data and AH = A. Iteration: 1. Computeaij for all i 6 I, aO for all j E J, all, and H(I, J). If H(I, J) <= return AIj. 2. Removethe rows i 6 I with i ]g] ~cJ E(aij-aij-alj+alj) IIIie* After node deletion, the resulting &bicluster may not be maximal, in the sense that some rows and columns may be added without increasing the score. Lemma3 and Theorem 3 below mirror Lemma1 and Theorem 2 and provide a guideline for node addition. Lemma3. Let S, d, re(S), and E(S) be defined as same as those in Lemma1. Then, the addition to S of any non-empty subset R C {x ~ S: d(x,m(S)) 2 > all(I, E(S)} (27) will not increase the score E: E(S + R) < E(S). (28) Proof. The condition (28) can be rewritten A’ is+hi A -B ,~----r-, ~ /i (29) where A E d(x,m(S)), A’= E d(x,m(S+R)), z6S+R x6S+R (30) B = ~ d(x, re(S)). (31) x6R The definition of the function m requires that A’ _< A. Thus, a sufficient condition for the inequality (29) A < A-B Is + nl IS--7-’ > .H(±,J) (321 which is equivalent to 3. Recompute axj, aij, and H(I, J). 4. Removethe columns j E J with 1 Node Addition E(S)= J) 5. If nothing has been removedin the iterate, switch to Algorithm 1. The correctness of Algorithm 2 is guaranteed by Lemma2. When a is properly selected, the Multiple Node Deletion phase of Algorithm 2 (before the call of Algorithm 1) requires a number of iterates in O(log n + log m), which is usually extremely fast. Without updating the score after the removal of each node, the matrix may shrink too much and one may miss some large 6-biclusters (although later runs of the same algorithm may find them). One may also choose aa adaptive a based on the score and size during the iteration. A-B > B ~d(x’m(S)) (33/ IsI -IRI- [RIxER Clearly, (27) is a sufficient condition for this inequality and therefore also for (28). Theorem3. The set of rows that can be completely or partially added with the net effect of decreasing the score of a bicluster AIj is { 1 R= i~I;-~E(aij-aij-azj+azj) jEJ } 2 <H(I,J) Proof. Similar to the proof of Theorem2. (34) [] There is also a similar result for the columns. ISMB 2000 97 Algorithm 3 (Node Addition). Input: A, a matrix of real numbers, I and J signifying a &bicluster. Output: I’ and J’ such that I C 11 and J C J’ with the property that H(I’, J’) H(I, J). Iteration: 1. Computeaij for all i, alj for all j, at j, and H(I, J). 2. Add the columns j ¢ J with 1 ~-~(a,j -aij -alj +arg)2 < g(I, J) II] ~ez 3. Recomputea/j, azj, and H(I, J). 4. Add the rows i ~ I with 2 <_ H(I, J) 1 ~(a~j-aij-aTj+ajj) IJIjcJ 5. For each row i still inverse if 1 not in I, add its 2 _<H(X, J) ’ ’ jEJ 6. If nothing is addedin the iterate, return the final I and J as I’ and J’. Lemma3 and Theorem 3 guarantee the addition of rows and columns in Algorithm 3 will not increase the score. However,the resulting 5-bicluster maystill not be maximal because of two reasons. The first is that Lemma3 only gives a sufficient condition for adding rows and columns and it is not necessarily a necessary condition. The second reason is that by adding rows and columns, the score may decrease to the point it is much smaller than 5. Each iterate in Algorithm 3 only adds rows and columns according to the current score, not 5. Step 5 in the iteration adds inverted rows into the bicluster. These rows form "mirror images" of the rest of the rows in the bicluster and can be interpreted as coregulated but receiving the opposite regulation. These inverted rows cannot be added to the data matrix at the beginning, because that would make all alj -~ 0 and also alj = O. Algorithm 3 is very efficient. Its time efficiency is comparable with the Multiple Node Deletion phase of Algorithm 2 and in the order of O(mn). Clearly, addition of nodes does not have to take place after all deletion is done. Sometimes an addition may decrease the score more than any deletion. A single node deletion and addition algorithm based on the Lemmas and thus more efficient than Algorithm 0 is possible to set up. 98 CHENG Experimental Methods The biclustering algorithms were tested on two sets of expression data, both having been clustered using conventional clustering algorithms. The yeast Saccharomyces cerevis~ae cell cycle expression data from Cho et al. (1998) and the human B-cells expression data from Alizadeh et al. (2000)were used. Data Preparation The yeast data contain 2,884 genes and 17 conditions. These genes were selected according to Tavazoie et al. (1999). The genes were identified their SGD ORF names (Ball et al., 2000) from http ://arep.med.harvard.edu/network~iscovery. The relative abundance values (percentage of the mRNA for the gene in all mRNAs)were taken from a table prepared by Aach, Pdndone, and Church (2000). (Two the ORFnames did not have corresponding entries in the table and thus there were 34 null elements.) These numbers were transformed by scaling and logarithm x -~ 1001og(105x) and the result was a matrix of integers in the range between 0 and 600. (The transformation does not affect the values 0 and -1 (null element).) The human data was downloaded from the Website for supplementary information for the article by A1izadeh et al. (2000). There were 4,026 genes and conditions. The expression levels were reported as log ratios and after a scaling by a factor of 100, we ended up with a matrix of integers in the range between -750 and 650, with 47,639 missing values (12.3% of the matrix elements). The matrices after the above preparation, along with the biclustering results can be found at http://arep.med.harvard.edu/biclustering. Missing Data Replacement Missing data in the matrices were replaced with random numbers. The expectation was that these random values would not form recognizable patterns and thus would be the leading candidates to get removedin node deletion. The random numbers used to replace missing values in the yeast data were generated so that they form a uniform distribution between 0 and 800. For the human data, the uniform distribution was between -800 and 800. Determining Algorithm Parameters The 30 clusters reported in Tavazoie et al. (1999) were used to determine the fi value in Algorithms 1 and 2. From the discussion before, we knowthat a completely random submatrix of any size for the value range (0 to 800) has a score about 53,000. The clusters reported in Tavazoie et al. (1999) have scores in the range between 261 (Cluster 3) and 996 (Cluster 7), with a median 630 (Clusters 8 and 14). A ~ value (300) close to lower end of this range was used in the experiment, to detect more refined patterns. rOWS 3 3 10 10 30 30 100 100 columns 6 17 6 17 6 17 6 17 low 10 30 110 240 410 480 630 700 high 6870 6600 4060 3470 2460 2310 1720 1630 peak 39O 48O 8OO 87O 96O 1040 1020 1080 tail 15.5% 6.83% 0.064% 0.OO2% -6 < 10 < -~ I0 -6 < 10 < -6 10 Table 1: Score distributions estimated by randomly selecting one million submatrices for each size combination. The columns correspond to the number of rows, the number of columns, the lowest score, the highest score, the peak score, and the percentage of submatrices with scores below 300. Submatrices of different sizes were randomly generated (one million times for each size) from the yeast matrix and the distributions of scores along with the probability that a submatrix of the size has a score lower than 300 were estimated and listed in Table 1. The $ Value used in the experiment with human data was 1,200, because of the doubling in the range and the quadrupling of the variance in the data, compared to the yeast data. Algorithm 1 (Single Node Deletion) becomes quite slow when the number of rows in the matrix is in the thousands, which is commonin expression data. A proper a must be determined to run the accelerated Algorithm 2 (Multiple Node Deletion). Lemma2 gives some guidance to the determination of a. Our aim was to find an a as large as possible and still allow the program to find 100 biclusters in less than 10 minutes. Whenthe number of conditions is less than 100, which was the case for both data sets, Steps 3 and 4 were not used in Algorithm 2, so deletion of conditions started only when Algorithm 1 was called. The a used in both experiments was 1.2. Node Addition Algorithm 3 was used after Algorithm 2 and Algorithm 1 (called by Algorithm 2), to add conditions and genes to further reduce the score. Only one iterate of Algorithm 3 was executed for each bicluster, based on the assumption that further iterates would not add much. Step 5 of Algorithm 3 was performed, so manybiclusters contain a "mirror image" of the expression pattern. These additions were performed using the original data set (without the masking described below). Masking Discovered Biclusters Because the algorithms are all deterministic, repeated run of themwill not discover different biclusters, unless discovered ones axe masked. Each time a bicluster was discovered, the elements in the submatrix representing it were replaced by random numbers, exactly like those generated for the missing values (see Missing Data Replacement above). This made it very unlikely that elements covered by existing biclusters would contribute to any future pattern discovery. The masks were not used during node addition. The steps described above are summarized in Algorithm 4 below. Algorithm 4 (Finding a Given Number of Biclusters). Input: A, a matrix of real numbers with possible missing elements, a > 1, a parameter for multiple node deletion, ~ > 0, the maximum acceptable mean squared residue score, and n, the number of ~biclusters to be found. Output: n ~-biclusters in A. Initialization: Missing elements in A are replaced with random numbers from a range covering the range of non-null values. A’ is a copy of A. Iteration for n times: 1. Apply Algorithm 2 on AI, 6, and a. If the row (column) size is small (less than 100), do not perform multiple node deletion on rows (columns). The matrix after multiple node deletion is B. 2. (Step 5 of Algorithm 2) Apply Algorithm 1 on B and ~ and the matrix after single node deletion is C. 3. Apply Algorithm 3 on A and C and the result is the bicluster D. 4. Report D, and replace the elements in A’ that are also in D with random numbers. Implementation and Display These algorithms were implemented using C and run on a Sun Ultral0 workstation. With the parameters specified above, 100 biclusters were discovered from each data set in less than 10 minutes. Plots were generated for each bicluster, showingthe expression levels of the genes in it under the conditions in the bicluster. 30 biclusters for the yeast data were ploted in Figures 1, 2, 3, and 4, and 24 for the human data in Figures 5 and 6. In the captions, "Bicluster" denotes a bicluster discovered using our algorithm and "Cluster" denotes a cluster discovered in Tavazoie et al. (1999). Detailed descriptions for these biclusters can be found in http://axep.med.haxvaxd.edu/biclustering. Results From visual inspection of the plots one can see that this biclustering approach works as well as conventional clustering methods, whenthere were clear patterns over all attributes (conditions whenthe genes axe clustered, or genes when conditions are clustered). The ~ paramISMB 2000 99 67 63 ~ 47 Figure 1: The first bicluster (Bicluster 0) discovered Algorithm 4 from the yeast data and the flattest one (Bicluster 47, with a row variance 39.33) are examples of the "flat" biclusters the algorithm has to find and mask before more "interesting" ones may emerge. The bicluster with the highest row variance (4,162) is Bicluster 93 in Figure 3. The quantization effect visible at lower ends of the expression levels was due to a lack of the numberof significant digits both before and after logarithm. Figure 3:10 biclusters discovered in the order labeled. Biclusters 17, 67, 71, 80, and 99 (on the left column) contain genes in Clusters 4, 8, and 12 of Tavazoie et al. (1999), while biclusters 57, 63, 77, 84, and 94 represent Cluster 7. 36 Figure 2:12 biclusters with numbers indicating the order they were discovered using the algorithms. These biclusters are clearly related to Cluster 2 in Tavazoie et al. (1999), which has a score of 757. They subdivides Cluster 2’s profile into similar but different ones. These biclusters have scores less than 300. They also include 10 genes from Cluster 14 of Tavazoie et al. (1999) and 6 from other clusters. All biclusters plotted here except Bicluster 95 contain all 17 conditions, indicating that these conditions form a cluster well, with respect to the genes included here. i00 CHENG 52 Figure 4:6 biclusters discovered in the order labeled. Bicluster 36 corresponds to Cluster 20 of Tavazoie et al. (1999). Bicluster 93 corresponds to Cluster 9. Bicluster 52 and 90 correspond to Clusters 14 and 30. Bicluster 46 corresponds to Clusters 1, 3, 11, 13, t9, 21, 25, and 29, and Bicluster 54 corresponds to Clusters 5, 15, 24, and 28. Notice that some biclusters have less than half of the 17 conditions and thus represent shared patterns in many clusters discovered using similarity based on all the conditions. eter gives a powerful tool to fine-tune the similarity requirements. This explains the correspondence between one or more biclusters to each of the better clusters discovered in Tavazoie et al. (1999). These includes the clusters associated to the highest scoring motifs (Clusters 2, 4, 7, 8, 14, and 30 of Tavazoie et al.). For other clusters from Tavazoie et al., there were not clear correspondence to our biclusters. Instead, Biclusters 46 and 54 represent the commonfeatures under some of the conditions of these lesser clusters. Coverage of the Biclusters In the yeast data experiment, the 100 biclusters covered 2,801, or 97.12%of the genes, 100%of the conditions, and 81.47%of the cells in the matrix. The first 100 biclusters from the humandata covered 3,687, or 91.58%of the genes, 100%of the conditions, and 36.81%of the cells in the data matrix. Sub-Categorization of Tavazoie’s Cluster 2 Figure 2 shows 12 biclusters containing mostly genes classified to Cluster 2 in Tavazoie et al.. Each of these biclusters clearly represents a variation to the common theme for Cluster 2. For example, Bicluster 87 contalns genes with three sharp peaks in expression (CLB6 and SPT21). Genes in Bicluster 62 (ERP3, LPP1, and PLM2)showedalso three peaks, but the third of these is rather fiat. Bicluster 56 shows a clear-cut double-peak pattern with DNAreplication genes CDC9, CDC21, POLl2, POL30, RFA1, RFA2, among others. Bicluster 66 contains two-peak genes with an even sharper image (CDC45, MSH6, RAD27, SWE1, and PDS5). the other hand, Bicluster 14 contains those genes with barely recognizable double peaks. .19 45 I Figure 5:12 biclusters discovered in the order labeled from the humanexpression data. Scores were 1,200 or lower. The numbers of genes and conditions in each are reported in the format of (bicluster label, number of genes, numberof conditions) as follows. (12, 4, 96), (19, 103, 25), (22, 10, 57), (39, 9, 51), (44, 10, 29), 127, 13), (49, 2, 96), (52, 3, 96), (53, 11, 25), (54, 21), (75, 25, 12), (83, 2, Broad Strokes and Fine Drawings Figures 5 and 6 showvarious biclusters discovered in the human lymphoma expression data. Some of them represent a few genes closely following each other, through almost all the conditions. Others show large numbers genes displaying a broad trend and its mirror image. These "broad strokes" often involve smaller subsets of the conditions and genes depicted in some of the "fine drawings" may be added to these broad trends during the node addition phase of the algorithm. These biclusters clearly say a lot about the regulatory mechanism and also the classification of conditions. Comparison with Alizadeh’s Clusters Wedid a comparison of the first 100 biclusters discovered in the human lymphoma data with the clusters discovered in Alizadeh et al. (2000). Hierarchical clustering was used in Alizadeh et al. on both the genes and the conditions. Weused the root division in each hierarchy in our comparison. Wecall the two clusters generated by the root division in a hierarchy the pr/mary clusters. Out of the 100 biclusters, only 10 have conditions exclusively from one or the other primary cluster on Figure 6: Another 12 biclusters discovered in the order labeled from the human expression data. The numbers of genes and conditions in each bicluster (whose label is the first figure in the triple) are as follows. (7, 34, 48), (14, 15, 46), (25, 57, 24), (27, 23, 30), (31, 17), (35, 102, 13), (37, 59, 18), (59, 8, 19), (60, (67, 102, 20), (79, 18, 11), (86, 18, 11) Mirror can be seen in most of these biclusters. ISMB 2000 101 conditions. Bicluster 19 in Figure 5 is heavily biased towards one primary condition cluster, while Bicluster 67 in Figure 6 is heavily biased towards the other. A similar proportion of the biclusters have genes exclusively from one or the other primary cluster on genes. The tendency is that the higher the row variance (5) and the number of columns (conditions) involved are, the more likely the genes in the bicluster will comefrom one primary cluster. Wedefine m/x as the percentage of genes from the minority primary cluster in the bicluster, and plot mix versus row variance in Figure 7. Each digit in the Figure 7 represents a bicluster, with the digit indicating the numberof conditions in the bicluster. All the biclusters shownin Figures 5 and 6 have row variance (squared-rooted) greater than 100. Many them have zero percent mix (Biclusters 7, 12, 14, 22, 39, 44, 49, 52, and 53 from one primary cluster and Biclusters 25, 54, 59, and 83 from the other). This demonstrates the correctness of the use of row variance as a means for separating interesting biclusters from the trivial ones. Somebiclusters have high row variances but also high mix scores (those in the upper right quadrant of the plot). These invariably involve a narrower view of the conditions, with the digit "1" indicating the number of conditions involved is below 20%. Genes in these biclusters are distributed on both sides of the root division of hierarchical clustering. These biclusters show the complementaryrole that biclustering plays to one-dimensional clustering. Biclustering allows us to focus on the right subsets of conditions to see the apparent co-regulatory patterns not seen with the global scope. For example, Bicluster 60 suggests that under the 11 conditions, mostly for chronic lymphocytic leukemia (CLL), FMR2and the interferon- 7 receptor a chain behave against their stereotype and get down-regulated. As another example, Bicluster 35 suggests that under the 13 conditions, CD49Band SIT appear similar to genes classified by hierarchical clustering into the other primary cluster. Discussion Wehave introduced a new paradigm, biclustering, to gene expression data analysis. The concept itself can be traced back to 1960’s and 1970’s, although it has been rarely used or even studied. To gene expression data analysis, this paradigmis relevant, because of the complexity of gene regulation and expression, and the sometimes low quality of the gathered raw data. Biclustering is performed on the expression matrix, which can be viewed as a weighted bipartite graph. The concept of bicluster is a natural generalization of the concept of biclique in graph theory. There are onedimensional clustering methods based on graphs constructed from similarity scores between genes, for example, the hierarchical clustering method in Alizadeh et al. (2000), and the finding of highly connected subgraphs in Hartuv et al. (1999). A major difference here 102 CHENG 50 1’ 12~1 Mix 31 4O 0 11 0 12D 11 3O 11 11 2 2O 1 3 1 1 2 912 2:Bt 11 1 I0 2 3 2 £ 0 2 7 0 2 I" 150 3 1 3:1. 412 S 29 T 300 5 2 RowVariance (square-rooted) Figure 7: A plot of the first 100 biclusters in the human lymphoma data. Three measurements are involved. The horizontal axis is the square root of the row variance, defined in (5). The vertical axis is the m/x, or the percentage of genes misclassified across the root division generated by hierarchical clustering in Alizadeh et al. (2000). Weassumed that genes forming mirror image expression patterns are naturally from the other side of the root division. The digits used to represent the biclusters in the plot also indicate the sizes of the condition sets in the biclusters. The digit n indicates that the number of conditions is between 10n and 10(n + 1) percent of the total numberof conditions. is that biclustering does not start from or require the computation of overall similarity between genes. The relation between biclustering and clustering is similar to that between the instance-based paradigm and the model-based paradigm in supervised learning. Instance-based learning (for example, nearest-neighbor classification) views data locally, while model-based learning (for example, feed-forward neural networks) finds the globally optimally fitting models. Biclustering has several obvious advantages over clustering. First, biclustering automatically selects genes and conditions with more coherent measurement and drops those representing random noise. This provides a method for dealing with missing data and corrupted measurements. Secondly, biclustering groups items based on a similarity measure that depends on a context, which is best defined as a subset of the attributes. It discovers not only the grouping, but the context as well. Andto some extent, these two becomeinseparable and exchangeable, which is a major difference between biclustering and clustering rows after clustering columns. Most expression data result from more or less complete sets of genes but very small portions of all the possible conditions. Any similarity measure between genes based on the available conditions becomes anyway context-dependent. Clustering genes based on a measure like this is no more representative than biclustering. Thirdly, biclustering allows rows and columns to be included in multiple biclusters, and thus allows one gene or one condition to be identified by more than one function categories. This addedflexibility correctly reflects the reality in the functionality of genes and overlapping factors in tissue samples and experiment conditions. By showing the NP-hardness of the problem, we tried to justify our efficient but greedy algorithms. But the nature of NP-hardness implies that there may be sizable biclusters with good scores evading the search by any efficient algorithm. Just like most efficient conventional clustering algorithms, one can say that the best biclusters can be found in most cases, but one cannot say that it will be found in all the cases. Ball, C.A. et al. 2000. Integrating functional genomic information into the Saccharomyces Genome Database. Nucleic Acids Res. 28:77-80. Cho, R.J. et al. A genomewide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2:65-73. Garey, M.R., and Johnson, D.S. 1979. Computers and Intractability: A Guide to the Theory of NPCompleteness. San Francisco: Freeman. Hartigan, J.A. 1972. Direct clustering of a data matrix. JASA 67:123-129. Hartuv, E. et al. 1999. An algorithm for clustering cDNAsfor gene expression analysis. RECOMB ’99, 188-197. Johnson, D.S. 1987. The NP-completeness column: an ongoing guide. J. Algorithms 8:438-448. Mirkin, B. 1996. Mathematical Classification and Clustering. Dordrecht: Kluwer. Morgan, J.N. and Sonquist, J.A. 1963. Problems in the analysis of survey data, and a proposal. JASA 58:415434. Nan, D.S., Markowsky, G., Woodbury, M.A., and Amos, D.B. 1978. A mathmatical analysis of human leukocyte antigen serology. Math. Biosci. 40:243-270. Orlin, J. 1977. Containment in graph theory: covering graphs with cliques, Nederl. Akad. Wetensch. Indag. Math. 39:211-218. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., and Church, G.M. 1999. Systematic determination of genetic network architecture. Nature Genetics 22:281285. Yannakakis, M. 1981. Node deletion problems on bipartite graphs. SIAMJ. Comput. 10:310-327. Acknowledgments This research was conducted at the Lipper Center for Computational Genetics at the Harvard Medical School, while the first author was on academic leave from the University of Cincinnati. References Aach, J., Rindone, W., and Church, G.M. 2000. Systematic management and analysis of yeast gene expression data. GenomeResearch in press. Alizadeh, A.A. et al. 2000. Distinct types of diffuse large B-cell lymphomaidentified by gene expression profiling. Nature 403:503-510. ISMB 2000 103