From: ISMB-00 Proceedings. Copyright © 2000, AAAI (www.aaai.org). All rights reserved.
Biclustering
of Expression Data
Yizong
Cheng $f* St
and George M. Church
tDepartment of Genetics, Harvard Medical School, Boston, MA02115
§Department of ECECS,University of Cincinnati, Cincinnati, OH45221
yizong.cheng@uc.edu, church@salt2.med.harvard.edu
Abstract
Anefficient node-deletion algorithm is introduced to
find submatrices in expression data that have low mean
squared residue scores and it is shownto perform well
in finding co-regulation patterns in yeast and human.
This introduces "biclustering’, or simultaneousclustering of both genes and conditions, to knowledgediscovery from expression data. This approach overcomes
someproblems associated with traditional clustering
methods,by allowing automatic discovery of similarity
basedon a subset of attributes, simultaneousclustering
of genes and conditions, and overlapped grouping that
provides a better representation for geneswith multiple
functions or regulated by manyfactors.
Keywords: microarray, gene expression pattern,
tering
clus-
Introduction
Gene expression data are being generated by DNAchips
and other microarray techniques and they are often presented as matrices of expression levels of genes under
different conditions (including environments, individuals, and tissues). One of the usual goals in expression
data analysis is to group genes according to their expression under multiple conditions, or to group conditions based on the expression of a number of genes.
This may lead to discovery of regulatory patterns or
condition similarities.
The current practice is often the application of some
agglomerative or divisive clustering algorithm that partitions the genes or conditions into mutually exclusive
groups or hierarchies. The basis for clustering is often
the similarity between genes or conditions as a function
of the rows or columns in the expression matrix. The
similarity between rows is often a function of the row
vectors involved and that between columns a function
of the column vectors. Functions that have been used
include Euclidean distance (or related coefficient of correlation and Gaussian similarity) and the dot product
"Tel: (513) 556-1809,Fax: (513) 556-7326.
tTeh (617) 432-7266, Fax: (617) 432-7663.
Copyright (~) 2000, AmericanAssociation for Artificial Intelligence (www.aaai.org).All rights reserved.
(or a nonlinear generalization of it, as used in kernel
methods) between the vectors. All conditions are given
equal weights in the computation of gene similarity and
vice versa. One must doubt not only the rationale of
equally weighing all conditions or all genes, but that of
giving the same weight to the same condition for the
similarity computation between all the genes and vice
versa as well. Any such formula leads to the discovery
of some similarity groups at the expense of obscuring
someother similarity groups.
In expression data analysis, beyond grouping genes
and conditions based on overall similarity, it is sometimes needed to salvage information lost during oversimplied similarity and grouping computation. One of
the goals of doing so is to disclose the involvement of
a gene or a condition in multiple pathways, some of
which can only be discovered under the dominance of
more consistent ones.
In this article, we introduce the concept of bicluster, corresponding to a subset of genes and a subset
of conditions with a high similarity score. Similarity
is not treated as an function of pairs of genes or pairs
of conditions. Instead, it is a measure of the coherence of the genes and conditions in the bichister. This
measure can be a symmetric function of the genes and
conditions involved and thus the finding of biclusters is
a process that groups genes and conditions simultaneously. If we project these biclusters onto the dimension
of genes or that of conditions, then we can see the result
as clustering of either genes or conditions, into possibly
overlapping groups.
A particular score that applies to expression data
transformed by a logarithm and augmented by the additive inverse is the meansquared residue. The residue
of element a~j in the bicluster indicated by the subsets
I and J is
a~j - aij - aij -t- alj ,
(1)
where aij is the mean of the ith row in the bicluster,
aij the mean of the jth columnin the bicluster, and aij
that of all elements in the bicluster. The mean squared
residue is the variance of the set of all elements in the
bicluster,
plus the mean row variance and the mean
column variance. Wewant to find biclusters with low
mean squared residue, in particular, large and maximal
ISMB 2000 93
ones with scores below a certain threshold.
A special case for a perfect score (a zero meansquared
residue) is a constant bicluster of elements of a single
value. Whena bicluster has a non-zero score, it is always possible to remove a row or a column to lower the
score, until the remaining bicluster becomesconstant.
The problem of finding a maximumbicluster with a
score lower than a threshold includes the problem of
finding a maximumbiclique (complete bipartite subgraph) in a bipartite graph as a special case. If a maximumbiclique is one that maximizes the number of vertices involved (maximizing [I] + [J]), then the problem
is equivalent to finding a maximummatching in the bipartite complement and can be solved using polynomial
time max-flow algorithms. However, this approach often results in a submatrix with maximumperimeter and
zero area, particularly in the case of expression data,
where the number of genes may be hundreds times more
than the number of conditions.
If the goal is to find the largest balanced biclique, for
example, the largest constant square submatrix, then
the problem is proven to be NP-hard (Johnson, 1987).
On the other hand, the hardness of finding one with the
maximumarea is still unknown.
Divisive algorithms for partitioning data into sets
with approximately constant values have been proposed by Morgan and Sonquist (1963) and Hartigan
(1972). The result is an hierarchy of clusters, and
the algorithms foretold the more recent decision tree
procedures. Hartigan (1972) also mentioned that the
criterion for partitioning may be other than a constant value, for example, a two-way analysis of variance model, which is quite similar to the mean squared
residue scoring proposed in this article. Rather than a
divisive algorithm, our approach is more of the type of
node deletion (Yannakakis, 1981).
The term biclustering has been used by Mirkin (1996)
to describe "simultaneous clustering of both row and
column sets in a data matrix". Other terms that have
been associated to the same idea include "direct clustering" (Hartigan, 1972) and "box clustering" (Mirkin,
1996). Mirkin (1996) presents a node addition algorithm, starting with a single cell in the matrix, to find
a maximal constant bicluster.
The algorithms mentioned above either find one constant bicluster, or find a set of mutually exclusive nearconstant biclusters that cover the data matrix. There
are ample reasons to allow biclusters to overlap in expression data analysis. One of the reasons is that a
Single gene may participate in multiple pathways that
may or may not be co-active under all conditions. The
problem of finding a minimumset of biclusters, either
mutually exclusive or overlapping, to cover all the elements in a data matrix is a generalization of the problem of covering a bipartite graph by a minimumset
of bicliques, either mutually exclusive or overlapping,
which has been shown to be NP-hard (Orlin, 1977).
Nan, Markowsky, Woodbury, and Amos (1978) had
interesting application of biclique covering on the in94
CHENG
terpretation of leukocyte-serum immunological reaction
matrices, which are not unlike the gene-condition expression matrices.
In expression data analysis, the uttermost important
goal may not be finding the maximumbicluster or even
finding a bicluster cover for the data matrix. More interesting is the finding of a set of genes showingstrikingly similar up-regulation and down-regulation under
a set of conditions. A low mean squared residue score
plus a large variation from the constant may be a good
criterion for identifying these genes and conditions.
In the following sections, we present a set of efficient
algorithms that find these interesting gene and condition sets. The basic iterate in the method consists of
the steps of maskingnull values and biclusters that have
been discovered, coarse and fine node deletion, node addition, and the inclusion of inverted data.
Methods
A gene-condition expression matrix is a matrix of real
numbers, with possible null values as some of the elements. Each element represents the logarithm of the
relative abundance of the mRNA
of a gene under a specific condition. The logarithm transformation is used to
convert doubling or other multiplicative changes of the
relative abundance into additive increments.
Definition 1. Let X be the set of genes and Y the
set of conditions. Let aij be the element of the expression matrix A representing the logarithm of the relative
abundance of the mRNA
of the ith gene under the jth
condition. Let I C X and J C Y be subsets of genes
and conditions. The pair (I, J) specifies a submatrix
AIj with the following mean squared residue score.
1
2, (2)
H(I,J)= ]ill.j]
~ (aij-aij-alj+alj)
iEI,jEJ
where
1 Z a j,
aiJ
= ~
a lj =
1 a j,
jEJ
(3)
icl
and
aij
1
:
iliigi
1 ~ aij
~ a,j
= -~[~
iEI,jEJ
~-
1
-(--~a,j
(4)
’ ’ jEJ
iEl
are the row and column means and the mean in the submatrix (I, J). A submatrix AIj is called a ~-bicluster
if H(I, J) < for so me ~ > 0. []
The lowest score H(I, J) = 0 indicates that the gene
expression levels fluctuate in unison. This includes the
trivial or constant biclusters where there is no fluctuation. These trivial biclusters may not be very interesting but need to be discovered and masked so more
interesting ones can be found. The row variance may
be an accompanyingscore to reject trivial biclusters.
v(I, J)
1
_
jEJ
(5)
The matrix aij = ij, i,j > 0 has the property that no
submatrix of a size larger than a single cell has a score
lower than 0.5. A K x K matrix of all 0s except one 1
has the score
1
hK = ~-~(K- 1) [(K- 3 + 2( K- 1) 2 - 2] . (6
A matrix with elements randomly and uniformly generated in the range of [a, b/ has an expected score of
(b - a)2/12. This result is independent of the size
the matrix. For example, when the range is [0,800],
the expected score is 53,333.
A translation (addition by a constant) to the matrix
will not affect the H(I, J) score. A scaling (multiplication by a constant) will affect the score (by the square
of the constant), but will have no affect if the score is
zero. Neither translation nor scaling affects the ranking
of the biclusters in a matrix.
Theorem1. The problem of finding the largest
5-bicluster (III--IJI) is NP-hard.
square
We construct
a reduction
from the
Proof.
BALANCED COMPLETE BIPARTITE SUBGRAPH
problem (GT24 in Garey and Johnson, 1979) to this
problem. Given a bipartite graph (V,, V2, E) and
positive integer K, form a real-valued matrix A with
aij = 0 if and only if (i,j) ¯ E and aij 2i j ot herwise, for i,j > 0. If the largest square 1/hK-bicluster
in A has a size larger than or equal to K, then there is a
K × K biclique (complete bipartite subgraph). Since the
BALANCED COMPLETE BIPARTITE SUBGRAPH
problem is NP-complete, the problem of this theorem
is NP-hard.
[]
Algorithm 0 (Brute-Force
Deletion and Addition).
Input: A, a matrix of real numbers, and 5 >
0, the maximumacceptable mean squared
residue score.
Output: Axj, a 5-bicluster that is a submatrix of A with row set I and columnset J,
with a score no larger than &
Initialization: I and J are initialized to the
gene and condition sets in the data and
AH = A.
Iteration:
1. Compute the score H for each possible row/column addition/deletion
and
choose the action that decreases H the
most. If no action will decrease H, or if
H <= 5, return AIj.
Algorithm 0, although a polynomial-time one, will
not be efficient enough for a quick analysis of most expression data matrices. Wepropose in the following
Algorithm 1 with time complexity in O(nm) and Algorithm 2 in O(mlog n). The combination of the two
will provide a very efficient node-deletion algorithm for
finding a biclnster of a low score. The correctness and
efficiency of these algorithms are based on a number
of lemmas, in which rows (or columns) are treated
points in a space where a distance is defined.
Lemrna1. Let S be a finite set of points in a space in
which a non-negative real-valued function of two arguments, d is defined. Let m(S) be a point that minimizes
the function
f(s) = ~ d(x,
(7)
xES
Define the measure
1
E(S) = ~[ ~ d(x, re(S)).
Node Deletion
(8)
xES
Every expression matrix contains a submatrix with the
perfect score (H(I, J) = 0), and any single element
such a submatrix. Certainly the kind of biclusters we
look for should have a maximumsize, both in terms of
the number of genes involved (lII) and in terms of the
numberof conditions (]JI)-
Then, the removal of any non-empty subset
If we start with a large matrix, say, the one with all
the data, then the question is how to select a submatrix with a low H score. A greedy method is to remove
the row or column to achieve the largest decrease of
the score. This requires the computation of the scores
of all the submatrices that may be the consequences
of any row or column removal, before each choice of
removal can be made. This method (Algorithm 0) requires time in O((n + m)nm), where n and m are the
row and column sizes of the expression matrix, to find
one bicluster.
Proof. Condition (10) can be rewritten
~
A
A+B
R C {x ¯ S: d(x,m(S))
> E(S)}
(9)
will only make
E(S - R) < E(S).
IS-
(10)
< IS---T’
(11)
where
A=
d(x,m(S)),
A’ = ~ d(x,m(S
xES-R
xcS-R
(12)
B = d(x, m(Sl)
(13)
xER
ISMB 2000 95
The definition of the function rn requires that A’ < A.
Thus, a sufficient condition for the inequality (11)
A
Is-
A+B
(14)
<IS---V-’
Proo]. Let the points in Lemma1 be IJI-dimensional
real-valued vectors and S be the set of vectors bi with
components bij = aij - aij for i E I and j 6 J. The
function d is defined as
2.
d(bi, bk) = E(bo - bkj)
which is equivalent to
E(S)-
A+B B _
IS I < IR I iR--~ ~
d(x,
xER
rn(S)).
In this case,
(15)
Clearly, (9) is a sufficient condition for this inequality
and therefore also for (10).
Lenuna 2. Suppose the set removed from S is
R C {x E S: d(x,m(S))
> aE(S)}
(16)
with a _> 1. Then the reduction rate of the score E(S)
can be characterized as
~- 1
E(S) - E(S - R)
>
(17)
E(S)
ISl/IRI- 1
Whena single point x is removed, the reduction rate
has the bound
E(S)- E(S- R)
d(x,
re(S))
- E(S) (18)
ISl
-1
Proof.Usingnotationin LemmaI, we havenow
aE(S)
A+B
B
I
= a--~ < I - IR I E d( x,m(S)).
(1 9)
z6R
This leads to
alRIA < (IS] - aIRI)B
,
(20)
ISIA < (ISI - aIRI)(A +
(21)
or, equivalently,
This is the same as
A
[SI - a]RI A + B
IS - R------~ < IS - RI ISI
E(S - R) < ISI - alRI E(S),
Is - RI
(22)
CHENG
and has the componentsalj - alj.
[]
There is also a similar result for the columns.
Lemma2 acts as a guide on the trade-off between two
types of node deletion, that of deleting one node a time,
and that of deleting a set of node a time, before the
score is recalculated. These two algorithms are listed
below.
Algorithm 1 (Single Node Deletion).
Input: A, a matrix of real numbers, and 5 >
0, the maximumacceptable mean squared
residue score.
Output: AIj, a 5-bicluster that is a submatrix of A with row set I and columnset J,
with a score no larger than &
Initialization: I and J are initialized to the
gene and condition sets in the data and
A;j = A.
Iteration:
1. Computeaij for all i 6 I, a1j for all j 6
J, aij, and H(I, J). If H(I, <= 5,
return AIj.
2. Find the row i 6 I with the largest
=
1
E(aij
j6J
-aij -
azj
2+ aIj)
1
d(j) = -~[ 2~(aij
ie’ -- aia -- aIj + alj)
(23)
Theorem2. The set of rows that can be completely
or partially removed with the net effect of decreasing
the score of a bicluster AIj is
96
(26)
i6/
and the column j E J with the largest
which is (17). Inequality (18) can be derived from (17).
[]
a = i e I; IJI jeJ (a~j - a~j azj "~-alJ)2 > H(I,
re(S)=1 Z
d(i)
Using the inequality A’ < A and the facts that E(S
R) = A’/IS - RI and E(S-) (A+B)/ISI, th is le ads to
the inequality
{
(25)
jEJ
J)
}
(24)
removethe row or columnwhichever
withthelargerd valueby updating
eitherI or J.
The correctness of Algorithm 1 is shown by Theorem 2, in the sense that every removal decreases the
score. Because there are only finite number of rows
and columns to remove, the algorithm terminates in no
more than n + m iterates, where n and m are the number of genes and the numberof conditions in the initial
data matrix. However, it may happen that all d(i) and
d(j) are equal to H(I, J) for i 6 I and j E J and hence
Theorem 2 does not apply. In this case, the removal
of one of them may still decrease the score, unless the
score is already zero.
Step 1 in each iterate requires time in O(nm) and
completerecalculation of all d values in Step 2 is also an
O(nm) effort. The selection of the best row and column
candidates takes O(logn + logrn) time. Whenthe matrix is bi-level, specifying "on" and "off" of the genes,
the update of various variables after the removal of a
row takes only O(m) time and that after the removal
of a column only O(n) time. In this case, the algorithm can be made very efficient even for whole genome
expression data, with overall running time in O(nm).
But, for non-bi-level matrices, updates are more expensive and it is advisable to use the following Multiple
Node Deletion before the matrix is reduced to a manageable size, when Single Node Deletion is appropriate.
Algorithm 2 (Multiple Node Deletion).
Input: A, a matrix of real numbers, 6 > 0,
the maximum acceptable mean squared
residue score, and a > 1, a threshold for
multiple node deletion.
Output: AIj, a 6-bichister that is a submatrix of A with row set I and columnset J,
with a score no larger than &
Initialization: I and J are initialized to the
gene and condition sets in the data and
AH = A.
Iteration:
1. Computeaij for all i 6 I, aO for all j E
J, all, and H(I, J). If H(I, J) <=
return AIj.
2. Removethe rows i 6 I with
i
]g] ~cJ
E(aij-aij-alj+alj)
IIIie*
After node deletion, the resulting &bicluster may not
be maximal, in the sense that some rows and columns
may be added without increasing the score. Lemma3
and Theorem 3 below mirror Lemma1 and Theorem 2
and provide a guideline for node addition.
Lemma3. Let S, d, re(S), and E(S) be defined as
same as those in Lemma1. Then, the addition to S of
any non-empty subset
R C {x ~ S: d(x,m(S))
2 > all(I,
E(S)}
(27)
will not increase the score E:
E(S + R) < E(S).
(28)
Proof. The condition (28) can be rewritten
A’
is+hi
A -B
,~----r-,
~
/i
(29)
where
A
E d(x,m(S)),
A’= E d(x,m(S+R)),
z6S+R
x6S+R
(30)
B = ~ d(x, re(S)).
(31)
x6R
The definition of the function m requires that A’ _< A.
Thus, a sufficient condition for the inequality (29)
A
< A-B
Is + nl IS--7-’
> .H(±,J)
(321
which is equivalent to
3. Recompute axj, aij, and H(I, J).
4. Removethe columns j E J with
1
Node Addition
E(S)=
J)
5. If nothing has been removedin the iterate, switch to Algorithm 1.
The correctness of Algorithm 2 is guaranteed by
Lemma2. When a is properly selected,
the Multiple Node Deletion phase of Algorithm 2 (before the
call of Algorithm 1) requires a number of iterates in
O(log n + log m), which is usually extremely fast. Without updating the score after the removal of each node,
the matrix may shrink too much and one may miss some
large 6-biclusters (although later runs of the same algorithm may find them). One may also choose aa adaptive
a based on the score and size during the iteration.
A-B
> B
~d(x’m(S)) (33/
IsI -IRI- [RIxER
Clearly, (27) is a sufficient condition for this inequality
and therefore also for (28).
Theorem3. The set of rows that can be completely
or partially added with the net effect of decreasing the
score of a bicluster AIj is
{
1
R= i~I;-~E(aij-aij-azj+azj)
jEJ
}
2 <H(I,J)
Proof. Similar to the proof of Theorem2.
(34)
[]
There is also a similar result for the columns.
ISMB 2000 97
Algorithm 3 (Node Addition).
Input: A, a matrix of real numbers, I and J
signifying a &bicluster.
Output: I’ and J’ such that I C 11 and
J C J’ with the property that H(I’, J’)
H(I, J).
Iteration:
1. Computeaij for all i, alj for all j, at j,
and H(I, J).
2. Add the columns j ¢ J with
1 ~-~(a,j -aij -alj +arg)2 < g(I, J)
II] ~ez
3. Recomputea/j, azj, and H(I, J).
4. Add the rows i ~ I with
2 <_ H(I, J)
1 ~(a~j-aij-aTj+ajj)
IJIjcJ
5. For each row i still
inverse if
1
not in I, add its
2
_<H(X,
J)
’ ’ jEJ
6. If nothing is addedin the iterate, return
the final I and J as I’ and J’.
Lemma3 and Theorem 3 guarantee the addition of
rows and columns in Algorithm 3 will not increase the
score. However,the resulting 5-bicluster maystill not
be maximal because of two reasons. The first is that
Lemma3 only gives a sufficient condition for adding
rows and columns and it is not necessarily a necessary
condition. The second reason is that by adding rows
and columns, the score may decrease to the point it is
much smaller than 5. Each iterate in Algorithm 3 only
adds rows and columns according to the current score,
not 5.
Step 5 in the iteration adds inverted rows into the
bicluster. These rows form "mirror images" of the rest
of the rows in the bicluster and can be interpreted as coregulated but receiving the opposite regulation. These
inverted rows cannot be added to the data matrix at
the beginning, because that would make all alj -~ 0
and also alj = O.
Algorithm 3 is very efficient. Its time efficiency is
comparable with the Multiple Node Deletion phase of
Algorithm 2 and in the order of O(mn).
Clearly, addition of nodes does not have to take place
after all deletion is done. Sometimes an addition may
decrease the score more than any deletion. A single
node deletion and addition algorithm based on the Lemmas and thus more efficient than Algorithm 0 is possible
to set up.
98
CHENG
Experimental
Methods
The biclustering algorithms were tested on two sets
of expression data, both having been clustered using
conventional clustering algorithms. The yeast Saccharomyces cerevis~ae cell cycle expression data from Cho
et al. (1998) and the human B-cells expression data
from Alizadeh et al. (2000)were used.
Data Preparation
The yeast data contain 2,884 genes and 17 conditions. These genes were selected according to Tavazoie et al. (1999). The genes were identified
their SGD ORF names (Ball et al., 2000) from
http ://arep.med.harvard.edu/network~iscovery.
The
relative abundance values (percentage of the mRNA
for
the gene in all mRNAs)were taken from a table prepared by Aach, Pdndone, and Church (2000). (Two
the ORFnames did not have corresponding entries in
the table and thus there were 34 null elements.) These
numbers were transformed by scaling and logarithm
x -~ 1001og(105x) and the result was a matrix of integers in the range between 0 and 600. (The transformation does not affect the values 0 and -1 (null element).)
The human data was downloaded from the Website
for supplementary information for the article by A1izadeh et al. (2000). There were 4,026 genes and
conditions. The expression levels were reported as log
ratios and after a scaling by a factor of 100, we ended up
with a matrix of integers in the range between -750 and
650, with 47,639 missing values (12.3% of the matrix
elements).
The matrices after the above preparation, along
with the biclustering
results
can be found at
http://arep.med.harvard.edu/biclustering.
Missing Data Replacement
Missing data in the matrices were replaced with random numbers. The expectation was that these random
values would not form recognizable patterns and thus
would be the leading candidates to get removedin node
deletion.
The random numbers used to replace missing values
in the yeast data were generated so that they form a
uniform distribution between 0 and 800. For the human
data, the uniform distribution was between -800 and
800.
Determining
Algorithm
Parameters
The 30 clusters reported in Tavazoie et al. (1999) were
used to determine the fi value in Algorithms 1 and 2.
From the discussion before, we knowthat a completely
random submatrix of any size for the value range (0 to
800) has a score about 53,000. The clusters reported in
Tavazoie et al. (1999) have scores in the range between
261 (Cluster 3) and 996 (Cluster 7), with a median
630 (Clusters 8 and 14). A ~ value (300) close to
lower end of this range was used in the experiment, to
detect more refined patterns.
rOWS
3
3
10
10
30
30
100
100
columns
6
17
6
17
6
17
6
17
low
10
30
110
240
410
480
630
700
high
6870
6600
4060
3470
2460
2310
1720
1630
peak
39O
48O
8OO
87O
96O
1040
1020
1080
tail
15.5%
6.83%
0.064%
0.OO2%
-6
< 10
< -~
I0
-6
< 10
< -6
10
Table 1: Score distributions estimated by randomly selecting one million submatrices for each size combination. The columns correspond to the number of rows,
the number of columns, the lowest score, the highest
score, the peak score, and the percentage of submatrices with scores below 300.
Submatrices of different sizes were randomly generated (one million times for each size) from the yeast
matrix and the distributions of scores along with the
probability that a submatrix of the size has a score lower
than 300 were estimated and listed in Table 1.
The $ Value used in the experiment with human data
was 1,200, because of the doubling in the range and the
quadrupling of the variance in the data, compared to
the yeast data.
Algorithm 1 (Single Node Deletion) becomes quite
slow when the number of rows in the matrix is in the
thousands, which is commonin expression data. A
proper a must be determined to run the accelerated
Algorithm 2 (Multiple Node Deletion). Lemma2 gives
some guidance to the determination of a. Our aim was
to find an a as large as possible and still allow the
program to find 100 biclusters in less than 10 minutes.
Whenthe number of conditions is less than 100, which
was the case for both data sets, Steps 3 and 4 were not
used in Algorithm 2, so deletion of conditions started
only when Algorithm 1 was called. The a used in both
experiments was 1.2.
Node Addition
Algorithm 3 was used after Algorithm 2 and Algorithm
1 (called by Algorithm 2), to add conditions and genes
to further reduce the score. Only one iterate of Algorithm 3 was executed for each bicluster, based on the
assumption that further iterates would not add much.
Step 5 of Algorithm 3 was performed, so manybiclusters contain a "mirror image" of the expression pattern.
These additions were performed using the original
data set (without the masking described below).
Masking Discovered
Biclusters
Because the algorithms are all deterministic, repeated
run of themwill not discover different biclusters, unless
discovered ones axe masked.
Each time a bicluster was discovered, the elements in
the submatrix representing it were replaced by random
numbers, exactly like those generated for the missing
values (see Missing Data Replacement above). This
made it very unlikely that elements covered by existing
biclusters would contribute to any future pattern discovery. The masks were not used during node addition.
The steps described above are summarized in Algorithm 4 below.
Algorithm 4 (Finding a Given Number of
Biclusters).
Input: A, a matrix of real numbers with
possible missing elements, a > 1, a parameter for multiple node deletion, ~ > 0,
the maximum acceptable mean squared
residue score, and n, the number of ~biclusters to be found.
Output: n ~-biclusters in A.
Initialization: Missing elements in A are replaced with random numbers from a range
covering the range of non-null values. A’
is a copy of A.
Iteration for n times:
1. Apply Algorithm 2 on AI, 6, and a.
If the row (column) size is small (less
than 100), do not perform multiple node
deletion on rows (columns). The matrix
after multiple node deletion is B.
2. (Step 5 of Algorithm 2) Apply Algorithm 1 on B and ~ and the matrix after
single node deletion is C.
3. Apply Algorithm 3 on A and C and the
result is the bicluster D.
4. Report D, and replace the elements in
A’ that are also in D with random numbers.
Implementation
and Display
These algorithms were implemented using C and run on
a Sun Ultral0 workstation. With the parameters specified above, 100 biclusters were discovered from each
data set in less than 10 minutes. Plots were generated
for each bicluster, showingthe expression levels of the
genes in it under the conditions in the bicluster. 30
biclusters for the yeast data were ploted in Figures 1,
2, 3, and 4, and 24 for the human data in Figures 5
and 6. In the captions, "Bicluster" denotes a bicluster discovered using our algorithm and "Cluster" denotes a cluster discovered in Tavazoie et al. (1999).
Detailed descriptions for these biclusters can be found
in http://axep.med.haxvaxd.edu/biclustering.
Results
From visual inspection of the plots one can see that
this biclustering approach works as well as conventional
clustering methods, whenthere were clear patterns over
all attributes (conditions whenthe genes axe clustered,
or genes when conditions are clustered). The ~ paramISMB 2000 99
67
63
~
47
Figure 1: The first bicluster (Bicluster 0) discovered
Algorithm 4 from the yeast data and the flattest one
(Bicluster 47, with a row variance 39.33) are examples
of the "flat" biclusters the algorithm has to find and
mask before more "interesting"
ones may emerge. The
bicluster with the highest row variance (4,162) is Bicluster 93 in Figure 3. The quantization effect visible
at lower ends of the expression levels was due to a lack
of the numberof significant digits both before and after
logarithm.
Figure 3:10 biclusters discovered in the order labeled.
Biclusters 17, 67, 71, 80, and 99 (on the left column)
contain genes in Clusters 4, 8, and 12 of Tavazoie et al.
(1999), while biclusters 57, 63, 77, 84, and 94 represent
Cluster 7.
36
Figure 2:12 biclusters with numbers indicating the order they were discovered using the algorithms. These
biclusters are clearly related to Cluster 2 in Tavazoie et
al. (1999), which has a score of 757. They subdivides
Cluster 2’s profile into similar but different ones. These
biclusters have scores less than 300. They also include
10 genes from Cluster 14 of Tavazoie et al. (1999) and
6 from other clusters. All biclusters plotted here except
Bicluster 95 contain all 17 conditions, indicating that
these conditions form a cluster well, with respect to the
genes included here.
i00
CHENG
52
Figure 4:6 biclusters discovered in the order labeled.
Bicluster 36 corresponds to Cluster 20 of Tavazoie et al.
(1999). Bicluster 93 corresponds to Cluster 9. Bicluster
52 and 90 correspond to Clusters 14 and 30. Bicluster
46 corresponds to Clusters 1, 3, 11, 13, t9, 21, 25, and
29, and Bicluster 54 corresponds to Clusters 5, 15, 24,
and 28. Notice that some biclusters have less than half
of the 17 conditions and thus represent shared patterns
in many clusters discovered using similarity based on
all the conditions.
eter gives a powerful tool to fine-tune the similarity requirements. This explains the correspondence between
one or more biclusters to each of the better clusters discovered in Tavazoie et al. (1999). These includes the
clusters associated to the highest scoring motifs (Clusters 2, 4, 7, 8, 14, and 30 of Tavazoie et al.). For other
clusters from Tavazoie et al., there were not clear correspondence to our biclusters. Instead, Biclusters 46 and
54 represent the commonfeatures under some of the
conditions of these lesser clusters.
Coverage of the Biclusters
In the yeast data experiment, the 100 biclusters covered
2,801, or 97.12%of the genes, 100%of the conditions,
and 81.47%of the cells in the matrix.
The first 100 biclusters from the humandata covered
3,687, or 91.58%of the genes, 100%of the conditions,
and 36.81%of the cells in the data matrix.
Sub-Categorization
of Tavazoie’s
Cluster 2
Figure 2 shows 12 biclusters containing mostly genes
classified to Cluster 2 in Tavazoie et al.. Each of these
biclusters clearly represents a variation to the common
theme for Cluster 2. For example, Bicluster 87 contalns genes with three sharp peaks in expression (CLB6
and SPT21). Genes in Bicluster 62 (ERP3, LPP1, and
PLM2)showedalso three peaks, but the third of these is
rather fiat. Bicluster 56 shows a clear-cut double-peak
pattern with DNAreplication
genes CDC9, CDC21,
POLl2, POL30, RFA1, RFA2, among others. Bicluster
66 contains two-peak genes with an even sharper image (CDC45, MSH6, RAD27, SWE1, and PDS5).
the other hand, Bicluster 14 contains those genes with
barely recognizable double peaks.
.19
45
I
Figure 5:12 biclusters discovered in the order labeled
from the humanexpression data. Scores were 1,200 or
lower. The numbers of genes and conditions in each
are reported in the format of (bicluster label, number
of genes, numberof conditions) as follows. (12, 4, 96),
(19, 103, 25), (22, 10, 57), (39, 9, 51), (44, 10, 29),
127, 13), (49, 2, 96), (52, 3, 96), (53, 11, 25), (54,
21), (75, 25, 12), (83, 2,
Broad Strokes
and Fine Drawings
Figures 5 and 6 showvarious biclusters discovered in the
human lymphoma expression data. Some of them represent a few genes closely following each other, through
almost all the conditions. Others show large numbers
genes displaying a broad trend and its mirror image.
These "broad strokes" often involve smaller subsets of
the conditions and genes depicted in some of the "fine
drawings" may be added to these broad trends during
the node addition phase of the algorithm. These biclusters clearly say a lot about the regulatory mechanism
and also the classification of conditions.
Comparison with Alizadeh’s
Clusters
Wedid a comparison of the first 100 biclusters discovered in the human lymphoma data with the clusters
discovered in Alizadeh et al. (2000). Hierarchical clustering was used in Alizadeh et al. on both the genes
and the conditions. Weused the root division in each
hierarchy in our comparison. Wecall the two clusters
generated by the root division in a hierarchy the pr/mary clusters.
Out of the 100 biclusters, only 10 have conditions
exclusively from one or the other primary cluster on
Figure 6: Another 12 biclusters discovered in the order
labeled from the human expression data. The numbers
of genes and conditions in each bicluster (whose label
is the first figure in the triple) are as follows. (7, 34,
48), (14, 15, 46), (25, 57, 24), (27, 23, 30), (31,
17), (35, 102, 13), (37, 59, 18), (59, 8, 19), (60,
(67, 102, 20), (79, 18, 11), (86, 18, 11) Mirror
can be seen in most of these biclusters.
ISMB 2000 101
conditions. Bicluster 19 in Figure 5 is heavily biased
towards one primary condition cluster, while Bicluster
67 in Figure 6 is heavily biased towards the other.
A similar proportion of the biclusters have genes exclusively from one or the other primary cluster on genes.
The tendency is that the higher the row variance (5)
and the number of columns (conditions) involved are,
the more likely the genes in the bicluster will comefrom
one primary cluster. Wedefine m/x as the percentage
of genes from the minority primary cluster in the bicluster, and plot mix versus row variance in Figure 7.
Each digit in the Figure 7 represents a bicluster, with
the digit indicating the numberof conditions in the bicluster.
All the biclusters shownin Figures 5 and 6 have row
variance (squared-rooted) greater than 100. Many
them have zero percent mix (Biclusters 7, 12, 14, 22,
39, 44, 49, 52, and 53 from one primary cluster and
Biclusters 25, 54, 59, and 83 from the other). This
demonstrates the correctness of the use of row variance
as a means for separating interesting biclusters from
the trivial ones. Somebiclusters have high row variances but also high mix scores (those in the upper right
quadrant of the plot). These invariably involve a narrower view of the conditions, with the digit "1" indicating the number of conditions involved is below 20%.
Genes in these biclusters are distributed on both sides
of the root division of hierarchical clustering. These
biclusters show the complementaryrole that biclustering plays to one-dimensional clustering. Biclustering
allows us to focus on the right subsets of conditions to
see the apparent co-regulatory patterns not seen with
the global scope.
For example, Bicluster 60 suggests that under the
11 conditions, mostly for chronic lymphocytic leukemia
(CLL), FMR2and the interferon- 7 receptor a chain
behave against their stereotype and get down-regulated.
As another example, Bicluster 35 suggests that under
the 13 conditions, CD49Band SIT appear similar to
genes classified by hierarchical clustering into the other
primary cluster.
Discussion
Wehave introduced a new paradigm, biclustering,
to
gene expression data analysis. The concept itself can
be traced back to 1960’s and 1970’s, although it has
been rarely used or even studied. To gene expression
data analysis, this paradigmis relevant, because of the
complexity of gene regulation and expression, and the
sometimes low quality of the gathered raw data.
Biclustering is performed on the expression matrix,
which can be viewed as a weighted bipartite graph. The
concept of bicluster is a natural generalization of the
concept of biclique in graph theory. There are onedimensional clustering methods based on graphs constructed from similarity scores between genes, for example, the hierarchical clustering method in Alizadeh
et al. (2000), and the finding of highly connected subgraphs in Hartuv et al. (1999). A major difference here
102
CHENG
50
1’ 12~1
Mix
31
4O
0
11
0
12D
11
3O
11 11
2
2O
1
3
1
1
2 912
2:Bt
11
1
I0
2
3
2
£
0
2
7
0
2
I"
150
3
1
3:1.
412
S 29
T
300
5 2
RowVariance (square-rooted)
Figure 7: A plot of the first 100 biclusters in the human
lymphoma data. Three measurements are involved.
The horizontal axis is the square root of the row variance, defined in (5). The vertical axis is the m/x, or the
percentage of genes misclassified across the root division
generated by hierarchical clustering in Alizadeh et al.
(2000). Weassumed that genes forming mirror image
expression patterns are naturally from the other side
of the root division. The digits used to represent the
biclusters in the plot also indicate the sizes of the condition sets in the biclusters. The digit n indicates that
the number of conditions is between 10n and 10(n + 1)
percent of the total numberof conditions.
is that biclustering does not start from or require the
computation of overall similarity between genes.
The relation between biclustering and clustering is
similar to that between the instance-based paradigm
and the model-based paradigm in supervised learning.
Instance-based learning (for example, nearest-neighbor
classification)
views data locally, while model-based
learning (for example, feed-forward neural networks)
finds the globally optimally fitting models.
Biclustering has several obvious advantages over clustering. First, biclustering automatically selects genes
and conditions with more coherent measurement and
drops those representing random noise. This provides
a method for dealing with missing data and corrupted
measurements.
Secondly, biclustering groups items based on a similarity measure that depends on a context, which is best
defined as a subset of the attributes. It discovers not
only the grouping, but the context as well. Andto some
extent, these two becomeinseparable and exchangeable,
which is a major difference between biclustering and
clustering rows after clustering columns.
Most expression data result from more or less complete sets of genes but very small portions of all the
possible conditions. Any similarity measure between
genes based on the available conditions becomes anyway context-dependent. Clustering genes based on a
measure like this is no more representative than biclustering.
Thirdly, biclustering allows rows and columns to be
included in multiple biclusters, and thus allows one gene
or one condition to be identified by more than one function categories. This addedflexibility correctly reflects
the reality in the functionality of genes and overlapping
factors in tissue samples and experiment conditions.
By showing the NP-hardness of the problem, we tried
to justify our efficient but greedy algorithms. But the
nature of NP-hardness implies that there may be sizable biclusters with good scores evading the search by
any efficient algorithm. Just like most efficient conventional clustering algorithms, one can say that the best
biclusters can be found in most cases, but one cannot
say that it will be found in all the cases.
Ball, C.A. et al. 2000. Integrating functional genomic information into the Saccharomyces Genome
Database. Nucleic Acids Res. 28:77-80.
Cho, R.J. et al. A genomewide transcriptional analysis
of the mitotic cell cycle. Mol. Cell 2:65-73.
Garey, M.R., and Johnson, D.S. 1979. Computers
and Intractability:
A Guide to the Theory of NPCompleteness. San Francisco: Freeman.
Hartigan, J.A. 1972. Direct clustering of a data matrix.
JASA 67:123-129.
Hartuv, E. et al. 1999. An algorithm for clustering
cDNAsfor gene expression analysis. RECOMB
’99,
188-197.
Johnson, D.S. 1987. The NP-completeness column: an
ongoing guide. J. Algorithms 8:438-448.
Mirkin, B. 1996. Mathematical Classification
and
Clustering. Dordrecht: Kluwer.
Morgan, J.N. and Sonquist, J.A. 1963. Problems in the
analysis of survey data, and a proposal. JASA 58:415434.
Nan, D.S., Markowsky, G., Woodbury, M.A., and
Amos, D.B. 1978. A mathmatical analysis of human
leukocyte antigen serology. Math. Biosci. 40:243-270.
Orlin, J. 1977. Containment in graph theory: covering
graphs with cliques, Nederl. Akad. Wetensch. Indag.
Math. 39:211-218.
Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J.,
and Church, G.M. 1999. Systematic determination of
genetic network architecture. Nature Genetics 22:281285.
Yannakakis, M. 1981. Node deletion problems on bipartite graphs. SIAMJ. Comput. 10:310-327.
Acknowledgments
This research was conducted at the Lipper Center
for Computational Genetics at the Harvard Medical
School, while the first author was on academic leave
from the University of Cincinnati.
References
Aach, J., Rindone, W., and Church, G.M. 2000. Systematic management and analysis of yeast gene expression data. GenomeResearch in press.
Alizadeh, A.A. et al. 2000. Distinct types of diffuse
large B-cell lymphomaidentified by gene expression
profiling. Nature 403:503-510.
ISMB 2000 103