1
2
1
2
To cluster genes from DNA microarray, an unsupervised methodology using independent component analysis (ICA) is proposed. Based on an
ICA mixture model of genomic expression patterns, linear and nonlinear ICA finds components that are specific to certain biological processes. Genes that exhibit significant up-regulation or downregulation within each component are grouped into clusters. We test the statistical significance of enrichment of gene annotations within each cluster. ICA-based clustering outperformed other leading methods in constructing functionally coherent clusters on various datasets. This result supports our model of genomic expression data as composite effect of independent biological processes. Comparison of clustering performance among various ICA algorithms including a kernel-based nonlinear ICA algorithm shows that nonlinear ICA performed the best for small datasets and natural-gradient maximization-likelihood worked well for all the datasets.
Expression pattern of genes in a certain condition is a composite effect of independent biological processes that are active in that condition. For example, suppose that there are 9 genes and 3 biological processes taking place inside a cell.
Ribosome Biosynthesis
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9
Cell Cycle Regulation
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9
Each biological process becomes active by turning on genes associated with the processes.
Microarray Data display expression levels of a set of genes measured in various experimental conditions.
Expression Patterns of Genes under an Experimental
Condition Exp i
Heat shock, G phase in cell cycle, etc … conditions
Liver cancer patient, normal person, etc … samples
Exp
1
Exp 2
Exp 3
Exp i
Exp M
Expression Levels of aGene G i across Experimental Conditions
G
1
G
2
G
N-1
G
N
Oxidative Phosphorylation
Observed genomic expression pattern can be seen as a combinational effect of genomic
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 expression programs of biological processes that are active in that condition.
Cell Cycle Regulation
Ribosome Biosynthesis
The expression measurement of K genes observed in three conditions denoted by x
1
, x
2 and x
3 can be expressed as linear combinations of genomic expression programs of three biological processes denoted by s
1
, s
2 and s
3
.
Unknown Mixing System
We can measure expression level of genes using
Microarray.
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9
Given a microarray dataset, can we recover genomic expression programs of biological processes?
x
x
1
: x m
As
a
:
11 a m 1
Ribosome Biogenesis
Oxidative Phosphorylation
Heat Shock
Starvation
Cell Cycle Regulation Hyper-Osmotic Shock a
:
1 n a mn
s n
: s
1
Genomic Expression Programs of
Biological Processes
Genomic Expression Pattern in
Certain Experimental Conditions
In other words, can we decompose a matrix X into A and S so that each row of
S represents a genomic expression program of a biological process?
Using the log-likelihood maximization approach, we can find
W that maximizes log-likelihood y
Wx p ( x )
| det( W ) | p ( y ) y i
’s are assumed to be statistically independent p ( y )
i n
1 p i
( y i
)
L(y,W) .
Prior information on y
Super-Gaussian or Sub-Gaussian ?
( y )
p ( y )
y p ( y )
p (
y
1 y
1
) p ( y
1
)
,...
p (
y n y n
) p ( y n
)
L ( y , W )
log
W
W
W p ( x )
log | det( W ) |
i n
1 log
W
L ( y , W )
W
( W
T
)
1
( y ) x
T p i
( y i
)
Apply ICA to microarray data X to obtain Y
Cluster genes based on independent components, rows of Y .
Based on our gene expression model, Independent Components y
1
,…, y n are assumed to be expression programs of biological processes. For each y i
, genes are ordered based on activity levels on y i and C % ( showing significantly high/low level are grouped into each cluster.
C =7.5)
For each method, the minimum p-values (<10 -7 ) corresponding to each
GO functional class were collected and compared.
Statistical significance of biological coherence of clusters was measure using gene annotation databases like Gene Ontology (GO).
Cluster 1
Cluster 2
Cluster 3
GO 1
GO m
Cluster n
GO 2
GO i
Cluster i
GO j k genes p i , j
1
k m
1
0
f m
g n
f m
g n
For every combination of our cluster and a GO category, we calculated the p -value, a change probability that these two clusters share the observed number of genes based on the hypergeometric distribution.
g : # of genes in all clusters and GOs f : # of genes in the GO j n : # of genes in the Cluster i k : # of genes GO j and Cluster i share
For testing, five microarray datasets were used and for each dataset, the clustering performance of our approach was compared with another approach applied to the same dataset.
D1 Yeast during cell cycle
5679 22 PCA
D2 Yeast during cell cycle 6616 17 k -means clustering
D3 Yeast under stressful conditions 6152 173 Bayesian approach
Plaid model
D4 C.elegans
in various conditions 17817 553 Topomap approach
D5 19 kinds of normal Human tissue 7070 59 PCA