Detecting Mutual Functional Gene Clusters from Multiple Related Diseases Nan Du

advertisement
Detecting Mutual Functional Gene Clusters from
Multiple Related Diseases
Nan Du∗ , Xiaoyi Li∗ , Yuan Zhang† and Aidong Zhang∗
∗ Computer
Science and Engineering Department
State University of New York at Buffalo,
Buffalo, U.S.A
nandu,xiaoyili,azhang@buffalo.edu
† College of Electronic Information and Control Engineering
Beijing University of Technology
Beijing, China
zhangyuan@emails.bjut.edu.cn
Abstract—Discovering functional gene clusters based on gene
expression data has been a widely-used method that offers
a tremendous opportunity for understanding the functional
genomics of a specific disease. Due to its strong power of
comprehending and interpreting mass of genes, plenty of studies
have been done on detecting and analyzing the gene clusters
for various diseases. However, more and more evidence suggest
that human diseases are not isolated from each other. Therefore,
it’s significant and interesting to detect the common functional
gene clusters driving the core mechanisms among multiple related
diseases. There are mainly two challenges for this task: first, the
gene expression from each disease may contain noise; second, the
common factors underlying multiple diseases are hard to detect.
To address these challenges, we propose a novel deep architecture
to discover the mutual functional gene clusters across multiple
types of diseases. To demonstrate that the proposed method can
discover precise and meaningful gene clusters which are not directly obtainable from traditional methods, we perform extensive
experimental studies on both synthetic and real datasets - public
gene-expression data of three types of cancers. Experimental
results show that the proposed approach is highly effective in
discovering the mutual functional gene clusters.
I.
I NTRODUCTION
Gene cluster detection based on gene expression data is
a widely-used method proven to be helpful to understand
gene function, gene regulation, cellular processes, and subtypes of cells. This approach helps us further understand
the functions of multiple genes for which information has
not been previously available. Genes with similar expression
patterns (co-expressed genes) are grouped together into a gene
cluster, and these genes are likely to be involved in the same
cellular processes [11]. These gene clusters may help us further
understand the functions of genes for which information has
not been previously available [5].
A variety of methods have been proposed or used in
the microarray literature of detecting the gene clusters for
various kinds of diseases. All these methods are proposed
for clustering the genes under the same disease or the same
situation. However, increasing evidence shows that some diseases are very likely to be related to each other. A report from
National Cancer Institute [10] shows that the mutation of two
commonly referred breast cancer genes BRCA 1 and BRCA
2 are associated with a significantly increased risk of ovarian
cancer; [18] showed their findings to prove that individuals
who have cancer, including women with uterine, ovarian,
and breast cancers, are at a statistically significant increased
risk of colorectal cancer; moreover, it has been reported that
women with breast cancer have a significantly increased risk
of developing a subsequent lung cancer [16]. Therefore, we
believe that some core mechanisms or hidden factors may
influence multiple related diseases simultaneously. Therefore,
learning the mutual gene clusters shared among the diseases
provides us not only a global view of human diseases, but also
potentially new insights into the etiology and design of novel
therapeutic interventions. By now, few unified mathematical
model has been proposed to detect the mutual functional gene
clusters across multiple diseases.
To find out the mutual functional gene clusters across various types of diseases, the easiest way is to use the clustering
ensemble methods, whose key idea is independently detect the
gene clusters from each data source and then ensemble the
multiple clustering results into a single consensus clustering.
However, due to the fact that the gene expression data usually
contain noise which may come from samples contamination,
experimental design or measurement errors, the clustering
result directly based on each specific dataset is usually not
very reliable, let alone the ensemble result. In addition, among
so many factors effect on a specific disease, only few of them
are mutual factors shared also by the other diseases. It is easy
to see that it would be very challenging to find out the mutual
factors from the exclusive factors in an unsupervised literature.
Recently, many efforts have been devoted to develop learning algorithms for deep learning methods such as Deep Belief
Networks and Stacked Autoencoder, with impressive results
obtained in application domains such as computer vision and
natural language processing [6]. The extensive learning power
of these models is suitable for our task. Thus, we propose
a novel deep architecture which can effectively discover the
mutual functional gene clusters among multiple diseases. Note
that our work differs from existing deep learning approaches
in that we develop an approach to detect mutual clusters
across multiple sources while existing work focuses on single
data source and targets at different problem. Our proposed
architecture includes three layers, where each layer takes a
specific responsibility: the first layer discovers the exclusive
hidden factors that can well represent a specific disease; the
second layer extracts the mutual hidden factors shared by the
diseases; the third layer groups the genes into clusters based on
the mutual hidden factors. The overall structure of the proposed
deep architecture is shown in Fig. 1. Since the major goal of
the proposed approach is to detect the mutual gene clusters
across multiple related diseases, our approach is referred as
Mutual Gene Cluster Detection - MGCD.
Visible Units
1
...
...
Mutual Hidden
Units
m
Cluster Units
h
2
h
2
...
...
...
...
...
...
hC
vC
...
...
1-st Layer
2-nd Layer
3-rd Layer
Fig. 1: Illustration of Overall Structure.
In summary, there are three main contributions of this
paper:
•
V
c
...
c
Fig. 2: Illustration of the First Layer’s Network Structure.
we are aiming at finding K gene clusters shared among the
diseases.
1
v
W
Hidden Units
h
v
...
c
H
B. Single Disease Representation
As we mentioned, each disease’s gene expression profiling
is very likely to be influenced by some hidden factors. Thus,
we discover the hidden factors for each specific disease in the
first layer of our model. To better represent a specific disease
(assume it is c-th disease) via hidden factors, we propose to use
the Restricted Boltzmann Machine (RBM) , which constructs a
set of visible units v c and a set of hidden
units hc . In our case,
c
1×V c
the visible unit vector v ∈ ℜ
represents each sample’s
expression profilesc on a specific gene, and the hidden unit
vector hc ∈ ℜ1×H represents the hidden factors that we want
to learn. Intuitively, the goal of learning in the first layer is to
learn the significance of all hidden factors given the observed
data, so that the hidden factors can get close to the true hidden
factors as much as possible. The distribution over v c and hc
is through the following energy function:
c
We investigate the problem of detecting the mutual
gene clusters across multiple related diseases.
•
A novel deep architecture MGCD is proposed, which
is represented as a multilayer network.
•
Experiments on synthetic data sets show that mutual
clusters are easily detected from multiple sources by
the proposed method. On real cancer datasets, meaningful mutual functional gene clusters are detected and
enrichment indeed suggests that the detected mutual
gene clusters may reflect core mechanisms of cancers.
II.
M ETHODOLOGY
In this section, we present our deep architecture MGCD for
discovering the mutual gene clusters across multiple related
diseases.
E(v c , hc ) = −
c
V ∑
H
∑
c
c
vic wij
−
V
∑
i=1 j=1
c
bci vic −
i=1
H
∑
acj hcj ,
(1)
i=1
where V c denotes the number of visible units
v c , H c denotes
c
c
1×H c
the number of hidden units h , a ∈ ℜ
represents the bias
c
units for the hidden layer and bc ∈ ℜ1×V represents the bias
units for the visible layer. Based on the above energy function,
the probability distribution of visible and hidden units can be
defined as:
∑
p(v c , hc ) =
exp(−E(v c , hc ))
,
Z
c
(2)
c
where Z = vc ,hc exp(−E(v , h )) is a normalization constant that equals to the sum over all pairs of visible and hidden
units. Because the RBM model is represented as a bipartite
graph in Fig. 2, thus there are no direct connections between
hidden units. A hidden unit is activated with the probability:
c
A. PROBLEM SETTING
We are considering the problem of discovering the mutual
gene clusters across multiple related diseases. To address this
problem, we propose a deep architecture, whose goal is to
group the genes into clusters. Therefore our task is summarized
as follows:
Suppose} we have a set of gene expression data
{
W = W 1 , ..., W C from C different types of diseases. Each
gene expression data W c (1 ≤ c ≤ C) is represented as an
N × S c expression matrix, where N denotes the gene number,
S c denotes the number of samples for the c-th disease, and
c
each cell wij
in W c is the measured expression level of i-th
gene in j-th sample in the c-th disease. Note that although we
assume the genes across different diseases (i.e. N) are the same,
the samples from each specific disease (S c (1 ≤ c ≤ C)) could
be different. Based on this set of gene expression matrices,
p(hcj = 1|v c ) = σ(bcj +
V
∑
c
vic wij
),
(3)
i
where σ(x) represents the logistic sigmoid function 1/1 +
exp(−x). Assuming hcj is the j-th hidden factor for the expression in c-th disease, the activation probability P (hcj = 1|v c )
shows the significance this factor affects the observed data.
Using binary states for the hidden units is helpful for avoiding
unnecessary sampling noise. Similarly, a visible unit’s state
can be represented by the hidden units as:
c
p(vic
c
= 1|h ) =
σ(aci
+
H
∑
j
c c
wij
hj ).
(4)
Suppose the model parameters are denoted as θ =
{ac , bc , W c }, the RBM can be trained by minimizing the
negative log-likelihood with respect to θ:
L(θ) = −
∑
log
∑
p(v c , hc ).
(5)
L(h, z) = −
hc
[hj log zj + (1 − hj ) log(1 − zj )] .
(9)
j=1
C. Mutual Hidden Units Learning
Although the genes may perform very differently in different diseases, there are some subtle common hidden factors
shared among diseases which can help us accurately discover
the mutual gene clusters. These common factors may be some
informative genes, core mechanisms, genetic mutations or
external stimulus that we do not know yet. From the first
layer, we have learned the hidden factors with respect to each
specific disease. Now, based on these disease-specific factors,
we want to find out the mutual factors that are shared across
the diseases.
To identify the valid hidden factors shared among the
diseases in an unsupervised literature, we train the second layer
in a feature selection fashion. Instead of directly mapping the
initial input vector h ∈ ℜ1×H where H is the sum of all the
hidden units from the first layer (i.e. H = H 1 +H 2 +...+H C )
to a mutual hidden representation m ∈ ℜ1×M (as Fig. 3), we
first corrupts h to a partially destroyed version e
h by means
of a stochastic mapping [21] and then use e
h to train the
model. Specifically, a fixed percentage e of units h are chosen
randomly whose values is forced to be zero, while the others
are left untouched. The main idea of randomly zeroing some
factors is to randomly guess some factors are not the mutual
factors.
As we can see, the mutual hidden units m, which can
be viewed as a lossy compression of h, is a distributed
representation that captures the common factors shared among
multiple diseases.
D. Gene Cluster Detection
In the second layer, we have learned the mutual hidden
factors underlying multiple diseases. Now, in the third layer,
based on the learned mutual factors, we group the genes
into multiple clusters via a competitive learning network.
Suppose we want to cluster the genes into K clusters, then
K cluster units would be added. Each cluster unit corresponds
to a specific cluster, and there are fully connections (i.e. W̃ )
between the cluster and mutual hidden units. The network
structure of the third layer is shown as Fig. 4. It is worth
noticing that the third layer’s network is trained based on a
competitive rule. When a gene that is represented by the mutual
hidden unit mi (1 ≤ i ≤ M ) comes, the cluster unit with
the least squared Euclidean distance between the mutual units
would be selected as Eq. 10:

 v
M

 u
u∑
∥mj − w̃xj ∥2 .
argminx = x|t


(10)
j=1
Then the edge weights connecting this cluster unit x will
be updated as:
...
M
N
∑
^
W
...
1
...
...
2
H
w̃ij (t + 1) =
...
H
H
After e
h is corrupted from h, the hidden mutual units m is
mapped from e
h as:
m = fθ (e
h) = σ(e
hŴ + p),
(6)
{
}
parametrized by θ = Ŵ , p , where Ŵ denotes the weight
matrix in the second layer and p represents the bias for the
mutual hidden units. Then resulting m is mapped back to
reconstruct vector z ∈ ℜ1×H as in the following equation:
′
z = gθ′ (m) = σ(mŴ + p ),
θ∗ , θ
∗
where η ∈ (0, 1] is a learning rate which controls the speed of
learning. On the one hand, the value of the learning rate should
be sufficiently large enough to allow a fast learning process.
On the other hand, it should also be small enough to guarantee
its effectiveness. Thus, we have utilized the adaptive learning
rate method [17] that automatically updates η following the
rules that increasing when the current η is expected to be far
away from the optimal, but decreasing when the distance is
uncertain. From the way of updating the weights, we can see
that there is a competition among the cluster units, where only
the ‘winner’ (i.e. the cluster unit with the closest distance to
the input) would be updated, and the remaining units’ weights
stay the same.
(7)
{ ′ ′}
which is parametrized with θ = Ŵ , p . The parameters of
this model are optimized to minimize the average reconstruction error:
′
′
(11)
c
Fig. 3: Illustration of the Second Layer’s Network Structure.
′
{
w̃ij (t) + η(mj − w̃ij (t)), if i == x
w̃ij (t),
if i ̸= x
N
1 ∑ ( [i] [i] )
= arg minθ,θ′
L h ,z
N i=1
(8)
N
( ( )))
1 ∑ ( [i]
e
.
L h , gθ′ fθ h
= arg minθ,θ′
N i=1
In Eq. 8, L denotes the reconstruction cross-entropy that
represents h and z as vectors of bit Bernoullis probabilities:
K
...
~
W
M
...
Fig. 4: Illustration of the Third Layer’s Network Structure.
There are mainly two advantages of using a competitive
network to detect the gene clusters: first, due to the learning
rate adaptation, the convergence speed is improved; second, as
a network-based method, it is easy to reconstruct the detected
cluster units to the units in every previous layer. To be more
This provides us not only the important insights into
understanding and validating the gene clusters, but also the
opportunity to see the specificity of each gene cluster for a
certain disease. The cluster data that are reconstructed from
p(hc |m, k) can be viewed as the centric of this gene cluster.
Therefore, for a given gene cluster i, we know its member
genes’ original expression profiling and its reconstructed centric, and then we can calculate its members distance to the
reconstructed centric via Root Mean Square Error (RMSE) at
each type of disease. Ideally, if a cluster’s members have a
large distance to the reconstructed centric in a certain disease,
this gene cluster is likely to be not so enriched in this disease.
Although the detected gene clusters are believed to share
among the diseases, some gene clusters may be more active
at some specific diseases than others. Thus, knowing the
specificity of the gene clusters corresponding to each disease
helps us better understand the human diseases.
III.
E XPERIMENTS ON S YNTHETIC DATASET
In this section, we conduct experiments on the synthetic
data to perform quantitative analysis on the proposed method.
A. Data Generation and Evaluation Metric
The synthetic data are generated based on the assumption
that there are some common clusters shared across the multiple
data sources. Therefore, we generate 200 objects (i.e. genes),
which are divided into five clusters of 40 objects each, and
these objects are measured by some features (i.e. samples). If
objects belong to the same cluster, their expression profiles
over features are drawn from the same distribution. Following
the same rule, we have generated four sources, each of which
obtains the same number of genes but different number of
samples. It is worth noticing that each cluster’s distribution
may be different in different sources. Furthermore, α percent
of samples are randomly chosen to be the ones receiving
unreliable expression profiling, and their expression profilings
are randomly shuffled and noises are added to their expression
profiling. It is easy to see that the way of generating the
synthetic data simulates the actual situation.
Due to the way of generating the synthetic data, we know
the label of each gene, thus we can directly assess the quality
of the results by measuring the discrepancy between the
predictive mutual clustering and the ground-truth labels. The
evaluation metric of this experiment is Normalized Mutual
Information (NMI) criterion [20]. Note that NMI obtains 0
when a random partition is given, and 1 when the clustering
result has the same partition of the data with the ground-truth.
B. Performance Comparison
To demonstrate the effectiveness of the proposed model,
we first introduce some baselines. As discussed in Section I,
one intuitive way to find out the mutual gene clusters shared
by various diseases is to use clustering ensemble. Thus, our
first baseline is Instance-Based Graph Formulation (IBGF)
[20], which first constructs a graph to measure the pairwise
similarity among objects based on the clustering result from
each source, and then IBGF uses this matrix in conjunction
with graph partition. [20] also proposed another method Cluster-Based Graph Formulation (CBGF), which is used as
the second baseline. Instead of measuring the similarity among
the objects, CBGF measures the similarity among different
clusters in a given ensemble and partitions the graph into
groups so that the clusters of the same group correspond to
one another. The third baseline is proposed by [7], which
models both objects and clusters of the ensemble as vertices in
a bipartite graph, and then partitions the objects and clusters
simultaneously.
Besides the graph-based clustering ensemble methods such
as the three mentioned above, Ayad et al. [1] proposed a voting
based clustering ensemble method Adaptive Cumulative Voting
(Ada-cvote), which would be used as the fourth baseline. Nonnegative Matrix Factorization (NMF) has also been used to
find out the common clusters from multiple sources [12],
which is used as the fifth baseline.
To better demonstrate MGCD’s noise resistance power, we
increase the noise rate α from 20% to 70% with a step of 10%.
In this experiment, we initialize the learning rate η = 0.01 and
the corrupted parameter e = 10%. The experimental results
of proposed method comparing with baselines measured with
NMI are shown in Fig. 5. Note that, each result shown here
is the average of 50 times performances. It is clear that, for
every method, the performance of detecting the mutual gene
clusters drops as the noise rate increases. As the figure shows,
the proposed method performs consistently better than the
other methods of every noise rate, which demonstrates that
the proposed method has a strong noise resistance power. In
addition, MGCD performs well at detecting the underlying
mutual clusters among multiple sources, which is due to that
the second layer in our proposed framework can extract the
common factors effectively.
0.9
0.8
NMI
specific, each gene cluster can be represented as a binary vector
by the cluster units kj (1 ≤ j ≤ K) where the corresponding
element is 1 and the remaining elements are 0. When this
cluster unit backpropogates its value to each layer, we can
learn how this cluster is represented by units from different
layers: kj w̃T and kj w̃T ŵT denote the j-th gene cluster’s
representation at the mutual hidden units and exclusive hidden
units, respectively. Interestingly, when the cluster unit backpropagates to the visible units of each disease, we can know
the standard gene expression representation of this gene cluster
in each disease.
0.7
0.6
IBGF
CBGF
HBGF
Ada−cvote
NMF
MGCD
0.5
0.4
0.3
0.2
0.3
0.4
α
0.5
0.6
0.7
Fig. 5: Comparison with Different Noise Rate on Synthetic Dataset.
C. Parameters Sensitivity
In this part, we show how the proposed method performs in
various learning scenarios by tuning two variables: the number
of hidden units for each disease and the number of mutual
hidden units. First of all, we fix the number of mutual hidden
units at 200, and then vary the number of hidden units as 10,
40, 100, 200 and 400, respectively. Although the number of
hidden units for different diseases can be set with different
values, for the sake of simplicity, we set them to be equal
here. Fig. 6(a) shows the performance of the proposed method
in terms of the number of hidden units. As we can see from the
Figure, when the number of hidden units is extremely small,
i.e., 10, the performance is poor. In addition, we can also find
that the deviation is relatively larger when only 10 hidden units
are used for each disease. When the number of hidden units
further increases, the performance is stable and the deviation
decreases. This suggests that each specific synthetic source can
be well represented with 40 hidden units. Note that real-life
data is more complicated than the synthetic data, so a larger
number of hidden units should be chosen.
Similarly, to demonstrate the influence with different number of mutual hidden units, we fix the number of hidden units
at 40, and then vary the number of mutual hidden units as
10, 40, 100, 200 and 400. Fig. 6(b) shows the performance
of the proposed method in terms of the number of mutual
hidden units. As we can see, the performance is rather stable,
indicating for multiple synthetic sources, 40 mutual hidden
units is sufficient. But, in the real disease gene expression data,
we may need more mutual hidden units.
0.92
0.9
NMI
NMI
0.9
0.85
0.88
0.86
0.8
0.84
0.82
0.75
10
40
100
200
Number of Mutal Hidden Units
(a) Hidden Units
400
10
40
100
200
400
Number of Hidden Units for Each Disease
(b) Mutual Hidden Units
Fig. 6: Performance of MGCD in terms of Varying Number of Hidden and
Mutual Hidden Units.
IV.
E XPERIMENTS ON R EAL DATASET
In Section III, we demonstrated that the proposed approach
is effective in discovering the mutual clusters shared by multiple sources. In this section, we apply the proposed method on
the real cancer datasets and show the meaningful mutual gene
clusters detected by it.
A. Data Set
Microarray gene expression data are collected from three
different cancer types, including breast, prostate and lung
cancer. The breast dataset was collected form 24 primary
breast tumor patients, who were divided into two diagnostic
categories based on the patient’s response to neoadjuvant
treatment (sensitive or resistant) [3]; the prostate data set [19]
includes the gene expression measurements for 52 prostate
tumor patients; the lung data set [2] contains the gene expression information on 186 lung tissue samples. The reason
of using these three types of cancers as the target diseases is
because some previous studies have shown that these diseases
are related to each other [16], [8].
Systematic analysis of mutual gene clusters provides important insights into the cellular defects of cancer. Therefore,
it is important and interesting to find out the underlying core
mechanisms of these three related diseases.
B. Results
Based on these three cancer datasets, we have detected 50
mutual gene clusters (i.e. K=50) via the proposed approach.
After the gene clusters whose memberships are less than three
are filtered out, we get 48 gene clusters whose quantities are
in the range from 3 to 276. Moreover, we first calculate the
average distance between each gene cluster’s members (i.e. raw
gene expression profiling) to its corresponding reconstructed
disease related pattern (i.e. reconstructed gene expression
profiling, mentioned in Section II-D), and then we sort these
clusters in an ascending order based on their average distance
as a ranking list. The average distance measures how truly
a gene cluster can reflect a specific disease. Therefore, the
higher position a gene cluster obtains in the ranking list, the
more confidently we believe that it is a mutual gene cluster
shared among these three kinds of cancers.
A widely-used way to analyze the gene cluster is subdividing them into functional categories for biological interpretation, which is most commonly accomplished using Gene
Ontology (GO) categories. The GO provides biologists a list
of gene annotations which are used as inferences for understanding the genes community biological functions instead of
investigating each gene individually. When GO is used on the
mutual gene clusters, we can find the strongest and the most
significant gene functions that influence these types of cancers.
Due to the page limit, Table I only shows the top 10 significant
Biological Process (BP) gene annotations that have the lowest
p-value for the top three stable gene clusters detected by the
proposed approach. Note that, all these gene annotations are
calculated by DAVID [9] and all of them are less than 5.74E03, which is far less than the general threshold 0.05. Thus, all
the gene annotations from the detected mutual gene clusters
are considered strongly enriched in the annotation categories.
Besides the validation from the gene annotations’ p-value,
we have also found some evidence on their relationships to
the corresponding type of cancer. There are three information sources are mainly considered: 1) The Genes-to-Systems
Breast Cancer (G2SBC) Database [14] is a bioinformatics
resource that collects and integrates data about genes, transcripts, proteins and ontologies which have been reported
in literature to be altered in breast cancer cells; 2) The
cancer miRNA regulatory network (CMRN) was constructed
by inferring miRNA mediated regulation from different cancer
transcriptome profiling studies, which also provides the gene
enrichment detected for cancers in previous studies; 3) Brig
Mecham et al. used the Gene Information Content [13] statistic
to measure the amount of activity for each gene in the several
prostate datasets, and 500 GO terms that are likely to be related
to prostate cancer are listed based on their findings. For each
annotation, the evidence for its relationship with one of the
target cancers are also listed in Table I. From this table, we can
see that each detected gene cluster has some gene annotations
that have been proved related to each of the cancers. In other
words, it means that the detected mutual gene clusters are
very likely to be really ‘shared’, at least on the views of gene
annotations, by the related three cancers.
V. R ELATED W ORK
Broadly speaking, our work is related to clustering ensemble with various strategies such as graph-based methods
[20], [7] and voting-based method [1]. All of these studies
R EFERENCES
TABLE I: Functional Description of Three Mutual Gene Clusters
Cluster
Cluster 1
Cluster 2
Cluster 3
GO term
GO:0019226
GO:0048878
GO:0007268
GO:0006873
GO:0055082
GO:0008284
GO:0042127
GO:0035150
GO:0050880
GO:0051270
GO:0003018
GO:0030334
GO:0042311
GO:0007423
GO:0035295
GO:0060429
GO:0001568
GO:0001944
p-value
6.16E-04
6.89E-04
9.46E-04
9.86E-04
1.1E-03
1.55E-04
2.07E-04
2.15E-04
2.15E-04
3.33E-04
3.54E-04
4.60E-04
7.81E-04
1.04E-05
5.42E-04
6.39E-04
9.52E-04
1.07E-03
Breast
[14]
[14]
[14]
[14]
Lung
Prostate
[13]
[13]
[15]
[15]
[1]
[2]
[13]
[13]
[14]
[14], [15]
[15]
[14]
[14]
[14]
[14]
[15]
[15]
[15]
[15]
[15]
[15]
[13]
[3]
[13]
[13]
[13]
[4]
[13]
[5]
[15]
[15]
focus on combining multiple base clustering results to generate
a stable and robust consensus clustering result. However, as
we discussed in Section I, these methods’ performance highly
depends on the base clustering results. Since in our case, the
raw gene expression data contain noise, thus the base clustering
results are also unreliable. In addition, the experiments on
the synthetic dataset in Section III have also demonstrated
that, the clustering ensemble based method cannot discover
the underlying mutual gene clusters under high noise rate.
[6]
[7]
[8]
[9]
Besides the topics mentioned above, Xu et al. [22] made
a comparative study on identifying the genetic and serum
markers for multiple cancer types based on microarray gene
expression data. Chen et al. [4] estimated expression pattern
similarities between several different tumor tissues and their
corresponding normal tissues. Different from these studies,
which focus on detecting the mutual information at the gene
level (i.e. informative gene, bio-marker or differentially expressed genes), we focus on discovering the mutual information at the cluster level.
VI.
[10]
[11]
[12]
[13]
[14]
C ONCLUSION
[15]
In this paper, we introduced a new problem of detecting
the mutual gene clusters from several related diseases’s gene
expression data. These mutual gene clusters reflected that
the core mechanisms or hidden factors may influence across
the related diseases. To handle this problem, we proposed a
novel deep architecture MGCD, which is represented as a
multilayer network. In such a network, exclusive hidden factors
for each disease are discovered in the first layer, and then
the mutual hidden factors are extracted from the exclusive
factors in the second layer. Finally, the mutual gene clusters
are detected in the third layer. Our extensive experimental
analysis demonstrated that the proposed method is effective
on the synthetic datasets. Case studies on three real cancer
datasets showed that meaningful and interesting mutual gene
clusters can be revealed by the proposed method.
[16]
[17]
[18]
[19]
[20]
[21]
VII.
ACKNOWLEDGMENTS
The materials published in this paper are partially supported by the National Science Foundation under Grants No.
1218393, No. 1016929, and No. 0101244.
[22]
H. G. Ayad and M. S. Kamel. Cumulative voting consensus method
for partitions with variable number of clusters. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 30(1):160–173, 2008.
A. Bhattacharjee, W. G. Richards, J. Staunton, and et al. Classification
of human lung carcinomas by mrna expression profiling reveals distinct
adenocarcinoma subclasses. Proceedings of the National Academy of
Sciences of the United States of America, 98(24):13790–13795, 2001.
J. C. Chang, E. C. Wooten, A. Tsimelzon, and et al. Gene expression
profiling for the prediction of therapeutic response to docetaxel in
patients with breast cancer. Lance, 362(9381):362–369, 2003.
M. Chen, J. Xiao, Z. Zhang, J. Liu, J. Wu, and J. Yu. Identification of
human hk genes and gene expression regulation study in cancer from
transcriptomics data analysis. PLoS ONE, 8:e54082, 01 2013.
M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster
analysis and display of genome-wide expression patterns. Proceedings
of the National Academy of Sciences of the United States of America,
95(25):14863–14868, 1998.
D. Erhan, A. Courville, and P. Vincent. Why does unsupervised pretraining help deep learning ? Journal of Machine Learning Research,
11(2007):625–660, 2010.
X. Z. Fern and C. E. Brodley. Solving cluster ensemble problems
by bipartite graph partitioning. In Proceedings of the twenty-first
international conference on Machine learning, ICML ’04, pages 36–
, New York, NY, USA, 2004. ACM.
J. H. Hankin, L. P. Zhao, L. R. Wilkens, and L. N. Kolonel. Attributable
risk of breast, prostate, and lung cancer in hawaii due to saturated fat.
Cancer causes control CCC, 3(1):17–23, 1992.
D. W. Huang, B. T. Sherman, and R. A. Lempicki. Systematic
and integrative analysis of large gene lists using david bioinformatics
resources. Nature Protocols, 4(1):44–57, 2009.
N. C. Institute. Genetics of breast and ovarian cancer: Peutz-jeghers
syndrome, Aug. 2006.
D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression
data: A survey. IEEE Trans. on Knowl. and Data Eng., 16:1370–1386,
2004.
C. M. Lee, M. A. V. Mudaliar, and et al. Simultaneous non-negative
matrix factorization for multiple large scale gene expression datasets in
toxicology. PloS one, 7(12):e48238, 2012.
B. Mecham. Top500 go terms of prostate cancer, 2 2011.
E. Mosca, R. Alfieri, I. Merelli, F. Viti, A. Calabria, and L. Milanesi.
A multilevel data integration resource for breast cancer study. BMC
Systems Biology, 4(1):1–11, 2010.
C. L. Plaisier, M. Pan, and N. S. Baliga. A mirna-regulatory network
explains how dysregulated mirnas perturb oncogenic processes across
diverse cancers. Genome Research, (206):gr.133991.111–, 2012.
M. Prochazka and et al. Lung cancer risks in women with previous
breast cancer. Annals of oncology official journal of the European
Society for Medical Oncology ESMO, 285(1):3090–3091, 2002.
R. Ranganath, C. Wang, B. David, and E. Xing. An adaptive learning
rate for stochastic variational inference. In Proceedings of the 30th
International Conference on Machine Learning (ICML-13), volume 28,
pages 298–306, May 2013.
R. E. Schoen, J. L. Weissfeld, and L. H. Kuller. Are women with breast,
endometrial, or ovarian cancer at increased risk for colorectal cancer?
American Journal of Gastroenterology, 89(6):835–842, 1994.
D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, and
et al. Gene expression correlates of clinical prostate cancer behavior.
Cancer Cell, 1(2):203–209, 2002.
A. Strehl and J. Ghosh. Cluster ensembles — a knowledge reuse
framework for combining multiple partitions. J. Mach. Learn. Res.,
3:583–617, Mar. 2003.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting
and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning (2008),
307(July):1096–1103, 2008.
K. Xu, J. Cui, V. Olman, Q. Yang, D. Puett, and Y. Xu. A comparative
analysis of gene-expression data of multiple cancer types. PLoS ONE,
5:e13696, 10 2010.
Download