Finding Informative Genes for Prostate Cancer: A General

advertisement
Finding Informative Genes for Prostate Cancer: A General
Framework of Integrating Heterogeneous Sources
Liang Ge, Jing Gao, Nan Du, and Aidong Zhang
Computer Science and Engineering Department, State University of New York at Buffalo
Buffalo, 14260, U.S.A.
{liangge, jing, nandu, azhang}@buffalo.edu
ABSTRACT
Finding informative genes for prostate cancer has always
been an important topic in cancer study. With the widespread
use of genomic analysis and microarray experiments, a large
number of genes can be analyzed efficiently to find the informative ones based on high-throughput microarray experiments [5–9]. On the other hand, based on clinical studies,
several genes have already been identified to be important in
prostate cancer development and progression [23–28]. These
research results come from heterogeneous sources, with different formats, and expressing different perspectives of the
problem of finding informative genes for prostate cancer.
In this work, we are aiming to find the informative genes
for prostate cancer by utilizing these heterogeneous sources
of information from various research progresses. We propose a general framework that encodes various heterogeneous sources including ranked lists of informative genes
[5–9], microarray expression data [5–9] and important genes
identified by [23–28]. The proposed framework estimates
the conditional probability of a gene being informative and
ranks the genes by this probability. The estimation of such
probability is formulated as an optimization problem, where
we propose an efficient iterative algorithm to solve the optimization problem. Furthermore, we show that the problem formulation is convex and the iterative algorithm converges to the global optimal value. Extensive experiments
show that the utilization of heterogeneous information is
very helpful in finding informative genes and the proposed
method outperforms many other baseline methods.
Categories and Subject Descriptors
H.4 [Algorithms]: Experimentation
General Terms
Algorithms
1. INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ACM-BCB’ 12 ,October 7-10, 2012, Orlando, FL, USA
Copyright 2012 ACM 978-1-4503-1670-5/12/10 ...$15.00.
Finding informative genes for prostate cancer has always
been an important topic in cancer study. With the widespread
use of genomic and microarray analysis, a large number of
genes can be studied and analyzed efficiently to find the informative ones based on microarray expression data. Consequently, various prostate cancer microarray studies have
been reported [5–9]. Since those experiments were carried
out in different laboratories, with different materials and microarray platforms, there exists certain discrepancy between
those results. On the other hand, based on clinical studies, several genes have already been identified as important
in prostate cancer development and progression, including
pepsin(HPN), which simulates metastasis formation in an
animal model of prostate cancer [23], alpha-Methylacyl-CoA
reacemase (AMACR), a clinically utilized marker of prostate
cancer [24], and fatty aside synthase (FASN), an emerging
therapeutic target [25].
Given those endeavors towards the same target, we believe
that each endeavor can be viewed as a unique perspective towards the problem of finding informative genes. Therefore,
by properly integrating those perspectives, we are able to
solve this problem more effectively. Inspired by this thought,
we propose to explore the heterogeneous sources of information regarding the problem of finding informative genes for
prostate cancer. To be specific, we consider sources of information as follows: 1) top-k informative genes produced
by five prostate cancer studies [5–9]. 2) The microarray expression data associated with the studies [5–9]. 3) The informative genes reported by other researches in [23–28]. Given
those information, we aim to infer a list of informative genes
for the prostate cancer.
Integrating heterogeneous sources of prostate cancer studies is challenging. The first challenge is that information
from different sources are in different formats. The outputs of studies [5–9] are ranked lists, in which top-k informative genes are given in a ranked order. The microarray
expression data are detailed expressions of genes on tumor
and normal samples, which is a full matrix with rows being
genes and columns being samples. The informative genes
from [23–28] are a set of genes that are deemed to be important for prostate cancer. In addition, the microarray expression data from each study are not directly comparable
because those expression data are obtained in different laboratories and with different techniques. Therefore, how to
propose a framework that is able to encode different forms
of information is the first challenge to be handled.
The second challenge is that, for prostate cancer study,
different sources bear different reliability, i.e., the confidence
that we trust each source are different. The informative
genes identified by [23–28] are believed to be most trustworthy in that those studies are conducted based on extensive clinical studies. On the other hand, it is well-known
that high-throughput experiments contain a lot of errors
and noises, therefore, the microarray expression data bear
less credibility than the information from clinical studies
in [23–28]. The ranked lists of informative genes reported
by [5–9] are generated based on microarray expression data,
therefore is more reliable than the raw microarray expression
data, yet less reliable than the studies in [23–28]. Therefore,
the problem of finding informative genes is coupled with
the problem of estimating the reliability of each information
source.
In this work, we propose a general framework to integrate
multiple heterogeneous sources to find the informative genes
for prostate cancer. The framework smoothly encodes heterogeneous sources of information with different formats and
estimates the conditional probability of each gene being informative. The genes are thus ranked by their conditional
probability. The estimations of such conditional probability
are formulated as an optimization problem where the objective function integrates all sources of information into a
semi-supervised learning problem. We present an efficient
iterative algorithm to solve the objective function. The objective function is shown to be convex and the iterative algorithm is shown to converge to a stationary point. We apply
the proposed method using all sources of information and
the experimental results show that the including of multiple
heterogeneous sources can greatly improve the performance
on finding the informative genes for prostate cancer and the
proposed method outperforms many baseline methods.
The major contributions of the paper are
• It is the first work to find the informative genes by
integrating multiple heterogenous sources of information.
• We propose a novel framework that can smoothly integrate heterogeneous sources of information and propose an efficient algorithm to solve a convex objective
function.
• The experimental evaluations show that the proposed
method outperforms many other baseline methods.
The organization of the paper is as follows: in Section 2,
we describe the setting of our problem and data sets used in
this work. Section 3 presents the proposed framework that
smoothly integrates heterogeneous sources of information.
An extensive experimental study is reported in Section 4.
Section 5 discusses the related work and we conclude our
work in Section 6.
2. DATE SETS AND PROBLEM SETTING
In this section, we present the data sets used in this paper
and the setting of our problem.
2.1 DATA SETS
In this work, we adopt the following three sources of information about the genes in prostate cancer studies.
• The top-k ranked list of informative genes from five
studies [5–9] as shown in Table 1.
• The microarray expression data associated with each
study in [5–9].
• The important genes for prostate cancer identified by
other studies: HPN [23], AMACR [24], FASN [25],
GUCY1A3 [26], ANK3 [27], STRA13 [28], CCT2 [28],
CANX [28] and TRAP1 [28].
The bold cells in Table 1 indicate that the genes are confirmed to be important from other researches.
2.2
PROBLEM SETTING
Given a pool of n genes G = {g1 , ..., gn }, m experiments
are conducted to find the informative genes. Each experiment is performed on a subset of the gene pool Si and
produces a ranking of the participating genes, in which the
top-k genes are deemed to be informative. Each subset Si
is associated with a microarray expression data matrix Mi .
Also we have a set I of confirmed informative genes by other
studies. The target is to infer a ranked list of genes that are
informative for prostate cancer. In this work, we solve this
problem by first estimating the conditional probability of a
gene being informative and then ranking the genes by their
probabilities.
3.
METHODOLOGY
In this section, we show how we represent heterogeneous
sources of information and propose a general framework
to integrate all sources of information into an optimization
problem. The conditional probability of each gene being informative is the solution to this optimization problem. Then
we present an efficient iterative algorithm to obtain the optimal value of the objective function.
3.1
Heterogeneous Information Representation
In this section, we show how to represent the three sources
of information .
Ranked Lists: For ranked lists produced by various studies, we observe that the ranked list from each study can be
viewed as a classifier to the gene pools G, i.e., each experiment classifies the gene pool G into two classes: informative
and uninformative. The top-k ranked list refers to the informative genes predicted by each experiment and remaining
genes in G are predicted to be uninformative.
Given the observation that each experiment can be viewed
as a classifier for the gene pool G, inspired by [15], each experiment corresponds to two groups: informative and uninformative. For m experiments, we have 2m groups. For
each gene gi in G, it is straightforward to see that it must
belong to m groups, i.e., one of the groups that corresponds
to each experiment. This can be formulated as a natural
bipartite graph representation for the ranked lists. Figure 1
illustrates the bipartite graph representation of the ranked
lists.
In Figure 1, the left side is the nodes representing groups
and the right side is the nodes representing genes. t1 and
t2 represent the informative and uninformative groups in
experiment m1 , t3 and t4 represent the informative and uninformative groups in experiment m2 , etc. For each gene
gi , a solid line indicates that the gene belongs to a certain
group. The bipartite graph naturally formulates the observation that the ranked lists can be viewed as classifiers to
the gene pool G.
We begin to lay down some notation. The affinity matrix
An×v of the bipartite graph is defined as aij = 1 if gene gi is
assigned to the group tj and 0 otherwise, where v = 2 × m
Table 1: Prostate Cancer Studies
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Luo [5]
HPN
AMACR
CYP1B1
ATF5
BRCA1
LGALS3
MYC
PCDHGC3
WT1
TFF3
MARCKS
OS-9
CCND2
NME1
DYRK1A
TRAP1
FMO5
ZHX2
RPL36AL
ITPR3
GCSH
DDB2
TFCP2
TRAM1
YTHDF3
Welsh [6]
HPN
AMACR
OACT2
GDF15
FASN
ANK3
KRT18
UAP1
GRP58
PPIB
KRT7
NME1
STRA13
DAPK1
TMEM4
CANX
TRA1
PRSS8
ENTPD6
PPP1CA
ACADSB
PTPLB
TMEM23
MRPL3
SLC19A1
Dhana [7]
OGT
AMACR
FASN
HPN
UAP1
GUCY1A3
OACT2
SLC19A1
KRT18
EEF2
STRA13
ALCAM
GDF15
NME1
CALR
SND1
STAT6
TCEB3
EIF4A1
LMAN1
MAOA
ATP6V0B
PPIB
FMO5
SLC7A5
True [8]
AMACR
HPN
NME2
CBX3
GDF15
MTHFD2
MRPL3
SLC25A6
NEM1
COX6C
JTV1
CCNG2
AP3S1
EEF2
RAN
PRKACA
RAD23B
PSAP
CCT2
G3BP
EPRS
CKAP1
LIG3
SNX4
NSMAF
Singh [9]
HPN
SLC25A6
EEF2
SAT
NME2
LDHA
CANX
NACA
FASN
SND1
KRT18
RPL15
TNFSF10
SERP1
GRP58
ALCAM
GDF15
TMEM4
CCT2
SLC39A6
RPL5
RPS13
MTHFD2
G3BP2
UAP1
Figure 2: Similarity Matrix build from Microarray Expression Data
Figure 1: Bipartite Graph Representation.
is the number of groups. Since we want to estimate the conditional probability of gi being informative, the conditional
probability is denoted by Fn×c . The conditional probability
of each group is also involved, denoted by Qv×c . We have
fiz = P rob(gi is class z|gi ) and qjz = P rob(tj is class z|tj ),
Here c=2, class z denotes one of the two classes: informative
and uninformative. Also we define the initial class labels for
the groups Pv×c as pjz = 1 if the group tj ’s class is z and 0
otherwise.
Microarray Expression Data: For each experiment,
we can get the microarray expression data. The expression data show the behaviors of genes in tumor and normal
samples. The expression data are used to estimate the similarity between genes, where informative genes should have
similar microarray expression.Therefore, for each study, we
can get the expression data for the top-k informative genes
reported by that study. Since different experiments are conducted on different platforms, with different techniques, direct comparisons of expression data across experiments are
infeasible. Therefore, we can’t compute each element in the
similarity matrix W for the gene pool G. Rather, the microarray expression data from each experiment can only be
used to compute a sub matrix of the whole similarity matrix W . Figure 2 illustrates the similarity matrix W built
from microarray expression data. Noted that even with five
microarray expression data, we still can’t compute every element of W . Wij denotes the similarity between gene gi and
gj . In case of overlapped genes across experiments, we take
the average similarity computed from each study.
Informative Genes from other Researches: The set
of informative genes I can be encoded as prior knowledge
of whether a gene is informative or not. Let n × c matrix
Y denote the prior knowledge, where yiz = 1 if gene gi is
given class z and 0 otherwise. class z can be informative or
uninformative.
Table 2 shows the notation that are defined in this work.
∑
2
The fourth term in Eq. (1), i.e., γ n
i,j=1 Wij ||fi − fj || ,
corresponds to the smoothness assumption in that informative genes should be similar in terms of expression data. γ
denotes the confidence over this source of information.
Note that the third and fourth terms in Eq. (1) correspond to the Label Propagation Model [19] widely used in
semi-supervised learning. The label propagation model is
to propagate label information to the unlabeled nodes using
smoothness assumption. Also the first and second terms in
Eq. (1) correspond to the Graph Consensus Model proposed
by Gao. et.al in [15]. Therefore, the objective function in
Eq. (1) naturally connects Label Propagation Model and
Graph Consensus Model and enforces the information from
each source influence one another.
Table 2: Notation
3.3 THE ITERATIVE ALGORITHM
Symbol
An×v
Fn×c
Qv×c
Pv×c
Yn×c
Wn×n
1, ..., c
1, ..., n
1, ..., m
1, ..., v
Definition
aij -indicator of gene gi in group j
fiz -probability of gene gi wrt class z
qjz -probability of group tj wrt class z
pjz -indicator of group tj predicted as class z
yiz -indicator of gene gi predicted as class z
wik -similarity between gene gi and gk
class indexes
gene indexes
experiment indexes
group indexes
We take the Block Coordinate Descend method to solve
objective function (1). The algorithm is shown in Algorithm
1.
Algorithm 1 The Iterative Algorithm
Input: An×v , Pv×c , Yn×c , Wn×n parameter α, β,γ, ϵ
Output: Consensus matrix F
3.2 THE OBJECTIVE FUNCTION
Given the representations of heterogeneous sources, we
solve the problem by estimating Fn×c . Now we formulate
this by the following objective function.
min J(F, Q) = min
v ∑
c ∑
n
∑
aij (qjz − fiz )2
j=1 z=1 i=1
+α
v ∑
c
∑
(qjz − pjz )2
j=1 z=1
+β
+γ
n ∑
c
∑
hi (fiz − yiz )2
i=1 z=1
n
∑
(1)
Wij ||fi − fj ||2
i,j=1
s.t.
c
∑
z=1
∑c
uiz = 1,
c
∑
qjz = 1
z=1
fiz ∈ [0, 1], qjz ∈ [0, 1],
where hi = z=1 yiz .
Interpretation:
and second parts in objective
∑ The
∑first ∑
2
function (1), i.e, vj=1 cz=1 n
i=1 aij (qjz − fiz )
∑v ∑c
2
+α j=1 z=1 (qjz −pjz ) , corresponds to the intuition that
a group tj corresponds to class z if the majority of genes in
this group belong to class z, meanwhile a gene corresponds to
class z if the majority of the groups it belongs to correspond
to class z. Parameter α denotes the confidence over each
group, i.e., the confidence over the information from this
source.
∑
∑c
2
The third term in Eq. (1), i.e., β n
i=1
z=1 hi (fiz −yiz ) ,
is to enforce that the estimations should not deviate from
the confirmed informative genes. β denotes the confidence
over this source of information.
1:
2:
3:
4:
5:
6:
7:
8:
Initialize F 0 , F 1 randomly;
t=1;
while ∥ F t − F t−1 ∥> ϵ do
Qt = (Dv + αKv )−1 (AT F t−1 + αKv P );
F t = (Dn + βHn + γ(I − L̂))−1 (AQt + βHn Y );
t=t+1;
end while
Output F t
∑v
∑
where Dv = diag{( n
j=1 aij )}n×n ,
i=1 aij )}v×v , Dn = diag{(
∑c
Kv = diag{( z=1 yjz )}v×v and diag means the diagonal
elements of a matrix. Dv and Dn are the normalization factors. Kv acts as constraints for the group nodes. L̂ is the
−1/2
−1/2
normalized Laplacian derived from W [19]: L̂ = Dw W Dw
where Dw is a diagonal matrix with its (i, i) element equal to
the sum of the i-th row of W . After getting F , we retrieve
the conditional probability of each gene being informative
and then rank the genes.
The iterative algorithm shows a picture of information
propagation. It can be understood that in each step, the
algorithm performs a label propagation, in which labeled information is propagated to the gene nodes based on smoothness assumption. After each gene node gets its estimated label, the gene nodes propagate the information to the group
nodes. The group nodes, after receiving the information,
adjust Q and pass the information back to the gene nodes.
In this way, the information from heterogeneous sources influences each other. The process stops when it converges.
3.4 ANALYSIS OF THE ITERATIVE ALGORITHM
In this section, we analyze the proposed algorithm and
show that the iterative algorithm converges to a stationary
point and the objective function is convex. We also present
the time complexity analysis of the iterative algorithm.
Lemma 1. Algorithm 1 converges to a stationary point.
Proof. We obtain the solution to the objective function
1 using Block Coordinate Descent method. It adopts an
iterative procedure. At each step, we obtain the minimum
of one variable while fixing the remaining variables.
At the first step, we fix F and take derivation of J over Q.
The Hessian matrix with regard
∑ to Q is a diagonal matrix
and its diagonal elements are n
i=1 aij +α > 0 and therefore
a positive definite matrix, which means ▽J(Q, F t−1 ) = 0
gives the unique minimum of the objective function in terms
of Q. We have
Qt = (Dv + αKv )−1 (AT F t−1 + αKv P ),
(2)
At the second step, we fix Q and take derivation of J over
F.
∑nThe hessian matrix is also a diagonal matrix with entries
i=1 aij > 0 plus I − L̂. The diagonal matrix is positive
definite and from [4] we know that I − L̂ is also a semipositive definite. Therefore, the hessian matrix is a positive
definite matrix, indicating that ▽J(Qt−1 , F ) = 0 gives the
unique minimum of the objective function in terms of F .
We have
F t = (Dn + βHn + γ(I − L̂))−1 (AQt + βHn Y ).
(3)
By Proposition 2.7.1 in [2], we have that the Block Coordinate Descent method converges to a stationary point.
Theorem 1. The objective function in Eq. (1) is convex
Proof. The objective function (1) can be divided into
two parts:
v ∑
c ∑
n
∑
aij (qjz − fiz )2 + α
j=1 z=1 i=1
=
v ∑
c
∑
(qjz − pjz )2
j=1 z=1
v ∑
c ∑
n
∑
aij (qjz − fiz )2 + α
j=1 z=1 i=1
v ∑
c
∑
2
qjz
(4)
j=1 z=1
v ∑
c
∑
+α
(p2jz − 2pjz qjz ).
j=1 z=1
Suppose θ is a vector containing all the variables of Eq.
(4), i.e., θ = (q11 , ..., qnc , f11 , ..., fvc ). Consider Eq. (4)’s
standard quadratic form:
Eq.4 = θT W θ + bT θ + c,
(5)
where W , b and c are the coefficient matrix, vector and
scalar, respectively. From Eq. 4, we have
θT W θ =
v ∑
c ∑
n
∑
aij (qjz − fiz )2 + α
j=1 z=1 i=1
v ∑
c
∑
2
qjz
,
(6)
j=1 z=1
Note that aij and α are non-negative for any i and j. Furthermore, each gene is either informative or uninformative,
indicating that there is at least one non-zero entry in each
vector. Therefore, we have θT W θ > 0 if θ ̸= 0. The matrix
W is strictly positive definite, and so the objective function
in Eq. (4) is strictly convex.
The remaining part of the objective function in Eq. (1) is
as follows:
β
n ∑
c
∑
hi (fiz − yiz )2 + γ
i=1 z=1
Eq. (7) can be rewritten as:
n
∑
i,j=1
Wij ||fi − fj ||2 .
(7)
γθLnorm θ + bT θ + c,
(8)
where θ is the vector containing all the variables. Lnorm is
the normalized Laplacian. By [4], we know that Lnorm is
semi-positive definite. Therefore, Eq. (7) is convex.
The sum of two convex functions are also convex. Therefore, we have the objective function in Eq. (1) is convex.
From Lemma 1 and Theorem 1, we know that for a convex
problem, any local minimum is also a global minimal [2].
Therefore, the solution found in Algorithm 1 converges to
the global minimum.
Time Complexity: On the first step of the iterative algorithm, the time complexity is O(v 2 + vcn2 + v 2 c). Since
v = 2m and c = 2 and usually much smaller than n, therefore, the time complexity for the first step is O(n2 ). The
time complexity for the second step is O(n2 + n3 ), therefore
is O(n3 ). Noted that the major time is spent on matrix multiplication. By Coopersmith-Winograd algorithm [3], the
time complexity can be reduced to O(n2.3727 ). Suppose the
time of iteration is t, the time complexity of whole algorithm
is O(tn3 ). In experiments, we observe that t is usually between 3 and 20.
4.
EXPERIMENTAL EVALUATION
In this section, we experimentally evaluate the proposed
method. First we discuss the evaluation metric and the baseline methods. Then we show that the benefit of including
heterogeneous sources and compare the proposed method
with other baseline methods.
4.1 EVALUATION METRIC
Since we already know that some genes are informative,
we therefore evaluate the result in terms of those ground
truth genes. We propose to evaluate the results by the average rank of those ground truth genes and also their number
of appearances in the result.
4.2
BASELINE METHODS
Borda Count [20]: Given k full lists τ1 , τ2 , ..., τk , Borda’s
method can be thought of as assigning a k -element position
vector to each candidate (the positions of the candidate in
the k lists), and sort the candidates by the L1 norm of these
vectors. There are also other variants of Borda’s count: sorting by Lp norm for p > 1, sorting by the median of the k
values, sorting by geometric mean of the k values, etc. In
this paper, we implemented four variants of Borda count:
Borda1 (L1 norm), Borda2 (median), Borda3 (geometric
mean), Borda4 (L2 norm).
Markov Chain [1] [14]: Markov chains can also be used
to generate the consensus results. The states of the chain
correspond to the n genes to be ranked, the transition probabilities depend in some particular way on the given lists, and
the stationary probability distribution will be used to sort
the n candidates. We implemented three different Markov
Chains, the details are as follows:
• M C1 : If the current state is gene g, then the next
state is chosen uniformly from the multiset of all genes
that were ranked∪higher than or equal to g, i.e., from
the multiset of i {q|τi (q) ≤ τi (g)}. The transition
probability is defined as 1/|G| where |G| is the size of
gene pool. M C1 is a slight variation from that in [14].
• M C2 : If the current state is gene g, then the next state
is chosen uniformly from the multiset of all genes that
were ranked higher than g in a majority of the input
lists. The transition probability is defined as 1/|G|.
M C2 can be seen as an extension to that in [1].
• M C3 : If the current state is gene g, ∑
the probability
m
of next state f is therefore defined as:
i=1 I(τi (f ) <
τi (g))/m × |G|, where τi (f ) < τi (g) denotes in the experiments i, gene f ranked higher than gene g. I is the
indicator function to count the number of times when
τi (f ) < τi (g). M C3 can also be seen as a modification
of the MCT algorithm in [1].
Cross Entropy Monte Carlo (CEMC) [13]: CEMC is
one representative of the typical ways of solving rank aggregation problem. Given n input lists
∑ τ1 , τ2 , ..., τk , the optimal
result τ ∗ satisfies: τ ∗ = argmin ki=1 wi d(τi , τ ∗ ), where wi
is the weight for each input list, d is the distance between
two lists. Lin etal,. [13] proposed to employ CEMC method
to solve this optimization problem. We acquired the source
code from the author and experimented with four variants:
CEM C1 : Spearman distance with unweighed case (i.e., set
the weight wi to be even); CEM C2 : Spearman distance
with weighted case; CEM C3 : Kendall-tau distance with unweighed case; CEM C4 : Kendall-tau distance with weighted
case.
4.3 RESULTS AND DISCUSSIONS
Using Only Ranked List: first we show the experiments by using the ranked list information alone. The results are shown in Table 3. In this experiment, we set α = 2,
β = 0 and γ = 0.
Table 3: Results of First Experiment
Method
Luo
Welsh
Dhana
True
Singh
Borda1
Borda2
Borda3
Borda4
M C1
M C2
M C3
CEM C1
CEM C2
CEM C3
CEM C4
Ranked Lists Only
Ranked Lists and Microarray
Appear #
3
6
5
3
4
5
4
7
5
5
6
6
6
6
5
6
7
7
Avg. Rank
19.4
13.4
14.1
19.8
18.4
15.2
17.4
15.8
15.2
16.1
15.5
14.7
14.1
14.4
14.8
14.6
13.2
13.0
Table 3 shows the results evaluated by the metric we proposed in this section. If a ground truth gene doesn’t appear
in the result, we assign its rank to be k+1 (26 in this case).
Also we list the evaluation results for the five prostate studies. As we can see from the result, the proposed method
not only contains the most ground truth genes (ties with
Borda3 ) but also achieves the best average rank. Borda
count methods are better than Luo, True and Singh’s result
but worse than that of Welsh and Dhana, so do Markov
Chain methods. CEMC does achieve good average rank
comparing to Borda and Markov Chain methods and only
inferior to Welsh’s result. An interesting finding is that for
the CEMC method, the Spearman distance with unweighed
case achieves the best performance of all CEMC methods,
quite contradictory to the intuition that weighted case is
normally better than unweighed case. The first set of experiment shows that using ranked list alone, we can achieve
performance no less than the best performance in the baseline methods. The detailed output of the first experiment is
shown in the column First Exp. in Table 5.
Using Ranked Lists and Microarray Expression
Data: In the second experiment, we adopt microarray data
into consideration. In this experiments, we set α = 2, β = 0
and γ = 2. The results are shown in the last row in Table 3.
As seen from the result that adding microarray data is able
to boost performance of inferring informative genes. The
detailed output of the second experiment is shown in the
column Second Exp. in Table 5.
Using All Sources of Information: in the third experiment, we adopt all three sources of information into consideration i.e., we set α = 2, β = 8 and γ = 2. The results are
shown in Table 4.
Table 4: Results of the Second Experiment
Gene
HPN
AMACR
FASN
GUCY1A3
ANK3
STRA13
CCT2
CANX
TRAP1
Appear #
7
8
7
7
7
8
6
7
7
Avg. Rank
13.0
12.5
13.0
11.5
11.0
2.0
13.1
13.2
11.3
As seen from the result, adding only one piece of side
information significantly improves the quality of the final
result. When gene AMACR is assigned to be informative,
the proposed method includes 8 informative genes, beating
all the previous results. Moreover, gene AMACR is a highranked gene in the input lists and therefore improving the
result in this way is quite encouraging. We also notice that
adding HPN as the side information doesn’t increase the
performance at all, mainly because HPN is predicted to be
top-ranked in most input lists. Adding such information will
not help propagate the information in the proposed method.
Adding CCT2 into the proposed method decreases the performance by both ground truth gene appearance times and
average rank. This is because gene CCT2 only appears in list
True and Singh, adding this information will raise the probabilities of these two lists and equivalently decrease the probabilities of the other three lists, therefore gene GUCY1A3
from Dhana experiment is eliminated from the final result.
The third experiment shows that by adding informative gene
information, we can significantly increase the performance
of the proposed method. The detailed output of the third
experiment is shown in the column Thrid Exp. in Table 5.
As seen from the results in Table 5, Borda successfully
contains 7 ground truth genes but comparing with our results, CANX, STRA13 and GUCY1A3 are ranked relatively
low, making its result inferior to ours. In Markov Chain
Table 5: Detail Experimental Outputs
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Borda
HPN
AMACR
FASN
GDF15
NME2
SLC25A6
EEF2
OACT2
OGT
KRT18
NEM1
UAP1
CCND2
CYP1B1
CBX3
SAT
CANX
BRCA1
GRP58
MTHFD2
STRA13
LGALS3
ANK3
GUCY1A3
LDHA
MC
HPN
AMACR
GDF15
NME1
FASN
EEF2
KRT18
UAP1
NME2
SLC25A6
OACT2
STRA13
CANX
GRP58
SND1
MTHFD2
ALCAM
MRPL3
TMEM4
PPIB
SLC19A1
CCT2
FMO5
CYP1B1
ATF5
CEMC
HPN
AMACR
FASN
GDF15
NME2
UAP1
OACT2
SLC25A6
KRT18
EEF2
STRA13
NME1
CANX
ALCAM
GRP58
SND1
FMO5
TMEM4
CCT2
PRKACA
MTHFD2
PTPLB
PPIB
MRPL3
SLC19A1
method, the ranking of CANX and STRA13 is quite close
to the result of the proposed method, yet it fails to include GUCY1A3, causing its average rank lower than ours.
CEMC is very good at predicting the ranking of top-ranked
genes in input lists: HPN, AMACR and FASN. These three
genes are ranked high in most experiments. And the ranking of STRA13 and CANX is close to that of the proposed
method . However, it also fails to include GUCY1A3.
In conclusion, we find that adding multiple heterogeneous
sources of information improves the performance of finding
informative genes. Also we notice that adding information
from [23–28] has the greatest boost to the performance while
information from microarray expression data gives the least
boost, which corresponds to the reliability of the source: the
information from [23–28] is most reliable while the microarray expression data is the least reliable. Furthermore, we
notice that although microarray expression data contains a
lot of errors and noises, it still can improve the performance
of finding informative genes for prostate cancer.
Parameter Sensitivity: There are three parameters in
the proposed method: α, β and γ. We conducted the sensitivity experiments shown in Figure 3. α represents the
confidence of our belief in the initial classes of the group
nodes. The classes of the groups are obtained from different experiments and hence may not be completely correct,
therefore, smaller α usually yields better performance. β
shows our confidence on the informative genes from other
researches. Those researches are deemed to be reliable and
thus a large β is usually better. γ denotes the confidence on
the microarray data. Those data may contain noises, therefore lower value usually yields better results. The result in
Figure 3 confirms our observation.
First Exp.
HPN
GDF15
AMACR
NME1
FASN
KRT18
EEF2
ALACM
SND1
OACT2
STRA13
CANX
GRP58
TEEM4
CCT2
NME2
SLC25A6
CALR
EIF4A1
GUCY1A3
LMAN1
OGT
STAT6
TCEB3
LDHA
Second Exp.
HPN
GDF15
AMACR
FASN
NME1
KRT18
EEF2
ALACM
SND1
OACT2
STRA13
CANX
GRP58
TEEM4
CCT2
NME2
SLC25A6
CALR
EIF4A1
GUCY1A3
LMAN1
OGT
STAT6
TCEB3
LDHA
Third Exp.
STRA13
HPN
GDF15
AMACR
NME1
FASN
KRT18
EEF2
OACT2
ALCAM
SND1
CANX
GRP58
TMEM4
CCT2
NME2
SLC25A6
CALR
EIF4A1
GUCY1A3
LMAN1
OGT
ANK3
TCEB3
STAT6
13.5
13
α
β
γ
12.5
12
11.5
11
1
2
3
4
5
Parameter
6
7
8
Figure 3: Parameter Sensitivity Experiment
5.
RELATED WORK
Many of the previous work on the same topic only uses the
ranked list alone and treat the problem as a variant of the
rank aggregation problem [1] [13]. The earliest approach to
revolve rank aggregation or voting aggregation is introduced
by Jean-Charles de Borda [20] in 1770. Given k full lists
τ1 , τ2 , ..., τk , Borda’s method can be thought of as assigning
a k -element position vector to each candidate (the positions
of the candidate in the k lists), and sort the candidates by
the L1 norm of these vectors. The intuition behind Borda
count is ”more wins is better”. The major limitation is its
extensibility to partial lists. It is shown that for partial lists,
Borda count sometimes yields undesirable outcomes.
Rank aggregation problem is also identified in information retrieval and web search areas. Generally, the problem
of rank aggregation is approached in two categories in Information Retrieval and Web Search areas: Unsupervised
and Supervised. For unsupervised approaches, they usually express the problem as a distance minimization problem: Given n input lists ∑
τ1 , τ2 , ..., τk , the optimal result τ ∗
satisfies: τ ∗ = argmin ki=1 wi d(τi , τ ∗ ) where wi is the
weight for each input list, d is the distance between two
lists. Choices include Spearman distance [21] and Kendalltau distance [22]. This optimization problem is shown to
be NP-hard [22]. Those methods differ in how to solve this
optimization. It is noted that the most recent work [13] that
studies the same problem employed cross entropy Monte
Carlo method to solve the optimization problem. The major
limitation these works is the proposed methods seem unable
to include more information from other sources while the
proposed method in this paper is able to utilize heterogeneous sources of information smoothly.
In addition, the problem in this paper is related to clustering ensemble or consensus clustering [18]: given a number
of clustering results, find a consensus result that is better
than any single input clustering result. Our work is related
to work in [17] in that we use the bipartite graph to represent the problem. [15] proposed a bipartite graph method
to solve consensus maximization problem in which given a
bunch of class labels and clustering results, find the consensus labels that achieves maximal consistency between them.
Our work is similar to Jing’s work [15] in that we both enforce the information propagated between neighboring nodes
in the bipartite graph.
6. CONCLUSIONS
In this work, we targeted the problem of finding informative genes for prostate cancer studies. Different than
previous work that use only one source of information, in
this work, we proposed a general framework that is able to
utilize heterogeneous source of information: ranked list, microarray expression data and confirmed informative genes.
We formulated the problem into an optimization problem
and proposed an iterative algorithm to solve the problem.
We showed that the formulated objective function is convex and the iterative algorithm converges to the global minimum. The extensive experiments showed that including
heterogeneous sources of information is able to increase the
performance and the proposed method outperforms many
baseline methods.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
7. REFERENCES
[1] DeConde, R.P., Hawley, S., Falcon, S., Clegg, N., Knudsen, B.,
and Etzioni, R. Combining results of microarray experiments:
a rank aggregation approach Statistical Applications in Gene.
and Mole. Biology , 2001
[2] D. P. Bertsekas Non-Linear Programming Athena Scientific,
2nd Edition, 1999
[3] Coppersmith and Winograd Matrix Multiplication via
arithmetic progressions Journal of Symbolic Computation,
1990
[4] U. von Luxburg A tutorial on spectral clustering Technical
Report, 2006
[5] Luo, J., Duggan, D.J., Chen, Y., Sauvageot, J., Ewing, .M.,
Bittner, M.L., Trent, J.M., and Isaacs, W.B. Human prostate
cancer and benign prostatic hyperplasia: molecular dissection
by gene expression profiling Cancer Research, 2001
[6] Welsh, J.B., Sapinoso, L.M., Su, A.I., Kern, S.G.,
Wang-Rodriguez, J., Moskaluk, C.A., Frierson, H.F.Jr., and
Hampton,G.M. Analysis of gene expression identifies candidate
markers and pharmacological targets in prostate cancer Cancer
Research, 2001
[7] Dhanasekaran, S.M., Barrette, T.R., Ghosh, D., Shah, R.,
Varambally, S., Kurachi, K.,Pienta, K.J., Rubin, M.S., and
[25]
[26]
[27]
[28]
Chinnaiyan, A.M. Delineation of prognostic biomarkers in
prostate cancer . Nature, 2001
True, L., Coleman, I., Hawley, S., Huang, A., Gikord, D.,
Coleman, R., Beer, T., Gelman, E., Datta, M., Mostaghel, E.,
Knudsen, B., Lange, P., Vessella, R., Lin, D., Hood, L., and
Nelson, P. A Molecular Correlate to the Gleason Grading
System for Prostate Adenocarcinoma Proceedings of the NSA,
2006
Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J.,
Ladd, C., Tamayo, P., Renshaw, A.A., DAmico, A.V., Richie,
J.P., Lander, E.S., Loda, M., Kantok, P.W., Golub, T.R., and
Sellers, W.R. Gene expression correlates of clinical prostate
cancer behavior Cancer Cell, 2002
Rhodes, D.R., Barrette, T.R., Rubin, M.A., Ghosh,D. and
Chinnaiyan,A.M. Meta-analysis of microarrays: interstudy
validation of gene expression profiles reveals pathway
dysregulation in prostate cancer. Cancer Research, 2002
Rhodes, D.R., Yu, J., Shanker, K., Deshpande, N., Varambally,
R., Ghosh, D., Barrette, T., Pandey, A. and Chinnaiyan, A.M.
Large-scale meta-analysis of cancer microarray data identifies
common transcriptional profiles of neoplastic transformation
and progression Proc Natl Acad Sci USA, 2004
Liang Ge, Nan Du, and Aidong Zhang Finding Informative
Genes from MultipleMicroarray Experiments: A Graph-based
Consensus Maximization Model In Proc. of BIBM, 2011
Shili, L., Jie, D. Integration of Ranked Lists via Cross Entropy
Monte Carlo with Applications to mRNA and microRNA
Studies Technical Report, 2008
Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. Rank
aggregation methods In Proc. of WWW, 2001
Jing, G., Feng, L., Wei, F., Yizhou, S., Jiawei, H. Graph-based
Consensus Maximization among Multiple Supervised and
Unsupervised Models In Proc. of NIPS, 2009
Jing, G., Feng, L., Wei, F., Yizhou, S., Jiawei, H. A
Graph-based Consensus Maximization Approach for Combining
Multiple Supervised and Unsupervised Models TKDE, 2011
X. Z. Fern and C. E. Brodley. Solving Cluster Ensemble
Problems by Bipartite Graph Partitioning In Proc. of ICML,
2004
A. Strehl and J. Ghosh. Cluster Ensembles a Knowledge Reuse
Framework for Combining Multiple Partitions Journal of
Machine Learning Research, 2003
D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scho’lkopf.
Learning with Local and Global Consistency In Proc. of NIPS,
2003.
J. C. Borda. Memoire sur les elections au scrutin Histoire de
l’Academie Royale des Sciences, 1781.
P. Diaconis and R. Graham. Spearman’s footrule as a measure
of disarray . J. of the Royal Statistical Society Series, 1977
J. J. Bartholdi, C. A. Tovey, and M. A. Trick. Voting schemes
for which it can be difficult to tell who won the election Social
Choice and Welfare, 1989
Klezovitch, O., Chevillet, J., Mirosevich, J., Roberts, R.,
Matusik, R., Vasioukhin, V. Hepsin promotes prostate cancer
and metastasis Cancer Cell, 2004
Kuefer, R., Varambally, S., Zhou, M., Lucas, P.C., Loeffler, M.,
Wolter, H., Mattfeldt, T., Hautmann, R.E., Gschwend, J.E.,
Barrette, T.R., Dunn, R.L., Chinnaiyan, A.M., Rubin, M.A.
alpha-Methylacyl-CoA racemase: expression levels of this novel
cancer biomarker depend on tumor differentiation , 2002
Pizer, E.S., Pflug, B.R., Bova, G.S., Han, W.F., Udan, M.S.,
Nelson, J.B. Increased fatty acid synthase as a therapeutic
target in adrogen-independent prostate cancer progression
Prostate, 2001
Dong, Y., Zhang, H., Gao, A.C., Marshall, J.R., Ip, C.
Androgen receptor signaling intensity is a key factor in
determining the sensitivity of prostate cancer cells to selenium
inhibition of growth and cancer-specific biomarkers Mol.
Cancer Ther , 2005
Ignatiuk, A., Quickfall, J.P., Hawrysh, A.D., Camberlain,
M.D., Anderson, D.H. The smaller isoforms of ankyrin 3 bind
to the p85 subunit of phosphatidylinositol 3’-kinase and
enhance platelet-derived growth factor receptor
down-regulation Journal of Biology Chemistry , 2006
Ivanova, A., Liao, S.Y., Lerman, M.I., Ivanov, S., Stanbridge,
F.J. STRA13 expression and subcellular localisation in normal
and tumour tissues: implications for use as a diagnostic and
differentiation marker Journal of Medical Gene. 2005
Download