Finding Informative Genes for Prostate Cancer: A General Framework of Integrating Heterogeneous Sources Liang Ge, Jing Gao, Nan Du, and Aidong Zhang Computer Science and Engineering Department, State University of New York at Buffalo Buffalo, 14260, U.S.A. {liangge, jing, nandu, azhang}@buffalo.edu ABSTRACT Finding informative genes for prostate cancer has always been an important topic in cancer study. With the widespread use of genomic analysis and microarray experiments, a large number of genes can be analyzed efficiently to find the informative ones based on high-throughput microarray experiments [5–9]. On the other hand, based on clinical studies, several genes have already been identified to be important in prostate cancer development and progression [23–28]. These research results come from heterogeneous sources, with different formats, and expressing different perspectives of the problem of finding informative genes for prostate cancer. In this work, we are aiming to find the informative genes for prostate cancer by utilizing these heterogeneous sources of information from various research progresses. We propose a general framework that encodes various heterogeneous sources including ranked lists of informative genes [5–9], microarray expression data [5–9] and important genes identified by [23–28]. The proposed framework estimates the conditional probability of a gene being informative and ranks the genes by this probability. The estimation of such probability is formulated as an optimization problem, where we propose an efficient iterative algorithm to solve the optimization problem. Furthermore, we show that the problem formulation is convex and the iterative algorithm converges to the global optimal value. Extensive experiments show that the utilization of heterogeneous information is very helpful in finding informative genes and the proposed method outperforms many other baseline methods. Categories and Subject Descriptors H.4 [Algorithms]: Experimentation General Terms Algorithms 1. INTRODUCTION Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM-BCB’ 12 ,October 7-10, 2012, Orlando, FL, USA Copyright 2012 ACM 978-1-4503-1670-5/12/10 ...$15.00. Finding informative genes for prostate cancer has always been an important topic in cancer study. With the widespread use of genomic and microarray analysis, a large number of genes can be studied and analyzed efficiently to find the informative ones based on microarray expression data. Consequently, various prostate cancer microarray studies have been reported [5–9]. Since those experiments were carried out in different laboratories, with different materials and microarray platforms, there exists certain discrepancy between those results. On the other hand, based on clinical studies, several genes have already been identified as important in prostate cancer development and progression, including pepsin(HPN), which simulates metastasis formation in an animal model of prostate cancer [23], alpha-Methylacyl-CoA reacemase (AMACR), a clinically utilized marker of prostate cancer [24], and fatty aside synthase (FASN), an emerging therapeutic target [25]. Given those endeavors towards the same target, we believe that each endeavor can be viewed as a unique perspective towards the problem of finding informative genes. Therefore, by properly integrating those perspectives, we are able to solve this problem more effectively. Inspired by this thought, we propose to explore the heterogeneous sources of information regarding the problem of finding informative genes for prostate cancer. To be specific, we consider sources of information as follows: 1) top-k informative genes produced by five prostate cancer studies [5–9]. 2) The microarray expression data associated with the studies [5–9]. 3) The informative genes reported by other researches in [23–28]. Given those information, we aim to infer a list of informative genes for the prostate cancer. Integrating heterogeneous sources of prostate cancer studies is challenging. The first challenge is that information from different sources are in different formats. The outputs of studies [5–9] are ranked lists, in which top-k informative genes are given in a ranked order. The microarray expression data are detailed expressions of genes on tumor and normal samples, which is a full matrix with rows being genes and columns being samples. The informative genes from [23–28] are a set of genes that are deemed to be important for prostate cancer. In addition, the microarray expression data from each study are not directly comparable because those expression data are obtained in different laboratories and with different techniques. Therefore, how to propose a framework that is able to encode different forms of information is the first challenge to be handled. The second challenge is that, for prostate cancer study, different sources bear different reliability, i.e., the confidence that we trust each source are different. The informative genes identified by [23–28] are believed to be most trustworthy in that those studies are conducted based on extensive clinical studies. On the other hand, it is well-known that high-throughput experiments contain a lot of errors and noises, therefore, the microarray expression data bear less credibility than the information from clinical studies in [23–28]. The ranked lists of informative genes reported by [5–9] are generated based on microarray expression data, therefore is more reliable than the raw microarray expression data, yet less reliable than the studies in [23–28]. Therefore, the problem of finding informative genes is coupled with the problem of estimating the reliability of each information source. In this work, we propose a general framework to integrate multiple heterogeneous sources to find the informative genes for prostate cancer. The framework smoothly encodes heterogeneous sources of information with different formats and estimates the conditional probability of each gene being informative. The genes are thus ranked by their conditional probability. The estimations of such conditional probability are formulated as an optimization problem where the objective function integrates all sources of information into a semi-supervised learning problem. We present an efficient iterative algorithm to solve the objective function. The objective function is shown to be convex and the iterative algorithm is shown to converge to a stationary point. We apply the proposed method using all sources of information and the experimental results show that the including of multiple heterogeneous sources can greatly improve the performance on finding the informative genes for prostate cancer and the proposed method outperforms many baseline methods. The major contributions of the paper are • It is the first work to find the informative genes by integrating multiple heterogenous sources of information. • We propose a novel framework that can smoothly integrate heterogeneous sources of information and propose an efficient algorithm to solve a convex objective function. • The experimental evaluations show that the proposed method outperforms many other baseline methods. The organization of the paper is as follows: in Section 2, we describe the setting of our problem and data sets used in this work. Section 3 presents the proposed framework that smoothly integrates heterogeneous sources of information. An extensive experimental study is reported in Section 4. Section 5 discusses the related work and we conclude our work in Section 6. 2. DATE SETS AND PROBLEM SETTING In this section, we present the data sets used in this paper and the setting of our problem. 2.1 DATA SETS In this work, we adopt the following three sources of information about the genes in prostate cancer studies. • The top-k ranked list of informative genes from five studies [5–9] as shown in Table 1. • The microarray expression data associated with each study in [5–9]. • The important genes for prostate cancer identified by other studies: HPN [23], AMACR [24], FASN [25], GUCY1A3 [26], ANK3 [27], STRA13 [28], CCT2 [28], CANX [28] and TRAP1 [28]. The bold cells in Table 1 indicate that the genes are confirmed to be important from other researches. 2.2 PROBLEM SETTING Given a pool of n genes G = {g1 , ..., gn }, m experiments are conducted to find the informative genes. Each experiment is performed on a subset of the gene pool Si and produces a ranking of the participating genes, in which the top-k genes are deemed to be informative. Each subset Si is associated with a microarray expression data matrix Mi . Also we have a set I of confirmed informative genes by other studies. The target is to infer a ranked list of genes that are informative for prostate cancer. In this work, we solve this problem by first estimating the conditional probability of a gene being informative and then ranking the genes by their probabilities. 3. METHODOLOGY In this section, we show how we represent heterogeneous sources of information and propose a general framework to integrate all sources of information into an optimization problem. The conditional probability of each gene being informative is the solution to this optimization problem. Then we present an efficient iterative algorithm to obtain the optimal value of the objective function. 3.1 Heterogeneous Information Representation In this section, we show how to represent the three sources of information . Ranked Lists: For ranked lists produced by various studies, we observe that the ranked list from each study can be viewed as a classifier to the gene pools G, i.e., each experiment classifies the gene pool G into two classes: informative and uninformative. The top-k ranked list refers to the informative genes predicted by each experiment and remaining genes in G are predicted to be uninformative. Given the observation that each experiment can be viewed as a classifier for the gene pool G, inspired by [15], each experiment corresponds to two groups: informative and uninformative. For m experiments, we have 2m groups. For each gene gi in G, it is straightforward to see that it must belong to m groups, i.e., one of the groups that corresponds to each experiment. This can be formulated as a natural bipartite graph representation for the ranked lists. Figure 1 illustrates the bipartite graph representation of the ranked lists. In Figure 1, the left side is the nodes representing groups and the right side is the nodes representing genes. t1 and t2 represent the informative and uninformative groups in experiment m1 , t3 and t4 represent the informative and uninformative groups in experiment m2 , etc. For each gene gi , a solid line indicates that the gene belongs to a certain group. The bipartite graph naturally formulates the observation that the ranked lists can be viewed as classifiers to the gene pool G. We begin to lay down some notation. The affinity matrix An×v of the bipartite graph is defined as aij = 1 if gene gi is assigned to the group tj and 0 otherwise, where v = 2 × m Table 1: Prostate Cancer Studies Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Luo [5] HPN AMACR CYP1B1 ATF5 BRCA1 LGALS3 MYC PCDHGC3 WT1 TFF3 MARCKS OS-9 CCND2 NME1 DYRK1A TRAP1 FMO5 ZHX2 RPL36AL ITPR3 GCSH DDB2 TFCP2 TRAM1 YTHDF3 Welsh [6] HPN AMACR OACT2 GDF15 FASN ANK3 KRT18 UAP1 GRP58 PPIB KRT7 NME1 STRA13 DAPK1 TMEM4 CANX TRA1 PRSS8 ENTPD6 PPP1CA ACADSB PTPLB TMEM23 MRPL3 SLC19A1 Dhana [7] OGT AMACR FASN HPN UAP1 GUCY1A3 OACT2 SLC19A1 KRT18 EEF2 STRA13 ALCAM GDF15 NME1 CALR SND1 STAT6 TCEB3 EIF4A1 LMAN1 MAOA ATP6V0B PPIB FMO5 SLC7A5 True [8] AMACR HPN NME2 CBX3 GDF15 MTHFD2 MRPL3 SLC25A6 NEM1 COX6C JTV1 CCNG2 AP3S1 EEF2 RAN PRKACA RAD23B PSAP CCT2 G3BP EPRS CKAP1 LIG3 SNX4 NSMAF Singh [9] HPN SLC25A6 EEF2 SAT NME2 LDHA CANX NACA FASN SND1 KRT18 RPL15 TNFSF10 SERP1 GRP58 ALCAM GDF15 TMEM4 CCT2 SLC39A6 RPL5 RPS13 MTHFD2 G3BP2 UAP1 Figure 2: Similarity Matrix build from Microarray Expression Data Figure 1: Bipartite Graph Representation. is the number of groups. Since we want to estimate the conditional probability of gi being informative, the conditional probability is denoted by Fn×c . The conditional probability of each group is also involved, denoted by Qv×c . We have fiz = P rob(gi is class z|gi ) and qjz = P rob(tj is class z|tj ), Here c=2, class z denotes one of the two classes: informative and uninformative. Also we define the initial class labels for the groups Pv×c as pjz = 1 if the group tj ’s class is z and 0 otherwise. Microarray Expression Data: For each experiment, we can get the microarray expression data. The expression data show the behaviors of genes in tumor and normal samples. The expression data are used to estimate the similarity between genes, where informative genes should have similar microarray expression.Therefore, for each study, we can get the expression data for the top-k informative genes reported by that study. Since different experiments are conducted on different platforms, with different techniques, direct comparisons of expression data across experiments are infeasible. Therefore, we can’t compute each element in the similarity matrix W for the gene pool G. Rather, the microarray expression data from each experiment can only be used to compute a sub matrix of the whole similarity matrix W . Figure 2 illustrates the similarity matrix W built from microarray expression data. Noted that even with five microarray expression data, we still can’t compute every element of W . Wij denotes the similarity between gene gi and gj . In case of overlapped genes across experiments, we take the average similarity computed from each study. Informative Genes from other Researches: The set of informative genes I can be encoded as prior knowledge of whether a gene is informative or not. Let n × c matrix Y denote the prior knowledge, where yiz = 1 if gene gi is given class z and 0 otherwise. class z can be informative or uninformative. Table 2 shows the notation that are defined in this work. ∑ 2 The fourth term in Eq. (1), i.e., γ n i,j=1 Wij ||fi − fj || , corresponds to the smoothness assumption in that informative genes should be similar in terms of expression data. γ denotes the confidence over this source of information. Note that the third and fourth terms in Eq. (1) correspond to the Label Propagation Model [19] widely used in semi-supervised learning. The label propagation model is to propagate label information to the unlabeled nodes using smoothness assumption. Also the first and second terms in Eq. (1) correspond to the Graph Consensus Model proposed by Gao. et.al in [15]. Therefore, the objective function in Eq. (1) naturally connects Label Propagation Model and Graph Consensus Model and enforces the information from each source influence one another. Table 2: Notation 3.3 THE ITERATIVE ALGORITHM Symbol An×v Fn×c Qv×c Pv×c Yn×c Wn×n 1, ..., c 1, ..., n 1, ..., m 1, ..., v Definition aij -indicator of gene gi in group j fiz -probability of gene gi wrt class z qjz -probability of group tj wrt class z pjz -indicator of group tj predicted as class z yiz -indicator of gene gi predicted as class z wik -similarity between gene gi and gk class indexes gene indexes experiment indexes group indexes We take the Block Coordinate Descend method to solve objective function (1). The algorithm is shown in Algorithm 1. Algorithm 1 The Iterative Algorithm Input: An×v , Pv×c , Yn×c , Wn×n parameter α, β,γ, ϵ Output: Consensus matrix F 3.2 THE OBJECTIVE FUNCTION Given the representations of heterogeneous sources, we solve the problem by estimating Fn×c . Now we formulate this by the following objective function. min J(F, Q) = min v ∑ c ∑ n ∑ aij (qjz − fiz )2 j=1 z=1 i=1 +α v ∑ c ∑ (qjz − pjz )2 j=1 z=1 +β +γ n ∑ c ∑ hi (fiz − yiz )2 i=1 z=1 n ∑ (1) Wij ||fi − fj ||2 i,j=1 s.t. c ∑ z=1 ∑c uiz = 1, c ∑ qjz = 1 z=1 fiz ∈ [0, 1], qjz ∈ [0, 1], where hi = z=1 yiz . Interpretation: and second parts in objective ∑ The ∑first ∑ 2 function (1), i.e, vj=1 cz=1 n i=1 aij (qjz − fiz ) ∑v ∑c 2 +α j=1 z=1 (qjz −pjz ) , corresponds to the intuition that a group tj corresponds to class z if the majority of genes in this group belong to class z, meanwhile a gene corresponds to class z if the majority of the groups it belongs to correspond to class z. Parameter α denotes the confidence over each group, i.e., the confidence over the information from this source. ∑ ∑c 2 The third term in Eq. (1), i.e., β n i=1 z=1 hi (fiz −yiz ) , is to enforce that the estimations should not deviate from the confirmed informative genes. β denotes the confidence over this source of information. 1: 2: 3: 4: 5: 6: 7: 8: Initialize F 0 , F 1 randomly; t=1; while ∥ F t − F t−1 ∥> ϵ do Qt = (Dv + αKv )−1 (AT F t−1 + αKv P ); F t = (Dn + βHn + γ(I − L̂))−1 (AQt + βHn Y ); t=t+1; end while Output F t ∑v ∑ where Dv = diag{( n j=1 aij )}n×n , i=1 aij )}v×v , Dn = diag{( ∑c Kv = diag{( z=1 yjz )}v×v and diag means the diagonal elements of a matrix. Dv and Dn are the normalization factors. Kv acts as constraints for the group nodes. L̂ is the −1/2 −1/2 normalized Laplacian derived from W [19]: L̂ = Dw W Dw where Dw is a diagonal matrix with its (i, i) element equal to the sum of the i-th row of W . After getting F , we retrieve the conditional probability of each gene being informative and then rank the genes. The iterative algorithm shows a picture of information propagation. It can be understood that in each step, the algorithm performs a label propagation, in which labeled information is propagated to the gene nodes based on smoothness assumption. After each gene node gets its estimated label, the gene nodes propagate the information to the group nodes. The group nodes, after receiving the information, adjust Q and pass the information back to the gene nodes. In this way, the information from heterogeneous sources influences each other. The process stops when it converges. 3.4 ANALYSIS OF THE ITERATIVE ALGORITHM In this section, we analyze the proposed algorithm and show that the iterative algorithm converges to a stationary point and the objective function is convex. We also present the time complexity analysis of the iterative algorithm. Lemma 1. Algorithm 1 converges to a stationary point. Proof. We obtain the solution to the objective function 1 using Block Coordinate Descent method. It adopts an iterative procedure. At each step, we obtain the minimum of one variable while fixing the remaining variables. At the first step, we fix F and take derivation of J over Q. The Hessian matrix with regard ∑ to Q is a diagonal matrix and its diagonal elements are n i=1 aij +α > 0 and therefore a positive definite matrix, which means ▽J(Q, F t−1 ) = 0 gives the unique minimum of the objective function in terms of Q. We have Qt = (Dv + αKv )−1 (AT F t−1 + αKv P ), (2) At the second step, we fix Q and take derivation of J over F. ∑nThe hessian matrix is also a diagonal matrix with entries i=1 aij > 0 plus I − L̂. The diagonal matrix is positive definite and from [4] we know that I − L̂ is also a semipositive definite. Therefore, the hessian matrix is a positive definite matrix, indicating that ▽J(Qt−1 , F ) = 0 gives the unique minimum of the objective function in terms of F . We have F t = (Dn + βHn + γ(I − L̂))−1 (AQt + βHn Y ). (3) By Proposition 2.7.1 in [2], we have that the Block Coordinate Descent method converges to a stationary point. Theorem 1. The objective function in Eq. (1) is convex Proof. The objective function (1) can be divided into two parts: v ∑ c ∑ n ∑ aij (qjz − fiz )2 + α j=1 z=1 i=1 = v ∑ c ∑ (qjz − pjz )2 j=1 z=1 v ∑ c ∑ n ∑ aij (qjz − fiz )2 + α j=1 z=1 i=1 v ∑ c ∑ 2 qjz (4) j=1 z=1 v ∑ c ∑ +α (p2jz − 2pjz qjz ). j=1 z=1 Suppose θ is a vector containing all the variables of Eq. (4), i.e., θ = (q11 , ..., qnc , f11 , ..., fvc ). Consider Eq. (4)’s standard quadratic form: Eq.4 = θT W θ + bT θ + c, (5) where W , b and c are the coefficient matrix, vector and scalar, respectively. From Eq. 4, we have θT W θ = v ∑ c ∑ n ∑ aij (qjz − fiz )2 + α j=1 z=1 i=1 v ∑ c ∑ 2 qjz , (6) j=1 z=1 Note that aij and α are non-negative for any i and j. Furthermore, each gene is either informative or uninformative, indicating that there is at least one non-zero entry in each vector. Therefore, we have θT W θ > 0 if θ ̸= 0. The matrix W is strictly positive definite, and so the objective function in Eq. (4) is strictly convex. The remaining part of the objective function in Eq. (1) is as follows: β n ∑ c ∑ hi (fiz − yiz )2 + γ i=1 z=1 Eq. (7) can be rewritten as: n ∑ i,j=1 Wij ||fi − fj ||2 . (7) γθLnorm θ + bT θ + c, (8) where θ is the vector containing all the variables. Lnorm is the normalized Laplacian. By [4], we know that Lnorm is semi-positive definite. Therefore, Eq. (7) is convex. The sum of two convex functions are also convex. Therefore, we have the objective function in Eq. (1) is convex. From Lemma 1 and Theorem 1, we know that for a convex problem, any local minimum is also a global minimal [2]. Therefore, the solution found in Algorithm 1 converges to the global minimum. Time Complexity: On the first step of the iterative algorithm, the time complexity is O(v 2 + vcn2 + v 2 c). Since v = 2m and c = 2 and usually much smaller than n, therefore, the time complexity for the first step is O(n2 ). The time complexity for the second step is O(n2 + n3 ), therefore is O(n3 ). Noted that the major time is spent on matrix multiplication. By Coopersmith-Winograd algorithm [3], the time complexity can be reduced to O(n2.3727 ). Suppose the time of iteration is t, the time complexity of whole algorithm is O(tn3 ). In experiments, we observe that t is usually between 3 and 20. 4. EXPERIMENTAL EVALUATION In this section, we experimentally evaluate the proposed method. First we discuss the evaluation metric and the baseline methods. Then we show that the benefit of including heterogeneous sources and compare the proposed method with other baseline methods. 4.1 EVALUATION METRIC Since we already know that some genes are informative, we therefore evaluate the result in terms of those ground truth genes. We propose to evaluate the results by the average rank of those ground truth genes and also their number of appearances in the result. 4.2 BASELINE METHODS Borda Count [20]: Given k full lists τ1 , τ2 , ..., τk , Borda’s method can be thought of as assigning a k -element position vector to each candidate (the positions of the candidate in the k lists), and sort the candidates by the L1 norm of these vectors. There are also other variants of Borda’s count: sorting by Lp norm for p > 1, sorting by the median of the k values, sorting by geometric mean of the k values, etc. In this paper, we implemented four variants of Borda count: Borda1 (L1 norm), Borda2 (median), Borda3 (geometric mean), Borda4 (L2 norm). Markov Chain [1] [14]: Markov chains can also be used to generate the consensus results. The states of the chain correspond to the n genes to be ranked, the transition probabilities depend in some particular way on the given lists, and the stationary probability distribution will be used to sort the n candidates. We implemented three different Markov Chains, the details are as follows: • M C1 : If the current state is gene g, then the next state is chosen uniformly from the multiset of all genes that were ranked∪higher than or equal to g, i.e., from the multiset of i {q|τi (q) ≤ τi (g)}. The transition probability is defined as 1/|G| where |G| is the size of gene pool. M C1 is a slight variation from that in [14]. • M C2 : If the current state is gene g, then the next state is chosen uniformly from the multiset of all genes that were ranked higher than g in a majority of the input lists. The transition probability is defined as 1/|G|. M C2 can be seen as an extension to that in [1]. • M C3 : If the current state is gene g, ∑ the probability m of next state f is therefore defined as: i=1 I(τi (f ) < τi (g))/m × |G|, where τi (f ) < τi (g) denotes in the experiments i, gene f ranked higher than gene g. I is the indicator function to count the number of times when τi (f ) < τi (g). M C3 can also be seen as a modification of the MCT algorithm in [1]. Cross Entropy Monte Carlo (CEMC) [13]: CEMC is one representative of the typical ways of solving rank aggregation problem. Given n input lists ∑ τ1 , τ2 , ..., τk , the optimal result τ ∗ satisfies: τ ∗ = argmin ki=1 wi d(τi , τ ∗ ), where wi is the weight for each input list, d is the distance between two lists. Lin etal,. [13] proposed to employ CEMC method to solve this optimization problem. We acquired the source code from the author and experimented with four variants: CEM C1 : Spearman distance with unweighed case (i.e., set the weight wi to be even); CEM C2 : Spearman distance with weighted case; CEM C3 : Kendall-tau distance with unweighed case; CEM C4 : Kendall-tau distance with weighted case. 4.3 RESULTS AND DISCUSSIONS Using Only Ranked List: first we show the experiments by using the ranked list information alone. The results are shown in Table 3. In this experiment, we set α = 2, β = 0 and γ = 0. Table 3: Results of First Experiment Method Luo Welsh Dhana True Singh Borda1 Borda2 Borda3 Borda4 M C1 M C2 M C3 CEM C1 CEM C2 CEM C3 CEM C4 Ranked Lists Only Ranked Lists and Microarray Appear # 3 6 5 3 4 5 4 7 5 5 6 6 6 6 5 6 7 7 Avg. Rank 19.4 13.4 14.1 19.8 18.4 15.2 17.4 15.8 15.2 16.1 15.5 14.7 14.1 14.4 14.8 14.6 13.2 13.0 Table 3 shows the results evaluated by the metric we proposed in this section. If a ground truth gene doesn’t appear in the result, we assign its rank to be k+1 (26 in this case). Also we list the evaluation results for the five prostate studies. As we can see from the result, the proposed method not only contains the most ground truth genes (ties with Borda3 ) but also achieves the best average rank. Borda count methods are better than Luo, True and Singh’s result but worse than that of Welsh and Dhana, so do Markov Chain methods. CEMC does achieve good average rank comparing to Borda and Markov Chain methods and only inferior to Welsh’s result. An interesting finding is that for the CEMC method, the Spearman distance with unweighed case achieves the best performance of all CEMC methods, quite contradictory to the intuition that weighted case is normally better than unweighed case. The first set of experiment shows that using ranked list alone, we can achieve performance no less than the best performance in the baseline methods. The detailed output of the first experiment is shown in the column First Exp. in Table 5. Using Ranked Lists and Microarray Expression Data: In the second experiment, we adopt microarray data into consideration. In this experiments, we set α = 2, β = 0 and γ = 2. The results are shown in the last row in Table 3. As seen from the result that adding microarray data is able to boost performance of inferring informative genes. The detailed output of the second experiment is shown in the column Second Exp. in Table 5. Using All Sources of Information: in the third experiment, we adopt all three sources of information into consideration i.e., we set α = 2, β = 8 and γ = 2. The results are shown in Table 4. Table 4: Results of the Second Experiment Gene HPN AMACR FASN GUCY1A3 ANK3 STRA13 CCT2 CANX TRAP1 Appear # 7 8 7 7 7 8 6 7 7 Avg. Rank 13.0 12.5 13.0 11.5 11.0 2.0 13.1 13.2 11.3 As seen from the result, adding only one piece of side information significantly improves the quality of the final result. When gene AMACR is assigned to be informative, the proposed method includes 8 informative genes, beating all the previous results. Moreover, gene AMACR is a highranked gene in the input lists and therefore improving the result in this way is quite encouraging. We also notice that adding HPN as the side information doesn’t increase the performance at all, mainly because HPN is predicted to be top-ranked in most input lists. Adding such information will not help propagate the information in the proposed method. Adding CCT2 into the proposed method decreases the performance by both ground truth gene appearance times and average rank. This is because gene CCT2 only appears in list True and Singh, adding this information will raise the probabilities of these two lists and equivalently decrease the probabilities of the other three lists, therefore gene GUCY1A3 from Dhana experiment is eliminated from the final result. The third experiment shows that by adding informative gene information, we can significantly increase the performance of the proposed method. The detailed output of the third experiment is shown in the column Thrid Exp. in Table 5. As seen from the results in Table 5, Borda successfully contains 7 ground truth genes but comparing with our results, CANX, STRA13 and GUCY1A3 are ranked relatively low, making its result inferior to ours. In Markov Chain Table 5: Detail Experimental Outputs Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Borda HPN AMACR FASN GDF15 NME2 SLC25A6 EEF2 OACT2 OGT KRT18 NEM1 UAP1 CCND2 CYP1B1 CBX3 SAT CANX BRCA1 GRP58 MTHFD2 STRA13 LGALS3 ANK3 GUCY1A3 LDHA MC HPN AMACR GDF15 NME1 FASN EEF2 KRT18 UAP1 NME2 SLC25A6 OACT2 STRA13 CANX GRP58 SND1 MTHFD2 ALCAM MRPL3 TMEM4 PPIB SLC19A1 CCT2 FMO5 CYP1B1 ATF5 CEMC HPN AMACR FASN GDF15 NME2 UAP1 OACT2 SLC25A6 KRT18 EEF2 STRA13 NME1 CANX ALCAM GRP58 SND1 FMO5 TMEM4 CCT2 PRKACA MTHFD2 PTPLB PPIB MRPL3 SLC19A1 method, the ranking of CANX and STRA13 is quite close to the result of the proposed method, yet it fails to include GUCY1A3, causing its average rank lower than ours. CEMC is very good at predicting the ranking of top-ranked genes in input lists: HPN, AMACR and FASN. These three genes are ranked high in most experiments. And the ranking of STRA13 and CANX is close to that of the proposed method . However, it also fails to include GUCY1A3. In conclusion, we find that adding multiple heterogeneous sources of information improves the performance of finding informative genes. Also we notice that adding information from [23–28] has the greatest boost to the performance while information from microarray expression data gives the least boost, which corresponds to the reliability of the source: the information from [23–28] is most reliable while the microarray expression data is the least reliable. Furthermore, we notice that although microarray expression data contains a lot of errors and noises, it still can improve the performance of finding informative genes for prostate cancer. Parameter Sensitivity: There are three parameters in the proposed method: α, β and γ. We conducted the sensitivity experiments shown in Figure 3. α represents the confidence of our belief in the initial classes of the group nodes. The classes of the groups are obtained from different experiments and hence may not be completely correct, therefore, smaller α usually yields better performance. β shows our confidence on the informative genes from other researches. Those researches are deemed to be reliable and thus a large β is usually better. γ denotes the confidence on the microarray data. Those data may contain noises, therefore lower value usually yields better results. The result in Figure 3 confirms our observation. First Exp. HPN GDF15 AMACR NME1 FASN KRT18 EEF2 ALACM SND1 OACT2 STRA13 CANX GRP58 TEEM4 CCT2 NME2 SLC25A6 CALR EIF4A1 GUCY1A3 LMAN1 OGT STAT6 TCEB3 LDHA Second Exp. HPN GDF15 AMACR FASN NME1 KRT18 EEF2 ALACM SND1 OACT2 STRA13 CANX GRP58 TEEM4 CCT2 NME2 SLC25A6 CALR EIF4A1 GUCY1A3 LMAN1 OGT STAT6 TCEB3 LDHA Third Exp. STRA13 HPN GDF15 AMACR NME1 FASN KRT18 EEF2 OACT2 ALCAM SND1 CANX GRP58 TMEM4 CCT2 NME2 SLC25A6 CALR EIF4A1 GUCY1A3 LMAN1 OGT ANK3 TCEB3 STAT6 13.5 13 α β γ 12.5 12 11.5 11 1 2 3 4 5 Parameter 6 7 8 Figure 3: Parameter Sensitivity Experiment 5. RELATED WORK Many of the previous work on the same topic only uses the ranked list alone and treat the problem as a variant of the rank aggregation problem [1] [13]. The earliest approach to revolve rank aggregation or voting aggregation is introduced by Jean-Charles de Borda [20] in 1770. Given k full lists τ1 , τ2 , ..., τk , Borda’s method can be thought of as assigning a k -element position vector to each candidate (the positions of the candidate in the k lists), and sort the candidates by the L1 norm of these vectors. The intuition behind Borda count is ”more wins is better”. The major limitation is its extensibility to partial lists. It is shown that for partial lists, Borda count sometimes yields undesirable outcomes. Rank aggregation problem is also identified in information retrieval and web search areas. Generally, the problem of rank aggregation is approached in two categories in Information Retrieval and Web Search areas: Unsupervised and Supervised. For unsupervised approaches, they usually express the problem as a distance minimization problem: Given n input lists ∑ τ1 , τ2 , ..., τk , the optimal result τ ∗ satisfies: τ ∗ = argmin ki=1 wi d(τi , τ ∗ ) where wi is the weight for each input list, d is the distance between two lists. Choices include Spearman distance [21] and Kendalltau distance [22]. This optimization problem is shown to be NP-hard [22]. Those methods differ in how to solve this optimization. It is noted that the most recent work [13] that studies the same problem employed cross entropy Monte Carlo method to solve the optimization problem. The major limitation these works is the proposed methods seem unable to include more information from other sources while the proposed method in this paper is able to utilize heterogeneous sources of information smoothly. In addition, the problem in this paper is related to clustering ensemble or consensus clustering [18]: given a number of clustering results, find a consensus result that is better than any single input clustering result. Our work is related to work in [17] in that we use the bipartite graph to represent the problem. [15] proposed a bipartite graph method to solve consensus maximization problem in which given a bunch of class labels and clustering results, find the consensus labels that achieves maximal consistency between them. Our work is similar to Jing’s work [15] in that we both enforce the information propagated between neighboring nodes in the bipartite graph. 6. CONCLUSIONS In this work, we targeted the problem of finding informative genes for prostate cancer studies. Different than previous work that use only one source of information, in this work, we proposed a general framework that is able to utilize heterogeneous source of information: ranked list, microarray expression data and confirmed informative genes. We formulated the problem into an optimization problem and proposed an iterative algorithm to solve the problem. We showed that the formulated objective function is convex and the iterative algorithm converges to the global minimum. The extensive experiments showed that including heterogeneous sources of information is able to increase the performance and the proposed method outperforms many baseline methods. [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] 7. REFERENCES [1] DeConde, R.P., Hawley, S., Falcon, S., Clegg, N., Knudsen, B., and Etzioni, R. Combining results of microarray experiments: a rank aggregation approach Statistical Applications in Gene. and Mole. Biology , 2001 [2] D. P. Bertsekas Non-Linear Programming Athena Scientific, 2nd Edition, 1999 [3] Coppersmith and Winograd Matrix Multiplication via arithmetic progressions Journal of Symbolic Computation, 1990 [4] U. von Luxburg A tutorial on spectral clustering Technical Report, 2006 [5] Luo, J., Duggan, D.J., Chen, Y., Sauvageot, J., Ewing, .M., Bittner, M.L., Trent, J.M., and Isaacs, W.B. Human prostate cancer and benign prostatic hyperplasia: molecular dissection by gene expression profiling Cancer Research, 2001 [6] Welsh, J.B., Sapinoso, L.M., Su, A.I., Kern, S.G., Wang-Rodriguez, J., Moskaluk, C.A., Frierson, H.F.Jr., and Hampton,G.M. Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer Cancer Research, 2001 [7] Dhanasekaran, S.M., Barrette, T.R., Ghosh, D., Shah, R., Varambally, S., Kurachi, K.,Pienta, K.J., Rubin, M.S., and [25] [26] [27] [28] Chinnaiyan, A.M. Delineation of prognostic biomarkers in prostate cancer . Nature, 2001 True, L., Coleman, I., Hawley, S., Huang, A., Gikord, D., Coleman, R., Beer, T., Gelman, E., Datta, M., Mostaghel, E., Knudsen, B., Lange, P., Vessella, R., Lin, D., Hood, L., and Nelson, P. A Molecular Correlate to the Gleason Grading System for Prostate Adenocarcinoma Proceedings of the NSA, 2006 Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., DAmico, A.V., Richie, J.P., Lander, E.S., Loda, M., Kantok, P.W., Golub, T.R., and Sellers, W.R. Gene expression correlates of clinical prostate cancer behavior Cancer Cell, 2002 Rhodes, D.R., Barrette, T.R., Rubin, M.A., Ghosh,D. and Chinnaiyan,A.M. Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Research, 2002 Rhodes, D.R., Yu, J., Shanker, K., Deshpande, N., Varambally, R., Ghosh, D., Barrette, T., Pandey, A. and Chinnaiyan, A.M. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression Proc Natl Acad Sci USA, 2004 Liang Ge, Nan Du, and Aidong Zhang Finding Informative Genes from MultipleMicroarray Experiments: A Graph-based Consensus Maximization Model In Proc. of BIBM, 2011 Shili, L., Jie, D. Integration of Ranked Lists via Cross Entropy Monte Carlo with Applications to mRNA and microRNA Studies Technical Report, 2008 Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. Rank aggregation methods In Proc. of WWW, 2001 Jing, G., Feng, L., Wei, F., Yizhou, S., Jiawei, H. Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models In Proc. of NIPS, 2009 Jing, G., Feng, L., Wei, F., Yizhou, S., Jiawei, H. A Graph-based Consensus Maximization Approach for Combining Multiple Supervised and Unsupervised Models TKDE, 2011 X. Z. Fern and C. E. Brodley. Solving Cluster Ensemble Problems by Bipartite Graph Partitioning In Proc. of ICML, 2004 A. Strehl and J. Ghosh. Cluster Ensembles a Knowledge Reuse Framework for Combining Multiple Partitions Journal of Machine Learning Research, 2003 D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scho’lkopf. Learning with Local and Global Consistency In Proc. of NIPS, 2003. J. C. Borda. Memoire sur les elections au scrutin Histoire de l’Academie Royale des Sciences, 1781. P. Diaconis and R. Graham. Spearman’s footrule as a measure of disarray . J. of the Royal Statistical Society Series, 1977 J. J. Bartholdi, C. A. Tovey, and M. A. Trick. Voting schemes for which it can be difficult to tell who won the election Social Choice and Welfare, 1989 Klezovitch, O., Chevillet, J., Mirosevich, J., Roberts, R., Matusik, R., Vasioukhin, V. Hepsin promotes prostate cancer and metastasis Cancer Cell, 2004 Kuefer, R., Varambally, S., Zhou, M., Lucas, P.C., Loeffler, M., Wolter, H., Mattfeldt, T., Hautmann, R.E., Gschwend, J.E., Barrette, T.R., Dunn, R.L., Chinnaiyan, A.M., Rubin, M.A. alpha-Methylacyl-CoA racemase: expression levels of this novel cancer biomarker depend on tumor differentiation , 2002 Pizer, E.S., Pflug, B.R., Bova, G.S., Han, W.F., Udan, M.S., Nelson, J.B. Increased fatty acid synthase as a therapeutic target in adrogen-independent prostate cancer progression Prostate, 2001 Dong, Y., Zhang, H., Gao, A.C., Marshall, J.R., Ip, C. Androgen receptor signaling intensity is a key factor in determining the sensitivity of prostate cancer cells to selenium inhibition of growth and cancer-specific biomarkers Mol. Cancer Ther , 2005 Ignatiuk, A., Quickfall, J.P., Hawrysh, A.D., Camberlain, M.D., Anderson, D.H. The smaller isoforms of ankyrin 3 bind to the p85 subunit of phosphatidylinositol 3’-kinase and enhance platelet-derived growth factor receptor down-regulation Journal of Biology Chemistry , 2006 Ivanova, A., Liao, S.Y., Lerman, M.I., Ivanov, S., Stanbridge, F.J. STRA13 expression and subcellular localisation in normal and tumour tissues: implications for use as a diagnostic and differentiation marker Journal of Medical Gene. 2005