Proceedings of the IASTED International Conference Computational Intelligence and Bioinformatics (CIB 2011) November 7 - 9, 2011 Pittsburgh, USA AN ARTIFICIAL FISH SWARM BASED SUPERVISED GENE RANK AGGREGATION ALGORITHM FOR INFORMATIVE GENES STUDIES Nan Du Computer Science and Engineering Department State University of New York at Buffalo nandu@buffalo.edu Supriya D Mahajan Department of Medicine State University of New York at Buffalo smahajan@buffalo.edu Bindukumar B Nair Department of Medicine State University of New York at Buffalo bnair@buffalo.edu Stanley A. Schwartz Department of Medicine State University of New York at Buffalo sasimmun@buffalo.edu Chiu Bin Hsiao Department of Medicine State University of New York at Buffalo chsiao@buffalo.edu Aidong Zhang Computer Science and Engineering Department State University of New York at Buffalo azhang@buffalo.edu ABSTRACT As the widespread use of high-throughput genomic and protein analysis, more and more experiments have been done to identify the informative genes for various diseases, thus it provides researchers an opportunity to aggregate across multiple microarray experiments via a rank aggregation approach. However, most of current’s microarray rank aggregation methods are either unweighted or prespecified weighted, which has obvious defects. In this paper, We define a new method to weight each ranked list automatically by considering its distances to the other ranked lists and the agreement with some priori knowledge. Then the problem of integrating ranked lists can be formulated as minimizing an objective criterion which is proved to be NP-hard. Accordingly, we use an Artificial Fish Swarm algorithm (AFSA) to solve the well-defined minimization problem of rank aggregation in terms of decision theory. We conduct two sets of experiments to evaluate the performance of our methods. The experimental results show that the proposed approach owns not only the capability of solving optimization problem but also the biological meaning. ple, discovering genes involved in multiple types of cancers is significant therapeutic importance. Therefore, it is a challenge to identify the informative genes among thousands of genes. As the widespread use of high-throughput genomic and protein analysis, more and more experiments have been done to identify the informative genes for various diseases. This is providing researchers an opportunity to aggregate results across sets of experiments designed to explore the same biological phenomenon. However, combining information across multiple studies is challenging. In the case of microarray studies, the use of different technologies means that not all studies measure expression levels of the same genes. In addition, technical, biological, and other sources of variability will generally lead to the Inconsistent measurements of gene expression. In this paper, we propose an Artificial Fish Swarm algorithm (AFSA) for the gene rank aggregation problem. AFSA is a novel method of swarm intelligence for searching the global optimum, which was inspired by the natural schooling behaviors of fish. The AFSA has been proved effective in function optimization, parameter estimation, combinatorial optimization, least squares support vector machine and geo-technical engineering problems. The remarkable property of this algorithm is that it is capable of global search in a rather large space, insensitive to initial values and not easy to stuck in the local optimal solution. Here we use an AFSA based algorithm to solve the minimization problem of rank aggregation. In this paper, we briefly discuss the related work on rank aggregation technologies in Section 2. In Section 3, we present a novel method which weights each ranked lists automatically. Besides that, an AFSA which is used to solve the minimization problem would also be explained in detail. Finally, the experimental results are illustrated to show the effectiveness of the proposed method in Section 4. KEY WORDS Gene rank aggregation; automated weighted; microarrays; artificial fish swarm algorithm 1. Introduction With the development of the large scale gene expression profiling technology, gene expression data analysis and modeling have become an important topic in the field of bioinformatics research. Based on the gene expression data, the diagnosis and treatment of various disease at the DNA, RNA, or protein levels are important. Informative genes, which differentially expressed under two sets (e.g. various diseased and normal), are critical for a reliable diagnosis and simplify the experimental analysis. For exam- DOI: 10.2316/P.2011.753-019 114 2. Related Work a novel method for solving the well-defined minimization problem of gene rank aggregation. As far as we know, there are no any swarm intelligence algorithms that have been used to solve the rank aggregation problem. Rank aggregation, as a classical problem, is defined as aggregating ranked results of entities from multiple ranking lists which aims at finding a more robust and reliable one. The individual ranking list may come from different methods, experiments or voters. This problem has had more than two hundred years’ history, which was first proposed by Borda in 1781 [3]. In the case of gene rank aggregation, the goal is to aggregate measures of expression at the DNA, RNA, or protein levels [25]. To aggregate more reliable and robust gene ranked list, several approaches have been proposed. DeConde et al [4] first formulated the rank aggregation problem on identifying informative genes from multiple microarray experiments. This method combined a relatively small number of ranked lists using Markov Chain, which is partly based on the algorithms of Dwork et al [8] for spam webpage detection. Lin et al [18, 20] used the cross entropy Monte Carlo methods to combine the lists of MicroRNAs targets. Also, Vasyl Pihur et al [22] used a similar method as Lin et al to make informative genes selection on 20 different cancer related microarray datasets. By now, most methods either weight each ranked list evenly which is also called unweighted [4], or give each ranked list a prespecified weight [20] which is based on some domain experience (Also include the rank aggregation methods which do not work on gene ranked lists aggregation). However, both unweighted method and prespecified weight method have obvious defects. The former method ignores the fact that some ranked lists are more important than the others. Each ranked list from various experiments, datasets or algorithms should have different reliability, which means they should not be regarded equally. Although the latter method may be effective in some cases, the prespecified weights are hard to evaluate. In addition, if we need to combine hundreds or thousands of experiments’s results, the method of manually weighted is nearly impossible in practice. An automated and objective way of weighting the rank lists is needed. Our goal is to find a way to weight each ranked list automatically based on some priori knowledge, which could be called supervised gene rank aggregation. The main contributions of this paper are outlined as follows: 3. Methods 3.1 Preliminaries Let SA (k) be the ranked list with k genes, from experiment, dataset or algorithm A, depending on the context of the problem. Although it is not a necessary condition, for the sake of simplicity, we assume that all lists as inputs to the rank aggregation method have the same size k. Hereafter, we refer to such list as top-k list and drop the k in SA (k) where there is no ambiguity. Let S = ∪A SA be the union set of elements from all the top-k lists. Suppose there are total n elements in the set, indexed from 1 through n without any particular order of preference. Let τ be any size k ordered subset of S without duplicate. Then, the goal is to find the ranked list τ ∗ which has minimal disagreement with all the lists: τ ∗ = argmin ( X wA d(τ, SA ), τ ⊂ S A ) , (1) where wA is a weight for list A, and d denotes a distance measure, such as the Spearman’s footrule or the Kendall’s Tau distance. Such a τ ∗ is referred to as the top-k list of the aggregate candidate targets set, and y ∗ = Φ(τ ∗ ) is the optimal (minimum) value of the objective function Φ. For each i ∈ / S ∩ SA , let RA (i) = k + 1, where RA (i) is the rank of gene i in the list A. Besides some definitions of gene rank aggregation made in the above, we would introduce two methods of computing the distance between two ranked lists, which are Kendall’s Tau and Spearman’s Footrule Distance [9]. 3.1.1 Kendall’s Tau Given two permutations L1 , L2 with n items, the Kendall Tau distance is defined as: K(L1 , L2 ) = • We define a new method to weight each ranked list automatically by considering its distances to the other ranked lists and the agreement with some priori knowledge. From the consensus assumption, we know that the list that is more consistent with other ranked lists tends to generate a more accurate prediction ranking for the genes. Base on this assumption, the aggregated ranked list would be more reliable. (2) where I is the indicator function that is equal to 1 if the condition inside the parentheses is satisfied, 0 otherwise. • By simulating the fish’s schooling behavior from the Artificial Fish Swarm Algorithm (AFSA), we propose The Spearman’s Footrule distance between two ranked lists L1 and L2 is defined as: n X n X I {[RL1 (i) > RL1 (j)][RL2 (i) > RL2 (j)] < 0} , i=1 j=1 3.1.2 115 Spearman’s Footrule F (L1, L2) = n X |RL1 (i) − RL2 (i)| . Let’s see two toy samples below: Example 1: There are two top-3 ordered lists L1 , L2 which partially rank 6 genes A, B, C, D, E, F as: (3) i=1 3.2 Supervised Ranked List Weighted Method L1 = [A ≻ B ≻ D] L2 = [A ≻ E ≻ F ] As the development of the biological technologies, more and more genes have already been identified as important or independent to a certain disease. Then, we try to use these prior knowledge to weight each ranked list more reasonable than some unweighted or prespecified weighted methods. One gene list is more reliable than the others if the top-rated genes of it are more likely to be the truly informative genes. However, we usually don’t know how many informative genes are there to express one certain disease nor how these informative genes’ exact ranking. Therefore, it can alternatively be regarded that a gene list is “reliable” if it agrees with the other ranked lists and the already known prior supervised knowledge as much as possible. Base on this assumption, we define the weight of each ranked list A as: WLA Note that “A ≻ B” means that gene A is preferred to gene B, and the operation ≻ owns the properties of antisymmetry, transitivity and linearity. When A, B and D are identified by some researches or experiments as informative genes, it is obvious that L1 should be more reliable than L2 . Since L1 has three identified informative genes in its top-rated list, but L2 just has one. Example 2: There are two top-6 ordered lists L1 , L2 which rank 6 genes A ≻ B ≻ C ≻ D ≻ E ≻ F as: L1 = [A ≻ B ≻ C ≻ D ≻ E ≻ F ] L2 = [C ≻ E ≻ F ≻ D ≻ A ≻ B] When A and B are identified by some researches or experiments as informative genes, the average rank of these two informative genes in L1 is 1.5 and 6.5. Since we neither know how many informative genes are there to express one certain disease nor how these informative genes’ exact ranking, the ranked list which rank the informative genes higher should be more reliable than the one rank them at the bottom of the top-k genes, which means L1 would be more reliable than L2 . The situation of the un-informative genes is just the opposite to the case of handling the identified informative genes, which would not be discussed in detail here. Intuitively, one ranked list is more reliable if it has less disagreement with the other ranked lists. Obviously, the higher penalty a ranked list gets, the less it would be weighted, which fits our consensus assumption. The penalty forP the disagreement with the other ranked lists, is defined as i=1&i6=A D(LA , Li ). P log2( u∈U RLA (u)) P P , = log2( u∈I RLA (u)) ∗ i=1&i6=A D(LA , Li ) (4) where I is the set of the already identified informative genes, U is the set of the already identified as uninformative genes and D is the distance metric between two lists which could use Kendall’s Tau or Spearman’s Footrule mentioned above. The above function mainly includes two parts: the penalty for the disagreement with the already identified informative genes and the penalty for the disagreement with the other ranked lists. According to our consensus assumption, it is obvious that when a certain ranked list is more consistent with most of the other lists, its weight should be higher. Meanwhile, the penalty for the disagreement withP the already identified informative genes is defined as log2 ( u∈I RLA (u)), which is based on realizing the following three goals: 3.3 • The already identified informative genes should be included in the top-rated items, otherwise the ranked list would be less credible. Artificial Fish Swarm Algorithm Artificial fish swarm algorithm (AFSA) is a novel method for searching the global optimum proposed by Li et al [15], which was inspired by the natural fish’s schooling behavior. AFSA presents a strong capability to avoid local optimum in order to achieve global optimization. It has been proved effective in various fields such as function optimization [17], Parameter estimation [16] and combinatorial optimization [14], etc. • Since we neither know how many informative genes are there to express one certain disease nor how these informative genes’ exact ranking, the ranked list which rank the informative genes higher should be more reliable than the one rank them at the bottom of the top-k genes. 3.3.1 • On the contrary, if the identified un-informative genes are included in the top-rated genes, the list would be less credible. In addition, a ranked list which ranks un-informative genes high would be less reliable than the one ranks them low in the top-rated genes. Principle of Artificial Fish Swarm Algorithm In a natural environment, fish usually stay in the place with a lot of food, so we simulate the behaviors of fish based on this characteristic to find the global optimum, which is the basic idea of the AFSA. According to this phenomenon, 116 AFSA builds some artificial fish, which search an optimal solution in the solution space by imitating fish swarm behavior. AFSA imitates fish’s three typical behaviors, which include “Prey Behavior”, “Swarm Behavior”, and “Follow Behavior”. more definitions : Definition 1: D(Xi , Xj ), the distance between two fish is defined as D(Xi , Xj ) = |Xi − Xj | + |Xj − Xi | , 1. Prey: It is a basic biological behavior that the fish try to find the food. In general, the fish stroll randomly at the beginning, when the fish discover an area with stronger food consistency, they will swim quickly toward that area. which shows that the distance between AFs Xi and Xj is denoted by the number of different genes between them. For example, the distance between Xi = (A, B, C, D) and Xi = (A, C, D, B) is 3. 2. Swarm: The fish will assemble in groups naturally in the moving process, which is a kind of living habits in order to guarantee the existence of the colony and avoid enemies’ invasion. Definition 2: N (X, V isual), the visual-neighbors of AF X whose distance with X is in the Visual, N (X, V isual) = {X ∗ |D(X, X ∗ ) < V isual} , 3. Follow: When a single fish or several fish find food (or reach an area with stronger food consistency), the around fish will trail them and approach the food quickly. (6) where each X ∗ is a neighbor of X. Definition 3: Xc , the center of AF set X1 , X2 , ..., Xm , The detailed process of these AF behaviors for gene rank aggregation would be discussed as follows. 3.3.2 (5) m C(X1 , X2 , ..., Xm ) = ∪m i=1 ∪j=1i6=j (Xi ∩ Xj ) . (7) Prey behavior Prey is a basic biological behavior adopted by fish looking for food. It is based on a random search, with a tendency toward food consistency. There are two steps of it: The Artificial Fish Swarm Algorithm for Gene Rank Aggregation Obviously, the objective function which we want to minimize is NP-complete when using the Kendall Tau as the distance measure which was shown by Dwork et al [7]. Since the rank aggregation problem is NP-hard, none of algorithms are guaranteed to find the optimal solution, and different algorithms will provide different solutions. Thus, we hope to solve this problem with a algorithm which owns the characteristics of swiftly converge, insensitive to initial values and no easy to stuck in the local optimal solution. Therefore, we use an Artificial Fish Swarm Algorithm presented below to predict the top-k ranked informative genes. Suppose that the search space is k-dimensions (top-k genes ranked list) and m AF form the colony. The current status of the ith AF can be expressed by a vector X = (G1 , G2 , ..., Gk ), where Gi (1, 2..., S) (S is the set contains all the elements that are present in each top-k list which is defined above) is the variable to be searched for the optimum. For example, Xi = (2, 4, 1, 5) expressed that the ith AF (Artificial Fish, which is expressed as a potential solution) has chosen genes 2, 4, 1 and 5 to be the top-4 informative genes, and 2 ≻ 4 ≻ 1 ≻ 5. The food consistency at present position of the ith AF can be represented by Yi = f (Xi ), where Yi is the corresponding objective function (e.g Kendall’s Tau or Spearman’s Footrule); Visual represents the vision distance; Step is the maximal step length and δ, 0 < δ < 1, is the congestion degree. What’s more, Max is the maximum iterations number and T is the number of prey iteration. In addition, there are Step1: For a AF Xi , we randomly select a new position Xj from the visual-neighbors within its visual. Step2: If Yj < Yi , then we replace Xi with Xj , then Xj became the new status of the ith AF; otherwise we go back to Step1. If an AF has tried over T times and could not find a better status, then the AF moves to a position randomly. Swarm behavior Fish assemble in swarms to easier minimize danger or search food. Step1: For an AF Xi , calculate nf , the number of visualneighbors of the ith AF. Step2: If nf 6= 0 and nf /m < δ, determined the central position Xc of all around artificial fish. In other words, when the visual scope of the ith AF is not vacant and the center site it is not crowded, we calculate the objective function Yc = f (Xc ). If Yc < Yi , which means the center site has a greater food consistency than the the fish’s current position Xi , then replaced Xi with Xc ; otherwise execute prey behavior. If nf = 0, also execute prey behavior. Follow behavior A fish searches neighboring individuals to follow. Within a fish’s visual, certain fish will be perceived as 117 ing direction of AF is consistent with the average moving direction of the around partners which are moving to the colony extreme, the convergence of colony can be kept. After finishing the process in Figure 1, AF gets to the place where food consistency is the biggest which is recorded in the bullet. Initialize all fishes Iteration counter n=0 For each AF Try Follow Behavior Any Progress? Do Follow Behavior 4. Experiments Try Swarm Behavior Any Progress? In this section, three experiments have been made to illustrate and compare the performances between the various algorithms. The first experiment is made on a contrived dataset which is used to evaluate AFSA’s optimum searching effectiveness. The last two experiments are two applications which aggregate the results from five gene expression studies of prostate cancer and three gene analysis results of HIV, respectively. Through these three experiments, we can illustrate that our method owns not only the capability of solving optimization problem but also the biological meaning. Do Swarm Behavior Try Prey Behavior Any Progress? Do Prey Behavior t++, each T or meet criteria? Final Solution 4.1 Figure 1. The flow chart of Artificial Fish Swarm Algorithm. Contrived Dataset In this experiment, we aim at aggregating three top-40 ranked lists to one aggregated top-40 list. This data was proposed by Lin et al in 2010 [19]. For the sake of simpleness, the dataset would not be listed here again. It is worth to note that there are 65 different items in the candidate set S, which means that there are more than 5 × 1065 potential aggregated lists. Obviously, it is impossible to use a Bruteforce method to evaluate each potential list. What’s more, since this data does not include any prior knowledge and our goal is to prove optimum searching ability of AFSA, the supervised ranked list weighted method would not be used here. The detailed description of this data can be found in Lin’s paper which would not be discussed here. To contrast with our methods, nine other rank aggregation algorithms are listed from three main categories (Also from Lin et al): Borda’s method [3], Markov Chain [8] and Cross Entropy Monte Carlo [20]. For Borda’s method, the rank bases on the score, and four aggregation functions, arithmetic mean(ARM), median(MED), geometric mean(GEM), and l2norm(L2N) , are used. For the Markov Chain method the extended MC1, MC2 and MC3 methods which are based on different strategy have been used. For Cross Entropy Monte Carlo, it directly minimizes either the Kendall’s distance (CEK) or the Spearman’s distance (CES), respectively. Similarly, our method is also based on different objective functions respectively, noted as ASF AK for Kendall’s Tau distance and ASF AS for Spearman’s distance. All the results of the above mentioned methods are listed in Table 1(Some results were directly referred from [19]). It can be seen from the experiment result that ASFA reaches both the minimal objective function value on Kendall’s Tau measure and Spearman’s measure, respec- finding a greater amount of food than others, and this fish will naturally try to follow the best one in order to increase satisfaction. Step1: For an AF Xi , select the best status Xj with the minimum Yj within N (Xi , V isual), which means the j th AF had the minimum value of objective function among visual-neighbors of the ith AF. Step2: Calculate the number of neighbors in the ith AF’s visual scope, noted as nf . If Yj < Yi and nf /m < δ (the area around the ith AF is not too crowded), then replaced Xi with Xj ; otherwise execute prey behavior. Besides, AFSA sets a bullet which is used to record the optimal solution and current performance of fish during each iteration. In addition, the algorithm is terminated when it reaches the maximal iteration number or the best AF state in the bullet does not change for ǫ ( ǫ is a small positive tolerance) rounds. The procedure of AFSA algorithm is shown as Figure 1, which arranges the above mentioned behaviors into a certain process. Obviously, AFSA is an optimization method based on swarm intelligence theory. The food consistency can be regarded as the objective function’s value, and the status of an AF is the potential solution to be optimized. Prey behavior is a self-studying process which allows AF moves randomly to find a stronger food consistency. Both swarm behavior and follow behavior are processes of AF interaction with surrounding environment. They can ensure the area around AF would not be too crowded and the mov- 118 Table 1. The Result of Proposed Algorithm and Other Algorithms Algorithm ARM MEP GEM L2N MC1 MC2 MC3 CEK CES ASF AS ASF AK Kendall 404 400 404 401 402 402 400 392 400 412 392 ing ranks the second high. Note that, Welsh and Dhana are given a relative higher weight, because both of their disagreement and the identified genes’s sum ranking are low. After aggregation through our method, the aggregated rank list is listed followings: HP N ≻ AM ACR ≻ F ASN ≻ GDF 15 ≻ OACT 2 ≻ N M E2 ≻ U AP 1 ≻ SLC25A6 ≻ N M E1 ≻ KRT 18 ≻ M RP L3 ≻ EEF 2 ≻ ST RA13 ≻ SN D1 ≻ GRP 58 ≻ CAN X ≻ AN K3 ≻ T M EM 4 ≻ CCT 2 ≻ AT F 5 ≻ CY P 1B1 ≻ M T HF D2 ≻ P P IB ≻ LGALS3 ≻ M Y C. Note that in the experiment of real dataset, we would not compare with other algorithms since as we mentioned above there are no automatically weighted method based on the prior knowledge. Since the weighted methods are different, it is hard to compare the results. The statistics information of the identified informative genes in each ranked list are shown in Table 2, where “Disagreement” denotes the sum disagreement with other ranked lists (We caculate with Kendall Tau distance here), “Include #” denotes the number of identified informative genes included in the top-25 items, “Total” denotes the sum of rank of the nine identified informative genes (As mentioned before, any element not being ranked by current list will be seen as ranking k+1, and k is 25 now) and “Ave” notes the average of the sum of rank of the nine identified informative genes. Our method which minimizes the weighted Kendall Tau is noted as ASF AK . Seen from Table 2 that the aggregated ranked list from our method has the minimal disagreement with the other five ranked lists and includes the maximal quantity of identified informative genes in the top-25 ranking genes. Therefore, this experiment illustrates that the ranked list aggregated through our method not only has a minimal disagreement with other lists but also has stronger biological significance. Spearman 488 450 504 476 480 450 450 450 450 450 450 tively. And the aggregated rank list based on Kendall Tau is listed followings: 1 ≻ 2 ≻ 3 ≻ 4 ≻ 5 ≻ 6 ≻ 7 ≻ 8 ≻ 9 ≻ 10 ≻ 11 ≻ 12 ≻ 13 ≻ 14 ≻ 15 ≻ 16 ≻ 17 ≻ 18 ≻ 19 ≻ 20 ≻ 21 ≻ 22 ≻ 23 ≻ 24 ≻ 25 ≻ 26 ≻ 27 ≻ 28 ≻ 29 ≻ 30 ≻ 46 ≻ 47 ≻ 48 ≻ 49 ≻ 50 ≻ 41 ≻ 31 ≻ 42 ≻ 56 ≻ 57. More specifically, our method’s result gets 450 Spearman’s distance which is as small as MED of Borda, MC2 and MC3 from Markov chain, CSK and CES from Stochastic, and gets 392 Kendall’s Tau’s distance which is the same with the minimal value from CEK. The experiment result shows AFSA’s optimum searching effectiveness on different objective functions. 4.2 4.2.1 Real Dataset Prostate Cancer Dataset In this experiment, we would evaluate our supervised rank aggregation method’s effectiveness on five prostate cancer informative genes’ ranked lists. The detailed dataset which presents the rankings and top-25 ranked genes which are from five prostate tumors studies was proposed in DeConde et al [4]. For the sake of simpleness, the dataset would not be listed here again. Nine genes listed by Deconde et al are already identified by biologists as informative genes for the prostate cancer, which are HPN, AMACR, FASN, GUCY1A3, ANK3, STRA13, CCT2, CANX and TRAP1 [5, 12, 13, 10, 23]. These identified informative genes would be used as prior knowledge which helps weight each ranked list via our supervised rank aggregation method. Since we do not have any information about the identified un-informative genes, we would not discuss about it. As our assumption, the more the ranked list agree with the prior knowledge and the other ranked lists, the higher it would be weighted. Based on our supervised rank aggregation method, each ranked list is 0.1634, 0.2278, 0.2245, 0.1896 and 0.1947. The Luo is given the lowest weight, since it has the least common with the other four studies with the other ranked lists and the identified genes’s sum rank- Table 2. The Statistics Result of Five Gene Ranked List for Prostate Cancer and the Aggregated List from the Proposed Algorithm Experiment Luo Welsh Dhana True Singh ASF AK 4.2.2 Disagreement 2082 1609 1608 1789 1766 1379 Include # 3 6 5 3 4 7 Total 175 121 130 178 166 124 Ave 19.4 13.4 14.4 19.8 18.4 13.7 HIV Dataset Our HIV data set has 24 progressors which are separated into 2 groups which are NP (Normal progressors) and LTNP (Long Term Normal progressors). There are total 36 candidate genes, which are CD40L, TNF, FASL, IL-17, TGFb2, CCL2, CCL3, CCL5, CXCL1, CXCL2, CXCL3, CXCL5, BCL-2, CASP3, CASP5, CASP6, 119 column “ASF AK ”. The aggregated ranked list from our method has the most common with the other three ranked lists and includes all the identified informative genes in the top-30 genes. In Table 4, we could see the proposed method achieves the best performance by giving the most common with the other three ranked lists and including all the identified informative genes with relative low average ranking. CASP7, CASP8, CCR7, CLDN7 COLIAI, MAP2K4, MAP2K5, TIMP-1, TIMP-2, TIMP-3, TIMP-4, VEGF-A, TGFb-1, TNF-a, MAP2K7, MMP-1, MMP-2, MMP-10, MMP-13 and MMP-14. Meanwhile, CCL2, CCL3 and CCL5 which are all chemokine family have been identified as informative genes among these candidate genes [1, 2, 6, 21, 24]. We have chosen three classical informative genes selection methods, which are SAM (Significant Analysis of Microarray Data), Student T Test and Wilcoxon signedrank test. Although, all these are simple methods, they are often used in microarray data analysis and are widely accepted by the biologists. For the sake of simplicity, only the top-30 genes are kept from these three ranked lists respectively, which are shown in Table 3. The aggregated top-30 list from the combined set S that minimizes the weighted sum of Kendall’s Tau distances and agrees with the prior knowledge is listed in Table 3 under the column ASF AK . Table 4. The Statistics Result of Three Gene Ranked Lists for HIV and the Aggregated List Experiment Lttest LSAM LW ilcoxon ASF AK Table 3. Three HIV Gene Ranking Lists and Aggregated List from the Proposed Algorithm Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Lttest TIMP-2 TIMP-1 CASP8 TGFb-1 CXCL1 MMP-14 FASL CCL3 CXCL5 MMP-1 CXCL2 MAP2K4 MAP2K5 CASP5 CASP3 TNF TGFb2 CCL5 TIMP-4 CLDN7 IL-17 TIMP-3 CASP6 CCR7 CCL2 CD40L TNF-a COLIAI MAP2K7 VEGF-A LSAM TIMP-2 TGFb-1 TIMP-1 CASP8 MMP-14 TNF CCR7 IL-17 MAP2K4 BCL-2 CASP6 CCL2 MMP-1 MMP-13 CCL3 CASP3 MAP2K5 VEGF-A CXCL3 COLIAI CD40L CXCL2 CXCL5 CASP7 CLDN7 MMP-2 CASP5 MAP2K7 CXCL1 TIMP-3 LW ilcoxon CXCL1 TIMP-1 CCL3 CASP8 MMP-14 TIMP-2 FASL CXCL5 TGFb-1 MMP-1 CXCL2 CCL5 MAP2K5 TGFb2 CASP3 TIMP-4 IL-17 CCR7 CLDN7 TIMP-3 MAP2K4 CASP6 COLIAI TNF CASP5 CCL2 TNF-a CD40L MMP-10 MAP2K7 5. Disagreement 284 480 312 274 Include # 3 2 3 3 Total 51 58 41 50 Ave 17 19.3 13.7 16.6 Conclusion By now, most rank aggregation methods either weight each ranked list evenly, or give each ranked list a prespecified weight. However, both of these two methods have obvious defects. An automatically weighted method based on some prior knowledge which makes the result more reasonable and reliable is greatly needed. In this paper, we not only defined a new method to weight each ranked list automatically by considering its distances to the other ranked list and the agreement with some priori knowledge, but also propose a novel AFSA which has been applied to several combinatorial optimization successfully based algorithm for solving the well-defined minimization problem of gene rank aggregation. We demonstrated the performance of the proposed approach by experiments on one contrived dataset and two real datasets. The experimental results show that the proposed approach not only has the capability of solving optimization problem but also the biological meaning. ASF AK TIMP-2 TIMP-1 CASP8 TGFb-1 CXCL1 MMP-14 CCL3 FASL CXCL5 MMP-1 CXCL2 MAP2K4 MAP2K5 CASP3 TNF CASP5 TGFb2 CCL5 TIMP-4 IL-17 CCR7 CLDN7 TIMP-3 CASP6 CCL2 COLIAI CD40L TNF-a MAP2K7 BCL-2 Acknowledgements N.D designed the project, conducted the project and wrote the manuscript. S.D.M and B.B.N modified the manuscript. S.A.S and C.B.H provided the HIV data. A.Z supervised the study and aided in manuscript preparation. References [1] An P, Nelson GW, Wang L, Donfield S, Goedert JJ, Phair J, Vlahov D, Buchbinder S, Farrar WL, Modi W, O’Brien SJ, Winkler CA. Modulating influence on HIV/AIDS by interacting RANTES gene variants. Proc. Natl. Acad. Sci. U. S. A. 99, 10002-10007, 2002. The information of the identified informative genes in each ranked list is shown in Table 3, where the aggregated rank list based on Kendall’s Tau distance is listed under the 120 [14] Li X, Lu F, and et al, Applications of artificial fish school algorithm in combinatorial optimization problems, Journal of Shan Dong University, 34(5):64-67, 2004. [15] Li X, Qian J, Studies on Artificial Fish Swarm Optimization Algorithm based on Decomposition and Coordination Techniques, Journal of Circuits and Systems, 2003. [16] Li X, Xue Y, and et al, Parameter estimation method based on artificial fish school algorithm, Journal of Shan Dong University,34(3):84-87, 2004. [17] Li X, Shao Z, and et al, An optimizing method base on autonomous animates: fish swarm algorithm, Systems Engineering Theory and Practice 22:32-38, 2002. [18] Lin S, Ding J. Integration of Ranked Lists via Cross Entropy Monte Carlo with Applications to mRNA and microRNA Studies, 2010. [19] Lin S. Rank aggregation methods. Wiley interdisciplinary Reviews: Computational Statistics, 2:555570, 2010 [20] Lin S. Finding cancer genes through meta-analysis of microarray experiments: Rank aggregation via the cross entropy algorithm, 2010. [21] O’Brien SJ, Nelson GW. Human genes that limit AIDS. Nature genetics 36, 565-574, 2004. [22] Pihur V, Datta S, Datta S. Finding common genes in multiple cancer types through meta-analysis of microarray experiments: A rank aggregation approach . Genomics. 92(6):400-403, 2008. [23] Pizer ES, Pflug BR, Bova GS, Han WF, Udan MS, Nelson JB. Increased fatty acid synthase as a therapeutic target in adrogen-independent prostate cancer progression. Prostate 47(2):102-110, 20025 [24] Rathore A, Chatterjee A, Sivarama P, Yamamoto N, Singhal PK, Dhole TN. Association of RANTES -403 G/A, -28 C/G and In1.1 T/C polymorphism with HIV1 transmission and progression among North Indians. J. Med. Virol. 80, 1133-1141, 2008. [25] Varambally S, Yu J, Laxman B, Rhodes DR, Mehra R, Tomlins SA, Shah RB, Chandran U, and et al. Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. Cancer Cell. 8(5):393-406, 2005. [2] Bleiber G, May M, Martinez R, Meylan P, Ott J, Beckmann JS, Telenti A. Use of a combined ex vivo/in vivo population approach for screening of human genes involved in the human immunodeficiency virus type 1 life cycle for variants influencing disease progression. J. Virol. 79, 12674-12680, 2005. [3] Borda P. Memoire sur les elections au scrutin, Histoire de l’Academie Royale des Sciences, 1781. [4] DeConde RP, Hawley S, Falcon S, Clegg N, Knudsen B, Etzioni R. Combining results of microarray experiments: a rank aggregation approach. Statistical Applications in Genetics and Molecular Biology, 5, article 15. 2006. [5] Dong Y, Zhang H, Gao, Gao AC, Marshall JR, Ip C. Androgen receptor signaling intensity is a key factor in determining the sensitivity of prostate cancer cells to selenium inhibition of growth and cancer-specific biomarkers. Mol Cancer Ther 4(7):1047-1055, 2005. [6] Duggal P, Winkler CA, An P, Yu XF, Farzadegan H, O’Brien SJ, Beaty TH, Vlahov D. The effect of RANTES chemokine genetic variants on early HIV-1 plasma RNA among African American injection drug users. J. Acquir. Immune Defic. Syndr. 38, 584-589, 2005. [7] Dwork C, Kumar R, Naor M, Sivakumar D. Rank aggregation revisited. Manuscript, 2001. [8] Dwork C, Kumar R, Naor M, Sivakumar D. Rank aggregation methods for the web, 2001. [9] Fagin R, Kumar R, Sivakumar D. Comparing top k lists. In SODA ’03, pages 28-36, 2003. [10] Ignatiuk A, Quickfall JP, Hawrysh AD, Chamberlain MD, Anderson DH. The smaller isoforms of ankyrin 3 bind to the p85 subunit of phosphatidylinositol 3kinase and enhance platelet-derived growth factor receptor down-regulation. J Biol Chem 281(9):59565964, 2006. [11] Ivanova A, Liao SY, Lerman MI, Ivanov S, Stanbridge EJ. STRA13 expression and subcellular localisation in normal and tumour tissues: implications for use as a diagnostic and differentiaion marker. J Med Genet 42(7):556-576, 2005. [12] Klezovitch O, Chevillet J, Mirosevich J, Roberts RL, Matusik RJ, Vasioukhin V. Hepsin promotes prostate cancer and metastasis. Cancer Cell. 6(2):185-195, 2004. [13] Kuefer R, Varambally S, Zhou M, Lucas PC, Loeffler M, Wolter H, Mattfeldt T, Hautmann RE. alphaMethylacyl-CoA racemase: expression levels of this novel cancer biomarker depend on tumor differentiation. Am J Pathol 161(3):841-848, 2002. 121