AN ARTIFICIAL FISH SWARM BASED SUPERVISED GENE RANK

advertisement
Proceedings of the IASTED International Conference
Computational Intelligence and Bioinformatics (CIB 2011)
November 7 - 9, 2011 Pittsburgh, USA
AN ARTIFICIAL FISH SWARM BASED SUPERVISED GENE RANK
AGGREGATION ALGORITHM FOR INFORMATIVE GENES STUDIES
Nan Du
Computer Science and Engineering Department
State University of New York at Buffalo
nandu@buffalo.edu
Supriya D Mahajan
Department of Medicine
State University of New York at Buffalo
smahajan@buffalo.edu
Bindukumar B Nair
Department of Medicine
State University of New York at Buffalo
bnair@buffalo.edu
Stanley A. Schwartz
Department of Medicine
State University of New York at Buffalo
sasimmun@buffalo.edu
Chiu Bin Hsiao
Department of Medicine
State University of New York at Buffalo
chsiao@buffalo.edu
Aidong Zhang
Computer Science and Engineering Department
State University of New York at Buffalo
azhang@buffalo.edu
ABSTRACT
As the widespread use of high-throughput genomic and
protein analysis, more and more experiments have been
done to identify the informative genes for various diseases,
thus it provides researchers an opportunity to aggregate
across multiple microarray experiments via a rank aggregation approach. However, most of current’s microarray rank
aggregation methods are either unweighted or prespecified
weighted, which has obvious defects. In this paper, We define a new method to weight each ranked list automatically
by considering its distances to the other ranked lists and the
agreement with some priori knowledge. Then the problem
of integrating ranked lists can be formulated as minimizing an objective criterion which is proved to be NP-hard.
Accordingly, we use an Artificial Fish Swarm algorithm
(AFSA) to solve the well-defined minimization problem of
rank aggregation in terms of decision theory. We conduct
two sets of experiments to evaluate the performance of our
methods. The experimental results show that the proposed
approach owns not only the capability of solving optimization problem but also the biological meaning.
ple, discovering genes involved in multiple types of cancers is significant therapeutic importance. Therefore, it is
a challenge to identify the informative genes among thousands of genes.
As the widespread use of high-throughput genomic
and protein analysis, more and more experiments have been
done to identify the informative genes for various diseases.
This is providing researchers an opportunity to aggregate
results across sets of experiments designed to explore the
same biological phenomenon. However, combining information across multiple studies is challenging. In the case of
microarray studies, the use of different technologies means
that not all studies measure expression levels of the same
genes. In addition, technical, biological, and other sources
of variability will generally lead to the Inconsistent measurements of gene expression.
In this paper, we propose an Artificial Fish Swarm
algorithm (AFSA) for the gene rank aggregation problem.
AFSA is a novel method of swarm intelligence for searching the global optimum, which was inspired by the natural
schooling behaviors of fish. The AFSA has been proved
effective in function optimization, parameter estimation,
combinatorial optimization, least squares support vector
machine and geo-technical engineering problems. The remarkable property of this algorithm is that it is capable of
global search in a rather large space, insensitive to initial
values and not easy to stuck in the local optimal solution.
Here we use an AFSA based algorithm to solve the minimization problem of rank aggregation.
In this paper, we briefly discuss the related work on
rank aggregation technologies in Section 2. In Section 3,
we present a novel method which weights each ranked lists
automatically. Besides that, an AFSA which is used to
solve the minimization problem would also be explained
in detail. Finally, the experimental results are illustrated to
show the effectiveness of the proposed method in Section
4.
KEY WORDS
Gene rank aggregation; automated weighted; microarrays; artificial fish swarm algorithm
1. Introduction
With the development of the large scale gene expression
profiling technology, gene expression data analysis and
modeling have become an important topic in the field of
bioinformatics research. Based on the gene expression
data, the diagnosis and treatment of various disease at the
DNA, RNA, or protein levels are important. Informative
genes, which differentially expressed under two sets (e.g.
various diseased and normal), are critical for a reliable diagnosis and simplify the experimental analysis. For exam-
DOI: 10.2316/P.2011.753-019
114
2. Related Work
a novel method for solving the well-defined minimization problem of gene rank aggregation. As far as
we know, there are no any swarm intelligence algorithms that have been used to solve the rank aggregation problem.
Rank aggregation, as a classical problem, is defined as aggregating ranked results of entities from multiple ranking
lists which aims at finding a more robust and reliable one.
The individual ranking list may come from different methods, experiments or voters. This problem has had more
than two hundred years’ history, which was first proposed
by Borda in 1781 [3]. In the case of gene rank aggregation,
the goal is to aggregate measures of expression at the DNA,
RNA, or protein levels [25]. To aggregate more reliable and
robust gene ranked list, several approaches have been proposed. DeConde et al [4] first formulated the rank aggregation problem on identifying informative genes from multiple microarray experiments. This method combined a relatively small number of ranked lists using Markov Chain,
which is partly based on the algorithms of Dwork et al [8]
for spam webpage detection. Lin et al [18, 20] used the
cross entropy Monte Carlo methods to combine the lists of
MicroRNAs targets. Also, Vasyl Pihur et al [22] used a
similar method as Lin et al to make informative genes selection on 20 different cancer related microarray datasets.
By now, most methods either weight each ranked list
evenly which is also called unweighted [4], or give each
ranked list a prespecified weight [20] which is based on
some domain experience (Also include the rank aggregation methods which do not work on gene ranked lists aggregation).
However, both unweighted method and prespecified
weight method have obvious defects. The former method
ignores the fact that some ranked lists are more important than the others. Each ranked list from various experiments, datasets or algorithms should have different reliability, which means they should not be regarded equally.
Although the latter method may be effective in some cases,
the prespecified weights are hard to evaluate. In addition,
if we need to combine hundreds or thousands of experiments’s results, the method of manually weighted is nearly
impossible in practice.
An automated and objective way of weighting the
rank lists is needed. Our goal is to find a way to weight
each ranked list automatically based on some priori knowledge, which could be called supervised gene rank aggregation. The main contributions of this paper are outlined as
follows:
3.
Methods
3.1
Preliminaries
Let SA (k) be the ranked list with k genes, from experiment, dataset or algorithm A, depending on the context of
the problem. Although it is not a necessary condition, for
the sake of simplicity, we assume that all lists as inputs to
the rank aggregation method have the same size k. Hereafter, we refer to such list as top-k list and drop the k in
SA (k) where there is no ambiguity.
Let S = ∪A SA be the union set of elements from all
the top-k lists. Suppose there are total n elements in the set,
indexed from 1 through n without any particular order of
preference. Let τ be any size k ordered subset of S without
duplicate. Then, the goal is to find the ranked list τ ∗ which
has minimal disagreement with all the lists:
τ ∗ = argmin
(
X
wA d(τ, SA ), τ ⊂ S
A
)
,
(1)
where wA is a weight for list A, and d denotes a distance
measure, such as the Spearman’s footrule or the Kendall’s
Tau distance. Such a τ ∗ is referred to as the top-k list of
the aggregate candidate targets set, and y ∗ = Φ(τ ∗ ) is the
optimal (minimum) value of the objective function Φ. For
each i ∈
/ S ∩ SA , let RA (i) = k + 1, where RA (i) is
the rank of gene i in the list A. Besides some definitions
of gene rank aggregation made in the above, we would introduce two methods of computing the distance between
two ranked lists, which are Kendall’s Tau and Spearman’s
Footrule Distance [9].
3.1.1
Kendall’s Tau
Given two permutations L1 , L2 with n items, the Kendall
Tau distance is defined as:
K(L1 , L2 ) =
• We define a new method to weight each ranked
list automatically by considering its distances to the
other ranked lists and the agreement with some priori knowledge. From the consensus assumption, we
know that the list that is more consistent with other
ranked lists tends to generate a more accurate prediction ranking for the genes. Base on this assumption,
the aggregated ranked list would be more reliable.
(2)
where I is the indicator function that is equal to 1 if the
condition inside the parentheses is satisfied, 0 otherwise.
• By simulating the fish’s schooling behavior from the
Artificial Fish Swarm Algorithm (AFSA), we propose
The Spearman’s Footrule distance between two ranked lists
L1 and L2 is defined as:
n X
n
X
I {[RL1 (i) > RL1 (j)][RL2 (i) > RL2 (j)] < 0} ,
i=1 j=1
3.1.2
115
Spearman’s Footrule
F (L1, L2) =
n
X
|RL1 (i) − RL2 (i)| .
Let’s see two toy samples below:
Example 1: There are two top-3 ordered lists L1 , L2
which partially rank 6 genes A, B, C, D, E, F as:
(3)
i=1
3.2
Supervised Ranked List Weighted Method
L1 = [A ≻ B ≻ D]
L2 = [A ≻ E ≻ F ]
As the development of the biological technologies, more
and more genes have already been identified as important
or independent to a certain disease. Then, we try to use
these prior knowledge to weight each ranked list more reasonable than some unweighted or prespecified weighted
methods. One gene list is more reliable than the others if
the top-rated genes of it are more likely to be the truly informative genes. However, we usually don’t know how many
informative genes are there to express one certain disease
nor how these informative genes’ exact ranking. Therefore, it can alternatively be regarded that a gene list is “reliable” if it agrees with the other ranked lists and the already
known prior supervised knowledge as much as possible.
Base on this assumption, we define the weight of each
ranked list A as:
WLA
Note that “A ≻ B” means that gene A is preferred
to gene B, and the operation ≻ owns the properties of antisymmetry, transitivity and linearity. When A, B and D
are identified by some researches or experiments as informative genes, it is obvious that L1 should be more reliable
than L2 . Since L1 has three identified informative genes in
its top-rated list, but L2 just has one.
Example 2: There are two top-6 ordered lists L1 , L2
which rank 6 genes A ≻ B ≻ C ≻ D ≻ E ≻ F as:
L1 = [A ≻ B ≻ C ≻ D ≻ E ≻ F ]
L2 = [C ≻ E ≻ F ≻ D ≻ A ≻ B]
When A and B are identified by some researches or
experiments as informative genes, the average rank of these
two informative genes in L1 is 1.5 and 6.5. Since we neither know how many informative genes are there to express
one certain disease nor how these informative genes’ exact
ranking, the ranked list which rank the informative genes
higher should be more reliable than the one rank them at
the bottom of the top-k genes, which means L1 would be
more reliable than L2 . The situation of the un-informative
genes is just the opposite to the case of handling the identified informative genes, which would not be discussed in
detail here.
Intuitively, one ranked list is more reliable if it has
less disagreement with the other ranked lists. Obviously,
the higher penalty a ranked list gets, the less it would
be weighted, which fits our consensus assumption. The
penalty forP
the disagreement with the other ranked lists, is
defined as i=1&i6=A D(LA , Li ).
P
log2( u∈U RLA (u))
P
P
,
=
log2( u∈I RLA (u)) ∗ i=1&i6=A D(LA , Li )
(4)
where I is the set of the already identified informative genes, U is the set of the already identified as uninformative genes and D is the distance metric between two
lists which could use Kendall’s Tau or Spearman’s Footrule
mentioned above. The above function mainly includes two
parts: the penalty for the disagreement with the already
identified informative genes and the penalty for the disagreement with the other ranked lists. According to our
consensus assumption, it is obvious that when a certain
ranked list is more consistent with most of the other lists,
its weight should be higher. Meanwhile, the penalty for the
disagreement withP
the already identified informative genes
is defined as log2 ( u∈I RLA (u)), which is based on realizing the following three goals:
3.3
• The already identified informative genes should be included in the top-rated items, otherwise the ranked list
would be less credible.
Artificial Fish Swarm Algorithm
Artificial fish swarm algorithm (AFSA) is a novel method
for searching the global optimum proposed by Li et al [15],
which was inspired by the natural fish’s schooling behavior. AFSA presents a strong capability to avoid local optimum in order to achieve global optimization. It has been
proved effective in various fields such as function optimization [17], Parameter estimation [16] and combinatorial optimization [14], etc.
• Since we neither know how many informative genes
are there to express one certain disease nor how
these informative genes’ exact ranking, the ranked list
which rank the informative genes higher should be
more reliable than the one rank them at the bottom
of the top-k genes.
3.3.1
• On the contrary, if the identified un-informative genes
are included in the top-rated genes, the list would be
less credible. In addition, a ranked list which ranks
un-informative genes high would be less reliable than
the one ranks them low in the top-rated genes.
Principle of Artificial Fish Swarm Algorithm
In a natural environment, fish usually stay in the place with
a lot of food, so we simulate the behaviors of fish based on
this characteristic to find the global optimum, which is the
basic idea of the AFSA. According to this phenomenon,
116
AFSA builds some artificial fish, which search an optimal
solution in the solution space by imitating fish swarm behavior. AFSA imitates fish’s three typical behaviors, which
include “Prey Behavior”, “Swarm Behavior”, and “Follow
Behavior”.
more definitions :
Definition 1: D(Xi , Xj ), the distance between two fish is
defined as
D(Xi , Xj ) = |Xi − Xj | + |Xj − Xi | ,
1. Prey: It is a basic biological behavior that the fish try
to find the food. In general, the fish stroll randomly
at the beginning, when the fish discover an area with
stronger food consistency, they will swim quickly toward that area.
which shows that the distance between AFs Xi and Xj is
denoted by the number of different genes between them.
For example, the distance between Xi = (A, B, C, D) and
Xi = (A, C, D, B) is 3.
2. Swarm: The fish will assemble in groups naturally in
the moving process, which is a kind of living habits
in order to guarantee the existence of the colony and
avoid enemies’ invasion.
Definition 2: N (X, V isual), the visual-neighbors of AF
X whose distance with X is in the Visual,
N (X, V isual) = {X ∗ |D(X, X ∗ ) < V isual} ,
3. Follow: When a single fish or several fish find food
(or reach an area with stronger food consistency),
the around fish will trail them and approach the food
quickly.
(6)
where each X ∗ is a neighbor of X.
Definition 3: Xc , the center of AF set X1 , X2 , ..., Xm ,
The detailed process of these AF behaviors for gene
rank aggregation would be discussed as follows.
3.3.2
(5)
m
C(X1 , X2 , ..., Xm ) = ∪m
i=1 ∪j=1i6=j (Xi ∩ Xj ) .
(7)
Prey behavior
Prey is a basic biological behavior adopted by fish
looking for food. It is based on a random search, with a
tendency toward food consistency. There are two steps of
it:
The Artificial Fish Swarm Algorithm for Gene
Rank Aggregation
Obviously, the objective function which we want to minimize is NP-complete when using the Kendall Tau as the
distance measure which was shown by Dwork et al [7].
Since the rank aggregation problem is NP-hard, none of
algorithms are guaranteed to find the optimal solution, and
different algorithms will provide different solutions. Thus,
we hope to solve this problem with a algorithm which owns
the characteristics of swiftly converge, insensitive to initial
values and no easy to stuck in the local optimal solution.
Therefore, we use an Artificial Fish Swarm Algorithm presented below to predict the top-k ranked informative genes.
Suppose that the search space is k-dimensions (top-k
genes ranked list) and m AF form the colony. The current status of the ith AF can be expressed by a vector
X = (G1 , G2 , ..., Gk ), where Gi (1, 2..., S) (S is the set
contains all the elements that are present in each top-k list
which is defined above) is the variable to be searched for
the optimum.
For example, Xi = (2, 4, 1, 5) expressed that the
ith AF (Artificial Fish, which is expressed as a potential
solution) has chosen genes 2, 4, 1 and 5 to be the top-4
informative genes, and 2 ≻ 4 ≻ 1 ≻ 5. The food consistency at present position of the ith AF can be represented
by Yi = f (Xi ), where Yi is the corresponding objective
function (e.g Kendall’s Tau or Spearman’s Footrule);
Visual represents the vision distance; Step is the maximal
step length and δ, 0 < δ < 1, is the congestion degree.
What’s more, Max is the maximum iterations number and
T is the number of prey iteration. In addition, there are
Step1: For a AF Xi , we randomly select a new position
Xj from the visual-neighbors within its visual.
Step2: If Yj < Yi , then we replace Xi with Xj , then Xj
became the new status of the ith AF; otherwise we go back
to Step1.
If an AF has tried over T times and could not find a
better status, then the AF moves to a position randomly.
Swarm behavior
Fish assemble in swarms to easier minimize danger
or search food.
Step1: For an AF Xi , calculate nf , the number of visualneighbors of the ith AF.
Step2: If nf 6= 0 and nf /m < δ, determined the central
position Xc of all around artificial fish. In other words,
when the visual scope of the ith AF is not vacant and the
center site it is not crowded, we calculate the objective
function Yc = f (Xc ). If Yc < Yi , which means the
center site has a greater food consistency than the the fish’s
current position Xi , then replaced Xi with Xc ; otherwise
execute prey behavior. If nf = 0, also execute prey
behavior.
Follow behavior
A fish searches neighboring individuals to follow.
Within a fish’s visual, certain fish will be perceived as
117
ing direction of AF is consistent with the average moving
direction of the around partners which are moving to the
colony extreme, the convergence of colony can be kept. After finishing the process in Figure 1, AF gets to the place
where food consistency is the biggest which is recorded in
the bullet.
Initialize all fishes
Iteration counter n=0
For each AF
Try Follow Behavior
Any
Progress?
Do Follow
Behavior
4.
Experiments
Try Swarm Behavior
Any
Progress?
In this section, three experiments have been made to illustrate and compare the performances between the various
algorithms. The first experiment is made on a contrived
dataset which is used to evaluate AFSA’s optimum searching effectiveness. The last two experiments are two applications which aggregate the results from five gene expression studies of prostate cancer and three gene analysis
results of HIV, respectively. Through these three experiments, we can illustrate that our method owns not only the
capability of solving optimization problem but also the biological meaning.
Do Swarm
Behavior
Try Prey Behavior
Any
Progress?
Do Prey
Behavior
t++, each T or
meet criteria?
Final Solution
4.1
Figure 1. The flow chart of Artificial Fish Swarm Algorithm.
Contrived Dataset
In this experiment, we aim at aggregating three top-40
ranked lists to one aggregated top-40 list. This data was
proposed by Lin et al in 2010 [19]. For the sake of simpleness, the dataset would not be listed here again. It is worth
to note that there are 65 different items in the candidate set
S, which means that there are more than 5 × 1065 potential
aggregated lists. Obviously, it is impossible to use a Bruteforce method to evaluate each potential list. What’s more,
since this data does not include any prior knowledge and
our goal is to prove optimum searching ability of AFSA,
the supervised ranked list weighted method would not be
used here. The detailed description of this data can be
found in Lin’s paper which would not be discussed here.
To contrast with our methods, nine other rank aggregation algorithms are listed from three main categories
(Also from Lin et al): Borda’s method [3], Markov Chain
[8] and Cross Entropy Monte Carlo [20]. For Borda’s
method, the rank bases on the score, and four aggregation functions, arithmetic mean(ARM), median(MED), geometric mean(GEM), and l2norm(L2N) , are used. For
the Markov Chain method the extended MC1, MC2 and
MC3 methods which are based on different strategy have
been used. For Cross Entropy Monte Carlo, it directly minimizes either the Kendall’s distance (CEK) or the Spearman’s distance (CES), respectively. Similarly, our method
is also based on different objective functions respectively,
noted as ASF AK for Kendall’s Tau distance and ASF AS
for Spearman’s distance. All the results of the above mentioned methods are listed in Table 1(Some results were directly referred from [19]).
It can be seen from the experiment result that ASFA
reaches both the minimal objective function value on
Kendall’s Tau measure and Spearman’s measure, respec-
finding a greater amount of food than others, and this fish
will naturally try to follow the best one in order to increase
satisfaction.
Step1: For an AF Xi , select the best status Xj with the
minimum Yj within N (Xi , V isual), which means the j th
AF had the minimum value of objective function among
visual-neighbors of the ith AF.
Step2: Calculate the number of neighbors in the ith AF’s
visual scope, noted as nf . If Yj < Yi and nf /m < δ (the
area around the ith AF is not too crowded), then replaced
Xi with Xj ; otherwise execute prey behavior.
Besides, AFSA sets a bullet which is used to record
the optimal solution and current performance of fish during each iteration. In addition, the algorithm is terminated
when it reaches the maximal iteration number or the best
AF state in the bullet does not change for ǫ ( ǫ is a small
positive tolerance) rounds. The procedure of AFSA algorithm is shown as Figure 1, which arranges the above mentioned behaviors into a certain process.
Obviously, AFSA is an optimization method based on
swarm intelligence theory. The food consistency can be regarded as the objective function’s value, and the status of
an AF is the potential solution to be optimized. Prey behavior is a self-studying process which allows AF moves
randomly to find a stronger food consistency. Both swarm
behavior and follow behavior are processes of AF interaction with surrounding environment. They can ensure the
area around AF would not be too crowded and the mov-
118
Table 1. The Result of Proposed Algorithm and Other
Algorithms
Algorithm
ARM
MEP
GEM
L2N
MC1
MC2
MC3
CEK
CES
ASF AS
ASF AK
Kendall
404
400
404
401
402
402
400
392
400
412
392
ing ranks the second high. Note that, Welsh and Dhana
are given a relative higher weight, because both of their
disagreement and the identified genes’s sum ranking are
low. After aggregation through our method, the aggregated rank list is listed followings: HP N ≻ AM ACR ≻
F ASN ≻ GDF 15 ≻ OACT 2 ≻ N M E2 ≻ U AP 1 ≻
SLC25A6 ≻ N M E1 ≻ KRT 18 ≻ M RP L3 ≻
EEF 2 ≻ ST RA13 ≻ SN D1 ≻ GRP 58 ≻ CAN X ≻
AN K3 ≻ T M EM 4 ≻ CCT 2 ≻ AT F 5 ≻ CY P 1B1 ≻
M T HF D2 ≻ P P IB ≻ LGALS3 ≻ M Y C. Note that
in the experiment of real dataset, we would not compare
with other algorithms since as we mentioned above there
are no automatically weighted method based on the prior
knowledge. Since the weighted methods are different, it is
hard to compare the results.
The statistics information of the identified informative genes in each ranked list are shown in Table 2, where
“Disagreement” denotes the sum disagreement with other
ranked lists (We caculate with Kendall Tau distance here),
“Include #” denotes the number of identified informative
genes included in the top-25 items, “Total” denotes the sum
of rank of the nine identified informative genes (As mentioned before, any element not being ranked by current list
will be seen as ranking k+1, and k is 25 now) and “Ave”
notes the average of the sum of rank of the nine identified informative genes. Our method which minimizes the
weighted Kendall Tau is noted as ASF AK . Seen from Table 2 that the aggregated ranked list from our method has
the minimal disagreement with the other five ranked lists
and includes the maximal quantity of identified informative genes in the top-25 ranking genes. Therefore, this experiment illustrates that the ranked list aggregated through
our method not only has a minimal disagreement with other
lists but also has stronger biological significance.
Spearman
488
450
504
476
480
450
450
450
450
450
450
tively. And the aggregated rank list based on Kendall Tau
is listed followings: 1 ≻ 2 ≻ 3 ≻ 4 ≻ 5 ≻ 6 ≻ 7 ≻
8 ≻ 9 ≻ 10 ≻ 11 ≻ 12 ≻ 13 ≻ 14 ≻ 15 ≻ 16 ≻ 17 ≻
18 ≻ 19 ≻ 20 ≻ 21 ≻ 22 ≻ 23 ≻ 24 ≻ 25 ≻ 26 ≻
27 ≻ 28 ≻ 29 ≻ 30 ≻ 46 ≻ 47 ≻ 48 ≻ 49 ≻ 50 ≻
41 ≻ 31 ≻ 42 ≻ 56 ≻ 57. More specifically, our method’s
result gets 450 Spearman’s distance which is as small as
MED of Borda, MC2 and MC3 from Markov chain, CSK
and CES from Stochastic, and gets 392 Kendall’s Tau’s distance which is the same with the minimal value from CEK.
The experiment result shows AFSA’s optimum searching
effectiveness on different objective functions.
4.2
4.2.1
Real Dataset
Prostate Cancer Dataset
In this experiment, we would evaluate our supervised rank
aggregation method’s effectiveness on five prostate cancer
informative genes’ ranked lists. The detailed dataset which
presents the rankings and top-25 ranked genes which are
from five prostate tumors studies was proposed in DeConde
et al [4]. For the sake of simpleness, the dataset would not
be listed here again.
Nine genes listed by Deconde et al are already identified by biologists as informative genes for the prostate cancer, which are HPN, AMACR, FASN, GUCY1A3, ANK3,
STRA13, CCT2, CANX and TRAP1 [5, 12, 13, 10, 23].
These identified informative genes would be used as prior
knowledge which helps weight each ranked list via our supervised rank aggregation method. Since we do not have
any information about the identified un-informative genes,
we would not discuss about it. As our assumption, the more
the ranked list agree with the prior knowledge and the other
ranked lists, the higher it would be weighted.
Based on our supervised rank aggregation method,
each ranked list is 0.1634, 0.2278, 0.2245, 0.1896 and
0.1947. The Luo is given the lowest weight, since it
has the least common with the other four studies with
the other ranked lists and the identified genes’s sum rank-
Table 2. The Statistics Result of Five Gene Ranked List
for Prostate Cancer and the Aggregated List from the
Proposed Algorithm
Experiment
Luo
Welsh
Dhana
True
Singh
ASF AK
4.2.2
Disagreement
2082
1609
1608
1789
1766
1379
Include #
3
6
5
3
4
7
Total
175
121
130
178
166
124
Ave
19.4
13.4
14.4
19.8
18.4
13.7
HIV Dataset
Our HIV data set has 24 progressors which are separated
into 2 groups which are NP (Normal progressors) and
LTNP (Long Term Normal progressors). There are total 36 candidate genes, which are CD40L, TNF, FASL,
IL-17, TGFb2, CCL2, CCL3, CCL5, CXCL1, CXCL2,
CXCL3, CXCL5, BCL-2, CASP3, CASP5, CASP6,
119
column “ASF AK ”. The aggregated ranked list from our
method has the most common with the other three ranked
lists and includes all the identified informative genes in the
top-30 genes. In Table 4, we could see the proposed method
achieves the best performance by giving the most common
with the other three ranked lists and including all the identified informative genes with relative low average ranking.
CASP7, CASP8, CCR7, CLDN7 COLIAI, MAP2K4,
MAP2K5, TIMP-1, TIMP-2, TIMP-3, TIMP-4, VEGF-A,
TGFb-1, TNF-a, MAP2K7, MMP-1, MMP-2, MMP-10,
MMP-13 and MMP-14. Meanwhile, CCL2, CCL3 and
CCL5 which are all chemokine family have been identified as informative genes among these candidate genes [1,
2, 6, 21, 24].
We have chosen three classical informative genes selection methods, which are SAM (Significant Analysis of
Microarray Data), Student T Test and Wilcoxon signedrank test. Although, all these are simple methods, they are
often used in microarray data analysis and are widely accepted by the biologists. For the sake of simplicity, only the
top-30 genes are kept from these three ranked lists respectively, which are shown in Table 3. The aggregated top-30
list from the combined set S that minimizes the weighted
sum of Kendall’s Tau distances and agrees with the prior
knowledge is listed in Table 3 under the column ASF AK .
Table 4. The Statistics Result of Three Gene Ranked Lists
for HIV and the Aggregated List
Experiment
Lttest
LSAM
LW ilcoxon
ASF AK
Table 3. Three HIV Gene Ranking Lists and Aggregated
List from the Proposed Algorithm
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Lttest
TIMP-2
TIMP-1
CASP8
TGFb-1
CXCL1
MMP-14
FASL
CCL3
CXCL5
MMP-1
CXCL2
MAP2K4
MAP2K5
CASP5
CASP3
TNF
TGFb2
CCL5
TIMP-4
CLDN7
IL-17
TIMP-3
CASP6
CCR7
CCL2
CD40L
TNF-a
COLIAI
MAP2K7
VEGF-A
LSAM
TIMP-2
TGFb-1
TIMP-1
CASP8
MMP-14
TNF
CCR7
IL-17
MAP2K4
BCL-2
CASP6
CCL2
MMP-1
MMP-13
CCL3
CASP3
MAP2K5
VEGF-A
CXCL3
COLIAI
CD40L
CXCL2
CXCL5
CASP7
CLDN7
MMP-2
CASP5
MAP2K7
CXCL1
TIMP-3
LW ilcoxon
CXCL1
TIMP-1
CCL3
CASP8
MMP-14
TIMP-2
FASL
CXCL5
TGFb-1
MMP-1
CXCL2
CCL5
MAP2K5
TGFb2
CASP3
TIMP-4
IL-17
CCR7
CLDN7
TIMP-3
MAP2K4
CASP6
COLIAI
TNF
CASP5
CCL2
TNF-a
CD40L
MMP-10
MAP2K7
5.
Disagreement
284
480
312
274
Include #
3
2
3
3
Total
51
58
41
50
Ave
17
19.3
13.7
16.6
Conclusion
By now, most rank aggregation methods either weight each
ranked list evenly, or give each ranked list a prespecified
weight. However, both of these two methods have obvious defects. An automatically weighted method based on
some prior knowledge which makes the result more reasonable and reliable is greatly needed. In this paper, we not
only defined a new method to weight each ranked list automatically by considering its distances to the other ranked
list and the agreement with some priori knowledge, but
also propose a novel AFSA which has been applied to several combinatorial optimization successfully based algorithm for solving the well-defined minimization problem
of gene rank aggregation. We demonstrated the performance of the proposed approach by experiments on one
contrived dataset and two real datasets. The experimental
results show that the proposed approach not only has the
capability of solving optimization problem but also the biological meaning.
ASF AK
TIMP-2
TIMP-1
CASP8
TGFb-1
CXCL1
MMP-14
CCL3
FASL
CXCL5
MMP-1
CXCL2
MAP2K4
MAP2K5
CASP3
TNF
CASP5
TGFb2
CCL5
TIMP-4
IL-17
CCR7
CLDN7
TIMP-3
CASP6
CCL2
COLIAI
CD40L
TNF-a
MAP2K7
BCL-2
Acknowledgements
N.D designed the project, conducted the project and wrote
the manuscript. S.D.M and B.B.N modified the manuscript.
S.A.S and C.B.H provided the HIV data. A.Z supervised
the study and aided in manuscript preparation.
References
[1] An P, Nelson GW, Wang L, Donfield S, Goedert JJ,
Phair J, Vlahov D, Buchbinder S, Farrar WL, Modi
W, O’Brien SJ, Winkler CA. Modulating influence
on HIV/AIDS by interacting RANTES gene variants.
Proc. Natl. Acad. Sci. U. S. A. 99, 10002-10007,
2002.
The information of the identified informative genes in
each ranked list is shown in Table 3, where the aggregated
rank list based on Kendall’s Tau distance is listed under the
120
[14] Li X, Lu F, and et al, Applications of artificial fish
school algorithm in combinatorial optimization problems, Journal of Shan Dong University, 34(5):64-67,
2004.
[15] Li X, Qian J, Studies on Artificial Fish Swarm Optimization Algorithm based on Decomposition and Coordination Techniques, Journal of Circuits and Systems, 2003.
[16] Li X, Xue Y, and et al, Parameter estimation method
based on artificial fish school algorithm, Journal of
Shan Dong University,34(3):84-87, 2004.
[17] Li X, Shao Z, and et al, An optimizing method
base on autonomous animates: fish swarm algorithm,
Systems Engineering Theory and Practice 22:32-38,
2002.
[18] Lin S, Ding J. Integration of Ranked Lists via Cross
Entropy Monte Carlo with Applications to mRNA and
microRNA Studies, 2010.
[19] Lin S. Rank aggregation methods. Wiley interdisciplinary Reviews: Computational Statistics, 2:555570, 2010
[20] Lin S. Finding cancer genes through meta-analysis
of microarray experiments: Rank aggregation via the
cross entropy algorithm, 2010.
[21] O’Brien SJ, Nelson GW. Human genes that limit
AIDS. Nature genetics 36, 565-574, 2004.
[22] Pihur V, Datta S, Datta S. Finding common genes in
multiple cancer types through meta-analysis of microarray experiments: A rank aggregation approach .
Genomics. 92(6):400-403, 2008.
[23] Pizer ES, Pflug BR, Bova GS, Han WF, Udan MS,
Nelson JB. Increased fatty acid synthase as a therapeutic target in adrogen-independent prostate cancer
progression. Prostate 47(2):102-110, 20025
[24] Rathore A, Chatterjee A, Sivarama P, Yamamoto N,
Singhal PK, Dhole TN. Association of RANTES -403
G/A, -28 C/G and In1.1 T/C polymorphism with HIV1 transmission and progression among North Indians.
J. Med. Virol. 80, 1133-1141, 2008.
[25] Varambally S, Yu J, Laxman B, Rhodes DR, Mehra
R, Tomlins SA, Shah RB, Chandran U, and et al. Integrative genomic and proteomic analysis of prostate
cancer reveals signatures of metastatic progression.
Cancer Cell. 8(5):393-406, 2005.
[2] Bleiber G, May M, Martinez R, Meylan P, Ott J,
Beckmann JS, Telenti A. Use of a combined ex vivo/in
vivo population approach for screening of human
genes involved in the human immunodeficiency virus
type 1 life cycle for variants influencing disease progression. J. Virol. 79, 12674-12680, 2005.
[3] Borda P. Memoire sur les elections au scrutin, Histoire de l’Academie Royale des Sciences, 1781.
[4] DeConde RP, Hawley S, Falcon S, Clegg N, Knudsen B, Etzioni R. Combining results of microarray
experiments: a rank aggregation approach. Statistical Applications in Genetics and Molecular Biology,
5, article 15. 2006.
[5] Dong Y, Zhang H, Gao, Gao AC, Marshall JR, Ip C.
Androgen receptor signaling intensity is a key factor
in determining the sensitivity of prostate cancer cells
to selenium inhibition of growth and cancer-specific
biomarkers. Mol Cancer Ther 4(7):1047-1055, 2005.
[6] Duggal P, Winkler CA, An P, Yu XF, Farzadegan
H, O’Brien SJ, Beaty TH, Vlahov D. The effect of
RANTES chemokine genetic variants on early HIV-1
plasma RNA among African American injection drug
users. J. Acquir. Immune Defic. Syndr. 38, 584-589,
2005.
[7] Dwork C, Kumar R, Naor M, Sivakumar D. Rank aggregation revisited. Manuscript, 2001.
[8] Dwork C, Kumar R, Naor M, Sivakumar D. Rank aggregation methods for the web, 2001.
[9] Fagin R, Kumar R, Sivakumar D. Comparing top k
lists. In SODA ’03, pages 28-36, 2003.
[10] Ignatiuk A, Quickfall JP, Hawrysh AD, Chamberlain
MD, Anderson DH. The smaller isoforms of ankyrin
3 bind to the p85 subunit of phosphatidylinositol 3kinase and enhance platelet-derived growth factor receptor down-regulation. J Biol Chem 281(9):59565964, 2006.
[11] Ivanova A, Liao SY, Lerman MI, Ivanov S, Stanbridge EJ. STRA13 expression and subcellular localisation in normal and tumour tissues: implications for
use as a diagnostic and differentiaion marker. J Med
Genet 42(7):556-576, 2005.
[12] Klezovitch O, Chevillet J, Mirosevich J, Roberts RL,
Matusik RJ, Vasioukhin V. Hepsin promotes prostate
cancer and metastasis. Cancer Cell. 6(2):185-195,
2004.
[13] Kuefer R, Varambally S, Zhou M, Lucas PC, Loeffler M, Wolter H, Mattfeldt T, Hautmann RE. alphaMethylacyl-CoA racemase: expression levels of this
novel cancer biomarker depend on tumor differentiation. Am J Pathol 161(3):841-848, 2002.
121
Download