A Parallel INDICATOR-BASED EVOLUTIONARY ALGORITHM for

advertisement
Inferring Multiobjective
Phylogenetic Hypotheses
by Using a Parallel IndicatorBased Evolutionary Algorithm
Sergio Santander-Jiménez* and Miguel A. Vega-Rodríguez
ARCO Research Group. University of Extremadura
*(sesaji@unex.es)
3rd International Conference on the
Theory and Practice of Natural Computing
TPNC 2014
December 9-11, 2014
Granada, Spain
Contents
 In this presentation we will see:
 An introduction to Phylogenetic Inference, a well-known
NP-Hard problem in Bioinformatics.
 Parallel IBEA: a parallel indicator-based proposal designed
to perform multiobjective phylogenetic analyses under two
optimality criteria:
Maximum parsimony.
 Maximum likelihood.


Experimental results.
Parallel results: speedup and efficiency.
 Multiobjective and biological results.


Concluding remarks and future work lines.
AN INTRODUCTION TO
PHYLOGENETIC INFERENCE
Phylogenetic
Inference
 Phylogenetic inference encloses a wide range of estimation
techniques which aim to describe natural evolutionary
relationships among organisms.


Input: a set of N sequences of L characters (sites) which
represent molecular characteristics of the organisms under
study. This set is defined according to an alphabet 𝛼.
Output: a mathematical structure T=(V,E) that represents a
hypothesis about the evolution of species (Phylogenetic Tree).
Phylogenetic inference contributes significantly useful
knowledge in various fields: evolutionary biology, molecular
evolution, physiology, ecology, and paleontology
An example
5 species 42 nucleotides (DNA-based analysis)
AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
Optimality
criteria
 We can find in the literature several approaches to conduct phylogenetic
analyses according to different theories about the way species evolve in nature.
 Maximum parsimony.
o Ockham’s razor principle.
o These approaches aim to find those phylogenies that minimize the
amount of molecular changes needed to explain the observed data.
L
P(T )  
 C (ai , bi ), where C (ai , bi ) 
i 1 ( a ,b )E

1 if ai  bi ,
0 otherwise.
 Maximum likelihood.
o Reconstruction of that phylogenetic tree which represents the most likely
evolutionary history under the assumptions given by an evolutionary
model m.
o These models give the probabilities of observing mutation events at
molecular level.
L
L[T , m]  
 [ P
i 1 x , y
x
xy
(tru ) Lp (ui  y)]  [ Pxy (trv ) Lp (vi  y)].
Multiobjective
Optimization
 These previous approaches only consider a
single objective to be optimized.
 The inference process is carried out in
agreement with the chosen criterion.
 Conflicting phylogenies can be inferred
from different criteria.
 This
issue can be solved by
multiobjective optimization.

using
Multiobjective approaches aim to infer a set of
Pareto
solutions
that
represent
a
compromise between these different principles
by optimizing simultaneously two or more
objective functions (i.e. parsimony and
likelihood).
optimize
F(T) = (f1(T), f2(T)),
where
f1(T) = minimize P(T),
f2(T) = maximize L(T).
Computational
Complexity
 The
inference of ancestor-descendant
relationships is a well-known biological
problem with NP-hard complexity.
 Modern biological data sets cannot be
analyzed by using exhaustive searches, due
to the exponential growth of the tree
search space.
 In addition, the assessment of solutions
involves time-consuming operations
which depend on the length of molecular
sequences.
 In order to deal with the additional
complexity introduced by multiobjective
searches, the development of new
approaches based on evolutionary
computation and parallelism must be
undertaken.
Species
Number of trees*
5
105
10
34,459,425
12
14
13,749,310,575
7,905,853,580,625
16
6,190,283,353,629,
375
6,332,659,870,762,
850,625
8,200,794,532,637,
891,559,375
4.9518 X 1038
1.00985 X 1057
2.75292 X 1076
18
20
30
40
50
* J. Felsenstein –
Inferring
Phylogenies
Proposal
 In
this work, we aim to solve the
phylogenetic inference problem according
to the maximum parsimony and maximum
likelihood criteria.
 For this purpose, we will focus on applying
one of the most popular algorithmic design
trends in evolutionary multiobjective
optimization: indicator-based approaches.
 Indicator-Based
Evolutionary
Algorithm (IBEA).
 Due to the complexity of the problem, we
propose the introduction of
parallel
computing techniques into IBEA, in
order to reduce execution times on
multicore architectures via OpenMP.
A PARALLEL INDICATOR BASED EVOLUTIONARY
ALGORITHM FOR
PHYLOGENETIC INFERENCE
Multiobjective
Optimization Terms
 Given a decision space S and an objective space Z = ℜn, a multiobjective
optimization problem (MOP) consists of finding those solutions s = (s1, s2, ...,
sk) ∈ S (defined by k decision variables) which optimize n objective functions ~
f(s) = (f1(s), f2(s), ..., fn(s)) ∈ Z.
 A common way to compare solutions in a context where multiple objectives are
involved is the application of the dominance relation:
Given two solutions s1 and s2 to a MOP, s1 dominates (≻) s2 iff ∀ i ∈ [1, 2...n],
fi(s1) is not worse than fi(s2) and ∃ i ∈ [1, 2...n], fi(s1) is better than fi(s2).
 Those solutions which are non-dominated
with regard to the overall decision space
compose the Pareto-optimal set, whose
representation in the objective space is known
as Pareto front.
Finding these Pareto-optimal
solutions represents the main goal
of the optimization process.
Quality
Indicators
 The post-hoc assessment of multiobjective metaheuristics can be carried
out by using the concept of quality indicator, a function which maps a
Pareto set to a real number for measuring its quality.
 Hypervolume metrics IH(X).
 Hypervolume can be defined as the n-dimensional volume of the
objective space which is covered by at least one point s ∈ X.
 For
a
bi-dimensional
MOP,
hypervolume returns the area of the
objective space weakly-dominated
by the evaluated outcome.
 Higher hypervolume values suggest
better multiobjective quality
IBEA I
 The Indicator-Based Evolutionary Algorithm (IBEA) is a population-based




algorithm proposed by Zitzler and Künzli (2006).
Main idea: integrate the computation of quality indicators into the algorithm for
fitness measurement purposes, in order to guide the search for high-quality
Pareto fronts.
Therefore, the optimization goal is to obtain the best Pareto set according to the
considered quality indicator.
In this work we will consider a hypervolume-based quality indicator named as IHD.
Given two sets of Pareto solutions R and S, we can compute IHD as:
I H ( S )  I H ( R) if s  S, s  R : s  s
I HD ( R, S )  
otherwise.
 I H ( R  S )  I H ( R)
 IHD (R,S) represents the space dominated by S but not by R.
 This definition can be applied to compare two solutions si and sj, by considering
R={si} and S={sj}.
IBEA II
 IBEA Pseudocode:



Initialize Population (P)
While (!stop criterion reached (maxEvaluations) do
•
Calculate IHD values (for each individual in P)
•
Assign fitness values (to each individual in P)
•
While P.size > popSize
 Remove the individual with smallest fitness
 Update fitness (for each individual in P)
•
End while
•
For i=0 to popSize
 P’i.m = Apply Genetic Operators
 P’i.T = Infer phylogenetic tree (P’i.m)
 P’i.scores = Evaluate solution (P’i.T)
•
End for
•
P = P U P’, ParetoFront = updateParetoSet(P)
End while
Input Parameters:
 popSize: number of
individuals
in
the
population.
 maxEvaluations:
maximum number of
evaluations.
 crossoverProb:
crossover probability.
 mutationProb:
mutation probability.
 k: scaling factor used in
fitness computations.
 Z: IHD reference point.
Individual
Representation
In order to adapt IBEA to phylogenetics, we will employ a methodology based on the
concept of distance matrix:
 A solution will be represented by means of symmetric NxN matrix (where N is
the number of species in the input alignment).
 Each entry m[i,j] defines the genetic distance between species i and j.
 These matrices will be generated and processed throughout the execution of the
algorithm by means of distance-based evolutionary operators.
 A tree-building method (BIONJ) will be used to infer the topologies associated to the
processed matrices.

Fitness Assignment and
Environmental Selection
First step: the current state of P is examined by ranking each individual according to how
useful it is attending to the considered quality indicator.
 The fitness assignment for an individual Pi is carried out as follows:
 Normalize its objective function scores to the interval [0, 1] and compute IHD values.


Pi.Fitness will be calculated by summing up its IHD values with regard to each
remaining individual Pj:
 I HD ({ Pj },{ Pi }) / ck
Pi .Fitness 
e
Pj P \{ Pi }
In this equation, c refers to the maximum absolute indicator value, which is
included to avoid widely spread indicator scores.
 By using these fitness values, an environmental selection is performed in a second step to
keep the most promising popSize individuals.
 This mechanism is implemented by removing iteratively the individual Pworst with the
smallest fitness value from P until the size of the population fits the parameter popSize.
 The fitness values of the remaining individuals is updated:

Pi .Fitness  Pi .Fitness  e
 I HD ({ Pworst},{ Pi }) / ck
Generating offspring:
Evolutionary Operators
 Parent




Selection: binary tournament,
based on IBEA fitness values.
Crossover: uniform crossover based on the
swapping of randomly selecting rows from the
parent matrices, along with a repair operator
BLX-alpha to ensure symmetry in the
resulting matrix.
Mutation: modification of randomly selected
entries in accordance with the gamma
distribution observed in genetic distances.
The resulting matrices are mapped to the
phylogenetic tree space via BIONJ,
topologically optimized, and evaluated
according to parsimony and likelihood.
The offspring individuals are integrated into
the population and a new generation takes
place.
Parallel
IBEA I
 According to the profile of the application, the most time-demanding
operations in this algorithm are the ones included in the offspring
computation loop (calls to the tree-building method and evaluations of
parsimony and likelihood).
 The IHD computation loop also represents a meaningful source of complexity in
comparison with traditional dominance-based fitness schemes.
 As there are no dependencies between different iterations in these loops,
we can design a parallel version of IBEA for multicore machines.

Our OpenMP-based parallel proposal implies the definition of a parallel region
(#pragma omp parallel) which encloses the main loop of the algorithm.


Those operations which show data dependencies (i.e. environmental selection) will be executed
by using #pragma omp single directives.
The tasks in the IHD and offspring computation loops will be distributed among execution
threads, using #pragma omp for with a scheduling policy = guided to deal with load
imbalances.
 This parallel scheme aims to minimize the overhead issues associated to the
continuous creation / liberation of threads involved when using #pragma omp
parallel for directives inside the main loop.
Parallel
IBEA II

#pragma omp parallel (num threads)

Initialize Population (P)
While (!stop criterion reached (maxEvaluations) do
#pragma omp for schedule (guided)
•
Calculate IHD values (for each individual in P)
#pragma omp single
•
Assign fitness values (to each individual in P)
•
While P.size > popSize



•
•
Remove the individual with smallest fitness
Update fitness (for each individual in P)
End while
#pragma omp for schedule (guided)
For i=0 to popSize



P’i.m = Apply Genetic Operators
P’i.T = Infer phylogenetic tree (P’i.m)
P’i.scores = Evaluate solution (P’i.T)
End for
#pragma omp single
•
P = P U P’, ParetoFront = updateParetoSet (P)
End while
Parallel computation of
IHD values
Data structure
management and
operations with data
dependencies
Parallel computation of
offspring individuales
•

Data structure management
EXPERIMENTAL
METHODOLOGY AND
RESULTS
Experimental
Methodology

The performance achieved by IBEA will be evaluated in terms
of speedup factors, efficiencies and biological quality. For this
purpose, we have performed experiments over four real
nucleotide data sets.
 rbcL_55
 55 sequences (1314 nucleotides per sequence) of the rbcL
gene from different species of green plants.
 mtDNA_186
 186
sequences (16608 nucleotides) of human
mitochondrial DNA.
 RDPII_218
 218 sequences (4182 nucleotides) of prokayotic RNA.
ZILLA_500

500 sequences (759 nucleotides) from rbcL gene.
 HW: 2 processors AMD Opteron Magny-Cours 6174 (24 cores) at 2,2Ghz and 32GB DDR3
RAM memory, under Scientific Linux 6.1. The software was compiled by using GCC 4.4.5
enabling the GOMP_CPU_AFFINITY flag to ensure CPU-thread affinity.
 Input parameter configuration.


maxEvaluations= 10000, popSize = 96, crossoverProb = 70%, mutationProb = 5%,
k = 0.05, Z = (2,2).
Parallel Results I
Parallel scalability
of
IBEA,
under
configurations of 4, 8,
16, 24 OpenMP threads.
 Comparison
with a
POSIX-based multicore
implementation
of
RAxML
(max.
likelihood).
 11 independent runs per
dataset and system
configuration
 Serial times (sec):

4 cores
Algorithm
IBEA
RAxML
Algorithm
IBEA
RAxML
Algorithm
IBEA
RAxML
Algorithm
IBEA
RAxML
SU
Eff.(%)
8 cores
SU
3.66 91.56 6.95
3.68 91.94 6.26
SU
Eff.(%)
SU
3.86 95.83 7.21
3.96 99.12 7.24
SU
Eff.(%)
SU
3.86 96.44 7.30
3.52 88.06 6.54
SU
Eff.(%)
SU
3.87 96.73 7.68
3.73 93.33 5.99
16 cores
rbcL_55
Eff.(%)
SU
86.87
78.23
Eff.(%)
SU
Eff.(%)
Eff.(%)
SU
Eff.(%)
Dataset
IBEA
rbcL_55
5367.60
mtDNA_186
47630.98
Eff.(%)
RDPII_218
51657.38
14.57 91.08 20.73 86.36
7.41 46.33 7.72 32.17
ZILLA_500
71754.79
13.37 83.56 18.01 75.03
9.31 58.19 11.35 47.27
ZILLA_500
Eff.(%)
SU
96.05
74.89
Eff.(%)
12.90 80.60 17.56 73.17
10.39 64.93 12.89 53.70
RDPII_218
Eff.(%)
SU
91.21
81.72
SU
12.32 77.01 16.83 70.14
8.33 52.04 8.77 36.56
mtDNA_186
Eff.(%)
SU
90.12
90.47
Eff.(%)
24 cores
Eff.(%)
SU
Parallel Results II
 According
to these results, our speedup factors show an
improvement as we increase the number of species in the
input dataset.
 Efficiencies for 24 cores:rbcL_55 (70.14%) vs ZILLA_500 (86.36%)
 Amdalh’s law implications for multicore machines can be used
to discuss these results.
 By considering growing number of species, the generation and
evaluation of offspring solutions will involve more
computations over growing matrix and tree data
structures.
 As these operations take place inside parallel regions defined
by #pragma omp for directives, an increase in the parallelizable
fraction of this application is expected, leading to better parallel
results.
Parallel Results III
 Comparisons
with PhyloMOEA, a multiobjective genetic algorithm
proposed by Cancino et al. These authors developed two parallel versions of their
software:
 Pure MPI-based master-worker scheme.
 Hybrid MPI-OpenMP system, based on fine grained parallelism to
reduce the times required on likelihood computations.
Dataset
rbcL_55
mtDNA_186
RDPII_218
ZILLA_500
IBEA
12.32
12.90
13.37
14.57
PhyloMOEA
MPI
7.30
7.40
9.80
6.70
PhyloMOEA MPIOpenMP
8.30
8.50
10.20
6.30
Speedup
Comparisons
(16 cores)
 In accordance with this comparison, IBEA improves significantly the results
published for both PhyloMOEA versions in all the considered data sets, showing
a proper exploitation of hardware resources.
Multiobjective
Results
 Multiobjective assessment of phylogenetic results (using hypervolume).
 Comparison
with NSGA-II, a well-known dominance-based
multiobjective metaheuristic.

Statistical methodology to compare results: Kolmogorov-Smirnov, Levene, Wilcoxon-MannWhitney, and ANOVA tests.
 Hypervolume results for 31 independent runs of IBEA and NSGA-II under
the evolutionary model GTR.
IBEA
Dataset
rbcL_55
mtDNA_186
RDPII_218
ZILLA_500
Median
71.31%
69.81 %
74.24 %
72.32 %
NSGA-II
IQR
0.07
0.11
0.08
0.04
Median
71.01%
69.69 %
73.58 %
71.77 %
IQR
0.22
0.09
0.06
0.03
Stat. Tests
Significant Diff?
Yes
Yes
Yes
Yes
 Our experiments suggest that IBEA achieves a statistically significant
improvement over NSGA-II in all the considered data sets.
 The introduction of quality indicators as a way to guide the inference process
leads to considerable multiobjective performance.
Phylogenetic
Results I
 Biological assessment of phylogenetic results.
 Comparisons with single-criterion methods:
 Maximum parsimony: TNT.
 Maximum likelihood: RAxML.
 Comparisons
of maximum parsimony trees by using KishinoHasegawa-Templeton (KHT) test (PHYLIP tools).
 Comparisons of maximum likelihood trees by using ShimodairaHasegawa (SH) test (CONSEL tools).
Comparison with TNT (best parsimony trees)
Dataset
rbcL_55
mtDNA_186
RDPII_218
ZILLA_500
Parsimony Standard
difference deviation
0
0
31
0
8.95
4.90
51.53
12.81
KHT Test output
No stat. significant diff.
No stat. significant diff.
No stat. significant diff.
No stat. significant diff.
Comparison with RAxML (best likelihood trees)
Dataset
IBEA Pvalue
RAxML
P-value
SH Test output
rbcL_55
mtDNA_186
RDPII_218
ZILLA_500
0.621
0.380
0.592
0.324
0.379
0.620
0.408
0.676
No stat. significant diff.
No stat. significant diff.
No stat. significant diff.
No stat. significant diff.
Phylogenetic
Results II
 Comparison of biological results (HKY85 model) with PhyloMOEA.
 IBEA: 31 additional runs per dataset under this evolutionary model.
rbcl_55
mtDNA_186
RDPII_218
ZILLA_500
Method
Best
parsimony
Best
Best
Best
Best
Best
Best
likelihood parsimony likelihood parsimony likelihood parsimony
IBEA
4874
-21821.11
2431
-39888.07
41517
-134260.26
16218
-80974.93
PhyloMOEA
4874
-21889.84
2437
-39896.44
41534
-134696.53
16219
-81018.06
These comparisons point out the relevance of
applying this indicator-based parallel approach,
giving significant results not only from a
multiobjective perspective, but also attending to
biological criteria.
Best
likelihood
CONCLUSION
Concluding
Remarks
 In
this work, we have applied an indicator-based multiobjective
metaheuristic to tackle phylogenetic inference as a MOP.
 We have introduced a parallel design which aims to reduce the times
required to perform real phylogenetic analyses on multicore machines.
 Experiments over four nucleotide data sets have pointed out a successful
exploitation of a 24-core shared memory architecture, showing improved
scalabilities with regard to other parallel phylogenetic methods of the
literature.
 In addition, statistically reliable comparisons with NSGA-II and singlecriterion biological approaches suggest that the introduction of quality
indicators in multiobjective searches allows IBEA to infer high-quality
Pareto sets attending to both multiobjective and biological perspectives
 Future work lines:
Development of new algorithmic designs which combine swarm intelligence and
quality indicators.
Fine-grained parallel approaches based on OpenCL to take advantage of
heterogeneous CPU-GPU systems.
Comparisons of different multiobjective metaheuristics to find out which proposal
leads to better performance from both multiobjective and biological points of view.
Inferring Multiobjective
Phylogenetic Hypotheses
by Using a Parallel IndicatorBased Evolutionary Algorithm
Sergio Santander-Jiménez* and Miguel A. Vega-Rodríguez
ARCO Research Group. University of Extremadura
*(sesaji@unex.es)
3rd International Conference on the
Theory and Practice of Natural Computing
TPNC 2014
December 9-11, 2014
Granada, Spain
Download