Inferring Multiobjective Phylogenetic Hypotheses by Using a Parallel IndicatorBased Evolutionary Algorithm Sergio Santander-Jiménez* and Miguel A. Vega-Rodríguez ARCO Research Group. University of Extremadura *(sesaji@unex.es) 3rd International Conference on the Theory and Practice of Natural Computing TPNC 2014 December 9-11, 2014 Granada, Spain Contents In this presentation we will see: An introduction to Phylogenetic Inference, a well-known NP-Hard problem in Bioinformatics. Parallel IBEA: a parallel indicator-based proposal designed to perform multiobjective phylogenetic analyses under two optimality criteria: Maximum parsimony. Maximum likelihood. Experimental results. Parallel results: speedup and efficiency. Multiobjective and biological results. Concluding remarks and future work lines. AN INTRODUCTION TO PHYLOGENETIC INFERENCE Phylogenetic Inference Phylogenetic inference encloses a wide range of estimation techniques which aim to describe natural evolutionary relationships among organisms. Input: a set of N sequences of L characters (sites) which represent molecular characteristics of the organisms under study. This set is defined according to an alphabet 𝛼. Output: a mathematical structure T=(V,E) that represents a hypothesis about the evolution of species (Phylogenetic Tree). Phylogenetic inference contributes significantly useful knowledge in various fields: evolutionary biology, molecular evolution, physiology, ecology, and paleontology An example 5 species 42 nucleotides (DNA-based analysis) AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA Optimality criteria We can find in the literature several approaches to conduct phylogenetic analyses according to different theories about the way species evolve in nature. Maximum parsimony. o Ockham’s razor principle. o These approaches aim to find those phylogenies that minimize the amount of molecular changes needed to explain the observed data. L P(T ) C (ai , bi ), where C (ai , bi ) i 1 ( a ,b )E 1 if ai bi , 0 otherwise. Maximum likelihood. o Reconstruction of that phylogenetic tree which represents the most likely evolutionary history under the assumptions given by an evolutionary model m. o These models give the probabilities of observing mutation events at molecular level. L L[T , m] [ P i 1 x , y x xy (tru ) Lp (ui y)] [ Pxy (trv ) Lp (vi y)]. Multiobjective Optimization These previous approaches only consider a single objective to be optimized. The inference process is carried out in agreement with the chosen criterion. Conflicting phylogenies can be inferred from different criteria. This issue can be solved by multiobjective optimization. using Multiobjective approaches aim to infer a set of Pareto solutions that represent a compromise between these different principles by optimizing simultaneously two or more objective functions (i.e. parsimony and likelihood). optimize F(T) = (f1(T), f2(T)), where f1(T) = minimize P(T), f2(T) = maximize L(T). Computational Complexity The inference of ancestor-descendant relationships is a well-known biological problem with NP-hard complexity. Modern biological data sets cannot be analyzed by using exhaustive searches, due to the exponential growth of the tree search space. In addition, the assessment of solutions involves time-consuming operations which depend on the length of molecular sequences. In order to deal with the additional complexity introduced by multiobjective searches, the development of new approaches based on evolutionary computation and parallelism must be undertaken. Species Number of trees* 5 105 10 34,459,425 12 14 13,749,310,575 7,905,853,580,625 16 6,190,283,353,629, 375 6,332,659,870,762, 850,625 8,200,794,532,637, 891,559,375 4.9518 X 1038 1.00985 X 1057 2.75292 X 1076 18 20 30 40 50 * J. Felsenstein – Inferring Phylogenies Proposal In this work, we aim to solve the phylogenetic inference problem according to the maximum parsimony and maximum likelihood criteria. For this purpose, we will focus on applying one of the most popular algorithmic design trends in evolutionary multiobjective optimization: indicator-based approaches. Indicator-Based Evolutionary Algorithm (IBEA). Due to the complexity of the problem, we propose the introduction of parallel computing techniques into IBEA, in order to reduce execution times on multicore architectures via OpenMP. A PARALLEL INDICATOR BASED EVOLUTIONARY ALGORITHM FOR PHYLOGENETIC INFERENCE Multiobjective Optimization Terms Given a decision space S and an objective space Z = ℜn, a multiobjective optimization problem (MOP) consists of finding those solutions s = (s1, s2, ..., sk) ∈ S (defined by k decision variables) which optimize n objective functions ~ f(s) = (f1(s), f2(s), ..., fn(s)) ∈ Z. A common way to compare solutions in a context where multiple objectives are involved is the application of the dominance relation: Given two solutions s1 and s2 to a MOP, s1 dominates (≻) s2 iff ∀ i ∈ [1, 2...n], fi(s1) is not worse than fi(s2) and ∃ i ∈ [1, 2...n], fi(s1) is better than fi(s2). Those solutions which are non-dominated with regard to the overall decision space compose the Pareto-optimal set, whose representation in the objective space is known as Pareto front. Finding these Pareto-optimal solutions represents the main goal of the optimization process. Quality Indicators The post-hoc assessment of multiobjective metaheuristics can be carried out by using the concept of quality indicator, a function which maps a Pareto set to a real number for measuring its quality. Hypervolume metrics IH(X). Hypervolume can be defined as the n-dimensional volume of the objective space which is covered by at least one point s ∈ X. For a bi-dimensional MOP, hypervolume returns the area of the objective space weakly-dominated by the evaluated outcome. Higher hypervolume values suggest better multiobjective quality IBEA I The Indicator-Based Evolutionary Algorithm (IBEA) is a population-based algorithm proposed by Zitzler and Künzli (2006). Main idea: integrate the computation of quality indicators into the algorithm for fitness measurement purposes, in order to guide the search for high-quality Pareto fronts. Therefore, the optimization goal is to obtain the best Pareto set according to the considered quality indicator. In this work we will consider a hypervolume-based quality indicator named as IHD. Given two sets of Pareto solutions R and S, we can compute IHD as: I H ( S ) I H ( R) if s S, s R : s s I HD ( R, S ) otherwise. I H ( R S ) I H ( R) IHD (R,S) represents the space dominated by S but not by R. This definition can be applied to compare two solutions si and sj, by considering R={si} and S={sj}. IBEA II IBEA Pseudocode: Initialize Population (P) While (!stop criterion reached (maxEvaluations) do • Calculate IHD values (for each individual in P) • Assign fitness values (to each individual in P) • While P.size > popSize Remove the individual with smallest fitness Update fitness (for each individual in P) • End while • For i=0 to popSize P’i.m = Apply Genetic Operators P’i.T = Infer phylogenetic tree (P’i.m) P’i.scores = Evaluate solution (P’i.T) • End for • P = P U P’, ParetoFront = updateParetoSet(P) End while Input Parameters: popSize: number of individuals in the population. maxEvaluations: maximum number of evaluations. crossoverProb: crossover probability. mutationProb: mutation probability. k: scaling factor used in fitness computations. Z: IHD reference point. Individual Representation In order to adapt IBEA to phylogenetics, we will employ a methodology based on the concept of distance matrix: A solution will be represented by means of symmetric NxN matrix (where N is the number of species in the input alignment). Each entry m[i,j] defines the genetic distance between species i and j. These matrices will be generated and processed throughout the execution of the algorithm by means of distance-based evolutionary operators. A tree-building method (BIONJ) will be used to infer the topologies associated to the processed matrices. Fitness Assignment and Environmental Selection First step: the current state of P is examined by ranking each individual according to how useful it is attending to the considered quality indicator. The fitness assignment for an individual Pi is carried out as follows: Normalize its objective function scores to the interval [0, 1] and compute IHD values. Pi.Fitness will be calculated by summing up its IHD values with regard to each remaining individual Pj: I HD ({ Pj },{ Pi }) / ck Pi .Fitness e Pj P \{ Pi } In this equation, c refers to the maximum absolute indicator value, which is included to avoid widely spread indicator scores. By using these fitness values, an environmental selection is performed in a second step to keep the most promising popSize individuals. This mechanism is implemented by removing iteratively the individual Pworst with the smallest fitness value from P until the size of the population fits the parameter popSize. The fitness values of the remaining individuals is updated: Pi .Fitness Pi .Fitness e I HD ({ Pworst},{ Pi }) / ck Generating offspring: Evolutionary Operators Parent Selection: binary tournament, based on IBEA fitness values. Crossover: uniform crossover based on the swapping of randomly selecting rows from the parent matrices, along with a repair operator BLX-alpha to ensure symmetry in the resulting matrix. Mutation: modification of randomly selected entries in accordance with the gamma distribution observed in genetic distances. The resulting matrices are mapped to the phylogenetic tree space via BIONJ, topologically optimized, and evaluated according to parsimony and likelihood. The offspring individuals are integrated into the population and a new generation takes place. Parallel IBEA I According to the profile of the application, the most time-demanding operations in this algorithm are the ones included in the offspring computation loop (calls to the tree-building method and evaluations of parsimony and likelihood). The IHD computation loop also represents a meaningful source of complexity in comparison with traditional dominance-based fitness schemes. As there are no dependencies between different iterations in these loops, we can design a parallel version of IBEA for multicore machines. Our OpenMP-based parallel proposal implies the definition of a parallel region (#pragma omp parallel) which encloses the main loop of the algorithm. Those operations which show data dependencies (i.e. environmental selection) will be executed by using #pragma omp single directives. The tasks in the IHD and offspring computation loops will be distributed among execution threads, using #pragma omp for with a scheduling policy = guided to deal with load imbalances. This parallel scheme aims to minimize the overhead issues associated to the continuous creation / liberation of threads involved when using #pragma omp parallel for directives inside the main loop. Parallel IBEA II #pragma omp parallel (num threads) Initialize Population (P) While (!stop criterion reached (maxEvaluations) do #pragma omp for schedule (guided) • Calculate IHD values (for each individual in P) #pragma omp single • Assign fitness values (to each individual in P) • While P.size > popSize • • Remove the individual with smallest fitness Update fitness (for each individual in P) End while #pragma omp for schedule (guided) For i=0 to popSize P’i.m = Apply Genetic Operators P’i.T = Infer phylogenetic tree (P’i.m) P’i.scores = Evaluate solution (P’i.T) End for #pragma omp single • P = P U P’, ParetoFront = updateParetoSet (P) End while Parallel computation of IHD values Data structure management and operations with data dependencies Parallel computation of offspring individuales • Data structure management EXPERIMENTAL METHODOLOGY AND RESULTS Experimental Methodology The performance achieved by IBEA will be evaluated in terms of speedup factors, efficiencies and biological quality. For this purpose, we have performed experiments over four real nucleotide data sets. rbcL_55 55 sequences (1314 nucleotides per sequence) of the rbcL gene from different species of green plants. mtDNA_186 186 sequences (16608 nucleotides) of human mitochondrial DNA. RDPII_218 218 sequences (4182 nucleotides) of prokayotic RNA. ZILLA_500 500 sequences (759 nucleotides) from rbcL gene. HW: 2 processors AMD Opteron Magny-Cours 6174 (24 cores) at 2,2Ghz and 32GB DDR3 RAM memory, under Scientific Linux 6.1. The software was compiled by using GCC 4.4.5 enabling the GOMP_CPU_AFFINITY flag to ensure CPU-thread affinity. Input parameter configuration. maxEvaluations= 10000, popSize = 96, crossoverProb = 70%, mutationProb = 5%, k = 0.05, Z = (2,2). Parallel Results I Parallel scalability of IBEA, under configurations of 4, 8, 16, 24 OpenMP threads. Comparison with a POSIX-based multicore implementation of RAxML (max. likelihood). 11 independent runs per dataset and system configuration Serial times (sec): 4 cores Algorithm IBEA RAxML Algorithm IBEA RAxML Algorithm IBEA RAxML Algorithm IBEA RAxML SU Eff.(%) 8 cores SU 3.66 91.56 6.95 3.68 91.94 6.26 SU Eff.(%) SU 3.86 95.83 7.21 3.96 99.12 7.24 SU Eff.(%) SU 3.86 96.44 7.30 3.52 88.06 6.54 SU Eff.(%) SU 3.87 96.73 7.68 3.73 93.33 5.99 16 cores rbcL_55 Eff.(%) SU 86.87 78.23 Eff.(%) SU Eff.(%) Eff.(%) SU Eff.(%) Dataset IBEA rbcL_55 5367.60 mtDNA_186 47630.98 Eff.(%) RDPII_218 51657.38 14.57 91.08 20.73 86.36 7.41 46.33 7.72 32.17 ZILLA_500 71754.79 13.37 83.56 18.01 75.03 9.31 58.19 11.35 47.27 ZILLA_500 Eff.(%) SU 96.05 74.89 Eff.(%) 12.90 80.60 17.56 73.17 10.39 64.93 12.89 53.70 RDPII_218 Eff.(%) SU 91.21 81.72 SU 12.32 77.01 16.83 70.14 8.33 52.04 8.77 36.56 mtDNA_186 Eff.(%) SU 90.12 90.47 Eff.(%) 24 cores Eff.(%) SU Parallel Results II According to these results, our speedup factors show an improvement as we increase the number of species in the input dataset. Efficiencies for 24 cores:rbcL_55 (70.14%) vs ZILLA_500 (86.36%) Amdalh’s law implications for multicore machines can be used to discuss these results. By considering growing number of species, the generation and evaluation of offspring solutions will involve more computations over growing matrix and tree data structures. As these operations take place inside parallel regions defined by #pragma omp for directives, an increase in the parallelizable fraction of this application is expected, leading to better parallel results. Parallel Results III Comparisons with PhyloMOEA, a multiobjective genetic algorithm proposed by Cancino et al. These authors developed two parallel versions of their software: Pure MPI-based master-worker scheme. Hybrid MPI-OpenMP system, based on fine grained parallelism to reduce the times required on likelihood computations. Dataset rbcL_55 mtDNA_186 RDPII_218 ZILLA_500 IBEA 12.32 12.90 13.37 14.57 PhyloMOEA MPI 7.30 7.40 9.80 6.70 PhyloMOEA MPIOpenMP 8.30 8.50 10.20 6.30 Speedup Comparisons (16 cores) In accordance with this comparison, IBEA improves significantly the results published for both PhyloMOEA versions in all the considered data sets, showing a proper exploitation of hardware resources. Multiobjective Results Multiobjective assessment of phylogenetic results (using hypervolume). Comparison with NSGA-II, a well-known dominance-based multiobjective metaheuristic. Statistical methodology to compare results: Kolmogorov-Smirnov, Levene, Wilcoxon-MannWhitney, and ANOVA tests. Hypervolume results for 31 independent runs of IBEA and NSGA-II under the evolutionary model GTR. IBEA Dataset rbcL_55 mtDNA_186 RDPII_218 ZILLA_500 Median 71.31% 69.81 % 74.24 % 72.32 % NSGA-II IQR 0.07 0.11 0.08 0.04 Median 71.01% 69.69 % 73.58 % 71.77 % IQR 0.22 0.09 0.06 0.03 Stat. Tests Significant Diff? Yes Yes Yes Yes Our experiments suggest that IBEA achieves a statistically significant improvement over NSGA-II in all the considered data sets. The introduction of quality indicators as a way to guide the inference process leads to considerable multiobjective performance. Phylogenetic Results I Biological assessment of phylogenetic results. Comparisons with single-criterion methods: Maximum parsimony: TNT. Maximum likelihood: RAxML. Comparisons of maximum parsimony trees by using KishinoHasegawa-Templeton (KHT) test (PHYLIP tools). Comparisons of maximum likelihood trees by using ShimodairaHasegawa (SH) test (CONSEL tools). Comparison with TNT (best parsimony trees) Dataset rbcL_55 mtDNA_186 RDPII_218 ZILLA_500 Parsimony Standard difference deviation 0 0 31 0 8.95 4.90 51.53 12.81 KHT Test output No stat. significant diff. No stat. significant diff. No stat. significant diff. No stat. significant diff. Comparison with RAxML (best likelihood trees) Dataset IBEA Pvalue RAxML P-value SH Test output rbcL_55 mtDNA_186 RDPII_218 ZILLA_500 0.621 0.380 0.592 0.324 0.379 0.620 0.408 0.676 No stat. significant diff. No stat. significant diff. No stat. significant diff. No stat. significant diff. Phylogenetic Results II Comparison of biological results (HKY85 model) with PhyloMOEA. IBEA: 31 additional runs per dataset under this evolutionary model. rbcl_55 mtDNA_186 RDPII_218 ZILLA_500 Method Best parsimony Best Best Best Best Best Best likelihood parsimony likelihood parsimony likelihood parsimony IBEA 4874 -21821.11 2431 -39888.07 41517 -134260.26 16218 -80974.93 PhyloMOEA 4874 -21889.84 2437 -39896.44 41534 -134696.53 16219 -81018.06 These comparisons point out the relevance of applying this indicator-based parallel approach, giving significant results not only from a multiobjective perspective, but also attending to biological criteria. Best likelihood CONCLUSION Concluding Remarks In this work, we have applied an indicator-based multiobjective metaheuristic to tackle phylogenetic inference as a MOP. We have introduced a parallel design which aims to reduce the times required to perform real phylogenetic analyses on multicore machines. Experiments over four nucleotide data sets have pointed out a successful exploitation of a 24-core shared memory architecture, showing improved scalabilities with regard to other parallel phylogenetic methods of the literature. In addition, statistically reliable comparisons with NSGA-II and singlecriterion biological approaches suggest that the introduction of quality indicators in multiobjective searches allows IBEA to infer high-quality Pareto sets attending to both multiobjective and biological perspectives Future work lines: Development of new algorithmic designs which combine swarm intelligence and quality indicators. Fine-grained parallel approaches based on OpenCL to take advantage of heterogeneous CPU-GPU systems. Comparisons of different multiobjective metaheuristics to find out which proposal leads to better performance from both multiobjective and biological points of view. Inferring Multiobjective Phylogenetic Hypotheses by Using a Parallel IndicatorBased Evolutionary Algorithm Sergio Santander-Jiménez* and Miguel A. Vega-Rodríguez ARCO Research Group. University of Extremadura *(sesaji@unex.es) 3rd International Conference on the Theory and Practice of Natural Computing TPNC 2014 December 9-11, 2014 Granada, Spain