CSRD (2009) 23: 211-215 DOI 10.1007/s00450-009-0083-7 SPECIAL ISSUE PAPER A framework for parallel large-scale global optimization Yuri Evtushenko • Mikhail Posypkin • Israel Sigal Published online: 15 May 2009 © Springer-Verlag 2009 Abstract The paper describes the design and implementation of BNB-Solver, an object-oriented framework for discrete and continuous parallel global optimization. The framework supports exact branch-and-bound algorithms, heuristic methods and hybrid approaches. BNB-Solver provides a support for distributed and shared memory architectures. The implementation for distributed memory machines is based on MPI and thus can run on almost any computational cluster. In order to take advantages of multicore processors we provide a separate multi-threaded implementation for shared memory platforms. We introduce a novel collaborative scheme for combining exact and heuristic search methods that provides the support for sophisticated parallel heuristics and convenient balancing between exact and heuristic methods. In the experimental results section we discuss a nonlinear programming solver and a highly efficient knapsack solver that significantly outperforms existing parallel implementations. Keywords Parallel global optimization • Branch-and-bound methods • Heuristic methods, knapsack problems • Nonlinear programming 1 Introduction Lots of research efforts have been devoted to the parallel implementation of exact and heuristic algorithms for optimization problems. Many references can be found in surveys [2, 4, 5]. Different optimization algorithms often share the common structure and therefore generic frameworks for such algorithms reduce programming efforts to implement new optimization problems. Existing frameworks usually focus on one family of methods (e.g. Branch-and-Bound, Tabu search etc.). However exact methods may benefit from heuristic search and vice-versa. Also a large piece of functionality can be shared among exact and heuristic methods, e.g. parallel communication routines, local search routines etc. Therefore there is a clear need to develop universal frameworks for implementing parallel exact, heuristic and hybrid methods. This paper describes the BNB-Solver: a С++ template library for parallel global discrete and continuous optimization. For the moment the library provides class templates for Branch-and-Bound, heuristic and hybrid 1 methods. Unlike many other frameworks BNB-Solver supports both distributed and shared memory platforms. The distributed memory implementation is based on MPI [13] and the shared memory implementation is based on Native POSIX Threads [7]. Providing a special support for shared memory platforms is a crucial issue because of the emerging growth of multicore processors popularity. 2 Overview of optimization problems and resolution methods The goal of global optimization (GO) is to find an extreme (minimal or maximal) value of an objective function f * = f ( x* ) on a feasible domain X Í R n . The value f * and feasible point x* are called optimum and optimal solution respectively. Without loss of generality we'll consid- er only minimization problems: f ( x) ® min, x Î X . Resolution methods for GO problems can be roughly classified according to the degree of rigor with which they approach the goal: optimality, incomplete methods or complete (exact) heuristics methods find the optimum and prove its find a feasible solution based on reasonable assump- tions but do not guarantee optimality. Heuristic methods are applied when complete methods fail to solve the problem due to the lack of computational resources or time. The Branch-and-Bound (B&B) is a general name for exact methods to split an initial problem into subproblems which are sooner or later eliminated by bounding rules. Bounding rules determine whether a subproblem can yield a solution better than the best solution found so far. The latter is called incumbent solution. Bounding is often done by comparing lower and upper bounds: a subproblem can be pruned if the lower bound for its objective is larger or equal to the current upper bound, i.e. incumbent solution. Thus quality upper bound significantly reduces the search space and in some cases causes dramatic performance improvements. Hybrid algorithms couple B&B and heuristics in order to employ the latter for improving incumbent solutions. 3 Implementation In BNB-Solver library we follow the standard approach taken in software frameworks for global optimization [1, 8, 12, 14]: the problem-specific and problem-independent (core) parts are separated and implemented independently. Coupling of problem-specific and problem-independent parts is done via С++ template mechanism. Problem-specific branching operations, bounding rules and type definitions for subproblems and feasible points are encapsulated by a problem factory class. This class is substituted as a template argument within specializations of core classes. Below we consider core components for serial, shared and distributed memory implementations. 2 3.1 Serial implementation The core serial B&B component Traverse class template implements basic interface for branching: method branch decomposes a subproblem into a set of new subproblems by performing a given number of B&B steps. The certain implementation of the branch method is taken from the problem factory class substituted as a template argument. 3.2 Shared memory implementation The core class PTraverse implements the multithreaded version of standard B&B scheme. Its branch method starts several threads that share a common pool of subproblems and maintain their own pools. Each thread performs a given threshold number of B&B steps storing resulting subproblems locally. After that the thread copies a chunk of subproblems from its local pool to the shared pool. Then the thread continues solution process with a subproblem taken from the local pool. If the local subproblem pool is exhausted the thread takes a new subproblem from the shared pool. If the shared pool is empty the requesting thread blocks until another thread fills it with subproblems. In B&B methods every branching operation results in dynamic memory allocation for newly created subproblems. Memory allocation requires threads synchronization because the system memory is shared among multiple threads. In [3] it is observed that dynamic memory allocation causes severe performance losses of multithreaded B&B implementation. In BNB-Solver this problem is solved by a special memory manager that maintains local memory pools for every thread. Similar to Traverse class template PTraverse takes problem specific information from the problem factory class substituted as a template argument. 3.3 Distributed memory implementation The general scheme of parallel implementation for distributed memory machines is depicted at the Fig. 1. In accordance with distributed memory programming paradigm the application con- Fig. 1 General scheme of the distributed memory implementation sists of a set of MPI processes interacting via message-passing. At run-time processes are normally mapped onto different processors or cores. The set of all processes is divided among B&B ( BM and {B j } ) and heuristic ( HM and {H j } ) parts. Master processes BM and HM manage B&B and heuristic slave processes re- 3 spectively . The supervisor process M provides an external interface (loading initial data, starting and stopping computations etc.) to the parallel application and coordinates processes BM and HM . Processes {H j } perform heuristic search by improving the collection of points maintained on process HM . New points are either generated by {H j } or borrowed from processes {B j } involved in B&B. New incumbent values produced by heuristics are transmitted through processes HM and BM to processes {B j } in order to facilitate bounding. The distributed memory solver is implemented by MSSolver class template with one template parameter – a class implementing tree traversal, i.e. implementing branch method. The actual class substituted as a template argument is used to perform B&B operations on processes {B j } . By substituting PTraverse as a template argument for MSSolver class template the hybrid implementation combining message-passing between nodes and shared memory interactions inside one node can be constructed. The substitution of the Traverse class produces a pure distributed memory solver. 4 Experimental results BNB-Solver has been used to implement four solvers: B&B boolean uni-dimensional knapsack, B&B constraint continuous optimization, B&B traveling salesman problem and stochastic global optimization solver. Due to the space limitations we focus on only knapsack and constraint continuous optimization examples. All presented experiments were run on a cluster of 11 SMP nodes with two 4-core Intel Xeon 5355 processors on each installed in the Institute for System Programming of Russian Academy of Sciences. 4.1 The knapsack solver The uni-dimensional knapsack problem [10] is formulated as follows: given a set of n items with weights w1 ,..., wn and profits p1 ,..., pn find a subset of {1,..., n} such that its total weight does not exceed the given capacity C and its total profit is maximized. The straightforward implementation employed in many knapsack solvers stores the whole vector of fixed variables for each subproblem. Branching results in allocating of two new vectors for descendant subproblems and coping data from the parent subproblem to descendants. A much more efficient B&B scheme proposed by Horovitz and Sahni [10] uses a single binary vector of size n . In fact this vector encodes the depth-first search tree without explicitly constructing it. The vector causes data dependency among consequent algorithm steps that makes the Horovitz-Sahni (HS) scheme hard to parallelize. To build a parallel implementation retaining advantages of HS ap4 proach we devised the following scheme. The knapsack problem factory branch method performs a given number of HS steps and then extracts from the resulting binary vector several pending subproblems. Extracted subproblems are put to the process or thread local pool for later transmission to other processes or threads. Table 1 lists experimental results for two problems. Problem 1 has the form å 24 x ® max, i= 1 i å 24 i= 1 2 xi £ 25 . Though the analytical solution for this problem is evident the standard B&B procedure expands approximately 8´ 106 subproblems. Problem 2 is a 2000-item random knapsack instance provided by Ekstein. Running time of BNBSolver, ALPS and PEBBL libraries is presented. ALPS was not able to solve Problem 2 in a reasonable time. From that table it is clear that all three libraries provide a relatively good scalability. However the BNB-Solver implementation is at least 10 times faster. This superiority is explained by a very efficient Horovitz-Sahni B&B scheme employed on BNB-Solver slave processes. Table 1 Running time (s) for two knapsack problem instances Number of cores 1 ALPS PEBBL 2 1 2 16.7 19.2 137.2 56.3 4 6.57 35.5 8 3.12 23.6 16 32 2.55 1.85 17.1 BNB-Solver 1.32 0.69 0.34 ALPS - - - - - - 743.3 369.5 183.1 99.9 50.2 35.1 PEBBL BNB-Solver 2.92 1.86 0.92 0.18 0.11 15.2 0.52 0.50 0.07 0.42 Problems 1 and 2 are steady to heuristics and therefore in the most efficient configuration all processes belong to B&B part. However for problems sensitive to heuristics the effect of sharing cores among exact and heuristic algorithms may be significant. For knapsack solver we devised a parallel heuristic method based on a local search in a neighborhood defined by a relatively small tuple of indices. Table 2 lists performance results obtained for three random 10 000-item knapsack instances items sensitive to heuristics. The effect of using heuristics is notably affected by the balance between cores involved in B&B and heuristic parts. Table 2 compares time obtained on one core with the best time obtained on 16 cores and gives the most effective Nb / N h ratio, where N b and N h are numbers of slave processes devoted to B&B and heuristic parts respectively. The huge observed speedup is explained by the fact that for considered instances the rapidly found optimum dramatically reduces the number of expanded subproblems. The detailed explanation of the knapsack implementation with more experimental data can be found in [11]. 5 Table 2 Performance results for knapsack problems sensible to heuristics Pr. 1 2 3 Time (s) on one core 1178.9 1748.5 464.44 Best time (s) obtained on 16 cores 0.46 1.13 0.69 Speedup Best Nb / N h ratio 11/5 8/8 14/2 2562 1556 673 Table 3 Running time (s) for pressure vessel design problem Number of cores 1 2 4 8 1 2 4 8 16 32 Shared memory T N 55.71 12 x 106 29.85 11 x 106 11.7 6 x 106 17.9 11 x 106 Distributed memory 55.71 12 x 106 27.86 12 x 106 23.11 19 x 106 11.59 16 x 106 9.22 15 x 106 4.45 16 x 106 t 4.72 x 10-6 2.72 x 10-6 1.77 x 10-6 1.56 x 10-6 4.72 x 10-6 2.35 x 10-6 1.23 x 10-6 0.7 x 10-6 0.6 x 10-6 0.3 x 10-6 4.2 Constrained continuous optimization A generic nonlinear programming problem (NLP) with m inequality and k equality constraints can be written as follows: f ( x) ® min, g ( x) £ 0, h( x) = 0, where f ( x) : R n ® R is an objective function, g ( x) : R n ® R m and h( x) : R n ® R k are inequality and equality constraints respectively. With the help of BNB-Solver framework we parallelize a sequential deterministic B&B continuous optimization method proposed in [9] adapted for NLP. To illustrate the speedup we selected pressure vessel design problem [6] formulated as follows: f ( x) = 0.6244 ×TsTh R + 1.7781×Th R 2 + 3.1661×Ts2 R ® min, g1 ( x) = - Ts + 0.0193 ×R £ 0, g 2 ( x) = - Th + 0.00954 ×R £ 0, g3 ( x) = - p R 3 L - 4 3 p R + 129600 £ 0, 3 g 4 ( x) = L £ 240, where Ts , Th , L , R are four design variables. The modified covering algorithm found an optimal solution (0.727591, 0.359649, 37.699020, 239.999961) with the guaranteed optimality to a precision 10- 7 . This solution yields the objective function value 5804.375611 that is notably lower than best value 6059.946409 given in [6]. Table 3 presents running times for shared and distributed memory implementations. Unlike examples from Table 1 this problem demonstrates 6 search anomalies: the number of expanded subproblems significantly varies. Therefore besides absolute time T we also present average time t required to expand one subproblem and the approximate number of expanded subproblems N , t = T / N . Distributed memory implementation scales a bit better w.r.t. shared memory one. This fact indicates problems with threads contention for shared resources. However the number of expanded subproblems is less for shared memory implementation due to the immediate incumbent value propagation among threads. 5 Concluding remarks and future work The paper describes new С++ framework for implementing Branch-and-Bound, heuristic and hybrid optimization algorithms on shared and distributed memory platforms. The efficiency of the proposed approach is demonstrated by examples: knapsack and nonlinear programming problems. Ongoing work focuses on improving existing implementations, implementing new optimization problems and devising a systematic mathematically justified approach to load balancing. Acknowledgement This work was supported by Intel Corporation (Intel High Performance Competition award) and Russian Fund for Basic Research (projects 08-01-00619-a, 08-0700072-a). References 1. Alba E, Almeida F, Blesa M et al (2006) Efficient parallel LAN/WAN algorithms for optimization. The MALLBA project. Parall Comput 32(5-6) :415-440 2. Ananth G, Kumar V, Pardalos P (1993) Parallel processing of discrete optimization problems. Encycl Microcomput 13:129-147 3. Casado L, Martinez J, Garcia I et al (2008) Branch-and-Bound interval global optimization on shared memory multiprocessors. Optim Methods Softw 23(5):689-701 4. Crainic TG, Le Cun B, Roucairol С (2006) Parallel branch and bound algorithms. In: Parallel Combinatorial Optimization, Chap 1, John Wiley & Sons, Hoboken, NJ, pp 1-28 5. Crainic TG, Toulouse M (2002) Parallel Strategies for Meta-heuristics. In: Glover F, Kochenberger G (eds) State-of-the-Art Hand-book in Metaheuristics, Kluwer Academic Publishers, Dordrecht, pp 475-513 6. Coello Coello CA, Mezura-Montes E (2002) Constraint-handling in genetic algorithms through the use of dominance-based tournament selection. Adv Eng Inf 16(3): 193-203 7. Drepper U, Molnar I (2005) The Native POSIX Thread Library for Linux, http://people.redhat.com/drepper/nptl-design.pdf. Last access: 13 May 2009 7 8. Eckstein J, Philips C, Hart W (2006) PEBBL 1.0 User Guide. RUTCOR Research Report RRR 19-2006. http://rutcor.rutgers.edu/pub/rrr/reports2006/19_2006.ps. Last access: 13 May 2009 9. Evtushenko Y (1971) Numerical methods for finding global extreme (case of a non-uniform mesh). USSR Comput Maths Math Phys ll(6):38-54 10.Kellerer H, Pferschy U, Pisinger D (2004) Knapsack Problems. Springer, Berlin 11.Posypkin M, Sigal I (2008) A combined parallel algorithm for solving the knapsack problem. J Comput Syst Sci Int 47(4): 543-551 12.Ralphs T (2006) Parallel branch and cut. In: Parallel Combinatorial Optimization. John Wiley & Sons, Hoboken, NJ, pp 53-103 13.Snir M, Otto S, Huss-Lederman S, Walker D, Dongarra J (1996) MPI: The Complete Reference. MIT Press, Boston 14. Tschoke S, Holthofer N (1995) A new parallel approach to the constrained two-dimensional cutting stock problem. In: Parallel Algorithms for Irregularly Structured Problems, LNCS 980, pp 285-300, Springer, London 8