CSRD (2009) 23: 211-215 - Dorodnicyn Computing Centre of the

advertisement
CSRD (2009) 23: 211-215
DOI 10.1007/s00450-009-0083-7
SPECIAL ISSUE PAPER
A framework for parallel large-scale global optimization
Yuri Evtushenko • Mikhail Posypkin • Israel Sigal
Published online: 15 May 2009
© Springer-Verlag 2009
Abstract The paper describes the design and implementation of BNB-Solver, an object-oriented
framework for discrete and continuous parallel global optimization. The framework supports exact branch-and-bound algorithms, heuristic methods and hybrid approaches. BNB-Solver provides a support for distributed and shared memory architectures. The implementation for distributed memory machines is based on MPI and thus can run on almost any computational cluster. In
order to take advantages of multicore processors we provide a separate multi-threaded implementation for shared memory platforms. We introduce a novel collaborative scheme for combining exact and heuristic search methods that provides the support for sophisticated parallel heuristics and convenient balancing between exact and heuristic methods. In the experimental results
section we discuss a nonlinear programming solver and a highly efficient knapsack solver that
significantly outperforms existing parallel implementations.
Keywords Parallel global optimization • Branch-and-bound methods • Heuristic methods,
knapsack problems • Nonlinear programming
1 Introduction
Lots of research efforts have been devoted to the parallel implementation of exact and heuristic
algorithms for optimization problems. Many references can be found in surveys [2, 4, 5]. Different optimization algorithms often share the common structure and therefore generic frameworks
for such algorithms reduce programming efforts to implement new optimization problems. Existing frameworks usually focus on one family of methods (e.g. Branch-and-Bound, Tabu search
etc.). However exact methods may benefit from heuristic search and vice-versa. Also a large
piece of functionality can be shared among exact and heuristic methods, e.g. parallel communication routines, local search routines etc. Therefore there is a clear need to develop universal
frameworks for implementing parallel exact, heuristic and hybrid methods. This paper describes
the BNB-Solver: a С++ template library for parallel global discrete and continuous optimization.
For the moment the library provides class templates for Branch-and-Bound, heuristic and hybrid
1
methods. Unlike many other frameworks BNB-Solver supports both distributed and shared
memory platforms. The distributed memory implementation is based on MPI [13] and the shared
memory implementation is based on Native POSIX Threads [7]. Providing a special support for
shared memory platforms is a crucial issue because of the emerging growth of multicore processors popularity.
2 Overview of optimization problems and resolution methods
The goal of global optimization (GO) is to find an extreme (minimal or maximal) value of an
objective function f * = f ( x* ) on a feasible domain X Í R n . The value f * and feasible point
x* are called optimum and optimal solution respectively. Without loss of generality we'll consid-
er only minimization problems: f ( x) ® min, x Î X .
Resolution methods for GO problems can be roughly classified according to the degree of rigor with which they approach the goal:
optimality,
incomplete
methods or
complete (exact)
heuristics
methods find the optimum and prove its
find a feasible solution based on reasonable assump-
tions but do not guarantee optimality. Heuristic methods are applied when complete methods fail
to solve the problem due to the lack of computational resources or time.
The Branch-and-Bound (B&B) is a general name for exact methods to split an initial problem
into subproblems which are sooner or later eliminated by bounding rules. Bounding rules determine whether a subproblem can yield a solution better than the best solution found so far. The
latter is called incumbent solution. Bounding is often done by comparing lower and upper
bounds: a subproblem can be pruned if the lower bound for its objective is larger or equal to the
current upper bound, i.e. incumbent solution. Thus quality upper bound significantly reduces the
search space and in some cases causes dramatic performance improvements. Hybrid algorithms
couple B&B and heuristics in order to employ the latter for improving incumbent solutions.
3 Implementation
In BNB-Solver library we follow the standard approach taken in software frameworks for global
optimization [1, 8, 12, 14]: the problem-specific and problem-independent (core) parts are separated and implemented independently. Coupling of problem-specific and problem-independent
parts is done via С++ template mechanism. Problem-specific branching operations, bounding
rules and type definitions for subproblems and feasible points are encapsulated by a problem factory class. This class is substituted as a template argument within specializations of core classes.
Below we consider core components for serial, shared and distributed memory implementations.
2
3.1 Serial implementation
The core serial B&B component Traverse class template implements basic interface for
branching: method branch decomposes a subproblem into a set of new subproblems by performing a given number of B&B steps. The certain implementation of the branch method is taken from the problem factory class substituted as a template argument.
3.2 Shared memory implementation
The core class PTraverse implements the multithreaded version of standard B&B scheme. Its
branch method starts several threads that share a common pool of subproblems and maintain
their own pools. Each thread performs a given threshold number of B&B steps storing resulting
subproblems locally. After that the thread copies a chunk of subproblems from its local pool to
the shared pool. Then the thread continues solution process with a subproblem taken from the
local pool. If the local subproblem pool is exhausted the thread takes a new subproblem from the
shared pool. If the shared pool is empty the requesting thread blocks until another thread fills it
with subproblems. In B&B methods every branching operation results in dynamic memory allocation for newly created subproblems. Memory allocation requires threads synchronization because the system memory is shared among multiple threads. In [3] it is observed that dynamic
memory allocation causes severe performance losses of multithreaded B&B implementation. In
BNB-Solver this problem is solved by a special memory manager that maintains local memory
pools for every thread. Similar to Traverse class template PTraverse takes problem specific information from the problem factory class substituted as a template argument.
3.3 Distributed memory implementation
The general scheme of parallel implementation for distributed memory machines is depicted at
the Fig. 1. In accordance with distributed memory programming paradigm the application con-
Fig. 1 General scheme of the distributed memory implementation
sists of a set of MPI processes interacting via message-passing. At run-time processes are normally mapped onto different processors or cores.
The set of all processes is divided among B&B ( BM and {B j } ) and heuristic ( HM and
{H j } ) parts. Master processes BM and HM manage B&B and heuristic slave processes re-
3
spectively . The supervisor process M provides an external interface (loading initial data, starting and stopping computations etc.) to the parallel application and coordinates processes BM
and HM . Processes {H j } perform heuristic search by improving the collection of points maintained on process HM . New points are either generated by {H j } or borrowed from processes
{B j } involved in B&B. New incumbent values produced by heuristics are transmitted through
processes HM and BM to processes {B j } in order to facilitate bounding. The distributed
memory solver is implemented by MSSolver class template with one template parameter – a
class implementing tree traversal, i.e. implementing branch method. The actual class substituted
as a template argument is used to perform B&B operations on processes {B j } . By substituting
PTraverse as a template argument for MSSolver class template the hybrid implementation
combining message-passing between nodes and shared memory interactions inside one node can
be constructed. The substitution of the Traverse class produces a pure distributed memory
solver.
4 Experimental results
BNB-Solver has been used to implement four solvers: B&B boolean uni-dimensional knapsack,
B&B constraint continuous optimization, B&B traveling salesman problem and stochastic global
optimization solver. Due to the space limitations we focus on only knapsack and constraint continuous optimization examples. All presented experiments were run on a cluster of 11 SMP
nodes with two 4-core Intel Xeon 5355 processors on each installed in the Institute for System
Programming of Russian Academy of Sciences.
4.1 The knapsack solver
The uni-dimensional knapsack problem [10] is formulated as follows: given a set of n items
with weights w1 ,..., wn and profits p1 ,..., pn find a subset of {1,..., n} such that its total weight
does not exceed the given capacity C and its total profit is maximized. The straightforward implementation employed in many knapsack solvers stores the whole vector of fixed variables for
each subproblem. Branching results in allocating of two new vectors for descendant subproblems and coping data from the parent subproblem to descendants. A much more efficient
B&B scheme proposed by Horovitz and Sahni [10] uses a single binary vector of size n . In fact
this vector encodes the depth-first search tree without explicitly constructing it. The vector causes data dependency among consequent algorithm steps that makes the Horovitz-Sahni (HS)
scheme hard to parallelize. To build a parallel implementation retaining advantages of HS ap4
proach we devised the following scheme. The knapsack problem factory branch method performs a given number of HS steps and then extracts from the resulting binary vector several
pending subproblems. Extracted subproblems are put to the process or thread local pool for later
transmission to other processes or threads. Table 1 lists experimental results for two problems.
Problem 1 has the form
å
24
x ® max,
i= 1 i
å
24
i= 1
2 xi £ 25 . Though the analytical solution for this
problem is evident the standard B&B procedure expands approximately 8´ 106 subproblems.
Problem 2 is a 2000-item random knapsack instance provided by Ekstein. Running time of BNBSolver, ALPS and PEBBL libraries is presented. ALPS was not able to solve Problem 2 in a reasonable time. From that table it is clear that all three libraries provide a relatively good scalability. However the BNB-Solver implementation is at least 10 times faster. This superiority is explained by a very efficient Horovitz-Sahni B&B scheme employed on BNB-Solver slave processes.
Table 1 Running time (s) for two knapsack problem instances
Number of cores
1
ALPS
PEBBL
2
1
2
16.7
19.2
137.2
56.3
4
6.57
35.5
8
3.12
23.6
16
32
2.55
1.85
17.1
BNB-Solver
1.32
0.69
0.34
ALPS
-
-
-
-
-
-
743.3
369.5
183.1
99.9
50.2
35.1
PEBBL
BNB-Solver
2.92
1.86
0.92
0.18
0.11
15.2
0.52
0.50
0.07
0.42
Problems 1 and 2 are steady to heuristics and therefore in the most efficient configuration all
processes belong to B&B part. However for problems sensitive to heuristics the effect of sharing
cores among exact and heuristic algorithms may be significant. For knapsack solver we devised a
parallel heuristic method based on a local search in a neighborhood defined by a relatively small
tuple of indices. Table 2 lists performance results obtained for three random 10 000-item knapsack instances items sensitive to heuristics.
The effect of using heuristics is notably affected by the balance between cores involved in
B&B and heuristic parts. Table 2 compares time obtained on one core with the best time obtained on 16 cores and gives the most effective Nb / N h ratio, where N b and N h are numbers
of slave processes devoted to B&B and heuristic parts respectively. The huge observed speedup
is explained by the fact that for considered instances the rapidly found optimum dramatically reduces the number of expanded subproblems. The detailed explanation of the knapsack implementation with more experimental data can be found in [11].
5
Table 2 Performance results for knapsack problems sensible to heuristics
Pr.
1
2
3
Time (s)
on one
core
1178.9
1748.5
464.44
Best time
(s) obtained
on 16 cores
0.46
1.13
0.69
Speedup
Best
Nb / N h
ratio
11/5
8/8
14/2
2562
1556
673
Table 3 Running time (s) for pressure vessel design problem
Number
of cores
1
2
4
8
1
2
4
8
16
32
Shared memory
T
N
55.71
12 x 106
29.85
11 x 106
11.7
6 x 106
17.9
11 x 106
Distributed memory
55.71
12 x 106
27.86
12 x 106
23.11
19 x 106
11.59
16 x 106
9.22
15 x 106
4.45
16 x 106
t
4.72 x 10-6
2.72 x 10-6
1.77 x 10-6
1.56 x 10-6
4.72 x 10-6
2.35 x 10-6
1.23 x 10-6
0.7 x 10-6
0.6 x 10-6
0.3 x 10-6
4.2 Constrained continuous optimization
A generic nonlinear programming problem (NLP) with m inequality and k equality constraints
can be written as follows: f ( x) ® min, g ( x) £ 0, h( x) = 0, where f ( x) : R n ® R is an objective function, g ( x) : R n ® R m and h( x) : R n ® R k are inequality and equality constraints respectively. With the help of BNB-Solver framework we parallelize a sequential deterministic
B&B continuous optimization method proposed in [9] adapted for NLP. To illustrate the speedup
we selected pressure vessel design problem [6] formulated as follows:
f ( x) = 0.6244 ×TsTh R + 1.7781×Th R 2 + 3.1661×Ts2 R ® min,
g1 ( x) = - Ts + 0.0193 ×R £ 0,
g 2 ( x) = - Th + 0.00954 ×R £ 0,
g3 ( x) = - p R 3 L -
4 3
p R + 129600 £ 0,
3
g 4 ( x) = L £ 240,
where Ts , Th , L , R are four design variables. The modified covering algorithm found an optimal solution (0.727591, 0.359649, 37.699020, 239.999961) with the guaranteed optimality to a
precision 10- 7 . This solution yields the objective function value 5804.375611 that is notably
lower than best value 6059.946409 given in [6]. Table 3 presents running times for shared and
distributed memory implementations. Unlike examples from Table 1 this problem demonstrates
6
search anomalies: the number of expanded subproblems significantly varies. Therefore besides
absolute time T we also present average time t required to expand one subproblem and the approximate number of expanded subproblems N , t = T / N . Distributed memory implementation
scales a bit better w.r.t. shared memory one. This fact indicates problems with threads contention
for shared resources. However the number of expanded subproblems is less for shared memory
implementation due to the immediate incumbent value propagation among threads.
5 Concluding remarks and future work
The paper describes new С++ framework for implementing Branch-and-Bound, heuristic and
hybrid optimization algorithms on shared and distributed memory platforms. The efficiency of
the proposed approach is demonstrated by examples: knapsack and nonlinear programming problems. Ongoing work focuses on improving existing implementations, implementing new optimization problems and devising a systematic mathematically justified approach to load balancing.
Acknowledgement This work was supported by Intel Corporation (Intel High Performance
Competition award) and Russian Fund for Basic Research (projects 08-01-00619-a, 08-0700072-a).
References
1. Alba E, Almeida F, Blesa M et al (2006) Efficient parallel LAN/WAN algorithms for optimization. The MALLBA project. Parall Comput 32(5-6) :415-440
2. Ananth G, Kumar V, Pardalos P (1993) Parallel processing of discrete optimization problems.
Encycl Microcomput 13:129-147
3. Casado L, Martinez J, Garcia I et al (2008) Branch-and-Bound interval global optimization on
shared memory multiprocessors. Optim Methods Softw 23(5):689-701
4. Crainic TG, Le Cun B, Roucairol С (2006) Parallel branch and bound algorithms. In: Parallel
Combinatorial Optimization, Chap 1, John Wiley & Sons, Hoboken, NJ, pp 1-28
5. Crainic TG, Toulouse M (2002) Parallel Strategies for Meta-heuristics. In: Glover F,
Kochenberger G (eds) State-of-the-Art Hand-book in Metaheuristics, Kluwer Academic Publishers, Dordrecht, pp 475-513
6. Coello Coello CA, Mezura-Montes E (2002) Constraint-handling in genetic algorithms
through the use of dominance-based tournament selection. Adv Eng Inf 16(3): 193-203
7. Drepper
U,
Molnar
I (2005)
The
Native
POSIX
Thread
Library for
Linux,
http://people.redhat.com/drepper/nptl-design.pdf. Last access: 13 May 2009
7
8. Eckstein J, Philips C, Hart W (2006) PEBBL 1.0 User Guide. RUTCOR Research Report RRR
19-2006. http://rutcor.rutgers.edu/pub/rrr/reports2006/19_2006.ps. Last access: 13 May 2009
9. Evtushenko Y (1971) Numerical methods for finding global extreme (case of a non-uniform
mesh). USSR Comput Maths Math Phys ll(6):38-54
10.Kellerer H, Pferschy U, Pisinger D (2004) Knapsack Problems. Springer, Berlin
11.Posypkin M, Sigal I (2008) A combined parallel algorithm for solving the knapsack problem.
J Comput Syst Sci Int 47(4): 543-551
12.Ralphs T (2006) Parallel branch and cut. In: Parallel Combinatorial Optimization. John Wiley
& Sons, Hoboken, NJ, pp 53-103
13.Snir M, Otto S, Huss-Lederman S, Walker D, Dongarra J (1996) MPI: The Complete Reference. MIT Press, Boston
14. Tschoke S, Holthofer N (1995) A new parallel approach to the constrained two-dimensional
cutting stock problem. In: Parallel Algorithms for Irregularly Structured Problems, LNCS
980, pp 285-300, Springer, London
8
Download