Hybrid Parallelization of Standard Full Tableau Simplex

advertisement
Technological Educational Institute of Athens
Department of Informatics
Hybrid Parallelization of Standard Full
Tableau Simplex Method with
MPI and OpenMP
Basilis Mamalis1, Marios Perlitis2
(1)Technological
Educational Institute of Athens,
(2)Democritus University of Thrace
October 2014
1
Background – LP & Simplex
 Linear programming is known as one of the most
important and well studied optimization problems.
 The simplex method, has been successfully used for
solving linear programming problems for many years.
 Parallel approaches have also extensively been
studied due to the intensive computations required.
 Most research (with regard to sequential simplex
method) has been focused on the revised simplex
method since it takes advantage of the sparsity that is
inherent in most linear programming applications.
 The standard simplex method is mostly suitable for
dense linear problems
2
Background – LP & Simplex
The mathematical formulation of the linear programming
problem, in standard form:
z = cT x
Ax = b
x≥0
Minimize
s.t.
where:
A
x
c
b
is an mxn matrix,
is an n-dimensional variable vector,
is the price vector,
is the right-hand side vector of the
constraints (m-dimensional)
(*) Dense and sparse linear problems
(*) The standard method and the revised method
3
Background – The Standard Simplex Method
The standard full tableau simplex method:
A tableau is an (m+1)x(m+n+1) matrix of the form:
xn+1
xn+2
...
xn+m
x1
-c1
a11
a21
...
am1
x2
-c2
a12
a22
...
am2
...
...
...
...
...
...
xn xn+1
-cn 0
a1n 1
a2n 0
... ...
amn 0
...
...
...
...
...
...
xn+m z
0
1
0
0
0
0
... ...
1
0
0
b1
b2
...
bm
(*) more efficient for dense linear problems
(*) it can be easily convert to a distributed version
4
Related Work (in simplex parallelization)

Parallel approaches have extensively been studied due to the
intensive computations required (especially for large problems).
Many of the known parallel (distributed memory) machines have
been used in the past (Intel iPSC, MassPar, CM-2 etc.)

Several attempts have been made to parallelize the (quite faster
sequentially) revised method, however high speed-up values can
not be achieved and most of them do not scale well, when
distributed environment is involved.

On the contrary the standard method can be parallelized more
easily, naturally many attempts have been done over this - with very
satisfactory results/speed-up and good/high scalability.

In the last decade, many attempts have naturally focused on the use
of tightly-coupled or shared memory hardware structures as well
as on cluster computer systems achieving very good speedup &
efficiency values and high scalability.
5
Related Work (in simplex parallelization)
Many typical and generally efficient implementations for low/medium
cost clusters that usually differ on the data distribution scheme
(column-based, row-based). E.g. Hall et al., Badr et al., Yarmish et al.
Distinguished:
 The work of Yarmish, G., and Slyke, R.V. 2009. A Distributed
Scaleable Simplex Method. Journal of Supercomputing, Springer,
49(3), 373-381.
 Some alternative very promising efforts have been made on
distributed-memory many-core implementations, based on the block
angular structure (or decomposition) of the initially transformed
problems – e.g. Hall et al. 2013 to solve large-scale stochastic LP
problems, and K.K. Sivaramakrishnan 2010.
The question remains: is it worth using distributed (shared nothing)
hardware platforms to parallelize simplex ???? (since the revised
method can not scale well and the standard method is much slower,
thus any possible speed-up may be not enough)
6
Background (MPI & OpenMP)
 MPI (Message Passing Interface) is the dominant approach for
the distributed-memory (message-passing) model.
 OpenMP emerges as the dominant high-level approach for
shared memory with threads.
 Recently, the hybrid model has begun to attract more attention,
for at least two reasons:
 It is relatively easy to pick a language/library instantiation of
the hybrid model.
 Scalable parallel computers now strongly encourage this
model. The fastest machines these days almost all consist of
multi-core nodes connected by a high speed network.
 The idea of using OpenMP threads to exploit the multiple
cores per node while using MPI to communicate among the
nodes is the most known hybrid approach.
7
Background (MPI & OpenMP)
 The last two years however, another strong alternative has
evolved; the MPI 3.0 Shared Memory support mechanism,
which improves significantly the previous existed Remote
Memory Access utilities of MPI, towards the direction of
optimized operation inside a multicore node.
 As analyzed in the literature both the above referred hybrid
models (MPI+OpenMP, MPI+MPI 3.0 Shared Memory) have
their pros and cons, and it’s not straightforward that they
outperform pure MPI implementations in all cases.
 Also, although OpenMP is still regarded as the most efficient
hybrid approach, it’s superiority over the MPI 3.0 Shared
Memory approach (especially when the processors/cores
need to access shared data only for reading; i.e. without
synchronization needs) is not straightforward too.
8
In Our Work …
 We focus on the the standard full tableau simplex method, and
we present relevant implementations for all hybrid schemes 
(a) MPI + OpenMP,
(b) MPI + MPI 3.0 Shared Memory, and
(c) pure MPI parallelization (one MPI process on each core).
 We experimentally evaluate our approaches (over a hybrid
environment of up to 4 processors / 16 cores) 
(a) among each other
(b) to the approach presented in [23] (Yarmish et al)
 In all cases the hybrid OpenMP-based parallelization scheme
performs quite/much better than the other two schemes.
 All schemes lead to high speed-up and efficiency values, The
corresponding values for the OpenMP-based scheme are better
than the ones achieved in the Yarmish et al [23]
9
The Standard method – Main steps
Step 0: Initialization: start with a feasible basic solution and construct
the corresponding simplex tableau.
Step 1: Choice of entering variable: find the winning column (the one
having the larger negative coefficient of the objective function –
entering variable).
Step 2: Choice of leaving variable: find the winning row (apply the
minimum ratio test to the elements of the winning column – choose
the row number with the minimum ratio – leaving variable).
Step 3: Pivoting (involves the most calculations): construct the next
simplex tableau by performing pivoting in the previous tableau rows
based on the new pivot row found in the previous step.
Step 4: Repeat the above steps until the best solution is found or the
problem gets unbounded.
10
Simplex Parallelization – Alternatives
1. Column-based distribution
 sharing the simplex table to all processors by columns
 the most popular and widely used - relatively direct parallelization
 all the computation parts except step 2 of the basic algorithm are fully
parallelized (even the first one which is of less computation)
 theoretically regarded as the most effective one in the general case
2. Row-based distribution
 sharing the simplex table to all processors by rows
 try to take advantage of some probable inherent parallelism in the
computations of step 2 (is it always an advantage or not?)
 an extra broadcast of a row is needed instead of an extra broadcast of
a column that is required in the column distribution scheme
 the parallelization of step 1 becomes more problematic and it’s
usually preferable to execute the corresponding task on P0
11
Column vs. Row-based distribution experimentally
Results drawn by a previous work of ours:
Mamalis, B., Pantziou, G., Dimitropoulos, G., and Kremmydas, D. 2013.
Highly Scalable Parallelization of Standard Simplex Method on a Myrinet
Connected Cluster Platform. ACTA Intl. Journal of Computers and
Applications, 35(4), 152-161.
 For large problems the speed-up given by the column scheme is
clearly better than the one of the row scheme (except the case
when the # of rows is much bigger than the # of columns)
 For medium sized problems, the behaviour of the two schemes
is similar, however the differences are quite lower
 For small problems the speed-up given by the row scheme is
rather a little better than the one of the column scheme in the
general case.
 In the general (most likely to happen) case the column-based
scheme should be the most preferable.
The Parallel Column-based algorithm
1. The simplex table is shared among the processors by columns. Also,
the right-hand constraints vector is broadcasted to all processors.
2. Each processor searches in its local part and chooses the locally best
candidate column (as entering variable) – the one with the larger
negative coefficient in the objective function part.
3. The local results are gathered in parallel and the winning processor
(with the larger negative coefficient) is found and globally known.
4. The processor with the winning column (entering variable) computes
the leaving variable (winning row) using the minimum ratio test over
all the winning column’s elements.
5. The same (winning) processor then broadcasts the winning column
as well as the winning row’s id to all processors.
6. Each processor performs (in parallel) on it’s own part (columns) of the
table all the calculations required for the global rows pivoting, based
on the pivot data received during the previous step.
7. The above steps are repeated until the best solution is found or the
problem gets unbounded.
A. Pure MPI implementation
 One MPI process on each core
 Without any MPI-2/MPI-3 RMA or SM support):
 The well-known MPI collective communication functions
 MPI_Bcast,
 MPI_Scatter/Satterv,
 MPI_Gather/Gatherv,
 MPI_Reduce/Allreduce
(with or without MAXLOC/MINLOC operators)
were appropriately used for the efficient implementation
of the data communication required by steps 1, 3 and 5
of the parallel algorithm.
B. Hybrid MPI+OpenMP implementation
 Appropriately built parallel for constructs were
used for the efficient thread-based parallelization of
the loops implied by steps 2, 4 and 6.
 With regard to the parallelization of steps 2 (in
cooperation with 3) and 4 (which both involve a
reduction operation), we used the newly added
min/max reduction operators of OpenMP API
specification for C/C++.
(*) The min/max operators were not supported in OpenMP API
specification for C/C++ until version 3.0 (they were supported
only for Fortran); they were added in version 3.1 (in our work
we’ve used version 4.0).
B. Hybrid MPI+OpenMP implementation
 With regard to the parallelization of step 6, in order to
achieve even distribution of computations to the
working threads (given that the computational costs of
the main loop iterations cannot be regarded a-priori
equivalent) we’ve used
 collapse-based nested parallelism, with
 dynamic scheduling policy.
 Beyond the OpenMP-based parallelization inside each
node, the collective communication functions of MPI
(MPI_Scatter, MPI_Gather, MPI_Bcast, MPI_Reduce)
were also used here for the communication between
the network connected nodes as in pure MPI
implementation.
C. Hybrid MPI+MPI Shared Memory implementation
 The shared memory support functions of MPI 3.0
 mainly: MPI_Comm_split_type, MPI_Win_allocate_
shared, MPI_Win_shared_query, MPI_Get/Put
 and the synchronization primitives MPI_Win_fence,
MPI_Win_lock/unlock and MPI_Accumulate
were used for the efficient implementation of all the data
communication required by the steps of the parallel
algorithm over the multiple cores of each node.
 Beyond the MPI 3.0 SM-based parallelization inside
each node, the collective communication functions of
MPI (MPI_Bcast, MPI_Scatter, etc.) were also used here
for the communication between the network connected
nodes as in pure MPI implementation.
Experimental Results
 The three outlined different parallelization schemes have
been implemented with use of the MPI 3.0 message
passing library and OpenMP 4.0 API, and they have
been extensively tested over a powerful hybrid parallel
environment (distributed memory, multicore nodes).
 Our test environment consists of up to 4 quad-core
processors (making a total of 16 cores) with 4GB RAM
each, connected via a network interface with up to 1Gbps
communication speed.
 The computing components of the above test
environment were mainly available and accessed through
the Okeanos Cyclades cloud computing services and
local infrastructure in T.E.I. of Athens.
18
Experimental Results
We’ve performed three kinds of experiments:
 experiments that compare our three different schemes to the
corresponding one of Yarmish et al [23]
 experiments that further compare the two hybrid schemes
(MPI+OpenMP vs. MPI+MPI 3.0 Shared Memory), over more
realistic (NETLIB) problems
 experiments wrt the scalability of the MPI+OpenMP scheme
(over larger NETLIB problems)
 Test problems either taken from NETLIB or random ones
(*) We measure response times (T), speed-up (Sp) and efficiency
(Ep) values for varying number of processors/cores (1…16)]
(*) Speed-up measure for P processors: Sp = T1 / TP
(*) Efficiency measure for P processors: Ep = Sp / P
Comparing to [23] (Yarmish et al)
Table 2. Comparing to the implementation of Yarmish et al
P
#cores
1
2
3
4
5
6
7
8
Time/iter
0.6133
0.3115
0.2172
0.1550
0.1311
0.1066
0.0913
Yarmish et al
Sp=T1/TP
1.00
1.97
2.82
3.96
4.68
5.75
6.72
Ep=Sp/P
100.0%
98.4%
94.1%
98.9%
93.5%
95.9%
96.0%
Hybrid (MPI + OpenMP)
Time/iter
Sp=T1/TP
Ep=Sp/P
0.2279
1.00
100.0%
0.1149
1.98
99.2%
0.0766
2.97
99.1%
0.0575
3.96
99.0%
0.0462
4.93
98.7%
0.0386
5.91
98.5%
0.0332
6.87
98.2%
0.0290
7.85
98.1%
Table 3. Comparing the two MPI-based implementations
P
#cores
1
2
3
4
5
6
7
8
Hybrid (MPI + MPI 3.0 SM)
Time/iter
Sp=T1/TP
Ep=Sp/P
0.2279
1.00
100.0%
0.1178
1.93
96.7%
0.0789
2.89
96.3%
0.0592
3.85
96.2%
0.0478
4.77
95.4%
0.0400
5.70
95.0%
0.0347
6.57
93.9%
0.0305
7.48
93.5%
Time/iter
0.2279
0.1257
0.0846
0.0630
0.0513
0.0425
0.0363
0.0317
Pure MPI
Sp=T1/TP
1.00
1.81
2.70
3.62
4.45
5.36
6.28
7.20
Ep=Sp/P
100.0%
90.7%
89.8%
90.4%
88.9%
89.4%
89.7%
90.0%
Comparing to [23] (Yarmish et al)
8
7
6
Speed-up
5
OpenMP
4
Yarmish
3
2
1
p=2
p=3
p=4
p=5
# of Processors
p=6
p=7
p=8
Comparing the three schemes
8
7
Speed-up
6
5
OpenMP
4
MPI-SM
3
Pure MPI
2
1
p=2
p=3
p=4
p=5
p=6
# of Processors
p=7
p=8
Discussion…
 The achieved speed-up values of our hybrid MPI+OpenMP
implementation are better than the ones in [23] in all cases.
 To be more precise, they are slightly better for 2 and 4
processors/cores (which are powers of two) and quite better
for 3, 5, 6 and 7 processors/cores.
 Furthermore, observing the efficiency values in the last column
someone can easily notice the high scalability (higher and
smoother than in [23]) achieved by our implementation.
 Note also that the achieved speedup remains very high
(close to the maximum / speedup = 7.85, efficiency = 98.1%)
even for 8 processors/cores.
(*) The implementation of [23] has been compared to MINOS, a
well-known implementation of the revised simplex method, and
it has been shown to be highly competitive, even for very low
density problems.
Discussion… – [continued]
 The corresponding speed-up and efficiency values of our hybrid
MPI+MPI 3.0 Shared Memory and pure MPI implementations
are very satisfactory in absolute values.
 However: comparing to [23] and MPI+OpenMP implementations,
the measurements for the hybrid MPI+MPI 3.0 Shared Memory
approach are slightly worse, whereas the measurements for pure
MPI approach are quite worse in all cases.
 The main disadvantage of the pure MPI approach is that it
follows the pure process model (use of IPC mechanisms for
interaction), it naturally cannot simulate shared memory access
speed, especially for large scale data sharing needs.
 The main disadvantage of the MPI+MPI 3.0 SM approach is
that its shared window allocation mechanism still cannot scale up
the same well for large data sharing needs and number of cores
(as shown more clearly later). However, it obviously offers a
substantial improvement over pure MPI use.
Comparing the two hybrid schemes [NETLIB problems]
Linear
Problems
MPI+OpenMP
MPI+MPI 3.0 SM
2x4=8 cores
4x4=16 cores
2x4=8 cores
4x4=16 cores
Sp
Sp
Sp
Sp
Ep
Ep
Ep
Ep
SC50A (50x48)
4.89 61.1%
6.50 40.6%
4.85 60.6%
6.40 40.0%
5.59 69.8%
8.24 51.5%
5.49 68.6%
8.00 50.0%
5.81 72.6%
8.78 54.9%
5.69 71.1%
8.50 53.1%
7.00 87.5%
12.17 76.1%
6.60 82.5%
10.63 66.5%
6.76 84.5%
12.11 75.7%
6.38 79.8%
11.27 70.4%
7.00 87.5%
12.89 80.5%
6.58 82.3%
11.99 74.9%
7.42 92.8%
13.61 85.0%
6.82 85.3%
12.03 75.2%
7.60 95.0%
14.38 89.9%
7.16 89.5%
13.05 81.5%
SHARE2B (96x79)
SC105 (105x103)
BRANDY (220x249)
AGG (488x163)
AGG2 (516x302)
BANDM (305x472)
SCFXM3 (990x1371)
Comparing the two hybrid schemes [NETLIB problems]
 The achieved speed-up values of MPI+OpenMP are
better than the ones of the hybrid MPI+MPI 3.0 SM
implementation, in all cases.
 Moreover, the larger the size of the linear problem the
larger the difference in favour of the MPI+OpenMP
implementation.
 For linear problems of small size (e.g. the first two
problems) the corresponding measurements are
almost the same (slightly better for the
MPI+OpenMP approach), whereas
 For problems of larger size the difference is quite
clear (e.g. the last two problems).
Comparing the two hybrid schemes [NETLIB problems]
 Overall, we can say that the shared window allocation
mechanism of MPI 3.0 offers a very good alternative
(with almost equivalent results to the MPI+OpenMP
approach) for shared memory parallelization when the
shared data are of relatively small/medium scale.
 However it cannot scale up the same well (for large
windows and large # of cores) due to internal protocol
limitations and management costs, especially when
some kind of synchronization is required.
 As it's also analyzed in [17], the performance of MPI
3.0 Shared Memory can be very close to MPI+OpenMP,
even for large scale shared data, if it's for simple sharing
(e.g. mainly for reading).
27
Scalability of MPI+OpenMP [large NETLIB problems]
Linear
Problems
Speed-up & Efficiency / MPI+OpenMP
2x1 cores
Sp
Ep
2x2 cores
Sp
Ep
2x4 cores
Sp
Ep
4x4 cores
Sp
Ep
FIT2P (3000x13525)
1.977
98.9%
3.94
98.5%
7.80
97.5%
15.24
95.3%
1.969
98.5%
3.91
97.8%
7.72
96.5%
14.92
93.3%
1.963
98.2%
3.89
97.3%
7.62
95.3%
14.47
90.5%
1.957
97.9%
3.87
96.8%
7.54
94.3%
14.12
88.3%
1.953
97.7%
3.86
96.5%
7.50
93.8%
13.97
87.3%
1.945
97.3%
3.85
96.3%
7.50
93.8%
14.04
87.8%
1.949
97.5%
3.84
96.0%
7.40
92.5%
13.59
84.9%
1.925
96.3%
3.79
94.8%
7.23
90.4%
12.80
80.0%
80BAU3B (2263x9799)
QAP15 (6330x22275)
MAROS-R7 (3136x9408)
QAP12 (3192x8856)
DFL001 (6071x12230)
GREENBEA (2392x5405)
STOCFOR3
(16675x15695)
Discussion…
 The efficiency values decrease with the increase of the #
of processors. However, this decrease is quite slow, and
both the speedup and efficiency remain high (no less than
80%) even for 16 processors/cores, in all cases.
 Moreover, particularly high efficiency values are
achieved for all the high aspect ratio problems (e.g. see
the values for the first three problems where the efficiency
even for 16 processors/cores is over 90%).
 In the case of 16 cores (4x4 network connected nodes)
the communication overhead starts to be quite significant;
however as it is shown in [11] the higher the aspect ratio
of the linear problem the better the performance of the
column distribution scheme with regard to the total
communication overhead.
Conclusion…
A highly scalable parallel implementation framework of the
standard full tableau simplex method on a hybrid (modern
hardware) platform has been presented and evaluated throughout
the paper, in terms of typical performance measures.
 MPI+OpenMP: High scalability, highly appropriate for
modern supercomputers and large-scale cluster
architectures
 MPI 3.0 Shared Memory performance very close to
OpenMP for simple parallelization needs
 Decreased performance (however quite satisfactory),
for large scale parallelization and increased
synchronization needs
End of Time…
Thank you for your attention!
Any questions
Download