Running the examples and benchmarking ScaLAPACK routines

advertisement
PERFORMING
DENSE
LINEAR
ALGEBRA
COMPUTATIONS IN NATIVE MPI APPLICATIONS
INTRODUCTION
If you have ever tried to perform a dense matrix-multiply or a linear solve in an MPI application,
you have probably heard of ScaLAPACK, a widely-used Fortran 77 library for performing
distributed-memory linear-algebra computations. However, once you picked up the ScaLAPACK
manual (available online here), you were perhaps bewildered by terms like “process grids”, “twodimensional block cyclic distribution” and “matrix descriptor” and the long sequence of input
arguments to the subroutines. If you looked past these, you probably encountered a relative
paucity of ready-to-use examples particularly if you are a C or C++ programmer on Windows.
The goal of this post is to present a set of concise, self-contained C++ examples that demonstrate
the use of routines in ScaLAPACK and PBLAS to solve a few commonly-occurring problems in
dense linear algebra such as matrix-multiplication and the solution of symmetric and
unsymmetric linear systems.
Before diving into the examples themselves, it might be worth pointing out where they came
from. ScaLAPACK as you’ve probably already realized consists of a number of components
summarized in Figure 1. I won’t go into too many details about them except to point out that the
two components at the very bottom, namely BLAS (a set of Basic Linear Algebra Sub-programs)
and MPI effectively determine the performance of routines in the library. Using a BLAS
implementation that is tuned to the underlying hardware or an MPI library that is able to exploit
high-performance interconnects such as Infiniband can have a dramatic impact on the
performance of a ScaLAPACK routine.
ScaLAPACK
LAPACK
BLAS
(Local)
Computation
PBLAS
BLACS
MPI
Communication
Figure 1. Components of ScaLAPACK.
Now, many moons ago, I found myself constantly having to evaluate the effect of different BLAS
(say, a sequential version vs. one using multi-threaded parallelism within a single node) and MPI
on the performance of several ScaLAPACK functions. I therefore developed a set of short, selfcontained benchmarks that I could quickly compile with different ScaLAPACK/BLAS and MPI
implementations to evaluate their performance and scalability.
Over time, I started distributing my “bare-bones” examples purely for their instructional value
since they demonstrated basic use of ScaLAPACK routines. Finally, over several weekends I
upgraded the examples to use newer C++ features (making them more concise in the process),
factored out common code into its own library and switched the build mechanism from a
menagerie of Makefiles to a set of MSBUILD projects so they could be built inside Visual Studio.
In the next section, I briefly describe the examples and provide instructions building them on
Windows. After that, I describe the steps to run the benchmarks locally on your development
machine on a HPC cluster.
An important aspect that is not explicitly covered in the samples is the process of transforming
existing data into a two-dimensional block-cyclic distribution so it can be passed into a
ScaLAPACK routine. If there is sufficient interest, I will cover this issue in a follow-on post.
A TOUR OF THE SAMPLES
The samples consist of three components: a common library, the example programs that
demonstrate various dense linear algebra operations such as performing Cholesky or LU
decomposition on a distributed matrix and a set of build files.
The common library in turn primarily consists of two classes, blacs_grid_t that provides a
convenient C++ wrapper around a two-dimensional process grid in BLACS and a distributed
matrix type, block_cyclic_mat_t that wraps a ScaLAPACK matrix descriptor object along with
the associated local memory. My intent with these two classes is to demonstrate how a concise
C++ wrapper around the lower-level Fortran API can be used to encapsulate details such as
process grid management and implement fundamental operations on distributed matrices. For
example, the code for creating a two-dimensional square process grid is simply:
auto grid = std::make_shared<blacs_grid_t>();
and the code for creating a 𝑁 × 𝑁 block-cyclically distributed random matrix is:
auto a = block_cyclic_mat_t::random(grid, N, N);
The example projects primarily illustrate the use of various ScaLAPACK routines such as matrix
multiplication and factorization. However, they also illustrate a few not-very-well known tricks
to effectively use auxiliary routines in ScaLAPACK to manipulate distributed matrices. For
instance, the Cholesky factorization sample creates a distributed 𝑁 × 𝑁 tri-diagonal matrix
arising from a uniform finite difference discretization of the one-dimensional Poisson equation:
2
−1
⋮
𝑨=
−1
2
…
⋱
⋱
−1
(
−1
2)
Similarly, the linear solve example illustrates not only the routines to solve a 𝑁 × 𝑁 unsymmetric
linear system of equations,
𝑨𝑥 = 𝑏
but also the steps to compute the norm of the residual with respect to the norm of the matrix:
𝜀=
‖𝑨𝑥 − 𝑏‖∞
.
‖𝐴‖1 × 𝑁
For a well-conditioned linear system, the residual norm should be of the order of the doubleprecision machine epsilon.
Finally, since one of the main reasons I developed these benchmark programs is to measure the
performance of various ScaLAPACK routines, the examples output not just the absolute time to
complete a certain matrix operation but also an estimate for the number of floating point
operations per second (commonly denoted as Flops/s). Given the peak floating point performance
of the cluster (usually a constant factor times the clock rate in Hz times the number of processing
elements) this metric can be used to evaluate the efficiency of a particular operation. It is much
more meaningful to say “The Cholesky factorization routine achieved 90% theoretical peak
performance on my cluster” as opposed to “Cholesky factorization took 140 seconds on my cluster”.
The build files for each sample (compatible with the Visual Studio 2010, Visual Studio 2012 and
Visual Studio 2013 versions of MSBUILD) are divided into two components: a common
Build.Settings file that contains configuration information such as the path on your machine
containing the MKL and MPI libraries and a project file containing a list of files for each project
along with dependencies if any. A top-level solution file (ScaLAPACKSamples.sln) is used to tie all
the components together. The build files are configured to only generate 64-bit executables and
I further constrained them to only generate the “debug” versions of the executable so you can
easily step through the examples in a (sequential or cluster) debugger.
BUILDING THE EXAMPLES
I have tried to keep the number of prerequisites and the process of building the examples as
simple and short as possible. Once you install all the necessary software and ensure that the
build configuration is correct, you can build and run the samples just by hitting ‘F5’ inside your
Visual Studio IDE.
STEP 1: INSTALL ALL PREREQUISITES
To build the samples, you must have the following prerequisites installed:
1. Visual Studio Professional 2010, 2012 or 2013. The samples use Visual Studio 2010
project and solution formats, C++11 features such as type inference, smart pointers and
lambda expressions and only generate 64-bit (x64) executables. Therefore older versions
of Visual Studio such as Visual Studio 2008 or versions such as Visual Studio Express
that do not support building 64-bit applications are not supported. If you do not already
have Visual Studio 2010 Professional, Visual Studio 2012 Professional or Visual Studio
2013, you can get a 90-day evaluation version here.
2. Microsoft HPC Pack SDK.
a. If you are using Visual Studio 2012 or Visual Studio 2013, I recommend
downloading the 2012 SDK R2, Update 1 available here.
i. When upgrading from an older version of HPC Pack SDK 2012 R2, you can
download the HPC2012R2_Update1_x64.exe upgrade package.
ii. When installing on a machine without an existing HPC Pack SDK, you
need to download HPC2012R2_Update1_Full.zip.
iii. When installing on a machine with an older version of the HPC Pack SDK
(such as 2008), ensure you uninstall it first and then download
HPC2012R2_Update1_Full.zip.
b. If you are using Visual Studio 2010, I recommend downloading the 2008 R2 SDK
with Service Pack 3 available here.
3. Microsoft MS-MPI Redistributable Package.
a. If you downloaded the 2012 R2 version of the HPC Pack SDK, download Microsoft
MPI v5 here. Install both packages: msmpisdk.msi (which contains the header and
library files) and MSMpiSetup.exe (which contains the mpiexec and smpd
executables)
b. If you downloaded the 2008 R2 version of the HPC Pack SDK, download the
corresponding version of Microsoft MPI here.
4. Intel Composer XE 2013. The particular implementation of ScaLAPACK I chose is from
the Intel Math Kernel Library (MKL) which besides being highly efficient is also very
convenient to use and works out of the box on Windows with Microsoft MPI. If you do not
have it, you can obtain a free 30-day evaluation version of Intel MKL here.
STEP 2: SETUP INCLUDE AND LIBRARY PATHS
As mentioned in the previous section, the sample comes with a configuration file, Build.Settings
that contains the path to the Microsoft MPI library and Intel MKL and the set of libraries that
the samples must be linked against.
The Build.Settings and contains the following configuration entries:
<PropertyGroup>
<MPIInc>C:\Program Files (x86)\Microsoft SDKs\MPI\Include</MPIInc>
<MPILibDir>C:\Program Files (x86)\Microsoft SDKs\MPI\Lib\x64</MPILibDir>
<MKLLibDir>C:\Program Files (x86)\Intel\Composer XE 2013
SP1\mkl\lib\intel64</MKLLibDir>
<MKLLibs>mkl_scalapack_lp64.lib;
mkl_intel_lp64.lib;
mkl_sequential.lib;mkl_core.lib;mkl_blacs_msmpi_lp64.lib</MKLLibs>
</PropertyGroup>
where MPIInc is the path on your system containing mpi.h from Microsoft MPI; MPILibDir is the
path containing the 64-bit version of MPI library, msmpi.lib; MKLLibDir is the path on your
system containing the Intel MKL library files and MKLLib is the list of MKL libraries that the
samples must be linked against.
In case you are curious to know how I selected the subset of Intel MKL libraries to link against,
I used the incredibly handy Intel link line advisor tool. The options I selected in the tool are
shown in the screenshot below:
Figure 2. Selecting a list of MKL libraries to link against.
STEP 3: BUILD THE EXAMPLES
Once you have set the configuration properties correctly (or have verified that the ones in the
provided sample correspond to valid paths on your system), double-click on the solution file,
ScaLAPACKSamples.sln and open it inside Visual Studio. You should see something like the
following:
(a)
(b)
Figure 3. The sample solution inside Visual Studio 2010 (a) and Visual Studio 2012 (b).
Once the solution is loaded successfully, hit F7 or click on Build -> Build Solution in the IDE
to compile all the examples.
If all build settings are correct, you will see four executable files in the output directory of the
solution:
Figure 4. The executable files generated after successful compilation.
At this point, you are ready to run the benchmark programs, either locally on a HPC cluster on
premise or in Azure.
RUNNING THE EXAMPLES AND BENCHMARKING SCALAPACK ROUTINES
The generated executables can be run like any other MPI program. Figure 5. A sample run of the
Cholesky factorization benchmark. shows the Cholesky factorization benchmark running on my
quad-core, 3.4 GHz Intel® Xeon workstation.
Figure 5. A sample run of the Cholesky factorization benchmark.
You can just as easily deploy and run the benchmarks on a HPC cluster. The executables have
all the MKL dependencies statically compiled in so you can either copy them to all the compute
nodes on your cluster or to a common file share and run the MPI job from there. In Figure 6 I
show the latter scenario, where the Cholesky example is copied to the C:\tmp directory on the
head node of our on-premise cluster (named Hydra), and an MPI job is launched on 32 compute
cores (each compute node has eight cores, so the job will run on four nodes).
Figure 6. Running the Cholesky factorization example on a HPC Cluster.
CONCLUSIONS
Numerical libraries such as ScaLAPACK are written by experts in the respective fields and have
been continuously improved over the past several years or even decades. Using these libraries
as opposed to say rolling your own routines ensure that you will get accurate results with high
performance. However, these legacy libraries often have a somewhat steep barrier to entry for
novices. In this blog post, I demonstrated a set of simple, self-contained example programs in
C++ that illustrate the use of a few representative routines in ScaLAPACK on Windows HPC
Server with MSMPI.
In subsequent posts, I hope to demonstrate the process of transforming existing data into the
format required by ScaLAPACK (the two-dimensional block-cyclic distribution) as well as more
detailed walkthroughs of deploying and running the benchmarks on Azure HPC Clusters.
Download