PERFORMING DENSE LINEAR ALGEBRA COMPUTATIONS IN NATIVE MPI APPLICATIONS INTRODUCTION If you have ever tried to perform a dense matrix-multiply or a linear solve in an MPI application, you have probably heard of ScaLAPACK, a widely-used Fortran 77 library for performing distributed-memory linear-algebra computations. However, once you picked up the ScaLAPACK manual (available online here), you were perhaps bewildered by terms like “process grids”, “twodimensional block cyclic distribution” and “matrix descriptor” and the long sequence of input arguments to the subroutines. If you looked past these, you probably encountered a relative paucity of ready-to-use examples particularly if you are a C or C++ programmer on Windows. The goal of this post is to present a set of concise, self-contained C++ examples that demonstrate the use of routines in ScaLAPACK and PBLAS to solve a few commonly-occurring problems in dense linear algebra such as matrix-multiplication and the solution of symmetric and unsymmetric linear systems. Before diving into the examples themselves, it might be worth pointing out where they came from. ScaLAPACK as you’ve probably already realized consists of a number of components summarized in Figure 1. I won’t go into too many details about them except to point out that the two components at the very bottom, namely BLAS (a set of Basic Linear Algebra Sub-programs) and MPI effectively determine the performance of routines in the library. Using a BLAS implementation that is tuned to the underlying hardware or an MPI library that is able to exploit high-performance interconnects such as Infiniband can have a dramatic impact on the performance of a ScaLAPACK routine. ScaLAPACK LAPACK BLAS (Local) Computation PBLAS BLACS MPI Communication Figure 1. Components of ScaLAPACK. Now, many moons ago, I found myself constantly having to evaluate the effect of different BLAS (say, a sequential version vs. one using multi-threaded parallelism within a single node) and MPI on the performance of several ScaLAPACK functions. I therefore developed a set of short, selfcontained benchmarks that I could quickly compile with different ScaLAPACK/BLAS and MPI implementations to evaluate their performance and scalability. Over time, I started distributing my “bare-bones” examples purely for their instructional value since they demonstrated basic use of ScaLAPACK routines. Finally, over several weekends I upgraded the examples to use newer C++ features (making them more concise in the process), factored out common code into its own library and switched the build mechanism from a menagerie of Makefiles to a set of MSBUILD projects so they could be built inside Visual Studio. In the next section, I briefly describe the examples and provide instructions building them on Windows. After that, I describe the steps to run the benchmarks locally on your development machine on a HPC cluster. An important aspect that is not explicitly covered in the samples is the process of transforming existing data into a two-dimensional block-cyclic distribution so it can be passed into a ScaLAPACK routine. If there is sufficient interest, I will cover this issue in a follow-on post. A TOUR OF THE SAMPLES The samples consist of three components: a common library, the example programs that demonstrate various dense linear algebra operations such as performing Cholesky or LU decomposition on a distributed matrix and a set of build files. The common library in turn primarily consists of two classes, blacs_grid_t that provides a convenient C++ wrapper around a two-dimensional process grid in BLACS and a distributed matrix type, block_cyclic_mat_t that wraps a ScaLAPACK matrix descriptor object along with the associated local memory. My intent with these two classes is to demonstrate how a concise C++ wrapper around the lower-level Fortran API can be used to encapsulate details such as process grid management and implement fundamental operations on distributed matrices. For example, the code for creating a two-dimensional square process grid is simply: auto grid = std::make_shared<blacs_grid_t>(); and the code for creating a 𝑁 × 𝑁 block-cyclically distributed random matrix is: auto a = block_cyclic_mat_t::random(grid, N, N); The example projects primarily illustrate the use of various ScaLAPACK routines such as matrix multiplication and factorization. However, they also illustrate a few not-very-well known tricks to effectively use auxiliary routines in ScaLAPACK to manipulate distributed matrices. For instance, the Cholesky factorization sample creates a distributed 𝑁 × 𝑁 tri-diagonal matrix arising from a uniform finite difference discretization of the one-dimensional Poisson equation: 2 −1 ⋮ 𝑨= −1 2 … ⋱ ⋱ −1 ( −1 2) Similarly, the linear solve example illustrates not only the routines to solve a 𝑁 × 𝑁 unsymmetric linear system of equations, 𝑨𝑥 = 𝑏 but also the steps to compute the norm of the residual with respect to the norm of the matrix: 𝜀= ‖𝑨𝑥 − 𝑏‖∞ . ‖𝐴‖1 × 𝑁 For a well-conditioned linear system, the residual norm should be of the order of the doubleprecision machine epsilon. Finally, since one of the main reasons I developed these benchmark programs is to measure the performance of various ScaLAPACK routines, the examples output not just the absolute time to complete a certain matrix operation but also an estimate for the number of floating point operations per second (commonly denoted as Flops/s). Given the peak floating point performance of the cluster (usually a constant factor times the clock rate in Hz times the number of processing elements) this metric can be used to evaluate the efficiency of a particular operation. It is much more meaningful to say “The Cholesky factorization routine achieved 90% theoretical peak performance on my cluster” as opposed to “Cholesky factorization took 140 seconds on my cluster”. The build files for each sample (compatible with the Visual Studio 2010, Visual Studio 2012 and Visual Studio 2013 versions of MSBUILD) are divided into two components: a common Build.Settings file that contains configuration information such as the path on your machine containing the MKL and MPI libraries and a project file containing a list of files for each project along with dependencies if any. A top-level solution file (ScaLAPACKSamples.sln) is used to tie all the components together. The build files are configured to only generate 64-bit executables and I further constrained them to only generate the “debug” versions of the executable so you can easily step through the examples in a (sequential or cluster) debugger. BUILDING THE EXAMPLES I have tried to keep the number of prerequisites and the process of building the examples as simple and short as possible. Once you install all the necessary software and ensure that the build configuration is correct, you can build and run the samples just by hitting ‘F5’ inside your Visual Studio IDE. STEP 1: INSTALL ALL PREREQUISITES To build the samples, you must have the following prerequisites installed: 1. Visual Studio Professional 2010, 2012 or 2013. The samples use Visual Studio 2010 project and solution formats, C++11 features such as type inference, smart pointers and lambda expressions and only generate 64-bit (x64) executables. Therefore older versions of Visual Studio such as Visual Studio 2008 or versions such as Visual Studio Express that do not support building 64-bit applications are not supported. If you do not already have Visual Studio 2010 Professional, Visual Studio 2012 Professional or Visual Studio 2013, you can get a 90-day evaluation version here. 2. Microsoft HPC Pack SDK. a. If you are using Visual Studio 2012 or Visual Studio 2013, I recommend downloading the 2012 SDK R2, Update 1 available here. i. When upgrading from an older version of HPC Pack SDK 2012 R2, you can download the HPC2012R2_Update1_x64.exe upgrade package. ii. When installing on a machine without an existing HPC Pack SDK, you need to download HPC2012R2_Update1_Full.zip. iii. When installing on a machine with an older version of the HPC Pack SDK (such as 2008), ensure you uninstall it first and then download HPC2012R2_Update1_Full.zip. b. If you are using Visual Studio 2010, I recommend downloading the 2008 R2 SDK with Service Pack 3 available here. 3. Microsoft MS-MPI Redistributable Package. a. If you downloaded the 2012 R2 version of the HPC Pack SDK, download Microsoft MPI v5 here. Install both packages: msmpisdk.msi (which contains the header and library files) and MSMpiSetup.exe (which contains the mpiexec and smpd executables) b. If you downloaded the 2008 R2 version of the HPC Pack SDK, download the corresponding version of Microsoft MPI here. 4. Intel Composer XE 2013. The particular implementation of ScaLAPACK I chose is from the Intel Math Kernel Library (MKL) which besides being highly efficient is also very convenient to use and works out of the box on Windows with Microsoft MPI. If you do not have it, you can obtain a free 30-day evaluation version of Intel MKL here. STEP 2: SETUP INCLUDE AND LIBRARY PATHS As mentioned in the previous section, the sample comes with a configuration file, Build.Settings that contains the path to the Microsoft MPI library and Intel MKL and the set of libraries that the samples must be linked against. The Build.Settings and contains the following configuration entries: <PropertyGroup> <MPIInc>C:\Program Files (x86)\Microsoft SDKs\MPI\Include</MPIInc> <MPILibDir>C:\Program Files (x86)\Microsoft SDKs\MPI\Lib\x64</MPILibDir> <MKLLibDir>C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\lib\intel64</MKLLibDir> <MKLLibs>mkl_scalapack_lp64.lib; mkl_intel_lp64.lib; mkl_sequential.lib;mkl_core.lib;mkl_blacs_msmpi_lp64.lib</MKLLibs> </PropertyGroup> where MPIInc is the path on your system containing mpi.h from Microsoft MPI; MPILibDir is the path containing the 64-bit version of MPI library, msmpi.lib; MKLLibDir is the path on your system containing the Intel MKL library files and MKLLib is the list of MKL libraries that the samples must be linked against. In case you are curious to know how I selected the subset of Intel MKL libraries to link against, I used the incredibly handy Intel link line advisor tool. The options I selected in the tool are shown in the screenshot below: Figure 2. Selecting a list of MKL libraries to link against. STEP 3: BUILD THE EXAMPLES Once you have set the configuration properties correctly (or have verified that the ones in the provided sample correspond to valid paths on your system), double-click on the solution file, ScaLAPACKSamples.sln and open it inside Visual Studio. You should see something like the following: (a) (b) Figure 3. The sample solution inside Visual Studio 2010 (a) and Visual Studio 2012 (b). Once the solution is loaded successfully, hit F7 or click on Build -> Build Solution in the IDE to compile all the examples. If all build settings are correct, you will see four executable files in the output directory of the solution: Figure 4. The executable files generated after successful compilation. At this point, you are ready to run the benchmark programs, either locally on a HPC cluster on premise or in Azure. RUNNING THE EXAMPLES AND BENCHMARKING SCALAPACK ROUTINES The generated executables can be run like any other MPI program. Figure 5. A sample run of the Cholesky factorization benchmark. shows the Cholesky factorization benchmark running on my quad-core, 3.4 GHz Intel® Xeon workstation. Figure 5. A sample run of the Cholesky factorization benchmark. You can just as easily deploy and run the benchmarks on a HPC cluster. The executables have all the MKL dependencies statically compiled in so you can either copy them to all the compute nodes on your cluster or to a common file share and run the MPI job from there. In Figure 6 I show the latter scenario, where the Cholesky example is copied to the C:\tmp directory on the head node of our on-premise cluster (named Hydra), and an MPI job is launched on 32 compute cores (each compute node has eight cores, so the job will run on four nodes). Figure 6. Running the Cholesky factorization example on a HPC Cluster. CONCLUSIONS Numerical libraries such as ScaLAPACK are written by experts in the respective fields and have been continuously improved over the past several years or even decades. Using these libraries as opposed to say rolling your own routines ensure that you will get accurate results with high performance. However, these legacy libraries often have a somewhat steep barrier to entry for novices. In this blog post, I demonstrated a set of simple, self-contained example programs in C++ that illustrate the use of a few representative routines in ScaLAPACK on Windows HPC Server with MSMPI. In subsequent posts, I hope to demonstrate the process of transforming existing data into the format required by ScaLAPACK (the two-dimensional block-cyclic distribution) as well as more detailed walkthroughs of deploying and running the benchmarks on Azure HPC Clusters.