Parallel Extensions to the Matrix Template Library Andrew Lumsdainey Brian C. McCandlessy Abstract We present the preliminary design for a C++ template library to enable the compositional construction of matrix classes suitable for high performance numerical linear algebra computations. The library based on our interface definition — the Matrix Template Library (MTL) — is written in C++ and consists of a small number of template classes that can be composed to represent commonly used matrix formats (both sparse and dense) and parallel data distributions. A comprehensive set of generic algorithms provide high performance for all MTL matrix objects without requiring specific functionality for particular types of matrices. We present performance data to demonstrate that there is little or no performance penalty caused by the powerful MTL abstractions. 1 Introduction There is a common perception in scientific computing that abstraction is the enemy of performance. Although there is continual interest in using languages such as C or C++ and the powerful data abstractions that those languages provide, the conventional wisdom is that data abstractions inherently carry with them a (perhaps severe) performance penalty. Our thesis is that this is not necessarily the case and that, in fact, abstraction can be an effective tool in enabling high performance — but one must choose the right abstractions. The misperception about abstraction springs from numerous examples of C++ libraries that provide a very nice user interface through polymorphism, operator overloading, and so forth so that the user can implement an algorithm or a library in a “natural” way (see, e.g., SparseLib++ and IML++ [8]). Such an approach will (by design) hide computational costs from the user and degrade performance. One approach to providing performance and abastraction is through the use of lazy evaluation (see, e.g., [2]), but this approach can have other performance penalties as well as implementation difficulties. One of the most important concerns in obtaining high-performance on modern workstations is proper exploitation of the memory hierarchy. That is, a high-performance algorithm must be cognizant of the costs of memory accesses and must be structured to maximize use of registers and cache and to minimize cache misses, pipeline stalls, etc. Most importantly, data abstractions can be made that explicitly account for hierarchical memory and which enable a programmer to readily exploit it. The particular set of abstractions that we present here to bridge the performanceabstraction gap is the Matrix Template Library (MTL), written in C++. In the following sections, we describe the basic design of MTL, discuss parallel extensions to MTL, and present experimental To appear in Proc. 8th SIAM Conference on Parallel Processing for Scientific Computing. This work was supported by NSF cooperative grant ASC94-22380. Computational resources provided to the University of Notre Dame under the auspices of the IBM SUR program. y Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556; Lumsdaine.1,McCandless.1 @nd.edu; http://www.cse.nd.edu/ lsc/research/mtl/. f g 1 2 results showing that MTL provides performance competitive with (or better than) traditional mathematical libraries. We remark that this work is decidedly not an attempt to “prove” that a particular language (in our case, C++) offers higher performance than another language (e.g., Fortran). Such arguments are, ultimately, pointless. Any language with a mature compiler can offer high performance (the PhiPAC effort definitively settles this question [4]). Software development, even scientific software development, is about more than just performance and, except for academic situations, one must necessarily be concerned with the costs of software over its entire life-cycle. Thus, we contend that the only relevant discussion to have about languages is how particular languages enable the robust construction, maintenance, and evolution of complex software systems. In that light, modern high-level languages have a distinct advantage — most of them were developed specifically for the development of complex software systems and the more widely-used ones have survived only because they are able to meet the needs of software developers. 2 The Standard Template Library 2.1 A New Paradigm, Not Just a New Library The idea for the Matrix Template Library was inspired to a large extent by the Standard Template Library (STL) for C++ [12]. STL has become extremely popular because of its elegance, richness, and versatility. The original motivation for STL, however, was not to provide yet another library of standard components, but rather to introduce a new programming paradigm [14, 15]. This new paradigm was based on the observation that many algorithms can be abstracted away from the particular representations of the data structures upon which they operate. As long as the data structures provide a standard interface for algorithms to use, algorithms and data structures can be freely mixed and matched. Moreover, this paradigm realizes that this process of abstraction can be done without sacrificing performance. To realize an implementation of an algorithm which is independent of data structure representation requires language support, however. In particular, a language must allow algorithms (and data) to be parameterized not only by the values of the formal parameters (the arguments), but also by the type of the data. Few languages offer this capability, and it has only (relatively) lately become part of C++. In C++, functions and object classes are parameterized through the use of templates [19], hence the realization of a generic algorithm library in C++ as the Standard Template Library. 2.2 The Structure of STL STL provides the following sets of components: Containers, Generic Algorithms, Iterators, Adapters, Function Objects (“Functors”), and Allocators. Since these types of components also form the framework for MTL, we discuss them briefly here. Containers are objects that contain other objects (e.g., a list of elements). STL uses templates to parameterize what is contained. Thus, the same template list code can be used to implement a list of integers, or a list of doubles, or a list of lists, etc. The generic algorithms defined by STL are a set of data-format independent algorithms and are parameterized by the type of their arguments. The particular algorithms provided by the STL specification are general computer-science type algorithms (e.g., sorting, searching, etc.). Iterators are objects that generalize access to other objects (iterators are sometimes called “generalized pointers”). The definition of the iterator classes in STL provide the uniform interface between algorithms and containers necessary to enable genericity. That is, each container class has certain iterators which can be used to access and perhaps manipulate its contents. STL algorithms are in turn written solely in terms of iterators. 3 The remaining components are perhaps less pre-eminent in STL. Adapter classes are used to provide a new interfaces to other components. Just as an iterator generalizes and make more powerful the concept of a pointer, a function object generalizes and makes more powerful the concept of a function pointer. Finally, allocators are classes that manage the use of memory. 2.3 Example The following code fragment demonstrates the use of the generic inner product() algorithm. Note that the inner-product can be taken between containers of arbitrary type. int x1[100]; vector<double> x2(100); list<double> x3; // ... initialize data ... // Compute inner product of x1 and x2 -- an array and a vector double result = inner_product(&x1[0], &x1[100], x2.begin(), 0.0); // Compute inner product of x2 and x3 -- a vector and a list result = inner_product(x2.begin(), x2.end(), x3.begin(), 0.0); 3 The Matrix Template Library MTL is by no means the first attempt to bring abstraction to scientific programming [3], nor is it the first attempt at a mathematical library in C++. Other notable efforts include HPC++ [20], LAPACK++ [9], SparseLib++/IML++ [8], and the Template Numerical Toolkit [16]. MTL is unique, however, in its general underlying philosophy (see below) and in its particular commitment to self-contained high-performance. Other libraries, if they are concerned about performance at all, attain high-performance by making (mixed-language) calls to BLAS subroutines. The higher-level C++ code merely provides a syntactically pleasing means for gluing high-performance subroutines together, but does not provide flexible means for obtaining high-performance (as MTL does). 3.1 The MTL Philosophy The goal of MTL is to introduce the philosophy of STL to a particular application domain, namely high-performance numerical linear algebra. The underlying ideas remain the same: to provide a framework in which algorithms and data structures are separated, to provide a classification of standard components, to define a set of interfaces for those components, and to provide a reference implementation of a conforming library. The basic architectural design of MTL (from the bottom up) begins with a fundamental arithmetic type — this organizes bytes into (say) doubles. Next, we collect groups of the arithmetic types into one-dimensional containers. Finally, we use a two-dimensional container to organize the one-dimensional containers. Note what we have not done to this point. We have not indicated how this two-dimensional container of arithmetic types corresponds to a mathematical matrix of elements. With the two-dimensional container, however, we do explicitly know how data is arranged in memory and hence where opportunities for high performance will be greatest. The transformation of one- and two-dimensional containers into (respectively) vectors and matrices is accomplished through the use of adapter classes. 4 3.2 The Design of MTL MTL provides the following sets of components: one-dimensional containers, two-dimensional containers, orientation, shape, generic algorithms, allocators, iterators, and a matrix adapter. Because of space limitations, we can only describe each of them briefly. These components are described more fully in the MTL specification [13]. One-Dimensional Containers As with containers in STL, one-dimensional containers in MTL are objects that contain other objects. We distinguish these containers as being “one-dimensional” because elements within such containers can be accessed using a single index. The declarations for the MTL's one-dimensional containers are as follows: namespace mtl { template <class template <class template <class template <class template <class }; T, T, T, T, T, class class class class class Allocator Allocator Allocator Allocator Allocator = = = = = allocator> allocator> allocator> allocator> allocator> class class class class class vector; pair_vector; compressed; list; map; These declarations may seem somewhat formidable to the C++ novice1 . The statement template <class T, class Allocator = allocator> simply indicates that the following class is parameterized by class T and by class Allocator. The class T may be either an object class or a fundamental C++ type, such as int or double, and parameterizes the type of object contained by the container class. The class Allocator template argument parameterizes the class that the container uses to allocate memory. In this case, the Allocator class has a default value; containers that are declared with a single template argument (T) will use the default allocator class for Allocator. The MTL vector class is similar to the STL vector class (which is similar to a C++ array) with an interface tailored to MTL requirements, and provides the basis for dense (mathematical) vectors and matrices. The remaining container classes are associative containers (they explicitly store index-value pairs) and form the basis for different types of sparse matrices. The different container classes have different computational complexity (computational complexity of particular container operations is part of the formal MTL specification), allowing users to choose a format that is most effective for particular applications. Two-Dimensional Containers Two-dimensional containers in MTL are containers of other containers. We distinguish these containers as being “two-dimensional” because elements within such containers must be accessed with a pair of indices. MTL provides the following twodimensional containers: namespace mtl { template <class OneD, class Allocator = allocator> vector_of; template <class OneD, class Allocator = allocator> pointers_to; }; The two-dimensional container classes have as a template argument a one-dimensional container and provide a linear arrangement of those containers. The fundamental difference between these two classes is in the complexity required to interchange one-dimensional containers — the vector of class requires linear time whereas the pointers to class requires constant time. 1 We remark that we are somewhat forward-looking in our definition of MTL in that we use features, such as namespace and default template arguments, that are new to C++ and not yet widely supported by available compilers. 5 Orientation and Shape MTL provides two types of components to map from two-dimensional containers to matrix structure — orientation and shape. The shape class describes the large-scale non-zero structure of the matrix. MTL provides the following shape classes (with the obvious interpretations): namespace mtl { class general; class upper_triangle; class lower_triangle; class unit_upper_triangle; class unit_lower_triangle; class banded; }; The orientation class maps matrix indices to two-dimensional container indices. MTL provides the following orientation classes: namespace mtl { class row; class col; class diagonal; class anti_diagonal; }; The row, col, diagonal, and anti diagonal classes align the one-dimensional container along (i.e., the one dimensional container iterators vary fastest along) the matrix row, column, diagonal, and anti-diagonal, respectively. Generic Algorithms MTL provides a number of high-performance generic algorithms for numerical linear algebra. These algorithms can be generally classified as vector arithmetic, operator application, and operator update operations and they supply a superset of the functionality provided in level-1, level-2, and level-3 BLAS [11, 7, 6]. It is important to understand here at what level these algorithms are generic, however. To obtain high-performance on a modern highperformance computer, an algorithm must exploit and manage the memory layout of its data. Thus, MTL provides generic algorithm interfaces at the matrix level, but the algorithms themselves are implemented generically at the (one-dimensional or two-dimensional) container level. Note that the dispatch from matrix-level to container-level is done at compile-time; there is no run-time performance penalty. Allocators As with STL, MTL allocators are used to manage memory. The default MTL allocator class allows the elements of the two-dimensional matrix to be laid out contiguously in memory. This is an important feature for interfacing MTL to external libraries (such as LAPACK [1]), which expect data to be laid out in one-dimensional fashion. Iterators In addition to the native container iterators, MTL one-dimensional containers provide a value iterator for accessing container values, and an index iterator for accessing container indices. Moreover, a block iterator component is provided at the two-dimensional container level to allow iteration over two-dimensional regions. The structure of MTL is thus compatible (and is in fact an early test vehicle) for the notion of a “lite” BLAS [18]. 6 Matrix Adapter The matrix adapter class provides a uniform linear algebra interface to MTL classes by wrapping up a two-dimensional container class, an orientation class, and a shape class. namespace mtl { template <class TwoD, class Orientation, class Shape = general> matrix; }; 3.3 Example The following MTL code fragment computes the product between a compressed-row sparse matrix and a column-oriented dense matrix: using namespace mtl; double alpha, beta; matrix<pointers_to<compressed<double> >, row, general> A; matrix<pointers_to<vector<double> >, col, general> B, C; // ... initialize scalars and matrices ... multiply(C, A, B, alpha, beta); 4 // C <- alpha*A*B + beta*C Parallel Extensions to MTL To extend MTL for parallel programming, we introduce four new components: distribution, processor map, distributed vector, and distributed matrix. Distribution An MTL distribution class maps global indices to local indices and is thus analagous in function to an orientation class. Presently, MTL includes the following distribution: namespace mtl { template <class ProcMap> class block_cyclic; }; The distribution class is parameterized by the type of underlying processor topology. Note that although MTL presently only provides this single distribution, the block cyclic distribution is general enough to subsume many other types of distributions [17]. Processor Map The MTL processor map component provides a topological structure to the processors of a parallel computer. MTL includes the following procesor maps: namespace mtl { namespace pmap { class one_d; class two_d; class three_d; }; }; These maps provide one, two, and three-dimensional Cartesian topologies, respectively. Distributed Vector and Distributed Matrix The distributed vector and distributed matrix components are adapter classes that respectively wrap up sequential MTL one-dimensional and twodimensional containers together with a distribution. The MTL distributed vector and distributed matrix classes are declared as follows: 7 namespace mtl { template <class Dist, class SeqTwoD> class dist_vector; template <class Dist, class SeqTwoD> class dist_matrix; }; 5 Performance Results To demonstrate the performance of MTL, we present performance results from multiplying two column-oriented dense matrices. Sequential results were obtained on an IBM RS/6000 model 590 workstation; parallel results were obtained on a thin-node IBM SP-2 (which has slightly lower single-node performance than the 590). All modules were compiled with the highest available level of optimization for the particular language of the module. Table 1 shows a comparison of IBM's (non-ESSL) version of DGEMM, the “DMR” version of DGEMM obtained from netlib [10], the PhiPAC version of DGEMM, and MTL. In the sequential case, MTL consistently provided the highest performance (except in one instance where it lagged DMR by an insignficant amount). In the parallel case, MTL showed very good scalability as the number of processors was increased. The blocking parameters for the PhiPAC DGEMM were the “optimal” parameters for the RS6000 590, as obtained from the PhiPAC web page. Unfortunately, we were not able to reproduce the near-peak performance of PhiPAC as reported on the web page. Evidently, the parameters reported produce near-peak performance for certain matrix sizes, but that performance can fall off dramatically for other matrix sizes. We did not attempt to find blocking parameters with more consistently high performance, but we are fairly certain that such parameters exist and that they would make PhiPAC competitive with the MTL results. Similar exploration of the blocking design space would presumably also allow MTL to eke out a few more Mflops. The point here is not that MTL has the fastest matrix-matrix product, but that it can be made as fast as other subroutines — all within a framework that is more conducive to modern software engineering practice. N N 6 Conclusion In this paper we have presented a (very brief) description of MTL and its parallel extensions. It should be clear from the discussion and from the performance results that abstractions are not barriers to high perfomance. Although not shown here due to space limitations, results from sparse matrix computations showed MTL to have superior performance to standard sparse libraries (i.e., the NIST sparse BLAS [5]). N 128 256 512 1024 2048 Sequential Mflop rate DGEMM DMR PhiPAC 158.56 215.18 209.52 159.62 218.80 207.34 161.25 220.92 47.58 162.71 216.77 47.47 162.44 218.94 47.41 MTL 215.04 221.63 225.61 226.30 226.14 MTL(1) 202.6 215.5 218.8 219.8 218.1 Parallel Mflop rate MTL(2) MTL(4) 167.8 246.7 243.1 398.3 314.0 553.2 363.3 681.7 N/A 759.2 MTL(8) 301.7 552.8 885.9 1173.5 1388.8 TABLE 1 Comparison of Mflop rates for dense matrix-matrix product. Sequential results are shown for DGEMM, the “DMR” version of DGEMM, the PhiPAC version of DGEMM, and MTL. Parallel results are shown for MTL on 1, 2, 4, and 8 SP-2 nodes. Results were not obtained for = 2048 on 2 nodes because of memory limits. N 8 Present work focuses on the development of specific mathematical libraries using (sequential and distributed) MTL: a direct sparse solver library and a preconditioned iterative sparse solver library. Future work will include the complete specification of a low-level “lite” BLAS level in C++ to provide the foundation for MTL generic algorithms. The current release of MTL (documentation and reference implementation) can be found at http://www.cse.nd.edu/lsc/research/mtl/. References [1] E. Anderson, Z. Bai, C. Bischoff, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK: A portable linear algebra package for high-performance computers, in Proc. Supercomputing ' 90, IEEE Press, 1990, pp. 1–10. [2] S. Atlas et al., POOMA: A high performance distributed simulation environment for scientific applications, in Proc. Supercomputing ' 95, 1995. [3] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith, Efficient management of parallelism in objectoriented numerical software libraries, in Modern Software Tools in Scientific Computing, E. Arge, A. M. Bruaset, and H. P. Langtangen, eds., Birkhauser, 1997. [4] J. Bilmes, K. Asanovic, J. Demmel, D. Lam, and C.-W. Chin, Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology, Tech. Rep. CS-96-326, University of Tennessee, May 1996. Also available as LAPACK working note 111. [5] S. Carney et al., A revised proposal for a sparse BLAS toolkit, Preprint 94-034, AHPCRC, 1994. SPARKER working note #3. [6] J. Dongarra, J. D. Croz, I. Duff, and S. Hammarling, A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software, 16 (1990), pp. 1–17. [7] J. Dongarra, J. D. Croz, S. Hammarling, and R. Hanson, Algorithm 656: An extended set of basic linear algebra subprograms: Model implementations and test programs, ACM Transactions on Mathematical Software, 14 (1988), pp. 18–32. [8] J. Dongarra, A. Lumsdaine, X. Niu, R. Pozo, and K. Remington, A sparse matrix library in C++ for high performance architectures, in Proc. Object Oriented Numerics Conference, Sun River, OR, 1994. [9] J. Dongarra, R. Pozo, and D. Walker, LAPACK++: A design overview of object-oriented extensions for high performance linear algebra, in Proc. Supercomputing ' 93, IEEE Press, 1993, pp. 162–171. [10] J. J. Dongarra, P. Mayes, and G. R. di Brozolo, The IBM RISC System/6000 and linear algebra operations, Computer Science Technical Report CS-90-122, University of Tennessee, 1990. [11] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic linear algebra subprograms for fortran usage, ACM Transactions on Mathematical Software, 5 (1979), pp. 308–323. [12] M. Lee and A. Stepanov, The standard template library, tech. rep., HP Laboratories, February 1995. [13] A. Lumsdaine and B. C. McCandless, The matrix template library, BLAIS Working Note #2, University of Notre Dame, 1996. [14] D. R. Musser and A. A. Stepanov, Generic programming, in Lecture Notes in Computer Science 358, Springer-Verlag, 1989, pp. 13–25. [15] , Algorithm-oriented generic libraries, Software–Practice and Experience, 24 (1994), pp. 623– 642. [16] R. Pozo, Template numerical toolkit for linear algebra: high performance programming with C++ and the standard template library, in Proc. ETPSC III, August 1996. [17] M. Sidani and B. Harrod, Parallel matrix distributions: Have we been doing it all right?, Tech. Rep. CS-96-340, University of Tennessee, November 1996. LAPACK working note #116. [18] A. Skjellum and A. Lumsdaine, Software engineering issues for linear algebra standards, BLAIS Working Note #1, Mississippi State University and University of Notre Dame, 1995. [19] B. Stroustrup, The C++ Programming Language, Addison-Wesley, Reading, Massachusetts, second ed., 1991. [20] The HPC++ Working Group, HPC++ white papers, tech. rep., Center for Research on Parallel Computation, 1995.