A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie Mellon University Tamara G. Kolda Sandia National Labs: Livermore 1 Envisioned workflow 1. New architecture comes out 2. Scientists specify what they want computed on new architecture to (computer) scientists 3. (Computer) scientists provide efficient library for the computation on new architecture 4. Scientists do science Formality is key! 2 Goals • Formally describe distribution of tensor data on processing grids • Identify patterns in collective communications to utilize specialized implementations when possible • Provide systematic approach to creating algorithms and implementations for problems • Achieve high performance 3 Outline • • • • Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 4 Assumptions • Assume a computing grid arranged as an order-N object • Elements of tensors wrapped elemental-cyclically on the grid 1 0 For this example, we assume an order-2 tensor (matrix) on order-2 grid 6 Assumptions • Assume a computing grid arranged as an order-N object • Elements of tensors wrapped elemental-cyclically on the grid 1 0 For this example, we assume an order-2 tensor (matrix) on order-2 grid 7 Data distribution notation: The Basics • Assign a distribution scheme to each mode of the object How indices of columns (mode 0) are distributed How indices of rows (mode 1) are distributed 8 Data distribution notation: The Basics • Assign a distribution scheme to each mode of the object How indices of columns (mode 0) are distributed How indices of rows (mode 1) are distributed Distributed based on mode 0 of grid Distributed based on mode 1 of grid Tuple assigned to each mode is referred to as the “mode distribution” 9 Example 1 • Distribute indices of columns based on mode 0 of grid 10 Example 1 • Distribute indices of columns based on mode 0 of grid 11 Example 1 • Distribute indices of columns based on mode 0 of grid 12 Example 1 • Distribute indices of columns based on mode 0 of grid 13 Example 1 • Distribute indices of columns based on mode 0 of grid 14 Example 1 • Distribute indices of columns based on mode 0 of grid 15 Example 1 • Distribute indices of columns based on mode 0 of grid 16 Example 1 • Distribute indices of columns based on mode 0 of grid • Distribute indices of rows based on mode 1 of grid 17 Example 1 • Distribute indices of columns based on mode 0 of grid • Distribute indices of rows based on mode 1 of grid 18 Example 1 • Distribute indices of columns based on mode 0 of grid • Distribute indices of rows based on mode 1 of grid 19 Example 1 • Distribute indices of columns based on mode 0 of grid • Distribute indices of rows based on mode 1 of grid 20 Example 1 • Distribute indices of columns based on mode 0 of grid • Distribute indices of rows based on mode 1 of grid 21 Example 1 • Distribute indices of columns based on mode 0 of grid • Distribute indices of rows based on mode 1 of grid 22 Notes • Distributions wrap elements on a logical view of grid – Allows for multiple grid modes to be used in symbols • Example, views grid as • represents replication 23 Notes • We use boldface lowercase Roman letters to refer to mode distributions • Elements of mode distributions denoted with subscripts • Concatenation of mode distributions denoted 24 Elemental Notation • Distributions of Elemental can be viewed in terms of defined notation 25 Parallel Matrix multiplication • Heuristic – Avoid communicating the “large” matrix – Leads to “Stationary” A,B,C algorithm variants • Stationary C algorithm: 26 27 28 29 Outline • • • • Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 30 Tensors and tensor contraction • Tensor – An order-m (m-mode) operator • Each mode associated with feature of the application – Modes have fixed length (dimension) 31 Notation • Tensors in capital script • Elements of tensors in lowercase Greek • Element’s location in tensor as subscripts 32 Tensor contractions • Einstein notation1 implicitly sums over modes shared by inputs 1A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916 33 Tensor contractions • Einstein notation1 implicitly sums over modes shared by inputs • Transpose corresponds to interchange of modes 1A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916 34 Tensor contractions • Einstein notation1 implicitly sums over modes shared by inputs • Transpose corresponds to interchange of modes • Arbitrary number modes involved (any of which can sum) 1A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916 35 Tensor contractions • Third-order Møller-Plesset1 method from computational chemistry 1R J Bartlett. Many-body perturbation theory and coupled cluster theory for electron correlation in molecules. Annual Review of Physical Chemistry, 32(1):359–401, 1981 36 Tensor contraction as MMmult • Through permutation of data, can arrange in such a way that MMmult can be performed • Results in algorithm of form • Requires large rearrangement of data – Cost of this operation magnified in distributed-memory environments 37 Outline • • • • Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 38 Tensor distribution notation • We’ve already seen the notation for order-2 tensors on order2 grids • What if higher-order tensor? – More modes to assigned distribution symbols to – Ex. order-4 tensor • What if higher-order grid? – More grid modes to choose from when creating distribution symbols – Ex. Mode distributions may only contain elements from {0,1,2} if computing on order-3 grid 39 Redistributions: Allgather Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007 40 Allgather in action 41 Allgather in action 42 Allgather in action Before 43 Allgather in action After Before 44 Redistributions: Allgather • Allgather within mode performs the following redistribution of data 45 Redistribution rules • Communication within modes specified by the following redistributions can perform – Ex. 46 Outline • • • • Description of parallel matrix-matrix multiplication Quick overview of tensors and tensor contractions A notation for distributing/redistributing tensors A method for deriving algorithms 47 Algorithm choices • For matrix operations, “Stationary” variants are useful – Extending ideas to tensors also useful? • Potentially other “families” of algorithms to choose from – Only focusing on those we know how to encode for now 48 Deriving Algorithms: Stationary Assumed order-4 grid • Avoid communicating 49 Deriving Algorithms: Stationary Assumed order-4 grid • Avoid communicating • Distribute modes similarly during local computation 50 Deriving Algorithms: Stationary Assumed order-4 grid • Avoid communicating • Distribute modes similarly during local computation • Do not reuse modes of the grid 51 Deriving Algorithms: Stationary Assumed order-4 grid • Avoid communicating • Distribute modes similarly during local computation • Do not reuse modes of the grid 52 Quick Note • Blocking described algorithms should be straightforward (done for matrix operations) 59 Analyzing algorithms • Communication costs used obtained from Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749– 1783, 2007. 60 Analyzing Stationary algorithm grid • Redistribute – All-to-all modes (2,3) – Allgather modes (1,2) • Redistribute – All-to-all modes (0,1) – Allgather modes (3,0) • Local tensor contraction 61 Analyzing Stationary algorithm grid • Redistribute – All-to-all modes (2,3) – Allgather modes (1,2) • Redistribute – All-to-all modes (0,1) – Allgather modes (3,0) • Local tensor contraction 62 Analyzing Matrix-mapping approach • Permute • Permute • Permute • Local tensor contraction • Permute 63 Analyzing Matrix-mapping approach • Permute • Permute • Permute • Local tensor contraction • Permute 64 Picking the “best” algorithm • Stationary algorithm • Matrix-multiply based algorithm Collectives involved Collectives involved processes processes 65 How this all fits together • Formalized aspects of distributed tensor computation – Rules defining valid data distributions – Rules specifying how collectives affect distributions • Given a mechanical way to go from problem specification to an implementation • If other knowledge can be formalized, search space reduced 66 Acknowledgements • • • • • • Tamara G. Kolda – Sandia National Laboratories: Livermore Robert van de Geijn Bryan Marker Devin Matthews Tze Meng Low The FLAME team 67 Thank you • This work has been funded by the following – Sandia National Laboratories: Sandia Graduate Fellowship – NSF CCF-1320112: SHF: Small: From Matrix Computations to Tensor Computations – NSF ACI-1148125/1340293 (supplement): Collaborative Research: SI2SSI: A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. – Argonne National Laboratories for access to computing resources 68