A Framework for Distributed Tensor Computations

advertisement
A Framework for Distributed Tensor
Computations
Martin Schatz
Bryan Marker
Robert van de Geijn
The University of Texas at Austin
Tze Meng Low
Carnegie Mellon University
Tamara G. Kolda
Sandia National Labs: Livermore
1
Envisioned workflow
1. New architecture comes out
2. Scientists specify what they want computed on new
architecture to (computer) scientists
3. (Computer) scientists provide efficient library for the
computation on new architecture
4. Scientists do science
Formality is key!
2
Goals
• Formally describe distribution of tensor data on processing
grids
• Identify patterns in collective communications to utilize
specialized implementations when possible
• Provide systematic approach to creating algorithms and
implementations for problems
• Achieve high performance
3
Outline
•
•
•
•
Description of parallel matrix-matrix multiplication
Quick overview of tensors and tensor contractions
A notation for distributing/redistributing tensors
A method for deriving algorithms
4
Assumptions
• Assume a computing grid arranged as an order-N object
• Elements of tensors wrapped elemental-cyclically on the grid
1
0
For this example, we assume an order-2 tensor (matrix) on order-2 grid
6
Assumptions
• Assume a computing grid arranged as an order-N object
• Elements of tensors wrapped elemental-cyclically on the grid
1
0
For this example, we assume an order-2 tensor (matrix) on order-2 grid
7
Data distribution notation: The Basics
• Assign a distribution scheme to each mode of the object
How indices of columns
(mode 0) are distributed
How indices of rows (mode 1)
are distributed
8
Data distribution notation: The Basics
• Assign a distribution scheme to each mode of the object
How indices of columns
(mode 0) are distributed
How indices of rows (mode 1)
are distributed
Distributed based on
mode 0 of grid
Distributed based on
mode 1 of grid
Tuple assigned to each mode is referred to as the “mode distribution”
9
Example 1
• Distribute indices of columns based on mode 0 of grid
10
Example 1
• Distribute indices of columns based on mode 0 of grid
11
Example 1
• Distribute indices of columns based on mode 0 of grid
12
Example 1
• Distribute indices of columns based on mode 0 of grid
13
Example 1
• Distribute indices of columns based on mode 0 of grid
14
Example 1
• Distribute indices of columns based on mode 0 of grid
15
Example 1
• Distribute indices of columns based on mode 0 of grid
16
Example 1
• Distribute indices of columns based on mode 0 of grid
• Distribute indices of rows based on mode 1 of grid
17
Example 1
• Distribute indices of columns based on mode 0 of grid
• Distribute indices of rows based on mode 1 of grid
18
Example 1
• Distribute indices of columns based on mode 0 of grid
• Distribute indices of rows based on mode 1 of grid
19
Example 1
• Distribute indices of columns based on mode 0 of grid
• Distribute indices of rows based on mode 1 of grid
20
Example 1
• Distribute indices of columns based on mode 0 of grid
• Distribute indices of rows based on mode 1 of grid
21
Example 1
• Distribute indices of columns based on mode 0 of grid
• Distribute indices of rows based on mode 1 of grid
22
Notes
• Distributions wrap elements on a logical view of grid
– Allows for multiple grid modes to be used in symbols
• Example,
views grid
as
•
represents replication
23
Notes
• We use boldface lowercase Roman letters to refer to mode
distributions
• Elements of mode distributions denoted with subscripts
• Concatenation of mode distributions denoted
24
Elemental Notation
• Distributions of Elemental can be viewed in terms of defined
notation
25
Parallel Matrix multiplication
• Heuristic – Avoid communicating the “large” matrix
– Leads to “Stationary” A,B,C algorithm variants
• Stationary C algorithm:
26
27
28
29
Outline
•
•
•
•
Description of parallel matrix-matrix multiplication
Quick overview of tensors and tensor contractions
A notation for distributing/redistributing tensors
A method for deriving algorithms
30
Tensors and tensor contraction
• Tensor
– An order-m (m-mode)
operator
• Each mode associated with
feature of the application
– Modes have fixed length
(dimension)
31
Notation
• Tensors in capital script
• Elements of tensors in lowercase Greek
• Element’s location in tensor as subscripts
32
Tensor contractions
• Einstein notation1 implicitly sums over modes shared by inputs
1A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916
33
Tensor contractions
• Einstein notation1 implicitly sums over modes shared by inputs
• Transpose corresponds to interchange of modes
1A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916
34
Tensor contractions
• Einstein notation1 implicitly sums over modes shared by inputs
• Transpose corresponds to interchange of modes
• Arbitrary number modes involved (any of which can sum)
1A. Einstein. Die Grundlage der allgemeinen Relativit ̈atstheorie. Annalen der Physik, 354:769–822, 1916
35
Tensor contractions
• Third-order Møller-Plesset1 method from computational
chemistry
1R J Bartlett. Many-body perturbation theory and coupled cluster theory for electron
correlation in molecules. Annual Review of Physical Chemistry, 32(1):359–401, 1981
36
Tensor contraction as MMmult
• Through permutation of data, can arrange in such a way that
MMmult can be performed
• Results in algorithm of form
• Requires large rearrangement of data
– Cost of this operation magnified in distributed-memory environments
37
Outline
•
•
•
•
Description of parallel matrix-matrix multiplication
Quick overview of tensors and tensor contractions
A notation for distributing/redistributing tensors
A method for deriving algorithms
38
Tensor distribution notation
• We’ve already seen the notation for order-2 tensors on order2 grids
• What if higher-order tensor?
– More modes to assigned distribution symbols to
– Ex. order-4 tensor
• What if higher-order grid?
– More grid modes to choose from when creating distribution symbols
– Ex. Mode distributions may only contain elements from {0,1,2} if
computing on order-3 grid
39
Redistributions: Allgather
Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective
communication: theory, practice, and experience. Concurrency and Computation: Practice and
Experience, 19(13):1749–1783, 2007
40
Allgather in action
41
Allgather in action
42
Allgather in action
Before
43
Allgather in action
After
Before
44
Redistributions: Allgather
• Allgather within mode performs the following redistribution
of data
45
Redistribution rules
• Communication within modes specified by
the following redistributions
can perform
– Ex.
46
Outline
•
•
•
•
Description of parallel matrix-matrix multiplication
Quick overview of tensors and tensor contractions
A notation for distributing/redistributing tensors
A method for deriving algorithms
47
Algorithm choices
• For matrix operations, “Stationary” variants are useful
– Extending ideas to tensors also useful?
• Potentially other “families” of algorithms to choose from
– Only focusing on those we know how to encode for now
48
Deriving Algorithms: Stationary
Assumed order-4 grid
• Avoid communicating
49
Deriving Algorithms: Stationary
Assumed order-4 grid
• Avoid communicating
• Distribute modes similarly during local computation
50
Deriving Algorithms: Stationary
Assumed order-4 grid
• Avoid communicating
• Distribute modes similarly during local computation
• Do not reuse modes of the grid
51
Deriving Algorithms: Stationary
Assumed order-4 grid
• Avoid communicating
• Distribute modes similarly during local computation
• Do not reuse modes of the grid
52
Quick Note
• Blocking described algorithms should be straightforward
(done for matrix operations)
59
Analyzing algorithms
• Communication costs used obtained from
Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn. Collective communication:
theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–
1783, 2007.
60
Analyzing Stationary
algorithm
grid
• Redistribute
– All-to-all modes (2,3)
– Allgather modes (1,2)
• Redistribute
– All-to-all modes (0,1)
– Allgather modes (3,0)
• Local tensor contraction
61
Analyzing Stationary
algorithm
grid
• Redistribute
– All-to-all modes (2,3)
– Allgather modes (1,2)
• Redistribute
– All-to-all modes (0,1)
– Allgather modes (3,0)
• Local tensor contraction
62
Analyzing Matrix-mapping approach
• Permute
• Permute
• Permute
• Local tensor contraction
• Permute
63
Analyzing Matrix-mapping approach
• Permute
• Permute
• Permute
• Local tensor contraction
• Permute
64
Picking the “best” algorithm
• Stationary
algorithm
• Matrix-multiply based algorithm
Collectives involved
Collectives involved
processes
processes
65
How this all fits together
• Formalized aspects of distributed tensor computation
– Rules defining valid data distributions
– Rules specifying how collectives affect distributions
• Given a mechanical way to go from problem specification to
an implementation
• If other knowledge can be formalized, search space reduced
66
Acknowledgements
•
•
•
•
•
•
Tamara G. Kolda – Sandia National Laboratories: Livermore
Robert van de Geijn
Bryan Marker
Devin Matthews
Tze Meng Low
The FLAME team
67
Thank you
• This work has been funded by the following
– Sandia National Laboratories: Sandia Graduate Fellowship
– NSF CCF-1320112: SHF: Small: From Matrix Computations to Tensor
Computations
– NSF ACI-1148125/1340293 (supplement): Collaborative Research: SI2SSI: A Linear Algebra Software Infrastructure for Sustained Innovation
in Computational Chemistry and other Sciences.
– Argonne National Laboratories for access to computing resources
68
Download