Methods for High-Throughput Computation of Elementary Functions Marat Dukhan Richard Vuduc

advertisement
Methods for High-Throughput Computation of
Elementary Functions
Marat Dukhan
Richard Vuduc
School of Computational Science and Engineering
College of Computing
Georgia Institute of Technology
September 10, 2013
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
1 / 33
Outline
1
Introduction
2
New Design Principles
Avoid Table Lookups
Avoid Hardware Division and Square Root
Avoid Unpredictable Branches
Use Pipelined Horner Scheme and Newton-Raphson Iterations
3
Performance and Accuracy Evaluation
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
2 / 33
Vector Elementary Functions
xi-5
yi-5
xi-4
yi-4
xi-3
xi-2
xi-1
xi
xi+1
xi+2
xi+3
yi = log xi
yi = exp xi
yi = sin xi
yi = tan xi
yi = asin xi
yi = atan xi
yi = sinh xi
yi = tanh xi
yi = asinh xi
yi = atanh xi
yi-3
yi-2
yi-1
yi
yi+1
yi+2
yi+3
xi+4
yi+4
xi+5
yi+5
M. Dukhan, R. Vuduc (Georgia Tech)
In this research we consider the
problem of computing such
functions as exp, log, sin, tan
on a large array of inputs and
producing an array of outputs.
We assume that the arrays are
large enough, so that only the
throughput matters for the
means of performance
optimization.
High-Throughput Elementary Functions
PPAM 10
3 / 33
Importance for Applications
Elementary functions are one of the best researched objects in
mathematics, and their properties are well known.
Thus, they are widely used in mathematical models in dierent areas
of science.
When these models are applied to arrays of data vector elementary
functions naturally arise.
Where vector elementary functions are used?
Non-parametric statistics
Machine learning
Information theory
Markov models
Random number generation
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
4 / 33
Contributions
Four design principles for portable high-throughput vector elementary
functions motivated by processor microarchitecture.
I
I
I
I
No table lookups
No division or square root operations
No branches except for special cases
Use Horner Scheme and Newton-Raphson with Software Pipelining
Elementary functions design using only hardware-ecient operations.
Performance evaluation of the proposed elementary functions design
on three architectures.
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
5 / 33
Outline
1
Introduction
2
New Design Principles
Avoid Table Lookups
Avoid Hardware Division and Square Root
Avoid Unpredictable Branches
Use Pipelined Horner Scheme and Newton-Raphson Iterations
3
Performance and Accuracy Evaluation
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
6 / 33
Outline
1
Introduction
2
New Design Principles
Avoid Table Lookups
Avoid Hardware Division and Square Root
Avoid Unpredictable Branches
Use Pipelined Horner Scheme and Newton-Raphson Iterations
3
Performance and Accuracy Evaluation
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
7 / 33
History of FPU and L1D Cache Performance
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
8 / 33
Legacy in Algorithms for Elementary Functions
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
9 / 33
Components of Floating-Point Performance
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
10 / 33
SIMD Table Lookup Options
L1 Cache
L1 Cache
Table
n0
n1
SIMD
register
n2
n3
GP
registers
n0
n0
n1
n1
n2
n2
n3
n3
SIMD
registers
x0
x1
x0
x2
x3
x0
x3
x1
x1
x2
x3
x2
There are two options to extract indices from SIMD register
Use direct transfers from lanes of SIMD register to GP register
Store SIMD register to memory and load individual components
Both options reduce the (already low) table lookup performance.
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
11 / 33
SIMD Table Lookup Performance
The latency of simulated SIMD table lookup operation is too high to hide
it with conventional hardware or software techniques. Codes which use
such table lookups are likely to suer from pipeline stalls due to long
dependencies.
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
12 / 33
Outline
1
Introduction
2
New Design Principles
Avoid Table Lookups
Avoid Hardware Division and Square Root
Avoid Unpredictable Branches
Use Pipelined Horner Scheme and Newton-Raphson Iterations
3
Performance and Accuracy Evaluation
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
13 / 33
History of DIV/SQRT Performance
1
1 Throughput of DIV/SQRT instructions from A.Fog (2013)
DIV and SQRT on Modern Processors
2
2 Throughput of DIV/SQRT instructions from A.Fog (2013)
Outline
1
Introduction
2
New Design Principles
Avoid Table Lookups
Avoid Hardware Division and Square Root
Avoid Unpredictable Branches
Use Pipelined Horner Scheme and Newton-Raphson Iterations
3
Performance and Accuracy Evaluation
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
16 / 33
Branch Misprediction Cost: Cycles
3
3 Data on branch misprediction cycles from A.Fog (2013)
Branch Misprediction Cost: FLOPs
4
4 Data on branch misprediction cycles from A.Fog (2013)
Outline
1
Introduction
2
New Design Principles
Avoid Table Lookups
Avoid Hardware Division and Square Root
Avoid Unpredictable Branches
Use Pipelined Horner Scheme and Newton-Raphson Iterations
3
Performance and Accuracy Evaluation
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
19 / 33
Polynomial Approximations
An elementary function can be approximate by a polynomial:
f (x ) ≈ Pn (x ) =
n
∑ ck x k
k =0
Due to recent progress in polynomial approximation algorithms
approximation error can be reduced to almost zero.
Figure:
Error analysis for approximations of log (1 + x ) by a 21-degree polynomial
√
√
on 22 − 1 ≤ x ≤ 2 − 1
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
20 / 33
Horner Scheme
Denition
Horner scheme is a method of evaluating a polynomial, which can be
illustrated by the formula
Pn (x ) = c0 + x · (c1 + x · (c2 + . . . x · (cn−1 + x · cn ) . . .))
Horner scheme is optimal in terms of accuracy and number of FLOPS
There are two parts in evaluating a polynomial with the Horner scheme:
1
Initialization: set y ← cn
2
Iterations for k = n − 1, . . . 0: y ← ck + x · y
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
21 / 33
Batch Processing
xi-1
yi-1
xi
yi
xi+1
yi+1
xi+2
yi+2
xi+3
xi+4
yi...i+7 = f(xi...i+7)
yi+3
yi+4
xi+5
yi+5
xi+6
yi+6
xi+7
yi+7
xi+8
yi+8
To hide the latency and fully utilize
hardware capabilities implementations
must process a batch of elements at a
time.
Each batch must contain
Latency(MUL)+Latency(ADD )
or
Throughput(MUL+ADD )
Latency(FMA)
Throughput(FMA) SIMD registers, e.g.
I
I
I
M. Dukhan, R. Vuduc (Georgia Tech)
AMD Piledriver: 5 AVX registers (20
elements)
Intel Sandy Bridge: 8 AVX registers
(16 elements)
Intel Haswell: 10 AVX registers (40
elements)
High-Throughput Elementary Functions
PPAM 10
22 / 33
Pipelined Horner Scheme
With a batch of input elements and some assembly magic hardware can
evaluate polynomials at peak oating-point performance.
Figure: Sandy Bridge processor can start one FP ADD and one FP MUL each
cycle. FP ADD will deliver result after 3 cycles, and FP MUL after 5. With a
batch of 8 AVX registers it is possible to completely hide the latency.
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
23 / 33
Newton-Raphson Iterations
If we have a rough approximation to a reciprocal, we can improve it with a
Newton-Raphson iteration.
With FMA
εn = FNMA (yn , x , 1)
yn+1 = FMA (yn , εn , yn )
Converges to 0.5 ULP accuracya
Without FMA
yn+1 = yn (2 − yn · x )
Converges to 1.5 ULP accuracy
a P.Markstein (2000)
In few iterations Newton-Raphson algorithm converges to a good
approximation of the reciprocal.
On x86 RCPPS instruction provides an initial approximation.
Newton-Raphson iterations also can be used to compute square root.
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
24 / 33
Outline
1
Introduction
2
New Design Principles
Avoid Table Lookups
Avoid Hardware Division and Square Root
Avoid Unpredictable Branches
Use Pipelined Horner Scheme and Newton-Raphson Iterations
3
Performance and Accuracy Evaluation
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
25 / 33
Experimental Implementation
To validate the four design principles we developed an experimental
implementation of several vector elementary functions:
log function (full domain)
exp function (full domain)
sin and cos functions (reduced domain |x | < 1.6 × 106 )
tan function (reduced domain |x | < 1.6 × 106 , for AMD Piledriver only)
We have versions of these functions for 6 x86 microarchitectures:
Intel Haswell
Intel Sandy Bridge
Intel Nehalem (not presented here)
AMD Piledriver
AMD K10 (not presented here)
AMD Bobcat (not presented here)
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
26 / 33
Log function/AMD Piledriver
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
27 / 33
Exp function/Intel Haswell
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
28 / 33
Sin function/Intel Sandy Bridge
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
29 / 33
Tan function/AMD Piledriver
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
30 / 33
Public availability
We open-sourced the software which was deloped as a part of this research
Hysteria, a tool for testing performance and accuracy of LibM
libraries, is available at bitbucket.org/MDukhan/hysteria
Peach-Py, a Python framework for writing assembly kernels, is hosted
at bitbucket.org/MDukhan/peachpy
Yeppp! library provides the vector elementary functions presented here
as well as other SIMD-accelerated vector operations. The library is
publicly available on www.yeppp.info
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
31 / 33
Summary
We suggest that algorithms for vector elementary functions would be
faster if they followed four design rules: avoid table lookups, avoid
hardware division and square root, avoid branches, and use pipelined
polynomial evaluations and Newton-Raphson iterations.
We demonstrated that four elementary functions can be implemented
with only oating-point addition and multiplication together with
integer and logical operations.
We evaluated the eect of the proposed design rules on performance
and found that the produced implementations are competitive with
the best alternatives.
M. Dukhan, R. Vuduc (Georgia Tech)
High-Throughput Elementary Functions
PPAM 10
32 / 33
Funding
This research was supported in part by
The National Science
Foundation (NSF)
under NSF CAREER
award number
0953100.
The U.S. Dept. of
Energy (DOE), Oce
of Science, Advanced
Scientic Computing
Research under award
DE-FC0210ER26006/DESC0004915.
Grants from the
Defense Advanced
Research Projects
Agency (DARPA)
Computer Science
Study Group program
Declaimer
Any opinions, conclusions or recommendations expressed in this presentation are
those of the authors and not necessarily reect those of NSF, DOE, or DARPA.
Download