Methods for High-Throughput Computation of Elementary Functions Marat Dukhan Richard Vuduc School of Computational Science and Engineering College of Computing Georgia Institute of Technology September 10, 2013 M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 1 / 33 Outline 1 Introduction 2 New Design Principles Avoid Table Lookups Avoid Hardware Division and Square Root Avoid Unpredictable Branches Use Pipelined Horner Scheme and Newton-Raphson Iterations 3 Performance and Accuracy Evaluation M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 2 / 33 Vector Elementary Functions xi-5 yi-5 xi-4 yi-4 xi-3 xi-2 xi-1 xi xi+1 xi+2 xi+3 yi = log xi yi = exp xi yi = sin xi yi = tan xi yi = asin xi yi = atan xi yi = sinh xi yi = tanh xi yi = asinh xi yi = atanh xi yi-3 yi-2 yi-1 yi yi+1 yi+2 yi+3 xi+4 yi+4 xi+5 yi+5 M. Dukhan, R. Vuduc (Georgia Tech) In this research we consider the problem of computing such functions as exp, log, sin, tan on a large array of inputs and producing an array of outputs. We assume that the arrays are large enough, so that only the throughput matters for the means of performance optimization. High-Throughput Elementary Functions PPAM 10 3 / 33 Importance for Applications Elementary functions are one of the best researched objects in mathematics, and their properties are well known. Thus, they are widely used in mathematical models in dierent areas of science. When these models are applied to arrays of data vector elementary functions naturally arise. Where vector elementary functions are used? Non-parametric statistics Machine learning Information theory Markov models Random number generation M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 4 / 33 Contributions Four design principles for portable high-throughput vector elementary functions motivated by processor microarchitecture. I I I I No table lookups No division or square root operations No branches except for special cases Use Horner Scheme and Newton-Raphson with Software Pipelining Elementary functions design using only hardware-ecient operations. Performance evaluation of the proposed elementary functions design on three architectures. M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 5 / 33 Outline 1 Introduction 2 New Design Principles Avoid Table Lookups Avoid Hardware Division and Square Root Avoid Unpredictable Branches Use Pipelined Horner Scheme and Newton-Raphson Iterations 3 Performance and Accuracy Evaluation M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 6 / 33 Outline 1 Introduction 2 New Design Principles Avoid Table Lookups Avoid Hardware Division and Square Root Avoid Unpredictable Branches Use Pipelined Horner Scheme and Newton-Raphson Iterations 3 Performance and Accuracy Evaluation M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 7 / 33 History of FPU and L1D Cache Performance M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 8 / 33 Legacy in Algorithms for Elementary Functions M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 9 / 33 Components of Floating-Point Performance M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 10 / 33 SIMD Table Lookup Options L1 Cache L1 Cache Table n0 n1 SIMD register n2 n3 GP registers n0 n0 n1 n1 n2 n2 n3 n3 SIMD registers x0 x1 x0 x2 x3 x0 x3 x1 x1 x2 x3 x2 There are two options to extract indices from SIMD register Use direct transfers from lanes of SIMD register to GP register Store SIMD register to memory and load individual components Both options reduce the (already low) table lookup performance. M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 11 / 33 SIMD Table Lookup Performance The latency of simulated SIMD table lookup operation is too high to hide it with conventional hardware or software techniques. Codes which use such table lookups are likely to suer from pipeline stalls due to long dependencies. M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 12 / 33 Outline 1 Introduction 2 New Design Principles Avoid Table Lookups Avoid Hardware Division and Square Root Avoid Unpredictable Branches Use Pipelined Horner Scheme and Newton-Raphson Iterations 3 Performance and Accuracy Evaluation M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 13 / 33 History of DIV/SQRT Performance 1 1 Throughput of DIV/SQRT instructions from A.Fog (2013) DIV and SQRT on Modern Processors 2 2 Throughput of DIV/SQRT instructions from A.Fog (2013) Outline 1 Introduction 2 New Design Principles Avoid Table Lookups Avoid Hardware Division and Square Root Avoid Unpredictable Branches Use Pipelined Horner Scheme and Newton-Raphson Iterations 3 Performance and Accuracy Evaluation M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 16 / 33 Branch Misprediction Cost: Cycles 3 3 Data on branch misprediction cycles from A.Fog (2013) Branch Misprediction Cost: FLOPs 4 4 Data on branch misprediction cycles from A.Fog (2013) Outline 1 Introduction 2 New Design Principles Avoid Table Lookups Avoid Hardware Division and Square Root Avoid Unpredictable Branches Use Pipelined Horner Scheme and Newton-Raphson Iterations 3 Performance and Accuracy Evaluation M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 19 / 33 Polynomial Approximations An elementary function can be approximate by a polynomial: f (x ) ≈ Pn (x ) = n ∑ ck x k k =0 Due to recent progress in polynomial approximation algorithms approximation error can be reduced to almost zero. Figure: Error analysis for approximations of log (1 + x ) by a 21-degree polynomial √ √ on 22 − 1 ≤ x ≤ 2 − 1 M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 20 / 33 Horner Scheme Denition Horner scheme is a method of evaluating a polynomial, which can be illustrated by the formula Pn (x ) = c0 + x · (c1 + x · (c2 + . . . x · (cn−1 + x · cn ) . . .)) Horner scheme is optimal in terms of accuracy and number of FLOPS There are two parts in evaluating a polynomial with the Horner scheme: 1 Initialization: set y ← cn 2 Iterations for k = n − 1, . . . 0: y ← ck + x · y M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 21 / 33 Batch Processing xi-1 yi-1 xi yi xi+1 yi+1 xi+2 yi+2 xi+3 xi+4 yi...i+7 = f(xi...i+7) yi+3 yi+4 xi+5 yi+5 xi+6 yi+6 xi+7 yi+7 xi+8 yi+8 To hide the latency and fully utilize hardware capabilities implementations must process a batch of elements at a time. Each batch must contain Latency(MUL)+Latency(ADD ) or Throughput(MUL+ADD ) Latency(FMA) Throughput(FMA) SIMD registers, e.g. I I I M. Dukhan, R. Vuduc (Georgia Tech) AMD Piledriver: 5 AVX registers (20 elements) Intel Sandy Bridge: 8 AVX registers (16 elements) Intel Haswell: 10 AVX registers (40 elements) High-Throughput Elementary Functions PPAM 10 22 / 33 Pipelined Horner Scheme With a batch of input elements and some assembly magic hardware can evaluate polynomials at peak oating-point performance. Figure: Sandy Bridge processor can start one FP ADD and one FP MUL each cycle. FP ADD will deliver result after 3 cycles, and FP MUL after 5. With a batch of 8 AVX registers it is possible to completely hide the latency. M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 23 / 33 Newton-Raphson Iterations If we have a rough approximation to a reciprocal, we can improve it with a Newton-Raphson iteration. With FMA εn = FNMA (yn , x , 1) yn+1 = FMA (yn , εn , yn ) Converges to 0.5 ULP accuracya Without FMA yn+1 = yn (2 − yn · x ) Converges to 1.5 ULP accuracy a P.Markstein (2000) In few iterations Newton-Raphson algorithm converges to a good approximation of the reciprocal. On x86 RCPPS instruction provides an initial approximation. Newton-Raphson iterations also can be used to compute square root. M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 24 / 33 Outline 1 Introduction 2 New Design Principles Avoid Table Lookups Avoid Hardware Division and Square Root Avoid Unpredictable Branches Use Pipelined Horner Scheme and Newton-Raphson Iterations 3 Performance and Accuracy Evaluation M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 25 / 33 Experimental Implementation To validate the four design principles we developed an experimental implementation of several vector elementary functions: log function (full domain) exp function (full domain) sin and cos functions (reduced domain |x | < 1.6 × 106 ) tan function (reduced domain |x | < 1.6 × 106 , for AMD Piledriver only) We have versions of these functions for 6 x86 microarchitectures: Intel Haswell Intel Sandy Bridge Intel Nehalem (not presented here) AMD Piledriver AMD K10 (not presented here) AMD Bobcat (not presented here) M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 26 / 33 Log function/AMD Piledriver M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 27 / 33 Exp function/Intel Haswell M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 28 / 33 Sin function/Intel Sandy Bridge M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 29 / 33 Tan function/AMD Piledriver M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 30 / 33 Public availability We open-sourced the software which was deloped as a part of this research Hysteria, a tool for testing performance and accuracy of LibM libraries, is available at bitbucket.org/MDukhan/hysteria Peach-Py, a Python framework for writing assembly kernels, is hosted at bitbucket.org/MDukhan/peachpy Yeppp! library provides the vector elementary functions presented here as well as other SIMD-accelerated vector operations. The library is publicly available on www.yeppp.info M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 31 / 33 Summary We suggest that algorithms for vector elementary functions would be faster if they followed four design rules: avoid table lookups, avoid hardware division and square root, avoid branches, and use pipelined polynomial evaluations and Newton-Raphson iterations. We demonstrated that four elementary functions can be implemented with only oating-point addition and multiplication together with integer and logical operations. We evaluated the eect of the proposed design rules on performance and found that the produced implementations are competitive with the best alternatives. M. Dukhan, R. Vuduc (Georgia Tech) High-Throughput Elementary Functions PPAM 10 32 / 33 Funding This research was supported in part by The National Science Foundation (NSF) under NSF CAREER award number 0953100. The U.S. Dept. of Energy (DOE), Oce of Science, Advanced Scientic Computing Research under award DE-FC0210ER26006/DESC0004915. Grants from the Defense Advanced Research Projects Agency (DARPA) Computer Science Study Group program Declaimer Any opinions, conclusions or recommendations expressed in this presentation are those of the authors and not necessarily reect those of NSF, DOE, or DARPA.