Number Theory Meets Cache Locality

Number Theory Meets Cache Locality – Efficient Implementation of a Small Prime FFT for the GNU Multiple Precision Arithmetic Library Tommy Färnqvist TRITA-NA-E05091 Numerisk analys och datalogi KTH 100 44 Stockholm Department of Numerical Analysis and Computer Science Royal Institute of Technology SE-100 44 Stockholm, Sweden Number Theory Meets Cache Locality – Efficient Implementation of a Small Prime FFT for the GNU Multiple Precision Arithmetic Library Tommy Färnqvist TRITA-NA-E05091 Master’s Thesis in Computer Science (20 credits) within the First Degree Programme in Mathematics and Computer Science, Stockholm University 2005 Supervisor at Nada was Stefan Nilsson Examiner was Stefan Arnborg Abstract When multiplying really large integer operands, the GNU Multiple Precision Arithmetic Library uses a method based on the Fast Fourier Transform. To make an algorithm execute quickly on a modern computer, data has to be available in the cache memory. If that is not the case, a large portion of the execution time will be spent accessing the main memory. It might pay off to perform much extra work to achieve good cache locality. In extreme cases, 500 primitive operations may be performed in the time of a single memory access. This report describes the implementation of a cache friendly variant of the Fast Fourier Transform and its application to integer multiplication. The variant uses arithmetic modulo primes near machine word-size. The multiplication method is shown to be competitive with its counterpart in version 4.1.4 of the GNU Multiple Precision Arithmetic Library for interesting platforms. Talteori möter cachelokalitet Effektiv implementation av småprimtals-FFT för GNU:s multiprecisionsaritmetikbibliotek Referat För multiplikation av riktigt stora heltalsoperander använder GNU:s multiprecisionsaritmetikbibliotek en metod baserad på en variant av den snabba fouriertransformen. För att få en algoritm att exekvera snabbt på en modern dator krävs det att data för algoritmen är åtkomligt i cacheminnet. Om så inte är fallet kommer en stor andel av körtiden att gå åt till accesser i maskinens primärminne. Det kan löna sig att utföra mycket extra arbete för att få bra cachelokalitet. I extrema fall kan 500 primitiva operationer utföras på den tid det tar för en minnesaccess. Denna rapport beskriver implementationen av en cachevänlig variant av den snabba fouriertransformen och dess tillämpning på heltalsmultiplikation. Varianten använder aritmetik modulo primtal nära maskinordsstorlek. Multiplikationsmetoden visar sig vara konkurrenskraftig med motsvarande i version 4.1.4 av GNU:s multiprecisionsaritmetikbibliotek för intressanta plattformar. Preface This is my Master’s thesis in Computer Science. The project was performed within the First Degree Programme in Mathematics and Computer Science at the Department of Numerical Analysis and Computer Science, Stockholm University. I would like to, sincerely, thank my supervisor at Swox AB, Torbjörn Granlund, for inviting me into the exciting world of bignum arithmetic and for showing me what the essence of a nice hack truly is. I am obliged to my supervisor at NADA, Stefan Nilsson, who taught me algorithms, complexity theory and computer architecture; tools that have come in very handy during all phases of this project. Thanks also go to Niels Möller, of KTH, Stockholm, for his mathematical insights. Last, but not least, I am grateful to the Medicis team at École Polytechnique, France, for assisting the project with computer time and resources. Contents 1 . . . . . . . 1 1 1 2 2 4 4 5 . . . . . . 7 7 9 10 12 14 15 FFT Based Integer Multiplication in a Finite Field 3.1 Feasibility of mod p FFTs . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 “Three primes” Algorithm for Integer Multiplication . . . . . . . . . . . 18 18 20 4 Implementation Issues 4.1 Cache Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Modular Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Two Wrongs Make a Right . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 24 27 5 Results 29 References 33 Appendix: Acronyms 37 2 3 Introduction 1.1 Big Numbers, Who Needs Them? . . . . . 1.2 GMP and Multiple-precision Multiplication 1.2.1 Schoolbook multiplication . . . . . 1.2.2 Karatsuba’s algorithm . . . . . . . 1.2.3 Toom’s algorithm . . . . . . . . . . 1.2.4 Fermat Style FFT Multiplication . . 1.3 Goals of this Project . . . . . . . . . . . . . . . . . . . . . . . . . . . Preliminaries 2.1 The FFT and Integer Multiplication . . . . . . 2.1.1 Primitive Roots of Unity . . . . . . . . 2.1.2 The Discrete Fourier Transform . . . . 2.1.3 The Fast Fourier Transform Algorithm . 2.1.4 Avoiding Recursion . . . . . . . . . . . 2.2 Computer Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction This first chapter presents the background setting for the project. In Section 1.1 a brief account of why the project is interesting is given. Section 1.2 introduces the concept of multiple-precision integers and the algorithms used in the GNU Multiple Precision Arithmetic Library, GMP1 , [14], for multiplication of such integers. The goals of the project and the further outline of this thesis is available in Section 1.3. 1.1 Big Numbers, Who Needs Them? This thesis deals with multiplication of large integers. Specifically multiplication based on the Fast Fourier Transform, FFT, algorithm. Multiplication is a very basic operation and if we are not able to understand how to solve that computational problem satisfactorily the chances of conquering more complex ground are slim. It is well worth to note that the currently best known multiplication algorithm, published in [32] in 1971, is not known to have asymptotically optimal time complexity! There might be room for improvement and that alone should be incentive enough for anyone to get interested. But, what good are such large integers? Of course there are applications, such as the very popular RSA algorithm, see [31], where integers represented by several hundred bits are needed. When experimenting with conjectures from number theory it is not uncommon to need as high a precision as current funds can buy. Then there are those who seek out larger and larger primes, try to factor big numbers, or just like to see what the first few billion decimals of π look like. 1.2 GMP and Multiple-precision Multiplication Our normal way of thinking of an integer, in a positional number system representation with base 10, generalises to any base B ∈ Z, B ≥ 2. An m-digit base B number, (um−1 . . . u0 )B , represents the quantity um−1 Bm−1 + . . . + u1 B + u0 , where 0 ≤ ui < B for each ui . A computer’s hardware can only perform operations on integers smaller than the machine word-size W. If bigger numbers are needed, the programmer has to provide routines 1 A list of acronyms and abbreviations used in this thesis can be found in the Appendix. 1 for the representation and manipulation of multiple-precision integers – integers on the form u = (um−1 . . . u1 u0 )B , where B ≤ W. This facilitates development of the classical arithmetical algorithms with a base B much larger than 10. GMP is a free software library written in C that provides routines for manipulating arbitrary precision numbers. The library obtains it speed by operating on fullwords, using the best algorithm for the given problem size, inclusion of hand optimised assembly code for the most common inner loops for many CPUs and by a general emphasis on speed. GMP implements four different integer multiplication algorithms. (Schoolbook multiplication, Karatsuba, Toom-3 and a Fermat style FFT multiplication.) This is in line with one of the governing design paradigms of GMP: It is not the case that the asymptotically better algorithm are always better than the slower ones. The asymptotically fast algorithms are often associated with a fair amount of overhead that does not pay off until larger instances of the problem are to be solved. The choice of when to switch algorithms in GMP is done at compile time. The hardware is assumed to be capable of the following arithmetic operations: • Addition of two single-digit numbers u and v, giving a single digit result u + v (or u + v − B if u + v ≥ B). • Subtraction of one single-digit number v from a single digit number u, giving a single-digit result u − v (or u − v + B if u − v < 0). • Multiplication of two single-digit numbers u and v, giving a two-digit result uv. • Division of a two-digit number u by a single-digit number v, giving a single digit quotient u/v and a single-digit remainder u mod v. (u/v < B is required.) Throughout the following descriptions of multiplication algorithms the input operands, U = un−1 Bn−1 + un−2 Bn−2 + . . . + u1 B1 + u0 and V = vm−1 Bm−1 + vm−2 Bm−2 + . . . + v1 B1 + v0 are assumed to be nonnegative. 1.2.1 Schoolbook multiplication This method is very similar to the one taught in grade school, except that storage of all partial products is not required. It is just the usual set of cross-products (U and V have to be nonnegative), as presented in Algorithm 1. It is straight forward to prove, by induction over j, that Algorithm 1 returns the product R of U and V. Obviously the algorithm requires Θ(nm) time. If multiplication truly was a quadratic operation, there would not be much hope of multiplying large numbers at all. Fortunately, there are faster algorithms. 1.2.2 Karatsuba’s algorithm We want to multiply the two n-digit numbers u and v. Assume n is a power of two and write u = x + Bn/2 y and v = w + Bn/2 , where x, y, w, z < Bn/2 z. What we want is uv = (x+Bn/2 y)(w+Bn/2 z) = xw+Bn/2(xz+yw)+Bnyz. Using that xz+yw = (x+y)(w+z)− xw−yz, 2 Algorithm 1: Schoolbook multiplication. Input: U, V Output: The product of U and V. M(U, V) (1) ri ← 0 for all 0 ≤ i < m (2) j←0 (3) while j < n (4) k←0 (5) qk−1 ← 0 (6) while k < m (7) tk ← r j+k + qk−1 + u j vk (8) r j+k ← tk mod B (9) qk ← tk /B (10) k ←k+1 (11) r j+m ← qk−1 (12) j← j+1 (13) return R ← rn+m+1 Bn+m+1 + rn+m Bn+m + . . . + r1 B + r0 it is enough to calculate three n/2-digit products2 : xw, yz and (x + y)(w + z), which can be computed by recursion. On each recursion level there is also need for a few shifts and additions. Algorithm 2: Karatsuba’s algorithm. Input: U, V, n Output: The product of U and V. M(U, V, n) (1) if n = 1 (2) return uv (3) Let u ← x + Bn/2 y and v ← w + Bn/2 z (4) Compute t1 ← xw using Karatsuba’s algorithm. (5) Compute t2 ← yz using Karatsuba’s algorithm. (6) Compute t3 ← (x + y)(w + z) using Karatsuba’s algorithm. (7) t4 ← t3 − t2 − t1 (8) return t1 + Bn/2 t4 + Bn t2 Algorithm 2 was first published in [22]. If it requires M(n) time, M(n) = 3M(n/2) + O(n). This recurrence relation has the solution M(n) = O(nlog2 3 ). Since log2 3 ≈ 1.585 this is considerably faster than schoolbook multiplication and operands do not have to be very large at all for Karatsuba to beat the naive algorithm. (x + y) and (w + z) might have n/2 + 1 digits each and this extra digit should be handled separately. This is just a linear amount of extra work and only complicates this analysis. 2 3 1.2.3 Toom’s algorithm Karatsuba’s method is the simplest case of a general way to split input operands that leads to both the Toom and FFT algorithms. Let r ≥ 2 be the, small, splitting parameter. The operands are split in r pieces and viewed as degree r − 1 polynomials. The polynomials are then evaluated, pointwise multiplied and interpolated, at 2r − 1 points, as in Algorithm 3. Algorithm 3: Toom’s algorithm. Input: U, V, n, r Output: The product of U and V. M(U, V, n, r) (1) if n < r (2) Compute the product using schoolbook multiplication. (3) s ← n/r (4) Let u ← ur−1 sr−1 + . . . + u1 s + u0 and v ← vr−1 sr−1 + . . . + v1 s + v0 . (5) Let U(t) ← ur−1 tr−1 +. . .+u1 t+u0 and V(t) ← vr−1 tr−1 +. . .+v1 t+v0 . (6) Evaluate U(t) and V(t) at small values of t. (7) Compute W(t) ← U(t)V(t) for those t using 2r − 1 recursive calls. (8) Compute the 2r − 1 coefficients of Z(t). (9) return Z(s) Algorithm 3 was published in [34]. If M(n) is the execution time of the algorithm, M(n) = (2r − 1)M(n/r) + O(n). M(n) can be solved for as M(n) = O(nlogr (2r−1) ). By letting r → ∞ we now get a multiplication algorithm with O(n1+ ) time complexity for any > 0. Of course the overhead grows with r and in practice only small values of r are worth implementing. GMP implements Toom-3, an O(nlog3 5 ) = O(n1.465 ) algorithm. The larger overhead, from more work in evaluation and interpolation, of Toom shows up here and so it pays off from operand sizes of a few hundred machine words. 1.2.4 Fermat Style FFT Multiplication If we pick the evaluation points in Algorithm 3 as roots of unity, instead of arbitrarily, the evaluation is a Discrete Fourier Transform and if r is chosen wisely, it can be computed by the Fast Fourier Transform algorithm. For large operand sizes GMP uses a Fermat style FFT multiplication, as in [32], an O(n log n log log n) algorithm. The product computed is uv mod 2N + 1 with N ≥ bits(u) + bits(v) and padding u and v with high zero words. The algorithm follows a scheme similar to that of Karatsuba and Toom (splitting, pointwise multiplication, interpolation). The points chosen for evaluation are powers of two so the operations needed for the FFT are only additions, shifts and negations. For more on multiplication algorithms, see [23] and the GMP manual for GMP specifics. See also [20] or why not the GMP source code; it is free. In Figure 1.1, the typical behaviour of GMP’s multiplication algorithms for different problem sizes is shown. 4 1 Schoolbook Karatsuba Toom-3 Fermat FFT 0.1 time (s) 0.01 0.001 0.0001 1e-05 1e-06 16 64 256 1024 operand size (words) 4096 16384 Figure 1.1. Execution times for GMP multiplication routines. Karatsuba is faster than schoolbook multiplication at 22 words, Toom-3 overtakes Karatsuba at 274 words and the FFT based code outperforms Toom-3 at 8448 words. Measurements are from an Intel Pentium 4, 2.4 GHz. 1.3 Goals of this Project The goal of this project was to improve upon the FFT based multiplication scheme in GMP 4.1.4 by implementing a faster FFT scheme over the finite field Z/Zp, for primes p near machine word size. The reason this scheme is relatively untried is that conventional FFT implementations use IEEE double precision floating point numbers with 53 bits of precision, giving them a clear advantage over the usual 32 bit integer precision. The approach used in GMP tries to circumvent this problem by leaving out the costly multiplications (floating point multiplications are usually much faster than integer ones) that appear in a Z/Zp-based scheme. It was the hope of the commissioner of this project that the new platforms with strong integer multiply support (Intel Itanium and AMD Opteron, for example) and 64-bit precision would give enough power to the Z/Zp-based scheme to make it a winner. Since it was deemed unlikely that a 32-bit implementation could be successful the project was concentrated at the 64-bit machines with a view to fulfil the demands of the planned release dubbed GMP 5. The rest of this thesis proceeds with introducing the Fast Fourier Transform and its application to integer multiplication in Chapter 2. This is followed by a brief glimpse of how modern computer memory works. It will be apparent that the Fast Fourier Transform is particularly ill suited for modern computers when it comes to memory access patterns. 5 A further challenge of the project is therefore to formulate the algorithm in a manner that better suits the modern computer. In Chapter 3 the nice theory behind the algorithm to be implemented in the project is presented. Chapter 4, then, gives an account of some of the more interesting issues that arose in the implementation phase of the project. Lastly, Chapter 5 presents the results of the project and discusses some future work to be done. 6 Chapter 2 Preliminaries In this chapter the theoretical foundation of the FFT and its application to integer multiplication is presented. A short introduction to modern computer memory architecture is also given. 2.1 The FFT and Integer Multiplication The Fourier transform of a continuous function a(t) is given by ∞ a(t)e2πi f t dt. A( f ) = −∞ The discrete analogue to this continuous transform is the discrete Fourier transform (DFT) which applies to samples a0 , a1 , . . . , aN−1 of a(t): Aj = N−1 ak e2πi jk/N , j = 0, . . . , N − 1. k=0 It was the problem of computing sums of this kind that J. W. Cooley and J. W. Tukey addressed in [8], the paper in which they essentially discovered the FFT. Actually the history of the FFT is long and convoluted, dating back to Gauss in 1805. See [7] and [18] for further historical remarks. It is not hard to make a list of areas where the speed of the FFT comes in handy. Such a list would most certainly contain: • RADAR Processing • Digital Pulse Compression • Digital Filtering • Spectrum Analysis • Optics 7 • Speech Analysis • Crystallography • Computational Fluid Dynamics • Microlitography • Image Analysis • Convolution/multiplication A longer list may be found in [21]. Since the FFT has found many areas of application the amount of research regarding the algorithm is large. [5] lists 4577 documents relating to the FFT. The reader who wishes to delve deeper into the realm of the FFT is referred to [4], an excellent source of pointers to more documents on the subject. For a more thorough introduction to the basic algorithm than the one given here see [6], [12] or [9]. To make the application of the FFT to integer multiplication obvious, the algorithm is derived as one solving the problem of polynomial multiplication. All that has to be done is then to utilise the casual observation that an integer U written in some base B (232 or 264 for applications.): U = u0 + u1 × B1 + u2 × B2 + . . . + un−1 × Bn−1 can be seen as a polynomial if B is substituted with an unbound variable. This fascinating duality is a subject in its own right. [24] contains more information relating to that. i Now, a polynomial in coefficient form is described as p(x) = n−1 i=0 ui x . The degree of p is the largest index of a nonzero coefficient, ai . v xi is another polynomial the sum of the polynomials is given as p(x) + If q(x) = n−1 n−1 i=0 i i q(x) = i=0 (ui + vi )x . The coefficient form also allows for efficient evaluation of p(x) by Horner’s rule; p(x) = u0 + x(u1 + x(u2 + . . . + x(un−2 + xun−1 ) . . .)). The time complexity for addition and evaluation of polynomials in coefficient representation is apparently O(n). The product of p and q, however, is a bit more tricky: p(x)q(x) = u0 v0 + (u0 v1 + u1 v0 )x + (u0 v2 + u1 v1 + u2 v0 )x2 + . . . + un−1 vn−1 x2n−2 . That i i is, p(x)q(x) = 2n−2 i=0 ri x , where ri = j=0 u j vi− j , for i = 0, 1, . . . , 2n − 2. The sequence ri is called the convolution of the sequences ui and vi . For symmetry reasons, the ri are regarded as a sequence of length 2n, with r2n−1 0. Viewing the ui and vi as vectors u = [u0 , u1 , . . . , un−1 ] and v = [v0 , v1 , . . . , vn−1 ], the convolution of the two vectors is denoted u ∗ v. Applying the definition of convolution directly, it takes Θ(n2 ) time to multiply p and q. Another way of representing a polynomial p, of degree n−1, is by its value on n distinct inputs, i.e. point-value pairs {(x0 , y0 ), (x1 , y1 ), . . . , (xn−1 , yn−1 )} such that all xi are distinct and yi = p(xi ) for i = 0, 1, . . . , n − 1. This representation gives an alternative method for multiplying p and q. Namely, evaluate p and q for 2n different inputs and compute the representation of the product as {(x0 , p(x0 )q(x0 )), (x1 , p(x1 )q(x1 )), . . . , (x2n−1 , p(x2n−1 , q(x2n−1 ))}. This computation clearly just takes O(n) time, given the 2n point-value pairs. To effectively use this approach the point-value pairs have to be found quickly. Applying Horner’s rule on 2n different inputs would yield Θ(n2 ) time complexity. Since the 8 choice of, the discrete, inputs is arbitrary ,the question arises as to whether it is possible to pick “easy” ones. For instance, p(0) = u0 comes to mind. This is where the FFT will help. The inverse of evaluation is, of course, interpolation. The following theorem shows that this is well defined. Theorem Uniqueness of an interpolating polynomial. For any set {(x0 , y0 ), (x1 , y1 ), . . . , (xn−1 , yn−1 )} of n point-value pairs such that all the xi are distinct, there is a unique polynomial p(x), of degree n or less, such that yi = p(xi ) for i = 0, 1, . . . , n − 1. Proof yi = p(xi ) is equivalent to the matrix equation        1 1 .. . x0 x1 .. . x20 · · · x21 · · · .. . x2n−1 1 xn−1 xn−1 0 xn−1 1 .. . ··· .. . xn−1 n−1    u0   u1     ..   . un−1     y0   y   1  =  ..   .   yn−1      .   The left matrix is denoted V(x0 , x1 , . . . , xn−1 ) and is known as a Vandermonde matrix. V has determinant 0≤ j<k≤n−1 (xk − x j ) and is therefore nonsingular if the xi are distinct. Thus, the ui can be solved for uniquely. One algorithm for n-point interpolation is based on Lagrange’s formula: p(x) = n−1 i=0 yi ji (x − x j) . ji (xi − x j ) The coefficients of p can now be computed in time Θ(n2 ). More in interpolation in a numerical context can be found in [17]. But, how to find a set of inputs that will make evaluation and interpolation go quicker than applying Horner and Lagrange? One answer is that of roots of unity and this will lead to the FFT. 2.1.1 Primitive Roots of Unity Now some concepts from algebra will come in handy. For a more solid background in group and field theory and related items see [3]. A most important idea for the FFT is that of roots of unity: Definition 2.1 Primitive nth root of unity. A number ω is a primitive nth root of unity, for n ≥ 2, if it satisfies: 1. ωn − 1 = 0 2. The numbers 1, ω, ω2 , . . . , ωn−1 are distinct. Note that Definition 2.1 implies that a primitive nth root of unity has a multiplicative inverse, ω−1 = ωn−1 , since ω−1 ω = ωn−1 ω = ωn = 1. In other words: negative exponents of ω are well defined. 9 For nth roots of unity to work as “easy” inputs to the polynomials to be evaluated a few properties are needed, including the following. ki Property 2.2 Cancellation property. If ω is an nth root of unity, then n−1 i=0 ω = 0, for any integer k 0, −n < k < n. Proof ωk 1 ⇒ n−1 i=0 ωki = 1k − 1 1−1 (ωk )n − 1 (ωn )k − 1 = = = k =0 k k k ω −1 ω −1 ω −1 ω −1 Property 2.3 Reduction property. If ω is a primitive (2n)th root of unity, then ω2 is a primitive nth root of unity. Proof Since 1, ω, ω2 , . . . , ω2n−1 are distinct, 1, ω2 , (ω2 )2 , . . . , (ω2 )n−1 are also distinct. Property 2.4 Reflective property. If ω is a primitive nth root of unity and n is even, then ωn/2 = −1. Proof If k = n/2, by Property 2.2, 0= n−1 ωn/2 i = ω0 + ωn/2 + ωn + . . . + ω(n/2)(n−2) + ω(n/2)(n−1) = i=0 0 = ω + ωn/2 + ω0 + . . . + ω0 + ωn/2 = (n/2)(1 + ωn/2 ). Thus, 0 = 1 + ωn/2 . With the reflective property at hand the fact that if ω is a primitive nth root of unity and n ≥ 2 is even then ωk+n/2 = ωk , is readily seen. 2.1.2 The Discrete Fourier Transform As the reader undoubtedly has surmised the DFT is precisely the technique of evaluating a i 0 1 n−1 . As this only yields polynomial p(x) = n−1 i=0 ui x at the nth roots of unity ω , ω , . . . , ω n point-value pairs, p is “padded” with zeros by setting ui = 0 for n ≤ i ≤ 2n − 1. p can now be viewed as a (2n-1) degree polynomial and the primitive (2n)th roots of unity can be used as inputs. In the following, p is assumed to have been padded with as many zeros as necessary. Formally, the DFT for p, as represented by the coefficient vector u, is defined as the ij vector y of values y j = p(ω j ), that is y j = n−1 i=0 ui ω , where ω is a primitive nth root of unity. Alternatively, if u and y are thought of as column vectors y = Fu, where F is an n × n matrix such that F[i, j] = ωi j . The DFT would not be much of a transform if it did not have an inverse. Fortunately F has an inverse, F −1 , so that F −1 (F(u)) = u for all u. F −1 has a simple form, F −1 [i, j] = ω−i j /n. Thus, given a vector y of the values of p at the nth roots of unity, the coefficient ui −i j can be recovered as ui = n−1 j=0 y j ω /n. The following lemma justifies this. 10 Lemma 2.5. For any vector u, F −1 · Fu = u. Proof (F −1 1 −ik k j · F)[i, j] = ω ω n k=0 n−1 If i = j this reduces to (F −1 1 0 1 · F)[i, i] = ω = · n = 1. n k=0 n n−1 mk Consider the case when i j and let m = j−i. Then (F −1 · F)[i, j] = (1/n) n−1 k=0 ω , where −1 −n < m < n, m 0. By Property 2.2, r.h.s. reduces to zero. Hence, (F · F)[i, j] = 0 for i j. The scheme to multiply two polynomials p and q now becomes (also illustrated in Figure 2.1.): 1. Pad the coefficient vectors of p and q, u and v, with n zeros and view them as column vectors u = [u0 , u1 , . . . , un−1 , 0, 0, . . . , 0]T and v = [v0 , v1 , . . . , vn−1 , 0, 0, . . . , 0]T . 2. Compute the DFTs y = Fu and z = Fv . 3. Multiply y and z component-wise, to get (y · z)[i] = yi · zi for i = 0, 1, . . . , 2n − 1. 4. Compute the inverse DFT of the product, that is r = F −1 (Fu · Fv ). ✛ ✘ ✚ ✙ u0 , u1 , . . . , un−1 v0 , v1 , . . . , vn−1 ☛ ✟ r0 , r1 , . . . , r2n−1 ✡ ✠ ✻ Pad with n zeros ❄ ✛ u0 , u1 , . . . , un−1 , 0, 0, . . . , 0 v0 , v1 , . . . , vn−1 , 0, 0, . . . , 0 ✚ ✘ inverse DFT ✙ DFT ✛ ❄ ✘ ✟ y0 , y1 , . . . , y2n−1 Pointwise ✲☛ y z , y z , . . . , y z 0 0 1 1 2n−1 2n−1 z0 , z1 , . . . , z2n−1 multiplication ✡ ✠ ✚ ✙ Figure 2.1. Using the computation of r = u ∗ v for polynomial multiplication. The success of the above approach hinges on the following theorem: Theorem 2.6 The convolution theorem. Given two length n vectors u and v padded with zeros to length 2n vectors u and v , u ∗ v = F −1 (Fu · Fv ). 11 Proof By Lemma 2.5 it is enough to show that F(u ∗ v) = Fu · Fv . Since the upper halves of u and v are zero,     n−1 n−1 n−1 n−1        i j ik  u j vk ωi( j+k) , for i = 0, 1, . . . , 2n − 1. Fu · Fv [i] =  u j ω  ·  vk ω  = j=0 k=0 j=0 k=0 Now consider F(u ∗ v). By the definition of convolution and the discrete Fourier transform, F(u ∗ v)[i] = 2n−1 2n−1 l=0 u j v j−l ωil . j=0 Change the order of the summations and substitute k for j − l to get F(u ∗ v)[i] = 2n−1 2n−1− j j=0 u j vk ωi( j+k) . k=− j vk is undefined for k < 0, so the second sum can be started at k = 0. Furthermore, u j = 0 for j > n − 1, and so the upper limit in the first sum can be lowered. Once that substitution is made, the upper limit on the second summation is always at least n. Thus, since vk = 0 for k > n − 1, also the upper limit in the second sum may be lowered to n − 1. Consequently, F(u ∗ v)[i] = n−1 n−1 u j vk ωi( j+k) . j=0 k=0 The method for multiplying two polynomials now involves computing two DFTs, doing a linear time pointwise multiplication and computing an inverse DFT. For this to be any faster than the obvious Θ(n2 ) time algorithm the crux is to find a fast algorithm for computing the DFTs. 2.1.3 The Fast Fourier Transform Algorithm The Fast Fourier Transform algorithm computes the DFT of a length n vector in O(n log n) time. The algorithm applies the divide-and-conquer approach to polynomial evaluation by observing that if n is even, a polynomial of degree n − 1, p(x) = u0 + u1 x + u2 x2 + . . . + un−1 xn−1 , can be divided into two n/2-1 degree polynomials, pe (x) = u0 + u2 x + u4 x2 + . . . + un−2 xn/2−1 and po (x) = u1 + u3 x + u5 x2 + . . . + un−1 xn/2−1 , for p(x) = pe (x2 ) + xpo (x2 ). Now, the DFT evaluates p at the nth roots of unity and by Property 2.3, the values (ω2 )0 , ω2 , (ω2 )2 , . . . , (ω2 )n−1 are (n/2)th roots of unity. Hence, pe and po can be evaluated at these values and the same values can be used to evaluate p. This is the key observation used in Algorithm 4. Why does the pseudo-code in Algorithm 4 work correctly? Lines 1–2 constitute the basis of the recursion, correctly returning a vector with the one entry u0 . Lines 3 and 11 keep ω updated properly, so that when the for loop is executed, r = ωi . Lines 6–7 perform the recursive DFT computations, setting yei = pe (ω2i ) and yoi = po (ω2i ). Lines 9–10 12 Algorithm 4: Recursive FFT algorithm. Input: u, ω, n Output: The DFT of u. FFT(u, ω, n) (1) if n = 1 (2) return y ← u (3) r ← ω0 (4) ue ← [u0 , u2 , u4 , . . . , un−2 ] (5) uo ← [u1 , u3 , u5 , . . . , un−1 ] (6) ye ← FFT(ue , ω2 , n/2) (7) yo ← FFT(uo , ω2 , n/2) (8) for i ← 0 to n/2 − 1 (9) yi ← yei + r · yoi (10) yi+n/2 ← yei − r · yoi (11) r ←r·ω (12) return y combine the results of the recursive calculations to get yi = yei +ωi yoi = pe (ω2i )+ωi po (ω2i ) = p(ωi ) for i = 0, 1, . . . , n/2− 1. For yn/2 , yn/2+1 , . . . , yn−1 , yi+n/2 = yei − ωi yoi = yei + ωi+n/2 yoi = pe (ω2i ) + ωi+n/2 po (ω2i ) = pe (ω2i+n ) + ωi+n/2 po (ω2i+n ) = p(ωi+n/2 ). Thus, the vector y returned by Algorithm 4 is the DFT of u. Within the for loop each value yoi is multiplied by ωi . The product is both added and subtracted from yoi . The factors ωi are known as twiddle factors. The operation of multiplying the twiddle factor by yoi and adding and subtracting the product from yei is known as a butterfly operation and is usually drawn as in Figure 2.2. yie yie + ω iyio ωi yio yie − ω iyio Figure 2.2. A schematic drawing of a butterfly operation. What about running time? Exclusive of the recursive calls each invocation takes time Θ(n). The recurrence for the running time is therefore T (n) = 2T (n/2) + Θ(n) = Θ(n log n). By Lemma 2.5, Algorithm 4 can be used for computing the inverse DFT; switch the roles of u and y, replace ω by ω−1 and divide each element of the result by n. Consequently, the inverse DFT can be computed in time Θ(n log n) as well. 13 2.1.4 Avoiding Recursion The main obstacle in the way of avoiding recursion in the FFT algorithm is the splitting of the input array into odd and even halves. What is the net effect of these permutations? The recursion call tree for input size n = 8 is shown in Figure 2.3. In the first recursive call all elements that have indices with 0 as their least significant bit are brought to the even half. Likewise, all indices with 1 as their least significant bit are brought to the odd half. Thinking of an in-place implementation of the algorithm, an element starting with an index with b as as its least significant bit, ends up at an index with b as its most significant bit. In the next level of recursion this “inverse shuffle” is repeated on each half of the input array, now addressing the second least significant bit of the indices. Thus, if an element starts out at index [bl−1 . . . b2 b1 b0 ], it ends up at index b0 b1 b2 . . . bl−1 , where l = log n. This permutation is known as the bit reversal of the input array. [u0 , u1 , u2 , u3, u4 , u5 , u6, u7 ] [u0 , u2 , u4 , u6] [u0 , u4 ] [u0 ] [u4 ] [u1 , u3 , u5 , u7] [u2 , u6 ] [u2 ] [u1 , u5 ] [u1 ] [u6 ] [u5 ] [u3 , u7 ] [u3 ] [u7 ] Figure 2.3. The input vectors to the calls of the recursive FFT algorithm. Initial invocation is for n = 8. Thus, if the input array could be arranged in bit reversed order, the behaviour of the recursive FFT could be mimicked. First, take the elements in pairs, compute the DFT of each pair and replace the pair with its DFT. Next, take the n/2 DFTs in pairs and compute the DFT of the four vector elements they come from by two butterfly operations, replacing two 2-element DFTs with one 4-element DFT. Continue in this manner until the vector holds two (n/2)-element DFTs, which are combined into the final n-element DFT using n/2 butterfly operations. This computational flow is illustrated in Figure 2.4. Now it should be easy to translate the essence of Figure 2.4 to the pseudo-code of Algorithm 5. An array holds the elements of the input vector, but in bit reversed order. Since the combining has to be done on each level of the tree, a variable is introduced to count the levels. The call to BRC runs in O(n log n) time since an integer with log n bits certainly can be reversed in time O(log n), and this is done n times. The loop of lines 5–12 iterates n/m = n/2lev times for each level and the innermost for loop iterates 2lev −1 times. 14 u0 y0 (ω 4)0 u1 y1 (ω 2 )0 u2 y2 (ω 4)0 2 1 (ω ) u3 y3 ω0 u4 y4 (ω 4)0 ω 1 u5 y5 (ω 2 )0 ω2 u6 y6 (ω 4 )0 2 1 (ω ) ω 3 u7 y7 Figure 2.4. Computational flow of the iterative FFT algorithm for n = 8 inputs. Also known as an FFT circuit. If L(n) is the number of times the lines 9–12 is executed, then L(n) = log n log n n lev −1 n = Θ(n log n). · 2 = s 2 2 lev=1 lev=1 Thus Algorithm 5 has running time Θ(n log n). 2.2 Computer Memory Hierarchy For a more complete overview than given here, see [33]. A nice exposition of the subject is available in [19]. During the last 20 years CPUs have been faster than memories. As it has become possible to put more and more circuits on a chip, CPU designers have used this to make CPUs go even faster, by introducing pipelining and superscalarity. Memory designers, usually, have increased the capacity of their ships, not the speed. This has resulted in an imbalance where a CPU may be able to perform hundreds of primitive operations in the time it takes for a word from memory to arrive at the CPU after it has been requested. If the CPU operation is dependent of the memory word on route there is no choice but to stall, be it in hardware or software. Actually, it is possible to build memories that are as fast as CPUs, but to run at full speed, they have to be located on the CPU chip. This is because the memory bus is very slow. So, it is a trade off between chip size and cost. To give the illusion of having a large amount of fast memory the concept of a cache is introduced. The basic idea is simple: the most heavily used memory words are kept in the cache. When the CPU needs a word, it first looks in the cache. If the word is not there, it goes to the main memory. 15 Algorithm 5: Iterative FFT algorithm. Input: u, ω, n Output: The DFT of u. FFT(u, ω, n)) (1) BRC(u, y) (2) for lev ← 1 to log n (3) m ← 2lev and rm ← ωn/m (4) for k ← 0 to n − 1 (5) r←1 (6) for j ← 0 to m/2 − 1 (7) t ← r · y[k + j + m/2] (8) u ← y[k + j] (9) y[k + j] ← u + t (10) y[k + j + m/2] ← u − t (11) r ← r · rm (12) return y A most effective technique for improving both memory bandwidth and latency is the use of multiple caches. A basic approach is to have separate caches for instructions and data. This is called a split cache, as opposed to a unified one and memory operations can be initiated independently in each cache. Many modern memory systems use an additional cache, called a level 2 cache, that resides between the data and instruction caches and main memory. On a typical system the CPU chip itself contains a small split cache, usually 16 KB to 64 KB. Then there is a unified L2 cache, in the same capsule as the CPU, of size 512 KB to a few MB. A thirdlevel cache, consisting of a few megabytes of SDRAM, may be situated on the processor board. The caches are generally inclusive, so that the contents of the L1 cache is present in the L2 cache and so on. Caches depend heavily on locality to work good. Temporal locality occurs when recently used memory words are accessed again. This may be the case, for example, inside a loop or when addressing memory close to the top of the stack. Spatial locality is the phenomenon that memory locations close in numerical address are likely to be accessed in rapid sequence. Think of a matrix of values being manipulated, for example. All caches use the model of dividing main memory into fixed-size blocks called cache lines. A cache line may consist of 4 to 64 consecutive bytes. When memory is referenced, the cache controller checks to see if the word referenced is in the cache. If the word is not there, some line entry is removed and the line needed is fetched from memory. The simplest cache is a direct-mapped cache. Each entry in the cache can hold exactly one cache line from main memory and a given memory word can be stored in exactly one place within the cache. This mapping scheme puts consecutive memory lines in consecutive cache entries, up to the cache size. However, two lines that differ in their address by precisely the size of the cache, or a multiple thereof, can not be stored in the cache at the same time. For example, if 16 a program accesses data at location X and next needs data at location X + sizeof(cache), the second instruction will force the cache entry to be reloaded. If this happens often enough, it results in very poor performance. In fact, the worst-case behaviour of a cache is worse than having no cache at all, since each memory operation involves reading an entire cache line instead of just one word. A solution to the above problem is to allow two or more lines in each cache entry. An n-way set-associative cache can hold n possible entries for each address. Of course, this kind of cache is more complicated than a direct-mapped cache. But, experience shows that two-way and four-way caches perform well enough to make the introduction of extra circuitry worthwhile. When a new entry arrives to a set-associative cache, the question is, which present item is to be discarded? For most purposes the LRU (Least Recently Used) strategy is a pretty good one. When it becomes time to replace an entry, the least recently accessed is discarded. 17 Chapter 3 FFT Based Integer Multiplication in a Finite Field In most FFT based integer multiplication implementations the nth root of unity is chosen as the complex number e2πi/n . The roots of unity are then approximated by floating point complex numbers. Since it has been very difficult to guarantee an exact result (to some number of bits, see [29] for some tighter bounds though) this is not an option for GMP, where exactness is an absolute demand. Instead, think of implementing FFTs in Z/Zm, the ring of integers modulo m ∈ Z. What properties of the ring are needed? Obviously addition, subtraction and multiplication are no problem. Division is only needed for dividing by the transform length in the inverse FFT, but this still rules out some choices of m. Furthermore nth roots of unity are needed. For choices of m that meet these demands the FFTs are called Number Theoretic Transforms (NTTs). In this project only prime moduli will be considered. Since all numbers involved in an NTT are integers, no round-off errors can occur. This makes very long transforms possible. Also, if the result overflows (calculations are done modulo p) the calculation can be repeated modulo several different primes and combined using the Chinese Remainder Theorem. The greatest risk factor is that integer arithmetic is slower than floating point on most machines. One goal of the project is therefore to see if the new 64 bit architectures with strong integer multiply support will be able to pick up the slack. One other thing to note is that when using NTTs, all physical meaning of the DFT is lost. The transform is no longer one between the time and frequency domains, but simply a game of numbers. 3.1 Feasibility of mod p FFTs The idea to perform the FFT in a finite field is due to J. M. Pollard. This idea and its application to integer multiplication, amongst many other things, are presented in [30]. This section will deal with showing that the necessary conditions for applying the FFT over the finite field of integers modulo a prime are met. 18 The goal is to investigate primes near the project target machines word size, i.e. 64 bits, with the intention of using the Chinese remainder theorem and as many primes as needed. For this to become reality, fields Z p having primitive n = 2m th roots of unity (m ≥ 50, to meet the demands of GMP 5) needs to exist, possibly there has to be plenty of them. When given a field Z p with a primitive nth root of unity, it is of interest to be able to find the root efficiently, since Z p may have many elements. Theorem 3.1. Z p has a primitive nth root of unity if and only if n|(p − 1). Proof ⇐ Let n|(p − 1). Now Z p is a finite field and so Z∗p is cyclic. (A proof of this can be found in [3], theorem 6.5.10.) Thus, Z∗p has a generator, say g. By Fermat’s little theorem gn = g(p−1)/n has multiplicative order n in Z∗p . Hence gn is a primitive nth root of unity. ⇒ The order of a group element divides the order of the group. (This follows from Lagrange’s theorem, proved in [3], theorem 3.2.10.) Z∗p has order p − 1. A mod p FFT of length n = 2m apparently requires primes such that n|(p − 1), i.e., of the form p = 2d k + 1, d ≥ m. Any such prime could be used for computing an FFT of length 2m for m ≤ d. How many primes on this form are there? It turns out that there are plenty of primes of the correct form, mainly due to: Theorem Generalised prime number theorem. Let a, b ∈ Z satisfy gcd(a, b) = 1. Then the number of primes ≤ n in the progression ak + b (k = 1, 2, . . .) is approximately (n/ ln n)/φ(a), where φ is Euler’s totient function. Proof See [10]. Lemma 3.2. If m = pα is a power of a prime then φ(m) = pα (1 − 1p ). Proof The totient function, φ(n) counts the number of positive integers ≤ n that are relatively prime to n. If m = pα is a power of a prime then the numbers having a common factor with m are multiples of p. There are pα−1 of these, so φ(m) = φ(pα ) = pα − pα−1 = pα−1 (p − 1) = pα 1 − 1 . p As a consequence of the Generalised prime number theorem and Lemma 3.2 the number of primes on the form p = 2 f k + 1 ≤ n, k odd, is approximately (n/ ln n)/2 f −1 . If n = 264 and f = 50, the conclusion is that there are around seven hundred primes p = 2d k + 1, k odd, with d ≥ f = 50. In what proportion does generators of Z∗p occur? It is known, from group theory, that the number of generators is φ(p − 1). Number theory says that the average value of φ(n) over all integers n is 6n/π2 (see [16]). In probabilistic terms, an element drawn at random from Z p , is expected to be a generator with probability greater than 3/π2 ≈ 0.3. The following theorem gives an efficient method for testing if an element of a cyclic group generates the group or not. Theorem 3.3. a ∈ Z∗p is a generator ⇔ a(p−1)/q 1 in Z∗p for any prime factor q of p − 1. Proof ⇒ Indeed. ⇐ Assume a has order k < p − 1. k|(p − 1) by Lagrange’s theorem, so p − 1 = mk. Let m = qr, where q is a prime factor of m. Now r p − 1 = qrk, so q is also a factor in (p-1)’s (p−1)/q rk k = a = a = 1. prime factorisation. So, a 19 With Theorem 3.3 in hand, the method of testing the elements 2, 3, 4, . . . of Z p until a generator is found seems as good as any. This is also what is suggested in algorithm 4.80 in [26]. The conclusion is that FFTs over Z p are possible, there exists plenty of primes having the desired characteristics. Finding these primes along with generators of their multiplicative groups should not pose any problems. 3.2 “Three primes” Algorithm for Integer Multiplication The aim is now to derive an integer multiplication algorithm based on FFTs over Z p . Consider the integers of Chapter 2, U = (un−1 . . . u0 )B and V = (un−1 . . . v0 )B and their corresn−1 i i ponding polynomials p(x) = n−1 i=0 ui x and q(x) = i=0 vi x . The first proposition is to choose B < W (W being the machine word size.), with a view to computing the polynomial product r(k) (x) = p(x)q(x) mod pk for an adequate number K of primes pk , B ≤ pk ≤ W. r(x) may then be recovered thanks to the Chinese remainder theorem. The algorithm was dubbed the “three primes” algorithm in [25]. Algorithm 6: “Three primes” algorithm for integer multiplication. Input: U = (un−1 . . . u0 )B , V = vn−1 . . . v0 )B Output: The product of U and V. TP(U, V) (1) for k ← 0 to K − 1 (2) Compute r(k) = p(x)q(x) over Z pk [x] using the scheme of Figure 2.1. (3) Solve the (polynomial) linear integer congruences u(x) ≡ r(k) (x) (mod pk ). (4) r(x) ← the least positive solution to the congruences. (5) return R ← r(B) What can be said about K? Each rm = i+ j=m ui v j of r(x) is < nB2 . Thus, the first step of Algorithm 6 calculates r(x) correctly if p0 p1 · · · pK−1 ≥ nB2 . Now, each pk ≥ B, so the critical condition is satisfied if BK ≥ nB2 , i.e., if K ≥ logB (n) + 2. In the range where n ≤ B, three primes are sufficient. Thus the name “three primes” algorithm. Naturally, the constraint of there having to be a suitable root of unity for each pk also restricts the size of n. (n ≤ 2D−1 has to be satisfied, if D is the largest integer for which three primes on the form p = 2d l + 1, when d ≥ D, exists.) Step 1 of Algorithm 6 takes O(n log n) time. The Chinese remaindering step involves 3·(2n−1) integer congruences, yielding O(n) time. It should be obvious that the evaluationat-radix step requires O(n log n) time. If the computer has a fixed word size W, making operations mod p in O(1) time possible, Algorithm 6 multiplies two n digit base B integers, B < W, in time O(n log n). (Observing the tacit constraints on n.) The range of n where the three prime algorithm is O(n log n) can not be extended infinitely while retaining the same time complexity. To extend the range, W would have to be 20 increased and the assumption that mod p operations is O(1) would soon become undefendable. Since the project target machines have 64 bit word size the limits now imposed on the problem size is of no practical importance. Indeed multiplication of operands larger than the available RAM on the typical workstation is possible. Of course, nothing says that the polynomial coefficients have to be taken as word sized pieces of the input operands. As long as p0 p1 · · · pK−1 ≥ nB2 , and suitable roots of unity exists, any number of consecutive bits can be used as polynomial coefficients. There is an interesting trade-off between fewer primes (polynomials of higher degree, lower hidden constant in complexity terms) and more primes (polynomials of lower degree, bigger hidden constant). In Figure 3.1 the maximum possible B for different number of primes and input sizes is plotted. 160 5 primes 4 primes 3 primes 2 primes 1 prime 140 chunk size (bits) 120 100 80 60 40 20 0 20 25 210 215 220 operand size (64 bit words) 225 230 Figure 3.1. The maximum number of bits per machine word for different operand sizes. Assumes high bit of primes is set. 21 Chapter 4 Implementation Issues The project code was written in C and based on primitives from the GMP framework. To get a feel for the FFT algorithm and the way it manipulates its operands, straightforward implementations of Algorithms 4 and 5 were made. As could be expected, the iterative formulation was found to be faster than the recursive one, when the operands were small enough to fit in the L1 cache. As soon as the operands were bigger than that, the recursive variant performed better. This is because it just takes a few levels of recursion to reach a working set that resides in L1 cache. Could there be a way to combine the speed of the iterative inner-loop with the nice locality properties of the recursive approach? This is described in Section 4.1. The true work horse of this multiplication scheme is the inner-loop of the FFT. In this setting, working modulo p, a word-sized prime, the main obstacle is the modular reduction of the product that results from multiplying with the twiddle factors (line 8 of Algorithm 5. If nothing special is done, this involves trial division, an operation that would make the inner-loop far too slow to be competitative at this level. Section 4.2 describes the project answer to this problem. In Section 4.3 a scheme to make the performance of the code behave less like a set of stairs with increasing operand size is presented. 4.1 Cache Locality FFT algorithms in general have very poor memory locality. The data is accessed in strides with large skips. The project code FFT only accepts power-of-two input lengths, also known as a radix-2 FFT. In this kind of algorithm the skips are, naturally, powers of two, making them particularly bad on direct-mapped cache systems. It could be the case that we get 100 % cache misses! One way to tackle this problem could be to try to rearrange the usual operand order in a more cache-friendly manner. A possibility would be to treat the computational flow of the FFT, see Figure 2.4, as a call tree and traverse it as locally as possible. This has been tried, successfully, with the kind of FFT that GMP is using. For the project, this scheme was implemented and discarded, due to it being less fit for an FFT over Zp , mainly due 22 to the extra computation of powers of ω needed with this method. (With the Fermat style FFT, the extra computations reduces to additions and subtractions, while a mod p FFT gets an extra multiplication and modular reduction inside the inner-loop.) In [1], Bailey uses a method for improving memory locality of FFTs, originating in ideas presented in [11] concerning the FFT and hierarchical memory systems: Assume that the transform length n = n1 n2 . View the data as stored in an n1 × n2 matrix, A jk . The “Four-step” FFT Algorithm proceeds as follows: 1. Do n2 transforms of length n1 . 2. Multiply each matrix element, A jk , by ω± jk , with the sign following the sign of the transform. 3. Transpose the matrix. 4. Do n1 transforms of length n2 . √ If n1 and n2 are chosen to be as close to n as possible, this algorithm will have much better memory performance due to the shorter transform lengths. For the above formulation to be optimal the memory would have to be arranged as in Fortran, with the columns in linear memory. Assume, instead, that memory is ordered as in C, i.e., the element A jk is stored at address jn2 + k. First, let us verify that the algorithm really does produce the DFT of the data; A(k) = n−1 ω jk a( j) : j=0 A transform, of length n1 is performed over column k2 . The correct root of unity is ωn2 , so A1 (k1 n2 + k2 ) = n 1 −1 ω j1 k1 n2 a( j1 n2 + k2 j1 =0 is the element in row k1 and column k2 . Then, the multiplication step: A2 (k1 n2 + k2 ) = ωk1 k2 A1 (k1 n2 + k2 ). After transposition: A3 (k2 n1 + k1 ) = A2 (k1 n2 + k2 ). Now a transform of length n2 over column k1 is performed. The root of unity used is ωn1 , so A4 (k2 n1 + k1 ) = n 2 −1 ω j2 k2 n1 A3 ( j2 n1 + k1 ). j2 =0 23 Substitute A2 for A3 to get A4 (k2 n1 + k1 ) = n 2 −1 ω j2 k2 n1 A2 (k1 n2 + j2 ), j2 =0 since j2 = k2 . Further, substituting A1 for A2 yields A4 (k2 n1 + k1 ) = n 2 −1 ω j2 k2 n1 + j2 k1 A1 (k1 n2 + j2 ). j2 =0 Which, by the definition of A1 , leaves us with: A4 (k2 n1 + k1 ) = n 2 −1 ω j2 k2 n1 + j2 k1 n 1 −1 j2 =0 ω j1 k1 n2 a( j1 n2 + j2 ). j1 =0 By changing the order of summation: A4 (k2 n1 + k1 ) = n 1 −1 n 2 −1 ω j1 k1 n2 + j2 k2 n1 + j2 k1 a( j1 n2 + j2 ). j1 =0 j2 =0 Notice that (k2 n1 + k1 )( j1 n2 + j2 ) = j1 k2 n1 n2 + j1 k1 n2 + j2 k2 n1 + j2 k1 . Now, ωn1 n2 = ωn = 1 ⇒ ω j1 k2 n1 n2 = 1, and finally, A4 (k2 n1 + k1 ) = n 1 −1 n 2 −1 ω(k2 n1 +k1 )( j1 n2 + j2 ) a( j1 n2 + j2 ). j1 =0 j2 =0 Identify k with k2 n1 + k1 and note that letting j go from 0 to n − 1 is the same as running j1 n2 j2 with j1 going from 0 to n1 − 1 and, for each j1 , letting j2 go from 0 to n2 − 1. It is now clear that A4 (k2 n1 + k1 ) is precisely A(k), the DFT of a( j), partitioned into two parts. Since the project code is written in C, the original scheme was modified slightly. The column FFTs from the first step is made by gathering the elements into a small scratch space, making only the copying forth and back slow. The copying back is interleaved with the multiplications by ω. To further increase the cache friendliness one cache line of padding is inserted after each row of the matrix, as per Figure 4.1. 4.2 Modular Reduction To get the inner-loop of the FFT running smoothly the problem of modular reduction has to be solved efficiently. We have xy, with 0 ≤ x, y < p and want xy mod p, preferably without having to do any costly trial division. The project code is aimed at large operands so a linear amount of extra work, such as some precomputation, could well be worth the effort. In [27] such a scheme is proposed. 24 n1 n2 Figure 4.1. A cache line of unused memory is inserted after each row to alleviate cache associativity problems. This scheme has become very widely used and is called Montgomery reduction after it’s inventor: For a fixed N a new residue system is chosen; for 0 ≤ i < N, i represents the residue class containing iR−1 , where R > N and gcd(N, R) = 1. This will help in computing T R−1 mod N when 0 ≤ T < RN as in Algorithm 7. Algorithm 7: Reduction algorithm. Input: T with 0 ≤ T < RN Output: T R−1 mod N REDC(T ) (1) m ← (T mod R)N −1 mod R (2) t ← (T − mN)/R (3) if t ≥ 0 then return t (4) else return t + N mN ≡ T mod R, so t is an integer. Furthermore, tR ≡ modN, so t ≡ T R−1 mod N. Also, 0 ≤ T < RN and 0 ≤ mN < RN, so −N < t < N. Now, given our x and y, let z = REDC(xy). Then z = (xy)R−1 , so (xR−1 )(yR−1 ) ≡ −1 zR mod N. 0 ≤ z < N holds, so z is the product of x and y in this representation. Observe that addition and subtraction, as well as other operations, become unchanged. In the project code R was chosen as 264 , making division become just choosing the high word of the product and mod the low word. The primes were, initially, meant to be any 64-bit primes offering long enough transform lengths. Then a tip reached the author via Niels Möller and [28]: 25 Consider primes on the form p = 2n − k2m + 1, where 1 ≤ m < n and k < 2n−m is odd and positive. When n <= 2m, p−1 mod 2n ≡ k2m + 1 . Now Montgomery reduction mod p becomes (In pseudo-code style): IN: x in 0 .. 2^n*p-1 OUT: r in 0 .. p-1 with r == x/2^n (mod p) STEP STEP STEP STEP 1: 2: 3: 4: Set x0 = x & (2^n - 1) and set x1 = X >> n Compute t = (((k*x0) << m) + x0) & (2^n - 1) Compute u = ((t << n) - ((k*t) << m) + t) >> n If x1 >= u then set r = x1 - u Otherwise set r = x1 - u + p This looked very nice, since every step only involves shifts, additions and subtractions and for small k it could be done without multiplications. Furthermore, there exists three very nice 64-bit primes fitting this purpose: p0 = 264 − 232 + 1, p1 = 264 − 234 + 1 and p2 = 264 − 240 + 1. After a bit of experimentation these nice primes were abandoned. Mainly due to the fact that, although elegant, the reduction function became hard to properly software pipeline, thus ultimately slowing down the inner-loop on the project target machines with strong integer multiply support. Also, since the inception of the project, the demands on transform length had changed. Now transforms of size 250 had to be possible. Clearly neither p0 , p1 nor p2 could provide the roots of unity for that. It is possible to take the concept of precomputation further in this context. After all, at run time the powers of ω to be used are known, could this not be utilised in some fashion? Exactly this point has been addressed by Victor Shoup and was communicated to the author by Torbjörn Granlund via [13]: We have x and y, two p-residues, and want xy mod p. If W is the machine word-size, let p be a (W − 1)-bit prime. Pre-compute y pinv = 2 · 2W−1 y/p. Now, q = x · y pinv /2W ≈ xy/p. Let r = xy − qp. If r is computed mod2W , it will be either the correct answer or p to big. To check this, replace the floor expressions by the appropriate remainders: y pinv = 2 q= 2W−1 y − r1 , 0 ≤ r1 < p p x · y pinv − r2 , 0 ≤ r2 < 2w 2W 26 ⇒q= x · y pinv r2 2r1 r2 2W−1 y − r1 r2 xy − W − W. − = 2x − W = W W W p 2 p 2 2 2 p2 2 Then r = x·y−q· p = 2r1 r2 p + w. 2w 2 Clearly, r ≥ 0. And also, r≤ 2w − 1 2(p − 1) + p < 1 + p. 2w 2w When using Montgomery reduction, the full double word product, x · y has to be computed first. Then follows the application of Algorithm 7, holding one single word product and one double word product. Shoup’s scheme has two single-precision multiplications and one double-precision multiplication where only the high word is needed, making it a bit more inexpensive. Also, there is more potential parallelism in Shoup’s scheme, possibly giving it better characteristics for software pipelining. As it requires both the powers of ω and their pre-computed duals it does perform worse than Montgomery reduction with respect to memory accesses. Also, it requires performing modular reductions without any precomputation, for instance when performing the pointwise multiplications of the transformed polynomials. This has to be accomplished using standard methods from, e.g., [15]. For the final version of the project code, both Montgomery reduction using 64 bit primes and Shoup’s reduction using 63 bit primes, was implemented. 4.3 Two Wrongs Make a Right In the first implementations the convolution of the two n-long sequences u and v was performed by padding the sequences with zeros to length 2k, where k is the smallest power of two greater than or equal to n. This clearly favours power-of-two operand sizes and gives sudden increases in computational cost when the size exceeds a power of two. In [2], Bailey describes a method for making the computational cost a more continuous function of n and, consequently, reducing the cost for certain problem sizes. This method was implemented in the project code and is described in the following example: Consider the case when n = k + 2, where k = 2m . I.e., uk , uk+1 , uk+2 and vk , vk+1 , vk+2 may be nonzero. First pad u and v with zeros to length 2k = 2m+1 . Then apply forward and inverse FFTs to the extended sequences to produce the following circular convolution: r0 = u0 v0 + uk−1 vk+1 + uk vk + uk+1 vk−1 r1 = u0 v1 + u1 v0 + uk vk+1 + uk+1 vk r3 = u0 v3 + u1 v2 + u2 v1 + u3 v0 .. . . = .. rk = u0 vk + u1 vk−1 + . . . + uk v0 .. . . = .. r2k−1 = uk−2 vk+1 + uk−1 vk + uk vk−1 + uk+1 vk−2 27 Two items make this result differ from the desired 2n-long linear convolution: • the first three values are corrupted by some additional terms • the final four members of the sequence are missing. The missing values are r2n−4 = r2k = uk−1 vk+1 + uk vk + uk+1 vk−1 r2n−3 = r2k+1 = uk vk+1 + uk+1 vk r2n−2 = r2k+2 = uk+1 vk+1 r2n−1 = r2k+3 = 0 Ignoring the last zero value, these expressions are precisely the values that have corrupted the first three members of the desired sequence r. So, by computing three expressions separately, the r sequence can be corrected to the sought 2n-point linear convolution. From the example it is clear that this method can be applied when evaluating the linear convolution of sequences of size n = k + d for any d < k = 2m . Extend the input sequences with zeros to length 2k, and calculate the 2k-long circular convolution by FFTs. Then compute a correction sequence by computing a linear convolution of the two (2d − 1)-long sequences u = {uk−d+1 , uk−d+2 , . . . , uk+d−1 } and v = {vk−d+1 , vk−d+2 , . . . , vk+d−1 }. Discard the first 2d − 2 values (as well as the final zero value), and correct the 2k-long sequence to the desired 2n-long sequence. Clearly, there is no point in doing this for d much larger than 2m−1 , since for this value of d the size of the correction convolution is about the same size as the 2m -point convolution, making the two convolutions nearly as costly as one convolution on inputs of length 2m+1 . 28 Chapter 5 Results Four different variants of the project code were implemented in C, using the GMP framework: • Straight iterative code based on Montgomery reduction. Also uses the ideas from Chapter 4.3. • Straight iterative code based on Victor Shoup’s reduction scheme. Also uses the ideas from Chapter 4.3. • Bailey Four-step FFT based on Montgomery reduction. • Bailey Four-step FFT based on Shoup’s reduction scheme. The versions of the code utilising Montgomery reduction was implemented for one to five 64 bit primes and the Shoup based code uses one to five 63 bit primes. As has been shown during the development of GMP, the use of hand written assembly code is essential to really get the best performance on every platform. The FFT present in GMP 4.1.4 relies heavily on the very tight addition and subtraction assembly loops of GMP. The project C code would not stand much of a chance on a platform where GMP assembly support is available. However, GMP 4.1.4 lacks assembly support for the very interesting AMD 64 architecture making the comparison a much fairer one. In Figures 5.1 and 5.2 the quotient of execution times between the project code and GMP 4.1.4 is shown for two different machines. Several conclusions can be drawn from the data in Figures 5.1 and 5.2. The most obvious one being that the project code is up to twice as fast as the code in GMP 4.1.4. It is also the case that Montgomery reduction is more successful than Shoup’s scheme. This is mainly due to the fact that the linear work is heavier in Shoup’s scheme when nothing special has been done to simplify the modular reductions not supported by the precomputed powers of ω. What may not be as evident are some general characteristics of the quotient curves. On all interesting platforms it is the case that when operands fit in L1 cache, the straight implementations are faster than their Four-step counterparts. Just above L1 cache 29 1 Shoup’s reduction Montgomery’s reduction execution time quotient 0.8 0.6 0.4 0.2 0 26 28 210 212 214 operand size (words) 216 218 220 Figure 5.1. The quotient of execution times between the project code and GMP 4.1.4 code for different operand sizes using different reduction schemes. Measurements are from an AMD Opteron 246 2 GHz machine. size there is a peak in the curves. This is where the Four-step code has overtaken the straight code in performance, but the larger overhead is showing through. It could be expected, and is the case, that on two platforms differing only in memory bandwidth, the project code would be more successful, in comparison with the GMP code, on the platform with lower bandwidth since it has better cache performance characteristics. Likewise the project code would be better of on a platform with higher CPU clock frequency if that was the only difference. This is due to fact that the project code performs more operations than the GMP code. The two machines used to generate the data in Figures 5.1 and 5.2 are similar, but this particular Opteron 246 platform has a faster processor as well as higher memory bandwidth. It seems that the two above mentioned effects are, almost, balanced out by this fact. On the Intel Itanium 2 architecture, another instance where GMP 4.1.4 does not have assembly support, similar performance, if not as spectacular, was achieved. The author and Torbjörn Granlund have written proof-of-concept assembly code, showing that the project approach could very well be competitive on machines with GMP assembly. To give the reader a feel for how fast the implemented algorithm is, Figure 5.3 shows timing data for the Opteron 246 test machine. Together with the timing data is also plotted how many primes that were used to give the measurements. It is well worth to note that it can pay off to make the input operands longer, i.e., to use only two primes. Naturally enough, this is only an option for relatively small operands. 30 1 Shoup’s reduction Montgomery’s reduction execution time quotient 0.8 0.6 0.4 0.2 0 26 28 210 212 214 operand size (words) 216 218 220 Figure 5.2. The quotient of execution times between the project code and GMP 4.1.4 code for different operand sizes using different reduction schemes. Measurements are from an AMD Opteron 240 1.4 GHz machine. execution time (s) 1 0.1 0.01 0.001 0.0001 number of primes 5 4 3 2 26 28 210 212 214 operand size (words) 216 218 220 Figure 5.3. Execution times for multiplication based on the project code using Montgomery reduction plotted together with the number of primes used to achieve that timing measurement. 31 There is still much work to be done if the project code is to inserted into the GMP repository. The main obstacle is to write a function that accurately predicts how many primes to use for a particular problem size. At this point, when the inner-loop of the FFT code dominates the execution time, a simple comparison of operations executed in that loop probably is accurate enough. As the inner-loop grows tighter, weight probably has to be given to the linear work as well. Other possible directions of development arise from the demands of GMP 5. There, truly huge operands have to be supported. To counter this, one suggestion would be to implement support for more primes and a recursive Four-step algorithm. It could also be interesting to experiment with Victor Shoup’s scheme on 32 bit platforms. In any case, the author would like to deem the project as successful and looks forward to meeting the remaining challenges mentioned above. 32 References [1] D. H. Bailey. Ffts in external or hierarchical memory. Journal of Supercomputing, 4(1):23–35, March 1990. [2] D. H. Bailey. On the Computational Cost of FFT-Based Linear Convolutions. http: //crd.lbl.gov/~dhbailey/dhbpapers/, June 1996. Last visited May 2005. [3] J. A. Beachy and W. D. Blair. Abstract Algebra. Waveland Press, second edition, 1996. ISBN 0-88133-866-4. [4] C. S Burrus. Notes on the FFT. http://www.fftw.org/burrus-notes.html, September 1997. Last visited May 2005. [5] CiteSeer.IST Scientific Literature Digital Library. http://citeseer.ist.psu.edu/. Last visited May 2005. [6] W. T. Cochran, J. W. Cooley, D. L. Favin, H. D. Helms, R. A. Kaenel, W. W. Lang, G. C. Maling, D. E. Nelson, C. M. Rader, and P.D. Welch. What is the fast fourier transform? IEEE Transactions on Audio and Electroacoustics, AU-15(2):45–55, June 1967. [7] J. W. Cooley, P. A. W. Lewis, and Welch P. D. Historical notes on the fast fourier transform. IEEE Transactions on Audio and Electroacoustics, AU-15(2):76–79, June 1967. [8] J.W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19:297–301, 1965. [9] T.H. Cormen, C. E. Leiserson, R. L. Rivest, and Stein. C. Introduction to Algorithms, chapter 30. MIT Press, second edition, 2001. ISBN 0-262-53196-8. [10] T. Estermann. Introduction to Modern Prime Number Theory, chapter 2. Cambridge University Press, 1961. [11] W. M. Gentleman and G. Sande. Fast fourier transforms – for fun and profit. In 1966 Fall Joint Computer Conference, volume 29 of AFIPS Proceedings, pages 563–578, 1966. 33 [12] M. T. Goodrich and R. Tamassia. Algorithm Design: Foundations, Analysis, and Internet Examples, chapter 10, pages 488–507. John Wiley, 2002. ISBN 0-47138365-1. [13] T. M. Granlund. E-mail communication with Victor Shoup of New York University, USA, February 2005. [14] T. M. Granlund et al. GNU Multiple Precision Arithmetic Library 4.1.4. http://swox. com/gmp/, September 2004. [15] T. M. Granlund and P. L. Montgomery. Division by invariant integers using multiplication. ACM SIGPLAN Notices, 29(6):61–72, 1994. [16] G. H. Hardy and E. M. Wright. An introduction to the theory of numbers. The Clarendon Press, fifth edition, 1979. ISBN 0-19-853170-2. [17] M. T. Heath. Scientific Computing: An Introductory Survey. McGraw-Hill, second edition, 2002. ISBN 0-07-239910-4. [18] M. T. Heideman, D. H. Johnson, and C. S. Burrus. Gauss and the history of the fft. IEEE Acoustics, Speech, and Signal Processing Magazine, 1:14–21, October 1984. [19] J. L. Hennessy and D. A. Patterson. Computer Architecture a Quantitative Approach, chapter 5, pages 372–483. Morgan Kaufmann, second edition, 1996. [20] J. Håstad. Notes for the course advanced algorithms. ~johanh/algnotes.pdf, 2000. Last visited May 2005. http://www.nada.kth.se/ [21] J. Johnson and R. Johnson. Challenges of computing the fast fourier transform. In Proc. Optimized Portable Application Libraries (OPAL) Workshop. DARPA/NSF, June 1997. [22] A. Karatsuba and Y. Ofman. Multiplication of multidigit numbers on automata. Doklady Akademii Nauk SSSR, 145(2):293–294, 1962. [23] D. E. Knuth. The Art of Computer Programming : Seminumerical Algorithms, volume 2, chapter 4. Addison-Wesley, 3 edition, 1998. ISBN 0-201-89684-2. [24] S. Landau and N. Immerman. The Similarities (and Differences) between Polynomials and Integers. http://citeseer.ist.psu.edu/landau94similarities.html, September 1994. Last visited May 2005. [25] J. D. Lipson. Elements of algebra and algebraic computing. Benjamin/Cummings, 1981. ISBN 0-201-04480-3. [26] A. J. Menenez, P. C. van Oorschot, and S. A. Vanstone. Handbook of Applied Cryptography. CRC Press, 1996. ISBN 0-8493-8523-7. [27] P. L. Montgomery. Modular multiplication without trial division. Mathematics of Computation, 44:519–521, 1985. 34 [28] N Möller. E-mail communication with François G. Dorais of Dartmouth College, USA., June 2004. Originating from Pomerance, C. [29] C. Percival. Rapid multiplication modulo the sum and difference of highly composite numbers. Mathematics of Computation, 72:387–395, 2003. [30] J. M. Pollard. The fast fourier transform in a finite field. Mathematics of Computation, 25:365–374, 1971. [31] R. L. Rivest, A. Shamir, and L. M. Adelman. A Method for Obtaining Digital Signatures and Public-key Cryptosystems. Technical Report MIT/LCS/TM-82, MIT, 1977. [32] A. Schönhage and V. Strassen. Schnelle multiplikationen grosse zahlen. Computing, 7(3-4):281–292, 1971. [33] A. S. Tanenbaum. Structured Computer Organization. Prentice Hall, fourth edition, 1999. ISBN 0-13-020435-8. [34] A. L. Toom. Complexity of a scheme of functional elements realizing the multiplication of integers. Doklady Akademii Nauk SSSR, 150:496–498, 1963. 35 36 Appendix Acronyms The following acronyms and abbreviations are used throughout the report. CRT: Chinese Remainder Theorem. DFT: Discrete Fourier Transform. FFT: Fast Fourier Transform. GMP: GNU Multiple Precision Arithmetic Library. NTT: Number Theoretic Transform. 37

Number Theory Meets Cache Locality

Related documents

Products

Support

Number Theory Meets Cache Locality

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib