Number Theory Meets Cache Locality

advertisement
Number Theory Meets Cache Locality
– Efficient Implementation of a Small Prime FFT
for the GNU Multiple Precision Arithmetic Library
Tommy Färnqvist
TRITA-NA-E05091
Numerisk analys och datalogi
KTH
100 44 Stockholm
Department of Numerical Analysis
and Computer Science
Royal Institute of Technology
SE-100 44 Stockholm, Sweden
Number Theory Meets Cache Locality
– Efficient Implementation of a Small Prime FFT
for the GNU Multiple Precision Arithmetic Library
Tommy Färnqvist
TRITA-NA-E05091
Master’s Thesis in Computer Science (20 credits)
within the First Degree Programme in Mathematics and Computer Science,
Stockholm University 2005
Supervisor at Nada was Stefan Nilsson
Examiner was Stefan Arnborg
Abstract
When multiplying really large integer operands, the GNU Multiple Precision Arithmetic
Library uses a method based on the Fast Fourier Transform.
To make an algorithm execute quickly on a modern computer, data has to be available
in the cache memory. If that is not the case, a large portion of the execution time will be
spent accessing the main memory. It might pay off to perform much extra work to achieve
good cache locality. In extreme cases, 500 primitive operations may be performed in the
time of a single memory access.
This report describes the implementation of a cache friendly variant of the Fast Fourier
Transform and its application to integer multiplication. The variant uses arithmetic modulo
primes near machine word-size. The multiplication method is shown to be competitive
with its counterpart in version 4.1.4 of the GNU Multiple Precision Arithmetic Library for
interesting platforms.
Talteori möter cachelokalitet
Effektiv implementation av småprimtals-FFT
för GNU:s multiprecisionsaritmetikbibliotek
Referat
För multiplikation av riktigt stora heltalsoperander använder GNU:s multiprecisionsaritmetikbibliotek en metod baserad på en variant av den snabba fouriertransformen.
För att få en algoritm att exekvera snabbt på en modern dator krävs det att data för algoritmen är åtkomligt i cacheminnet. Om så inte är fallet kommer en stor andel av körtiden
att gå åt till accesser i maskinens primärminne. Det kan löna sig att utföra mycket extra
arbete för att få bra cachelokalitet. I extrema fall kan 500 primitiva operationer utföras på
den tid det tar för en minnesaccess.
Denna rapport beskriver implementationen av en cachevänlig variant av den snabba
fouriertransformen och dess tillämpning på heltalsmultiplikation. Varianten använder aritmetik modulo primtal nära maskinordsstorlek. Multiplikationsmetoden visar sig vara konkurrenskraftig med motsvarande i version 4.1.4 av GNU:s multiprecisionsaritmetikbibliotek för intressanta plattformar.
Preface
This is my Master’s thesis in Computer Science. The project was performed within the First
Degree Programme in Mathematics and Computer Science at the Department of Numerical
Analysis and Computer Science, Stockholm University.
I would like to, sincerely, thank my supervisor at Swox AB, Torbjörn Granlund, for
inviting me into the exciting world of bignum arithmetic and for showing me what the
essence of a nice hack truly is. I am obliged to my supervisor at NADA, Stefan Nilsson,
who taught me algorithms, complexity theory and computer architecture; tools that have
come in very handy during all phases of this project. Thanks also go to Niels Möller, of
KTH, Stockholm, for his mathematical insights. Last, but not least, I am grateful to the
Medicis team at École Polytechnique, France, for assisting the project with computer time
and resources.
Contents
1
.
.
.
.
.
.
.
1
1
1
2
2
4
4
5
.
.
.
.
.
.
7
7
9
10
12
14
15
FFT Based Integer
Multiplication in a Finite Field
3.1 Feasibility of mod p FFTs . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 “Three primes” Algorithm for Integer Multiplication . . . . . . . . . . .
18
18
20
4
Implementation Issues
4.1 Cache Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Modular Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Two Wrongs Make a Right . . . . . . . . . . . . . . . . . . . . . . . . .
22
22
24
27
5
Results
29
References
33
Appendix: Acronyms
37
2
3
Introduction
1.1 Big Numbers, Who Needs Them? . . . . .
1.2 GMP and Multiple-precision Multiplication
1.2.1 Schoolbook multiplication . . . . .
1.2.2 Karatsuba’s algorithm . . . . . . .
1.2.3 Toom’s algorithm . . . . . . . . . .
1.2.4 Fermat Style FFT Multiplication . .
1.3 Goals of this Project . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Preliminaries
2.1 The FFT and Integer Multiplication . . . . . .
2.1.1 Primitive Roots of Unity . . . . . . . .
2.1.2 The Discrete Fourier Transform . . . .
2.1.3 The Fast Fourier Transform Algorithm .
2.1.4 Avoiding Recursion . . . . . . . . . . .
2.2 Computer Memory Hierarchy . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction
This first chapter presents the background setting for the project. In Section 1.1 a brief
account of why the project is interesting is given. Section 1.2 introduces the concept of
multiple-precision integers and the algorithms used in the GNU Multiple Precision Arithmetic Library, GMP1 , [14], for multiplication of such integers. The goals of the project and
the further outline of this thesis is available in Section 1.3.
1.1
Big Numbers, Who Needs Them?
This thesis deals with multiplication of large integers. Specifically multiplication based on
the Fast Fourier Transform, FFT, algorithm. Multiplication is a very basic operation and
if we are not able to understand how to solve that computational problem satisfactorily the
chances of conquering more complex ground are slim. It is well worth to note that the
currently best known multiplication algorithm, published in [32] in 1971, is not known to
have asymptotically optimal time complexity! There might be room for improvement and
that alone should be incentive enough for anyone to get interested.
But, what good are such large integers? Of course there are applications, such as the
very popular RSA algorithm, see [31], where integers represented by several hundred bits
are needed. When experimenting with conjectures from number theory it is not uncommon
to need as high a precision as current funds can buy. Then there are those who seek out
larger and larger primes, try to factor big numbers, or just like to see what the first few
billion decimals of π look like.
1.2
GMP and Multiple-precision Multiplication
Our normal way of thinking of an integer, in a positional number system representation with
base 10, generalises to any base B ∈ Z, B ≥ 2. An m-digit base B number, (um−1 . . . u0 )B ,
represents the quantity um−1 Bm−1 + . . . + u1 B + u0 , where 0 ≤ ui < B for each ui .
A computer’s hardware can only perform operations on integers smaller than the machine word-size W. If bigger numbers are needed, the programmer has to provide routines
1
A list of acronyms and abbreviations used in this thesis can be found in the Appendix.
1
for the representation and manipulation of multiple-precision integers – integers on the
form u = (um−1 . . . u1 u0 )B , where B ≤ W. This facilitates development of the classical
arithmetical algorithms with a base B much larger than 10.
GMP is a free software library written in C that provides routines for manipulating
arbitrary precision numbers. The library obtains it speed by operating on fullwords, using
the best algorithm for the given problem size, inclusion of hand optimised assembly code
for the most common inner loops for many CPUs and by a general emphasis on speed.
GMP implements four different integer multiplication algorithms. (Schoolbook multiplication, Karatsuba, Toom-3 and a Fermat style FFT multiplication.) This is in line with
one of the governing design paradigms of GMP: It is not the case that the asymptotically
better algorithm are always better than the slower ones. The asymptotically fast algorithms
are often associated with a fair amount of overhead that does not pay off until larger instances of the problem are to be solved. The choice of when to switch algorithms in GMP
is done at compile time.
The hardware is assumed to be capable of the following arithmetic operations:
• Addition of two single-digit numbers u and v, giving a single digit result u + v (or
u + v − B if u + v ≥ B).
• Subtraction of one single-digit number v from a single digit number u, giving a
single-digit result u − v (or u − v + B if u − v < 0).
• Multiplication of two single-digit numbers u and v, giving a two-digit result uv.
• Division of a two-digit number u by a single-digit number v, giving a single digit
quotient u/v and a single-digit remainder u mod v. (u/v < B is required.)
Throughout the following descriptions of multiplication algorithms the input operands,
U = un−1 Bn−1 + un−2 Bn−2 + . . . + u1 B1 + u0 and V = vm−1 Bm−1 + vm−2 Bm−2 + . . . + v1 B1 + v0
are assumed to be nonnegative.
1.2.1
Schoolbook multiplication
This method is very similar to the one taught in grade school, except that storage of all
partial products is not required. It is just the usual set of cross-products (U and V have to
be nonnegative), as presented in Algorithm 1.
It is straight forward to prove, by induction over j, that Algorithm 1 returns the product
R of U and V. Obviously the algorithm requires Θ(nm) time. If multiplication truly was
a quadratic operation, there would not be much hope of multiplying large numbers at all.
Fortunately, there are faster algorithms.
1.2.2
Karatsuba’s algorithm
We want to multiply the two n-digit numbers u and v. Assume n is a power of two and
write u = x + Bn/2 y and v = w + Bn/2 , where x, y, w, z < Bn/2 z. What we want is uv =
(x+Bn/2 y)(w+Bn/2 z) = xw+Bn/2(xz+yw)+Bnyz. Using that xz+yw = (x+y)(w+z)− xw−yz,
2
Algorithm 1: Schoolbook multiplication.
Input: U, V
Output: The product of U and V.
M(U, V)
(1)
ri ← 0 for all 0 ≤ i < m
(2)
j←0
(3)
while j < n
(4)
k←0
(5)
qk−1 ← 0
(6)
while k < m
(7)
tk ← r j+k + qk−1 + u j vk
(8)
r j+k ← tk mod B
(9)
qk ← tk /B
(10)
k ←k+1
(11)
r j+m ← qk−1
(12)
j← j+1
(13)
return R ← rn+m+1 Bn+m+1 + rn+m Bn+m + . . . + r1 B + r0
it is enough to calculate three n/2-digit products2 : xw, yz and (x + y)(w + z), which can be
computed by recursion. On each recursion level there is also need for a few shifts and
additions.
Algorithm 2: Karatsuba’s algorithm.
Input: U, V, n
Output: The product of U and V.
M(U, V, n)
(1)
if n = 1
(2)
return uv
(3)
Let u ← x + Bn/2 y and v ← w + Bn/2 z
(4)
Compute t1 ← xw using Karatsuba’s algorithm.
(5)
Compute t2 ← yz using Karatsuba’s algorithm.
(6)
Compute t3 ← (x + y)(w + z) using Karatsuba’s algorithm.
(7)
t4 ← t3 − t2 − t1
(8)
return t1 + Bn/2 t4 + Bn t2
Algorithm 2 was first published in [22]. If it requires M(n) time, M(n) = 3M(n/2) +
O(n). This recurrence relation has the solution M(n) = O(nlog2 3 ). Since log2 3 ≈ 1.585 this
is considerably faster than schoolbook multiplication and operands do not have to be very
large at all for Karatsuba to beat the naive algorithm.
(x + y) and (w + z) might have n/2 + 1 digits each and this extra digit should be handled separately. This
is just a linear amount of extra work and only complicates this analysis.
2
3
1.2.3
Toom’s algorithm
Karatsuba’s method is the simplest case of a general way to split input operands that leads
to both the Toom and FFT algorithms.
Let r ≥ 2 be the, small, splitting parameter. The operands are split in r pieces and
viewed as degree r − 1 polynomials. The polynomials are then evaluated, pointwise multiplied and interpolated, at 2r − 1 points, as in Algorithm 3.
Algorithm 3: Toom’s algorithm.
Input: U, V, n, r
Output: The product of U and V.
M(U, V, n, r)
(1)
if n < r
(2)
Compute the product using schoolbook multiplication.
(3)
s ← n/r
(4)
Let u ← ur−1 sr−1 + . . . + u1 s + u0 and v ← vr−1 sr−1 + . . . + v1 s + v0 .
(5)
Let U(t) ← ur−1 tr−1 +. . .+u1 t+u0 and V(t) ← vr−1 tr−1 +. . .+v1 t+v0 .
(6)
Evaluate U(t) and V(t) at small values of t.
(7)
Compute W(t) ← U(t)V(t) for those t using 2r − 1 recursive calls.
(8)
Compute the 2r − 1 coefficients of Z(t).
(9)
return Z(s)
Algorithm 3 was published in [34]. If M(n) is the execution time of the algorithm,
M(n) = (2r − 1)M(n/r) + O(n). M(n) can be solved for as M(n) = O(nlogr (2r−1) ). By
letting r → ∞ we now get a multiplication algorithm with O(n1+ ) time complexity for any
> 0. Of course the overhead grows with r and in practice only small values of r are worth
implementing.
GMP implements Toom-3, an O(nlog3 5 ) = O(n1.465 ) algorithm. The larger overhead,
from more work in evaluation and interpolation, of Toom shows up here and so it pays off
from operand sizes of a few hundred machine words.
1.2.4
Fermat Style FFT Multiplication
If we pick the evaluation points in Algorithm 3 as roots of unity, instead of arbitrarily, the
evaluation is a Discrete Fourier Transform and if r is chosen wisely, it can be computed by
the Fast Fourier Transform algorithm.
For large operand sizes GMP uses a Fermat style FFT multiplication, as in [32], an
O(n log n log log n) algorithm. The product computed is uv mod 2N + 1 with N ≥ bits(u) +
bits(v) and padding u and v with high zero words.
The algorithm follows a scheme similar to that of Karatsuba and Toom (splitting, pointwise multiplication, interpolation). The points chosen for evaluation are powers of two so
the operations needed for the FFT are only additions, shifts and negations.
For more on multiplication algorithms, see [23] and the GMP manual for GMP specifics. See also [20] or why not the GMP source code; it is free. In Figure 1.1, the typical
behaviour of GMP’s multiplication algorithms for different problem sizes is shown.
4
1
Schoolbook
Karatsuba
Toom-3
Fermat FFT
0.1
time (s)
0.01
0.001
0.0001
1e-05
1e-06
16
64
256
1024
operand size (words)
4096
16384
Figure 1.1. Execution times for GMP multiplication routines. Karatsuba is faster than schoolbook multiplication at 22 words, Toom-3 overtakes Karatsuba at 274 words and the FFT based
code outperforms Toom-3 at 8448 words. Measurements are from an Intel Pentium 4, 2.4 GHz.
1.3
Goals of this Project
The goal of this project was to improve upon the FFT based multiplication scheme in GMP
4.1.4 by implementing a faster FFT scheme over the finite field Z/Zp, for primes p near
machine word size.
The reason this scheme is relatively untried is that conventional FFT implementations
use IEEE double precision floating point numbers with 53 bits of precision, giving them
a clear advantage over the usual 32 bit integer precision. The approach used in GMP
tries to circumvent this problem by leaving out the costly multiplications (floating point
multiplications are usually much faster than integer ones) that appear in a Z/Zp-based
scheme.
It was the hope of the commissioner of this project that the new platforms with strong
integer multiply support (Intel Itanium and AMD Opteron, for example) and 64-bit precision would give enough power to the Z/Zp-based scheme to make it a winner. Since
it was deemed unlikely that a 32-bit implementation could be successful the project was
concentrated at the 64-bit machines with a view to fulfil the demands of the planned release
dubbed GMP 5.
The rest of this thesis proceeds with introducing the Fast Fourier Transform and its
application to integer multiplication in Chapter 2. This is followed by a brief glimpse of
how modern computer memory works. It will be apparent that the Fast Fourier Transform
is particularly ill suited for modern computers when it comes to memory access patterns.
5
A further challenge of the project is therefore to formulate the algorithm in a manner that
better suits the modern computer.
In Chapter 3 the nice theory behind the algorithm to be implemented in the project is
presented. Chapter 4, then, gives an account of some of the more interesting issues that
arose in the implementation phase of the project. Lastly, Chapter 5 presents the results of
the project and discusses some future work to be done.
6
Chapter 2
Preliminaries
In this chapter the theoretical foundation of the FFT and its application to integer multiplication is presented. A short introduction to modern computer memory architecture is also
given.
2.1
The FFT and Integer Multiplication
The Fourier transform of a continuous function a(t) is given by
∞
a(t)e2πi f t dt.
A( f ) =
−∞
The discrete analogue to this continuous transform is the discrete Fourier transform (DFT)
which applies to samples a0 , a1 , . . . , aN−1 of a(t):
Aj =
N−1
ak e2πi jk/N , j = 0, . . . , N − 1.
k=0
It was the problem of computing sums of this kind that J. W. Cooley and J. W. Tukey addressed in [8], the paper in which they essentially discovered the FFT. Actually the history
of the FFT is long and convoluted, dating back to Gauss in 1805. See [7] and [18] for
further historical remarks.
It is not hard to make a list of areas where the speed of the FFT comes in handy. Such
a list would most certainly contain:
• RADAR Processing
• Digital Pulse Compression
• Digital Filtering
• Spectrum Analysis
• Optics
7
• Speech Analysis
• Crystallography
• Computational Fluid Dynamics
• Microlitography
• Image Analysis
• Convolution/multiplication
A longer list may be found in [21].
Since the FFT has found many areas of application the amount of research regarding
the algorithm is large. [5] lists 4577 documents relating to the FFT. The reader who wishes
to delve deeper into the realm of the FFT is referred to [4], an excellent source of pointers
to more documents on the subject. For a more thorough introduction to the basic algorithm
than the one given here see [6], [12] or [9].
To make the application of the FFT to integer multiplication obvious, the algorithm is
derived as one solving the problem of polynomial multiplication. All that has to be done is
then to utilise the casual observation that an integer U written in some base B (232 or 264
for applications.): U = u0 + u1 × B1 + u2 × B2 + . . . + un−1 × Bn−1 can be seen as a polynomial
if B is substituted with an unbound variable. This fascinating duality is a subject in its own
right. [24] contains more information relating to that.
i
Now, a polynomial in coefficient form is described as p(x) = n−1
i=0 ui x . The degree of
p is the largest index of a nonzero coefficient, ai .
v xi is another polynomial the sum of the polynomials is given as p(x) +
If q(x) = n−1
n−1 i=0 i i
q(x) = i=0 (ui + vi )x . The coefficient form also allows for efficient evaluation of p(x) by
Horner’s rule; p(x) = u0 + x(u1 + x(u2 + . . . + x(un−2 + xun−1 ) . . .)).
The time complexity for addition and evaluation of polynomials in coefficient representation is apparently O(n). The product of p and q, however, is a bit more tricky:
p(x)q(x) = u0 v0 + (u0 v1 + u1 v0 )x + (u0 v2 + u1 v1 + u2 v0 )x2 + . . . + un−1 vn−1 x2n−2 . That
i
i
is, p(x)q(x) = 2n−2
i=0 ri x , where ri =
j=0 u j vi− j , for i = 0, 1, . . . , 2n − 2. The sequence
ri is called the convolution of the sequences ui and vi . For symmetry reasons, the ri are
regarded as a sequence of length 2n, with r2n−1 0. Viewing the ui and vi as vectors
u = [u0 , u1 , . . . , un−1 ] and v = [v0 , v1 , . . . , vn−1 ], the convolution of the two vectors is denoted u ∗ v. Applying the definition of convolution directly, it takes Θ(n2 ) time to multiply
p and q.
Another way of representing a polynomial p, of degree n−1, is by its value on n distinct
inputs, i.e. point-value pairs {(x0 , y0 ), (x1 , y1 ), . . . , (xn−1 , yn−1 )} such that all xi are distinct
and yi = p(xi ) for i = 0, 1, . . . , n − 1.
This representation gives an alternative method for multiplying p and q. Namely,
evaluate p and q for 2n different inputs and compute the representation of the product as
{(x0 , p(x0 )q(x0 )), (x1 , p(x1 )q(x1 )), . . . , (x2n−1 , p(x2n−1 , q(x2n−1 ))}. This computation clearly
just takes O(n) time, given the 2n point-value pairs.
To effectively use this approach the point-value pairs have to be found quickly. Applying Horner’s rule on 2n different inputs would yield Θ(n2 ) time complexity. Since the
8
choice of, the discrete, inputs is arbitrary ,the question arises as to whether it is possible to
pick “easy” ones. For instance, p(0) = u0 comes to mind. This is where the FFT will help.
The inverse of evaluation is, of course, interpolation. The following theorem shows that
this is well defined.
Theorem Uniqueness of an interpolating polynomial. For any set
{(x0 , y0 ), (x1 , y1 ), . . . , (xn−1 , yn−1 )} of n point-value pairs such that all the xi are distinct, there
is a unique polynomial p(x), of degree n or less, such that yi = p(xi ) for i = 0, 1, . . . , n − 1.
Proof yi = p(xi ) is equivalent to the matrix equation







1
1
..
.
x0
x1
..
.
x20 · · ·
x21 · · ·
..
.
x2n−1
1 xn−1
xn−1
0
xn−1
1
..
.
···
..
.
xn−1
n−1

  u0
  u1
 
  ..
  .
un−1
 
  y0
  y
  1
 =  ..
  .
 
yn−1




 .


The left matrix is denoted V(x0 , x1 , . . . , xn−1 ) and is known as a Vandermonde matrix. V
has determinant 0≤ j<k≤n−1 (xk − x j ) and is therefore nonsingular if the xi are distinct. Thus,
the ui can be solved for uniquely.
One algorithm for n-point interpolation is based on Lagrange’s formula:
p(x) =
n−1
i=0
yi
ji (x −
x j)
.
ji (xi − x j )
The coefficients of p can now be computed in time Θ(n2 ). More in interpolation in a
numerical context can be found in [17]. But, how to find a set of inputs that will make
evaluation and interpolation go quicker than applying Horner and Lagrange? One answer
is that of roots of unity and this will lead to the FFT.
2.1.1
Primitive Roots of Unity
Now some concepts from algebra will come in handy. For a more solid background in
group and field theory and related items see [3].
A most important idea for the FFT is that of roots of unity:
Definition 2.1 Primitive nth root of unity. A number ω is a primitive nth root of unity,
for n ≥ 2, if it satisfies:
1. ωn − 1 = 0
2. The numbers 1, ω, ω2 , . . . , ωn−1 are distinct.
Note that Definition 2.1 implies that a primitive nth root of unity has a multiplicative inverse, ω−1 = ωn−1 , since ω−1 ω = ωn−1 ω = ωn = 1. In other words: negative exponents of
ω are well defined.
9
For nth roots of unity to work as “easy” inputs to the polynomials to be evaluated a few
properties are needed, including the following.
ki
Property 2.2 Cancellation property. If ω is an nth root of unity, then n−1
i=0 ω = 0, for any
integer k 0, −n < k < n.
Proof
ωk 1 ⇒
n−1
i=0
ωki =
1k − 1
1−1
(ωk )n − 1 (ωn )k − 1
=
=
= k
=0
k
k
k
ω −1
ω −1
ω −1 ω −1
Property 2.3 Reduction property. If ω is a primitive (2n)th root of unity, then ω2 is a primitive nth root of unity.
Proof Since 1, ω, ω2 , . . . , ω2n−1 are distinct, 1, ω2 , (ω2 )2 , . . . , (ω2 )n−1 are also distinct.
Property 2.4 Reflective property. If ω is a primitive nth root of unity and n is even, then
ωn/2 = −1.
Proof If k = n/2, by Property 2.2,
0=
n−1
ωn/2 i = ω0 + ωn/2 + ωn + . . . + ω(n/2)(n−2) + ω(n/2)(n−1) =
i=0
0
= ω + ωn/2 + ω0 + . . . + ω0 + ωn/2 = (n/2)(1 + ωn/2 ).
Thus, 0 = 1 + ωn/2 .
With the reflective property at hand the fact that if ω is a primitive nth root of unity and
n ≥ 2 is even then ωk+n/2 = ωk , is readily seen.
2.1.2
The Discrete Fourier Transform
As the reader undoubtedly has surmised the DFT is precisely the technique of evaluating a
i
0
1
n−1 . As this only yields
polynomial p(x) = n−1
i=0 ui x at the nth roots of unity ω , ω , . . . , ω
n point-value pairs, p is “padded” with zeros by setting ui = 0 for n ≤ i ≤ 2n − 1. p can
now be viewed as a (2n-1) degree polynomial and the primitive (2n)th roots of unity can be
used as inputs. In the following, p is assumed to have been padded with as many zeros as
necessary.
Formally, the DFT for p, as represented by the coefficient vector u, is defined as the
ij
vector y of values y j = p(ω j ), that is y j = n−1
i=0 ui ω , where ω is a primitive nth root of
unity. Alternatively, if u and y are thought of as column vectors y = Fu, where F is an
n × n matrix such that F[i, j] = ωi j .
The DFT would not be much of a transform if it did not have an inverse. Fortunately
F has an inverse, F −1 , so that F −1 (F(u)) = u for all u. F −1 has a simple form, F −1 [i, j] =
ω−i j /n. Thus, given a vector y of the values of p at the nth roots of unity, the coefficient ui
−i j
can be recovered as ui = n−1
j=0 y j ω /n. The following lemma justifies this.
10
Lemma 2.5. For any vector u, F −1 · Fu = u.
Proof
(F
−1
1 −ik k j
· F)[i, j] =
ω ω
n k=0
n−1
If i = j this reduces to
(F
−1
1 0 1
· F)[i, i] =
ω = · n = 1.
n k=0
n
n−1
mk
Consider the case when i j and let m = j−i. Then (F −1 · F)[i, j] = (1/n) n−1
k=0 ω , where
−1
−n < m < n, m 0. By Property 2.2, r.h.s. reduces to zero. Hence, (F · F)[i, j] = 0 for
i j.
The scheme to multiply two polynomials p and q now becomes (also illustrated in
Figure 2.1.):
1. Pad the coefficient vectors of p and q, u and v, with n zeros and view them as column
vectors u = [u0 , u1 , . . . , un−1 , 0, 0, . . . , 0]T and v = [v0 , v1 , . . . , vn−1 , 0, 0, . . . , 0]T .
2. Compute the DFTs y = Fu and z = Fv .
3. Multiply y and z component-wise, to get (y · z)[i] = yi · zi for i = 0, 1, . . . , 2n − 1.
4. Compute the inverse DFT of the product, that is r = F −1 (Fu · Fv ).
✛
✘
✚
✙
u0 , u1 , . . . , un−1
v0 , v1 , . . . , vn−1
☛
✟
r0 , r1 , . . . , r2n−1
✡
✠
✻
Pad with n zeros
❄
✛
u0 , u1 , . . . , un−1 , 0, 0, . . . , 0
v0 , v1 , . . . , vn−1 , 0, 0, . . . , 0
✚
✘
inverse DFT
✙
DFT
✛
❄
✘
✟
y0 , y1 , . . . , y2n−1 Pointwise ✲☛
y
z
,
y
z
,
.
.
.
,
y
z
0
0
1
1
2n−1
2n−1
z0 , z1 , . . . , z2n−1 multiplication ✡
✠
✚
✙
Figure 2.1. Using the computation of r = u ∗ v for polynomial multiplication.
The success of the above approach hinges on the following theorem:
Theorem 2.6 The convolution theorem. Given two length n vectors u and v padded with
zeros to length 2n vectors u and v , u ∗ v = F −1 (Fu · Fv ).
11
Proof By Lemma 2.5 it is enough to show that F(u ∗ v) = Fu · Fv . Since the upper halves
of u and v are zero,

 
 n−1 n−1
n−1
n−1

 
 


i
j
ik

u j vk ωi( j+k) , for i = 0, 1, . . . , 2n − 1.
Fu · Fv [i] =  u j ω  ·  vk ω  =
j=0
k=0
j=0 k=0
Now consider F(u ∗ v). By the definition of convolution and the discrete Fourier transform,
F(u ∗ v)[i] =
2n−1
2n−1
l=0
u j v j−l ωil .
j=0
Change the order of the summations and substitute k for j − l to get
F(u ∗ v)[i] =
2n−1
2n−1−
j
j=0
u j vk ωi( j+k) .
k=− j
vk is undefined for k < 0, so the second sum can be started at k = 0. Furthermore, u j = 0
for j > n − 1, and so the upper limit in the first sum can be lowered. Once that substitution
is made, the upper limit on the second summation is always at least n. Thus, since vk = 0
for k > n − 1, also the upper limit in the second sum may be lowered to n − 1. Consequently,
F(u ∗ v)[i] =
n−1 n−1
u j vk ωi( j+k) .
j=0 k=0
The method for multiplying two polynomials now involves computing two DFTs, doing a linear time pointwise multiplication and computing an inverse DFT. For this to be
any faster than the obvious Θ(n2 ) time algorithm the crux is to find a fast algorithm for
computing the DFTs.
2.1.3
The Fast Fourier Transform Algorithm
The Fast Fourier Transform algorithm computes the DFT of a length n vector in O(n log n)
time. The algorithm applies the divide-and-conquer approach to polynomial evaluation by
observing that if n is even, a polynomial of degree n − 1, p(x) = u0 + u1 x + u2 x2 + . . . +
un−1 xn−1 , can be divided into two n/2-1 degree polynomials, pe (x) = u0 + u2 x + u4 x2 +
. . . + un−2 xn/2−1 and po (x) = u1 + u3 x + u5 x2 + . . . + un−1 xn/2−1 , for p(x) = pe (x2 ) +
xpo (x2 ). Now, the DFT evaluates p at the nth roots of unity and by Property 2.3, the values
(ω2 )0 , ω2 , (ω2 )2 , . . . , (ω2 )n−1 are (n/2)th roots of unity. Hence, pe and po can be evaluated
at these values and the same values can be used to evaluate p. This is the key observation
used in Algorithm 4.
Why does the pseudo-code in Algorithm 4 work correctly? Lines 1–2 constitute the
basis of the recursion, correctly returning a vector with the one entry u0 . Lines 3 and
11 keep ω updated properly, so that when the for loop is executed, r = ωi . Lines 6–7
perform the recursive DFT computations, setting yei = pe (ω2i ) and yoi = po (ω2i ). Lines 9–10
12
Algorithm 4: Recursive FFT algorithm.
Input: u, ω, n
Output: The DFT of u.
FFT(u, ω, n)
(1)
if n = 1
(2)
return y ← u
(3)
r ← ω0
(4)
ue ← [u0 , u2 , u4 , . . . , un−2 ]
(5)
uo ← [u1 , u3 , u5 , . . . , un−1 ]
(6)
ye ← FFT(ue , ω2 , n/2)
(7)
yo ← FFT(uo , ω2 , n/2)
(8)
for i ← 0 to n/2 − 1
(9)
yi ← yei + r · yoi
(10)
yi+n/2 ← yei − r · yoi
(11)
r ←r·ω
(12) return y
combine the results of the recursive calculations to get yi = yei +ωi yoi = pe (ω2i )+ωi po (ω2i ) =
p(ωi ) for i = 0, 1, . . . , n/2− 1. For yn/2 , yn/2+1 , . . . , yn−1 , yi+n/2 = yei − ωi yoi = yei + ωi+n/2 yoi =
pe (ω2i ) + ωi+n/2 po (ω2i ) = pe (ω2i+n ) + ωi+n/2 po (ω2i+n ) = p(ωi+n/2 ). Thus, the vector y
returned by Algorithm 4 is the DFT of u.
Within the for loop each value yoi is multiplied by ωi . The product is both added and
subtracted from yoi . The factors ωi are known as twiddle factors. The operation of multiplying the twiddle factor by yoi and adding and subtracting the product from yei is known
as a butterfly operation and is usually drawn as in Figure 2.2.
yie
yie + ω iyio
ωi
yio
yie − ω iyio
Figure 2.2. A schematic drawing of a butterfly operation.
What about running time? Exclusive of the recursive calls each invocation takes time
Θ(n). The recurrence for the running time is therefore T (n) = 2T (n/2) + Θ(n) = Θ(n log n).
By Lemma 2.5, Algorithm 4 can be used for computing the inverse DFT; switch the
roles of u and y, replace ω by ω−1 and divide each element of the result by n. Consequently,
the inverse DFT can be computed in time Θ(n log n) as well.
13
2.1.4
Avoiding Recursion
The main obstacle in the way of avoiding recursion in the FFT algorithm is the splitting of
the input array into odd and even halves. What is the net effect of these permutations? The
recursion call tree for input size n = 8 is shown in Figure 2.3. In the first recursive call
all elements that have indices with 0 as their least significant bit are brought to the even
half. Likewise, all indices with 1 as their least significant bit are brought to the odd half.
Thinking of an in-place implementation of the algorithm, an element starting with an index
with b as as its least significant bit, ends up at an index with b as its most significant bit.
In the next level of recursion this “inverse shuffle” is repeated on each half of the input
array, now addressing the second least significant bit of the indices. Thus, if an element
starts out at index [bl−1 . . . b2 b1 b0 ], it ends up at index b0 b1 b2 . . . bl−1 , where l = log n. This
permutation is known as the bit reversal of the input array.
[u0 , u1 , u2 , u3, u4 , u5 , u6, u7 ]
[u0 , u2 , u4 , u6]
[u0 , u4 ]
[u0 ]
[u4 ]
[u1 , u3 , u5 , u7]
[u2 , u6 ]
[u2 ]
[u1 , u5 ]
[u1 ]
[u6 ]
[u5 ]
[u3 , u7 ]
[u3 ]
[u7 ]
Figure 2.3. The input vectors to the calls of the recursive FFT algorithm. Initial invocation is
for n = 8.
Thus, if the input array could be arranged in bit reversed order, the behaviour of the
recursive FFT could be mimicked. First, take the elements in pairs, compute the DFT of
each pair and replace the pair with its DFT. Next, take the n/2 DFTs in pairs and compute
the DFT of the four vector elements they come from by two butterfly operations, replacing
two 2-element DFTs with one 4-element DFT. Continue in this manner until the vector
holds two (n/2)-element DFTs, which are combined into the final n-element DFT using
n/2 butterfly operations. This computational flow is illustrated in Figure 2.4.
Now it should be easy to translate the essence of Figure 2.4 to the pseudo-code of
Algorithm 5. An array holds the elements of the input vector, but in bit reversed order.
Since the combining has to be done on each level of the tree, a variable is introduced to
count the levels.
The call to BRC runs in O(n log n) time since an integer with log n bits
certainly can be reversed in time O(log n), and this is done n times. The loop of lines 5–12
iterates n/m = n/2lev times for each level and the innermost for loop iterates 2lev −1 times.
14
u0
y0
(ω 4)0
u1
y1
(ω 2 )0
u2
y2
(ω 4)0
2 1
(ω )
u3
y3
ω0
u4
y4
(ω 4)0
ω
1
u5
y5
(ω 2 )0
ω2
u6
y6
(ω 4 )0
2 1
(ω )
ω
3
u7
y7
Figure 2.4. Computational flow of the iterative FFT algorithm for n = 8 inputs. Also known as
an FFT circuit.
If L(n) is the number of times the lines 9–12 is executed, then
L(n) =
log
n
log n
n lev −1 n
= Θ(n log n).
·
2
=
s
2
2
lev=1
lev=1
Thus Algorithm 5 has running time Θ(n log n).
2.2
Computer Memory Hierarchy
For a more complete overview than given here, see [33]. A nice exposition of the subject is
available in [19].
During the last 20 years CPUs have been faster than memories. As it has become
possible to put more and more circuits on a chip, CPU designers have used this to make
CPUs go even faster, by introducing pipelining and superscalarity. Memory designers,
usually, have increased the capacity of their ships, not the speed. This has resulted in an
imbalance where a CPU may be able to perform hundreds of primitive operations in the
time it takes for a word from memory to arrive at the CPU after it has been requested. If the
CPU operation is dependent of the memory word on route there is no choice but to stall, be
it in hardware or software.
Actually, it is possible to build memories that are as fast as CPUs, but to run at full
speed, they have to be located on the CPU chip. This is because the memory bus is very
slow. So, it is a trade off between chip size and cost.
To give the illusion of having a large amount of fast memory the concept of a cache is
introduced. The basic idea is simple: the most heavily used memory words are kept in the
cache. When the CPU needs a word, it first looks in the cache. If the word is not there, it
goes to the main memory.
15
Algorithm 5: Iterative FFT algorithm.
Input: u, ω, n
Output: The DFT of u.
FFT(u, ω, n))
(1)
BRC(u, y)
(2)
for lev ← 1 to log n
(3)
m ← 2lev and rm ← ωn/m
(4)
for k ← 0 to n − 1
(5)
r←1
(6)
for j ← 0 to m/2 − 1
(7)
t ← r · y[k + j + m/2]
(8)
u ← y[k + j]
(9)
y[k + j] ← u + t
(10)
y[k + j + m/2] ← u − t
(11)
r ← r · rm
(12) return y
A most effective technique for improving both memory bandwidth and latency is the
use of multiple caches. A basic approach is to have separate caches for instructions and
data. This is called a split cache, as opposed to a unified one and memory operations can
be initiated independently in each cache.
Many modern memory systems use an additional cache, called a level 2 cache, that
resides between the data and instruction caches and main memory. On a typical system
the CPU chip itself contains a small split cache, usually 16 KB to 64 KB. Then there is a
unified L2 cache, in the same capsule as the CPU, of size 512 KB to a few MB. A thirdlevel cache, consisting of a few megabytes of SDRAM, may be situated on the processor
board. The caches are generally inclusive, so that the contents of the L1 cache is present in
the L2 cache and so on.
Caches depend heavily on locality to work good. Temporal locality occurs when recently used memory words are accessed again. This may be the case, for example, inside
a loop or when addressing memory close to the top of the stack. Spatial locality is the
phenomenon that memory locations close in numerical address are likely to be accessed in
rapid sequence. Think of a matrix of values being manipulated, for example.
All caches use the model of dividing main memory into fixed-size blocks called cache
lines. A cache line may consist of 4 to 64 consecutive bytes. When memory is referenced,
the cache controller checks to see if the word referenced is in the cache. If the word is not
there, some line entry is removed and the line needed is fetched from memory.
The simplest cache is a direct-mapped cache. Each entry in the cache can hold exactly
one cache line from main memory and a given memory word can be stored in exactly one
place within the cache.
This mapping scheme puts consecutive memory lines in consecutive cache entries, up
to the cache size. However, two lines that differ in their address by precisely the size of the
cache, or a multiple thereof, can not be stored in the cache at the same time. For example, if
16
a program accesses data at location X and next needs data at location X + sizeof(cache), the
second instruction will force the cache entry to be reloaded. If this happens often enough,
it results in very poor performance. In fact, the worst-case behaviour of a cache is worse
than having no cache at all, since each memory operation involves reading an entire cache
line instead of just one word.
A solution to the above problem is to allow two or more lines in each cache entry. An
n-way set-associative cache can hold n possible entries for each address. Of course, this
kind of cache is more complicated than a direct-mapped cache. But, experience shows
that two-way and four-way caches perform well enough to make the introduction of extra
circuitry worthwhile.
When a new entry arrives to a set-associative cache, the question is, which present
item is to be discarded? For most purposes the LRU (Least Recently Used) strategy is a
pretty good one. When it becomes time to replace an entry, the least recently accessed is
discarded.
17
Chapter 3
FFT Based Integer
Multiplication in a Finite Field
In most FFT based integer multiplication implementations the nth root of unity is chosen
as the complex number e2πi/n . The roots of unity are then approximated by floating point
complex numbers. Since it has been very difficult to guarantee an exact result (to some
number of bits, see [29] for some tighter bounds though) this is not an option for GMP,
where exactness is an absolute demand.
Instead, think of implementing FFTs in Z/Zm, the ring of integers modulo m ∈ Z. What
properties of the ring are needed? Obviously addition, subtraction and multiplication are no
problem. Division is only needed for dividing by the transform length in the inverse FFT,
but this still rules out some choices of m. Furthermore nth roots of unity are needed. For
choices of m that meet these demands the FFTs are called Number Theoretic Transforms
(NTTs). In this project only prime moduli will be considered.
Since all numbers involved in an NTT are integers, no round-off errors can occur. This
makes very long transforms possible. Also, if the result overflows (calculations are done
modulo p) the calculation can be repeated modulo several different primes and combined
using the Chinese Remainder Theorem.
The greatest risk factor is that integer arithmetic is slower than floating point on most
machines. One goal of the project is therefore to see if the new 64 bit architectures with
strong integer multiply support will be able to pick up the slack.
One other thing to note is that when using NTTs, all physical meaning of the DFT is
lost. The transform is no longer one between the time and frequency domains, but simply
a game of numbers.
3.1
Feasibility of mod p FFTs
The idea to perform the FFT in a finite field is due to J. M. Pollard. This idea and its
application to integer multiplication, amongst many other things, are presented in [30].
This section will deal with showing that the necessary conditions for applying the FFT
over the finite field of integers modulo a prime are met.
18
The goal is to investigate primes near the project target machines word size, i.e. 64 bits,
with the intention of using the Chinese remainder theorem and as many primes as needed.
For this to become reality, fields Z p having primitive n = 2m th roots of unity (m ≥ 50, to
meet the demands of GMP 5) needs to exist, possibly there has to be plenty of them. When
given a field Z p with a primitive nth root of unity, it is of interest to be able to find the root
efficiently, since Z p may have many elements.
Theorem 3.1. Z p has a primitive nth root of unity if and only if n|(p − 1).
Proof ⇐ Let n|(p − 1). Now Z p is a finite field and so Z∗p is cyclic. (A proof of this can be
found in [3], theorem 6.5.10.) Thus, Z∗p has a generator, say g. By Fermat’s little theorem
gn = g(p−1)/n has multiplicative order n in Z∗p . Hence gn is a primitive nth root of unity.
⇒ The order of a group element divides the order of the group. (This follows from
Lagrange’s theorem, proved in [3], theorem 3.2.10.) Z∗p has order p − 1.
A mod p FFT of length n = 2m apparently requires primes such that n|(p − 1), i.e., of
the form p = 2d k + 1, d ≥ m. Any such prime could be used for computing an FFT of
length 2m for m ≤ d. How many primes on this form are there?
It turns out that there are plenty of primes of the correct form, mainly due to:
Theorem Generalised prime number theorem. Let a, b ∈ Z satisfy gcd(a, b) = 1. Then
the number of primes ≤ n in the progression ak + b (k = 1, 2, . . .) is approximately
(n/ ln n)/φ(a), where φ is Euler’s totient function.
Proof See [10].
Lemma 3.2. If m = pα is a power of a prime then φ(m) = pα (1 − 1p ).
Proof The totient function, φ(n) counts the number of positive integers ≤ n that are relatively prime to n. If m = pα is a power of a prime then the numbers having a common factor
with m are multiples of p. There are pα−1 of these, so
φ(m) = φ(pα ) = pα − pα−1 = pα−1 (p − 1) = pα 1 −
1
.
p
As a consequence of the Generalised prime number theorem and Lemma 3.2 the number
of primes on the form p = 2 f k + 1 ≤ n, k odd, is approximately (n/ ln n)/2 f −1 .
If n = 264 and f = 50, the conclusion is that there are around seven hundred primes
p = 2d k + 1, k odd, with d ≥ f = 50.
In what proportion does generators of Z∗p occur? It is known, from group theory, that
the number of generators is φ(p − 1). Number theory says that the average value of φ(n)
over all integers n is 6n/π2 (see [16]). In probabilistic terms, an element drawn at random
from Z p , is expected to be a generator with probability greater than 3/π2 ≈ 0.3.
The following theorem gives an efficient method for testing if an element of a cyclic
group generates the group or not.
Theorem 3.3. a ∈ Z∗p is a generator ⇔ a(p−1)/q 1 in Z∗p for any prime factor q of p − 1.
Proof ⇒ Indeed.
⇐ Assume a has order k < p − 1. k|(p − 1) by Lagrange’s theorem, so p − 1 = mk. Let
m = qr, where q is a prime factor of m. Now
r p − 1 = qrk, so q is also a factor in (p-1)’s
(p−1)/q
rk
k
= a = a = 1.
prime factorisation. So, a
19
With Theorem 3.3 in hand, the method of testing the elements 2, 3, 4, . . . of Z p until a
generator is found seems as good as any. This is also what is suggested in algorithm 4.80
in [26].
The conclusion is that FFTs over Z p are possible, there exists plenty of primes having
the desired characteristics. Finding these primes along with generators of their multiplicative groups should not pose any problems.
3.2
“Three primes” Algorithm for Integer Multiplication
The aim is now to derive an integer multiplication algorithm based on FFTs over Z p . Consider the integers of Chapter 2, U = (un−1 . . . u0 )B and V = (un−1 . . . v0 )B and their corresn−1 i
i
ponding polynomials p(x) = n−1
i=0 ui x and q(x) =
i=0 vi x . The first proposition is to
choose B < W (W being the machine word size.), with a view to computing the polynomial
product r(k) (x) = p(x)q(x) mod pk for an adequate number K of primes pk , B ≤ pk ≤ W.
r(x) may then be recovered thanks to the Chinese remainder theorem. The algorithm was
dubbed the “three primes” algorithm in [25].
Algorithm 6: “Three primes” algorithm for integer multiplication.
Input: U = (un−1 . . . u0 )B , V = vn−1 . . . v0 )B
Output: The product of U and V.
TP(U, V)
(1)
for k ← 0 to K − 1
(2)
Compute r(k) = p(x)q(x) over Z pk [x] using the scheme of Figure 2.1.
(3)
Solve the (polynomial) linear integer congruences u(x) ≡ r(k) (x)
(mod pk ).
(4)
r(x) ← the least positive solution to the congruences.
(5)
return R ← r(B)
What can be said about K? Each rm = i+ j=m ui v j of r(x) is < nB2 . Thus, the first
step of Algorithm 6 calculates r(x) correctly if p0 p1 · · · pK−1 ≥ nB2 . Now, each pk ≥ B, so
the critical condition is satisfied if BK ≥ nB2 , i.e., if K ≥ logB (n) + 2. In the range where
n ≤ B, three primes are sufficient. Thus the name “three primes” algorithm. Naturally, the
constraint of there having to be a suitable root of unity for each pk also restricts the size of
n. (n ≤ 2D−1 has to be satisfied, if D is the largest integer for which three primes on the
form p = 2d l + 1, when d ≥ D, exists.)
Step 1 of Algorithm 6 takes O(n log n) time. The Chinese remaindering step involves
3·(2n−1) integer congruences, yielding O(n) time. It should be obvious that the evaluationat-radix step requires O(n log n) time. If the computer has a fixed word size W, making
operations mod p in O(1) time possible, Algorithm 6 multiplies two n digit base B integers,
B < W, in time O(n log n). (Observing the tacit constraints on n.)
The range of n where the three prime algorithm is O(n log n) can not be extended infinitely while retaining the same time complexity. To extend the range, W would have to be
20
increased and the assumption that mod p operations is O(1) would soon become undefendable.
Since the project target machines have 64 bit word size the limits now imposed on the
problem size is of no practical importance. Indeed multiplication of operands larger than
the available RAM on the typical workstation is possible.
Of course, nothing says that the polynomial coefficients have to be taken as word sized
pieces of the input operands. As long as p0 p1 · · · pK−1 ≥ nB2 , and suitable roots of unity
exists, any number of consecutive bits can be used as polynomial coefficients. There is
an interesting trade-off between fewer primes (polynomials of higher degree, lower hidden
constant in complexity terms) and more primes (polynomials of lower degree, bigger hidden constant). In Figure 3.1 the maximum possible B for different number of primes and
input sizes is plotted.
160
5 primes
4 primes
3 primes
2 primes
1 prime
140
chunk size (bits)
120
100
80
60
40
20
0
20
25
210
215
220
operand size (64 bit words)
225
230
Figure 3.1. The maximum number of bits per machine word for different operand sizes. Assumes high bit of primes is set.
21
Chapter 4
Implementation Issues
The project code was written in C and based on primitives from the GMP framework. To
get a feel for the FFT algorithm and the way it manipulates its operands, straightforward
implementations of Algorithms 4 and 5 were made.
As could be expected, the iterative formulation was found to be faster than the recursive
one, when the operands were small enough to fit in the L1 cache. As soon as the operands
were bigger than that, the recursive variant performed better. This is because it just takes
a few levels of recursion to reach a working set that resides in L1 cache. Could there be a
way to combine the speed of the iterative inner-loop with the nice locality properties of the
recursive approach? This is described in Section 4.1.
The true work horse of this multiplication scheme is the inner-loop of the FFT. In this
setting, working modulo p, a word-sized prime, the main obstacle is the modular reduction
of the product that results from multiplying with the twiddle factors (line 8 of Algorithm
5. If nothing special is done, this involves trial division, an operation that would make the
inner-loop far too slow to be competitative at this level. Section 4.2 describes the project
answer to this problem.
In Section 4.3 a scheme to make the performance of the code behave less like a set of
stairs with increasing operand size is presented.
4.1
Cache Locality
FFT algorithms in general have very poor memory locality. The data is accessed in strides
with large skips. The project code FFT only accepts power-of-two input lengths, also
known as a radix-2 FFT. In this kind of algorithm the skips are, naturally, powers of two,
making them particularly bad on direct-mapped cache systems. It could be the case that we
get 100 % cache misses!
One way to tackle this problem could be to try to rearrange the usual operand order in
a more cache-friendly manner. A possibility would be to treat the computational flow of
the FFT, see Figure 2.4, as a call tree and traverse it as locally as possible. This has been
tried, successfully, with the kind of FFT that GMP is using. For the project, this scheme
was implemented and discarded, due to it being less fit for an FFT over Zp , mainly due
22
to the extra computation of powers of ω needed with this method. (With the Fermat style
FFT, the extra computations reduces to additions and subtractions, while a mod p FFT gets
an extra multiplication and modular reduction inside the inner-loop.)
In [1], Bailey uses a method for improving memory locality of FFTs, originating in
ideas presented in [11] concerning the FFT and hierarchical memory systems:
Assume that the transform length n = n1 n2 . View the data as stored in an n1 × n2 matrix,
A jk . The “Four-step” FFT Algorithm proceeds as follows:
1. Do n2 transforms of length n1 .
2. Multiply each matrix element, A jk , by ω± jk , with the sign following the sign of the
transform.
3. Transpose the matrix.
4. Do n1 transforms of length n2 .
√
If n1 and n2 are chosen to be as close to n as possible, this algorithm will have much
better memory performance due to the shorter transform lengths. For the above formulation
to be optimal the memory would have to be arranged as in Fortran, with the columns in
linear memory. Assume, instead, that memory is ordered as in C, i.e., the element A jk is
stored at address jn2 + k. First, let us verify that the algorithm really does produce the DFT
of the data;
A(k) =
n−1
ω jk a( j) :
j=0
A transform, of length n1 is performed over column k2 . The correct root of unity is ωn2 ,
so
A1 (k1 n2 + k2 ) =
n
1 −1
ω j1 k1 n2 a( j1 n2 + k2
j1 =0
is the element in row k1 and column k2 . Then, the multiplication step:
A2 (k1 n2 + k2 ) = ωk1 k2 A1 (k1 n2 + k2 ).
After transposition:
A3 (k2 n1 + k1 ) = A2 (k1 n2 + k2 ).
Now a transform of length n2 over column k1 is performed. The root of unity used is ωn1 ,
so
A4 (k2 n1 + k1 ) =
n
2 −1
ω j2 k2 n1 A3 ( j2 n1 + k1 ).
j2 =0
23
Substitute A2 for A3 to get
A4 (k2 n1 + k1 ) =
n
2 −1
ω j2 k2 n1 A2 (k1 n2 + j2 ),
j2 =0
since j2 = k2 . Further, substituting A1 for A2 yields
A4 (k2 n1 + k1 ) =
n
2 −1
ω j2 k2 n1 + j2 k1 A1 (k1 n2 + j2 ).
j2 =0
Which, by the definition of A1 , leaves us with:
A4 (k2 n1 + k1 ) =
n
2 −1
ω
j2 k2 n1 + j2 k1
n
1 −1
j2 =0
ω j1 k1 n2 a( j1 n2 + j2 ).
j1 =0
By changing the order of summation:
A4 (k2 n1 + k1 ) =
n
1 −1 n
2 −1
ω j1 k1 n2 + j2 k2 n1 + j2 k1 a( j1 n2 + j2 ).
j1 =0 j2 =0
Notice that
(k2 n1 + k1 )( j1 n2 + j2 ) = j1 k2 n1 n2 + j1 k1 n2 + j2 k2 n1 + j2 k1 .
Now, ωn1 n2 = ωn = 1 ⇒ ω j1 k2 n1 n2 = 1, and finally,
A4 (k2 n1 + k1 ) =
n
1 −1 n
2 −1
ω(k2 n1 +k1 )( j1 n2 + j2 ) a( j1 n2 + j2 ).
j1 =0 j2 =0
Identify k with k2 n1 + k1 and note that letting j go from 0 to n − 1 is the same as running
j1 n2 j2 with j1 going from 0 to n1 − 1 and, for each j1 , letting j2 go from 0 to n2 − 1. It is
now clear that A4 (k2 n1 + k1 ) is precisely A(k), the DFT of a( j), partitioned into two parts.
Since the project code is written in C, the original scheme was modified slightly. The
column FFTs from the first step is made by gathering the elements into a small scratch
space, making only the copying forth and back slow. The copying back is interleaved
with the multiplications by ω. To further increase the cache friendliness one cache line of
padding is inserted after each row of the matrix, as per Figure 4.1.
4.2
Modular Reduction
To get the inner-loop of the FFT running smoothly the problem of modular reduction has to
be solved efficiently. We have xy, with 0 ≤ x, y < p and want xy mod p, preferably without
having to do any costly trial division.
The project code is aimed at large operands so a linear amount of extra work, such as
some precomputation, could well be worth the effort. In [27] such a scheme is proposed.
24
n1
n2
Figure 4.1. A cache line of unused memory is inserted after each row to alleviate cache associativity problems.
This scheme has become very widely used and is called Montgomery reduction after it’s
inventor:
For a fixed N a new residue system is chosen; for 0 ≤ i < N, i represents the residue
class containing iR−1 , where R > N and gcd(N, R) = 1. This will help in computing
T R−1 mod N when 0 ≤ T < RN as in Algorithm 7.
Algorithm 7: Reduction algorithm.
Input: T with 0 ≤ T < RN
Output: T R−1 mod N
REDC(T )
(1)
m ← (T mod R)N −1 mod R
(2)
t ← (T − mN)/R
(3)
if t ≥ 0 then return t
(4)
else return t + N
mN ≡ T mod R, so t is an integer. Furthermore, tR ≡ modN, so t ≡ T R−1 mod N.
Also, 0 ≤ T < RN and 0 ≤ mN < RN, so −N < t < N.
Now, given our x and y, let z = REDC(xy). Then z = (xy)R−1 , so (xR−1 )(yR−1 ) ≡
−1
zR mod N. 0 ≤ z < N holds, so z is the product of x and y in this representation. Observe
that addition and subtraction, as well as other operations, become unchanged.
In the project code R was chosen as 264 , making division become just choosing the high
word of the product and mod the low word.
The primes were, initially, meant to be any 64-bit primes offering long enough transform lengths. Then a tip reached the author via Niels Möller and [28]:
25
Consider primes on the form p = 2n − k2m + 1, where 1 ≤ m < n and k < 2n−m is odd
and positive. When n <= 2m, p−1 mod 2n ≡ k2m + 1 . Now Montgomery reduction mod p
becomes (In pseudo-code style):
IN: x in 0 .. 2^n*p-1
OUT: r in 0 .. p-1 with r == x/2^n (mod p)
STEP
STEP
STEP
STEP
1:
2:
3:
4:
Set x0 = x & (2^n - 1) and set x1 = X >> n
Compute t = (((k*x0) << m) + x0) & (2^n - 1)
Compute u = ((t << n) - ((k*t) << m) + t) >> n
If x1 >= u then set r = x1 - u
Otherwise set r = x1 - u + p
This looked very nice, since every step only involves shifts, additions and subtractions
and for small k it could be done without multiplications. Furthermore, there exists three
very nice 64-bit primes fitting this purpose: p0 = 264 − 232 + 1, p1 = 264 − 234 + 1 and
p2 = 264 − 240 + 1.
After a bit of experimentation these nice primes were abandoned. Mainly due to the fact
that, although elegant, the reduction function became hard to properly software pipeline,
thus ultimately slowing down the inner-loop on the project target machines with strong
integer multiply support.
Also, since the inception of the project, the demands on transform length had changed.
Now transforms of size 250 had to be possible. Clearly neither p0 , p1 nor p2 could provide
the roots of unity for that.
It is possible to take the concept of precomputation further in this context. After all, at
run time the powers of ω to be used are known, could this not be utilised in some fashion?
Exactly this point has been addressed by Victor Shoup and was communicated to the author
by Torbjörn Granlund via [13]:
We have x and y, two p-residues, and want xy mod p. If W is the machine word-size,
let p be a (W − 1)-bit prime. Pre-compute
y pinv = 2 · 2W−1 y/p.
Now,
q = x · y pinv /2W ≈ xy/p.
Let r = xy − qp. If r is computed mod2W , it will be either the correct answer or p to
big.
To check this, replace the floor expressions by the appropriate remainders:
y pinv = 2
q=
2W−1 y − r1
, 0 ≤ r1 < p
p
x · y pinv − r2
, 0 ≤ r2 < 2w
2W
26
⇒q=
x · y pinv
r2
2r1
r2
2W−1 y − r1
r2
xy
− W − W.
−
=
2x
− W =
W
W
W
p 2 p 2
2
2
p2
2
Then
r = x·y−q· p =
2r1 r2 p
+ w.
2w
2
Clearly, r ≥ 0. And also,
r≤
2w − 1
2(p − 1)
+
p
< 1 + p.
2w
2w
When using Montgomery reduction, the full double word product, x · y has to be computed first. Then follows the application of Algorithm 7, holding one single word product
and one double word product. Shoup’s scheme has two single-precision multiplications
and one double-precision multiplication where only the high word is needed, making it a
bit more inexpensive. Also, there is more potential parallelism in Shoup’s scheme, possibly giving it better characteristics for software pipelining. As it requires both the powers
of ω and their pre-computed duals it does perform worse than Montgomery reduction with
respect to memory accesses. Also, it requires performing modular reductions without any
precomputation, for instance when performing the pointwise multiplications of the transformed polynomials. This has to be accomplished using standard methods from, e.g., [15].
For the final version of the project code, both Montgomery reduction using 64 bit
primes and Shoup’s reduction using 63 bit primes, was implemented.
4.3
Two Wrongs Make a Right
In the first implementations the convolution of the two n-long sequences u and v was performed by padding the sequences with zeros to length 2k, where k is the smallest power of
two greater than or equal to n. This clearly favours power-of-two operand sizes and gives
sudden increases in computational cost when the size exceeds a power of two.
In [2], Bailey describes a method for making the computational cost a more continuous
function of n and, consequently, reducing the cost for certain problem sizes. This method
was implemented in the project code and is described in the following example:
Consider the case when n = k + 2, where k = 2m . I.e., uk , uk+1 , uk+2 and vk , vk+1 , vk+2
may be nonzero. First pad u and v with zeros to length 2k = 2m+1 . Then apply forward and
inverse FFTs to the extended sequences to produce the following circular convolution:
r0 = u0 v0 + uk−1 vk+1 + uk vk + uk+1 vk−1
r1 = u0 v1 + u1 v0 + uk vk+1 + uk+1 vk
r3 = u0 v3 + u1 v2 + u2 v1 + u3 v0
..
.
. = ..
rk = u0 vk + u1 vk−1 + . . . + uk v0
..
.
. = ..
r2k−1 = uk−2 vk+1 + uk−1 vk + uk vk−1 + uk+1 vk−2
27
Two items make this result differ from the desired 2n-long linear convolution:
• the first three values are corrupted by some additional terms
• the final four members of the sequence are missing. The missing values are
r2n−4 = r2k = uk−1 vk+1 + uk vk + uk+1 vk−1
r2n−3 = r2k+1 = uk vk+1 + uk+1 vk
r2n−2 = r2k+2 = uk+1 vk+1
r2n−1 = r2k+3 = 0
Ignoring the last zero value, these expressions are precisely the values that have corrupted
the first three members of the desired sequence r. So, by computing three expressions
separately, the r sequence can be corrected to the sought 2n-point linear convolution.
From the example it is clear that this method can be applied when evaluating the linear
convolution of sequences of size n = k + d for any d < k = 2m . Extend the input sequences
with zeros to length 2k, and calculate the 2k-long circular convolution by FFTs. Then
compute a correction sequence by computing a linear convolution of the two (2d − 1)-long
sequences u = {uk−d+1 , uk−d+2 , . . . , uk+d−1 } and v = {vk−d+1 , vk−d+2 , . . . , vk+d−1 }. Discard the
first 2d − 2 values (as well as the final zero value), and correct the 2k-long sequence to the
desired 2n-long sequence.
Clearly, there is no point in doing this for d much larger than 2m−1 , since for this value of
d the size of the correction convolution is about the same size as the 2m -point convolution,
making the two convolutions nearly as costly as one convolution on inputs of length 2m+1 .
28
Chapter 5
Results
Four different variants of the project code were implemented in C, using the GMP framework:
• Straight iterative code based on Montgomery reduction. Also uses the ideas from
Chapter 4.3.
• Straight iterative code based on Victor Shoup’s reduction scheme. Also uses the ideas
from Chapter 4.3.
• Bailey Four-step FFT based on Montgomery reduction.
• Bailey Four-step FFT based on Shoup’s reduction scheme.
The versions of the code utilising Montgomery reduction was implemented for one to
five 64 bit primes and the Shoup based code uses one to five 63 bit primes.
As has been shown during the development of GMP, the use of hand written assembly
code is essential to really get the best performance on every platform. The FFT present
in GMP 4.1.4 relies heavily on the very tight addition and subtraction assembly loops of
GMP. The project C code would not stand much of a chance on a platform where GMP
assembly support is available.
However, GMP 4.1.4 lacks assembly support for the very interesting AMD 64 architecture making the comparison a much fairer one. In Figures 5.1 and 5.2 the quotient of
execution times between the project code and GMP 4.1.4 is shown for two different machines.
Several conclusions can be drawn from the data in Figures 5.1 and 5.2. The most
obvious one being that the project code is up to twice as fast as the code in GMP 4.1.4. It
is also the case that Montgomery reduction is more successful than Shoup’s scheme. This
is mainly due to the fact that the linear work is heavier in Shoup’s scheme when nothing
special has been done to simplify the modular reductions not supported by the precomputed
powers of ω. What may not be as evident are some general characteristics of the quotient
curves. On all interesting platforms it is the case that when operands fit in L1 cache, the
straight implementations are faster than their Four-step counterparts. Just above L1 cache
29
1
Shoup’s reduction
Montgomery’s reduction
execution time quotient
0.8
0.6
0.4
0.2
0
26
28
210
212
214
operand size (words)
216
218
220
Figure 5.1. The quotient of execution times between the project code and GMP 4.1.4 code
for different operand sizes using different reduction schemes. Measurements are from an AMD
Opteron 246 2 GHz machine.
size there is a peak in the curves. This is where the Four-step code has overtaken the straight
code in performance, but the larger overhead is showing through.
It could be expected, and is the case, that on two platforms differing only in memory
bandwidth, the project code would be more successful, in comparison with the GMP code,
on the platform with lower bandwidth since it has better cache performance characteristics. Likewise the project code would be better of on a platform with higher CPU clock
frequency if that was the only difference. This is due to fact that the project code performs
more operations than the GMP code.
The two machines used to generate the data in Figures 5.1 and 5.2 are similar, but this
particular Opteron 246 platform has a faster processor as well as higher memory bandwidth.
It seems that the two above mentioned effects are, almost, balanced out by this fact.
On the Intel Itanium 2 architecture, another instance where GMP 4.1.4 does not have
assembly support, similar performance, if not as spectacular, was achieved. The author and
Torbjörn Granlund have written proof-of-concept assembly code, showing that the project
approach could very well be competitive on machines with GMP assembly.
To give the reader a feel for how fast the implemented algorithm is, Figure 5.3 shows
timing data for the Opteron 246 test machine. Together with the timing data is also plotted
how many primes that were used to give the measurements. It is well worth to note that
it can pay off to make the input operands longer, i.e., to use only two primes. Naturally
enough, this is only an option for relatively small operands.
30
1
Shoup’s reduction
Montgomery’s reduction
execution time quotient
0.8
0.6
0.4
0.2
0
26
28
210
212
214
operand size (words)
216
218
220
Figure 5.2. The quotient of execution times between the project code and GMP 4.1.4 code
for different operand sizes using different reduction schemes. Measurements are from an AMD
Opteron 240 1.4 GHz machine.
execution time (s)
1
0.1
0.01
0.001
0.0001
number of primes
5
4
3
2
26
28
210
212
214
operand size (words)
216
218
220
Figure 5.3. Execution times for multiplication based on the project code using Montgomery
reduction plotted together with the number of primes used to achieve that timing measurement.
31
There is still much work to be done if the project code is to inserted into the GMP
repository. The main obstacle is to write a function that accurately predicts how many
primes to use for a particular problem size. At this point, when the inner-loop of the FFT
code dominates the execution time, a simple comparison of operations executed in that loop
probably is accurate enough. As the inner-loop grows tighter, weight probably has to be
given to the linear work as well.
Other possible directions of development arise from the demands of GMP 5. There,
truly huge operands have to be supported. To counter this, one suggestion would be to
implement support for more primes and a recursive Four-step algorithm. It could also be
interesting to experiment with Victor Shoup’s scheme on 32 bit platforms.
In any case, the author would like to deem the project as successful and looks forward
to meeting the remaining challenges mentioned above.
32
References
[1] D. H. Bailey. Ffts in external or hierarchical memory. Journal of Supercomputing,
4(1):23–35, March 1990.
[2] D. H. Bailey. On the Computational Cost of FFT-Based Linear Convolutions. http:
//crd.lbl.gov/~dhbailey/dhbpapers/, June 1996. Last visited May 2005.
[3] J. A. Beachy and W. D. Blair. Abstract Algebra. Waveland Press, second edition,
1996. ISBN 0-88133-866-4.
[4] C. S Burrus. Notes on the FFT. http://www.fftw.org/burrus-notes.html, September
1997. Last visited May 2005.
[5] CiteSeer.IST Scientific Literature Digital Library. http://citeseer.ist.psu.edu/. Last
visited May 2005.
[6] W. T. Cochran, J. W. Cooley, D. L. Favin, H. D. Helms, R. A. Kaenel, W. W. Lang,
G. C. Maling, D. E. Nelson, C. M. Rader, and P.D. Welch. What is the fast fourier
transform? IEEE Transactions on Audio and Electroacoustics, AU-15(2):45–55, June
1967.
[7] J. W. Cooley, P. A. W. Lewis, and Welch P. D. Historical notes on the fast fourier
transform. IEEE Transactions on Audio and Electroacoustics, AU-15(2):76–79, June
1967.
[8] J.W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex
fourier series. Mathematics of Computation, 19:297–301, 1965.
[9] T.H. Cormen, C. E. Leiserson, R. L. Rivest, and Stein. C. Introduction to Algorithms,
chapter 30. MIT Press, second edition, 2001. ISBN 0-262-53196-8.
[10] T. Estermann. Introduction to Modern Prime Number Theory, chapter 2. Cambridge
University Press, 1961.
[11] W. M. Gentleman and G. Sande. Fast fourier transforms – for fun and profit. In 1966
Fall Joint Computer Conference, volume 29 of AFIPS Proceedings, pages 563–578,
1966.
33
[12] M. T. Goodrich and R. Tamassia. Algorithm Design: Foundations, Analysis, and
Internet Examples, chapter 10, pages 488–507. John Wiley, 2002. ISBN 0-47138365-1.
[13] T. M. Granlund. E-mail communication with Victor Shoup of New York University,
USA, February 2005.
[14] T. M. Granlund et al. GNU Multiple Precision Arithmetic Library 4.1.4. http://swox.
com/gmp/, September 2004.
[15] T. M. Granlund and P. L. Montgomery. Division by invariant integers using multiplication. ACM SIGPLAN Notices, 29(6):61–72, 1994.
[16] G. H. Hardy and E. M. Wright. An introduction to the theory of numbers. The
Clarendon Press, fifth edition, 1979. ISBN 0-19-853170-2.
[17] M. T. Heath. Scientific Computing: An Introductory Survey. McGraw-Hill, second
edition, 2002. ISBN 0-07-239910-4.
[18] M. T. Heideman, D. H. Johnson, and C. S. Burrus. Gauss and the history of the fft.
IEEE Acoustics, Speech, and Signal Processing Magazine, 1:14–21, October 1984.
[19] J. L. Hennessy and D. A. Patterson. Computer Architecture a Quantitative Approach,
chapter 5, pages 372–483. Morgan Kaufmann, second edition, 1996.
[20] J. Håstad. Notes for the course advanced algorithms.
~johanh/algnotes.pdf, 2000. Last visited May 2005.
http://www.nada.kth.se/
[21] J. Johnson and R. Johnson. Challenges of computing the fast fourier transform.
In Proc. Optimized Portable Application Libraries (OPAL) Workshop. DARPA/NSF,
June 1997.
[22] A. Karatsuba and Y. Ofman. Multiplication of multidigit numbers on automata. Doklady Akademii Nauk SSSR, 145(2):293–294, 1962.
[23] D. E. Knuth. The Art of Computer Programming : Seminumerical Algorithms,
volume 2, chapter 4. Addison-Wesley, 3 edition, 1998. ISBN 0-201-89684-2.
[24] S. Landau and N. Immerman. The Similarities (and Differences) between Polynomials
and Integers. http://citeseer.ist.psu.edu/landau94similarities.html, September 1994.
Last visited May 2005.
[25] J. D. Lipson. Elements of algebra and algebraic computing. Benjamin/Cummings,
1981. ISBN 0-201-04480-3.
[26] A. J. Menenez, P. C. van Oorschot, and S. A. Vanstone. Handbook of Applied Cryptography. CRC Press, 1996. ISBN 0-8493-8523-7.
[27] P. L. Montgomery. Modular multiplication without trial division. Mathematics of
Computation, 44:519–521, 1985.
34
[28] N Möller. E-mail communication with François G. Dorais of Dartmouth College,
USA., June 2004. Originating from Pomerance, C.
[29] C. Percival. Rapid multiplication modulo the sum and difference of highly composite
numbers. Mathematics of Computation, 72:387–395, 2003.
[30] J. M. Pollard. The fast fourier transform in a finite field. Mathematics of Computation,
25:365–374, 1971.
[31] R. L. Rivest, A. Shamir, and L. M. Adelman. A Method for Obtaining Digital Signatures and Public-key Cryptosystems. Technical Report MIT/LCS/TM-82, MIT, 1977.
[32] A. Schönhage and V. Strassen. Schnelle multiplikationen grosse zahlen. Computing,
7(3-4):281–292, 1971.
[33] A. S. Tanenbaum. Structured Computer Organization. Prentice Hall, fourth edition,
1999. ISBN 0-13-020435-8.
[34] A. L. Toom. Complexity of a scheme of functional elements realizing the multiplication of integers. Doklady Akademii Nauk SSSR, 150:496–498, 1963.
35
36
Appendix
Acronyms
The following acronyms and abbreviations are used throughout the report.
CRT: Chinese Remainder Theorem.
DFT: Discrete Fourier Transform.
FFT: Fast Fourier Transform.
GMP: GNU Multiple Precision Arithmetic Library.
NTT: Number Theoretic Transform.
37
Download