Three Dimensional Fast Fourier Transform CUDA Implementation

advertisement
Three Dimensional Fast Fourier Transform
CUDA Implementation
Kumar Aatish, Boyan Zhang
Contents
1 Introduction
1.1 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . . . . . .
2
2
2 Three dimensional FFT Algorithms
2.1 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 In-Order Decimation in Time FFT . . . . . . . . . . . . . . . . . . . . . .
3
3
9
3 CUDA Implementation
11
3.1 Threadblock Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Twiddle Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Transpose Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Experiment Result
17
4.1 Occupancy calculation and analysis . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Scope for further improvements
5.1 Algorithmic Improvements . . . . . . . . . .
5.1.1 CUDA FFT . . . . . . . . . . . . . .
5.1.2 Transpose Kernel . . . . . . . . . . .
5.2 Kepler Architecture Specific Improvements
5.2.1 Register Per Thread . . . . . . . . .
5.2.2 Shuffle Instruction . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A Occupancy Calculation for Different Thread Block Heights
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
19
19
20
20
20
20
1
Introduction
1.1
Discrete Fourier Transform (DFT)
One dimensional Discrete Fourier Transform (DFT) and Inverse Discrete Fourier Transform (IDFT) are given below[Discrete Fourier Transform]:
x(k) =
N
−1
X
x(n)WNkn , 0 ≤ n ≤ N − 1,
(1)
n=0
x(n) =
N −1
1 X
x(n)WN−kn , 0 ≤ k ≤ N − 1,
N
(2)
k=0
where
WNkn = e−i
2πkn
N
(3)
In above discrete transform, the input is N-tuple complex vector x(n), output is vector
with same length.
X = Wx
(4)
If a complex function h(n1 , n2 ) defined over the two-dimensional grid n1 ∈ [0, N − 1],
n2 ∈ [0, N − 1]. We can define two-dimensional DFT as a complex function H(k1, k2),
defined over the same grid.
H(k1 , k2 ) =
NX
2 −1 N
1 −1
X
e
−2πik2 n2 −2πik1 n1
N2
N1
h(n1 , n2 )
(5)
n2 =0 n1 =0
Correspondingly, it is inverse transform can be re-addressed in such form:
H(n1 , n2 ) =
NX
2 −1 N
1 −1
X
−2πik2 n2 −2πik1 n1
1
N1
e N2
h(k1 , k2 )
N1 N2
(6)
k2 =0 k1 =0
Since the Fourier Transform or Discrete Fourier Transform is separable, two dimensional
DFT can be decomposed to two one dimensional DFTs. Thus the computation of two
dimensional DFT can achieved by applying one dimensional DFT to all rows of two
dimensional complex matrix and then to all columns (or vice versa). Higher dimensional
DFT can be generalized from above process, which hints similar computation solution
along each transform dimension.[Separability of 2D Fourier Transform]
2
2
Three dimensional FFT Algorithms
As explained in the previous section, a 3 dimensional DFT can be expressed as 3 DFTs
on a 3 dimensional data along each dimension. Each of these 1 dimensional DFTs can be
computed efficiently owing to the properties of the transform. This class of algorithms
is known as the Fast Fourier Transform (FFT). We introduce the one dimensional FFT
algorithm in this section, which will be used in our GPU implementation.
2.1
Basis
The DFT of a vector of size N can be rewritten as a sum of two smaller DFTs, each of size
N/2, operating on the odd and even elements of the vector (Fig 1). This is know as the
Danielson-Lancsoz Lemma [G. C. Danielson and C. Lanczos] and is the basis of FFT. .
So-called fast fourier transform (FFT) algorithm reduces the complexity to O(N logN ).
Danielson-Lancsoz Lemma:
N
2
X(k) =
−1
X
x(2n)e
2π×(2n)k
−i
N
N
2
+
n=0
=
x(2n + 1)e−i
2π×(2n+1)k
N
n=0
N
2
−1
X
−1
X
x(2n)e
−i 2πnk
N
2
N
2
+
n=0
−1
X
x(2n + 1)e
−i 2πnk
N
2
n=0
= DF T N [x0 , x1 , . . . , xN −2 ] + WNK DF T N [x1 , x3 , . . . , xN −1 ]
2
2
If N is a power of 2, this divide and conquer strategy can be used to recursively break
down the transform of size N into log2 (N ) transforms of length 1 leading to performance
increase (O(N log2 N )) over a nave matrix multiplication of a vector with the DFT matrix
(O(N 2 )).
The exact method of decomposing the larger problem into smaller problems can vary
widely. A number of algorithms have been proposed over the years, the most famous
of which is the Cooley-Tukey Algorithm[Cooley, J. W. and Tukey, O. W.] which has two
variations called the Decimation in Time (DIT) and Decimation in Frequency (DIF).
3
Figure 1: Recursive Relation (DIT FFT) : 1. 2 * N/2-radix DFT 2. Twiddle Multiplication 3. N/2 * 2-radix DFTs[Decimation-in-time (DIT) Radix-2 FFT]
In the DIT FFT (Fig 1) , if a vector is of size N = N1*N2, the DFT formulation is
done by:
4
Figure 2: Recursive Relation (DIF FFT) : 1. N/2 * 2-radix DFTs 2. Twiddle Multiplication 3. 2 * N/2-radix DFT[Decimation-in-time (DIT) Radix-2 FFT]
• Performing N1 DFTs of size N2 called Radix − N2 FFT.
• Multiplication by complex roots of unity called twiddle factors.
• Performing N2 DFTs of size N1 called Radix − N1 FFT.
The DIF FFT, the DFT formulation is:
• Performing N2 DFTs of size N1 called Radix − N1 FFT.
• Multiplication by complex roots of unity called twiddle factors.
• Performing N1 DFTs of size N2 called Radix − N2 FFT.
In this paper, we implement the DIT FFT for length 128, although, according to our
hypothesis, an equivalent DIF FFT would not differ in performance. By using this 128
5
element FFT, we can further construct FFT algorithms for different sizes by utilizing
the recursive property of FFTs. However, such an exercise is not under the scope of our
project.
In the DIT scheme, we apply 2 FFT each of size N/2 which can be further broken
down into more FFTs recursively. After applying each such recursive relation, we get a
computation structure which is popularly referred to as a Butterfly.
W (N,
N
) = −1
2
(7)
Therefore, W (N, k) can be expressed as
− W (N, k −
N N
),
≤k ≤N −1
2 2
(8)
Since we can express half of the twiddle factors in terms of the other half, we have
N
2 twiddle multiplication at each stage, excluding multiplications by -1. For an input
(G(i), H(i)) the function B(G(i), H(i)) = (G + WNi ∗ H(i), G − WNi ∗ H(i)).
Figure 3: Butterfly [Signal Flow Graphs of Cooley-Tukey FFTs]
It should be noted that while breaking down the FFT of size N into two FFTs of size
the even and odd inputs are mapped out of order to the output. Decomposing this
problem further to FFTs of the smallest size possible would lead to an interesting order
of input corresponding to the output. Upon inspection, we can see (Fig 4) that an input
having an index k, had a corresponding output with index as bitreverse (k, log2 (N )), i.e.
the bit reversal done in log2 (N ) bits.
N
2,
6
Figure
4:
Radix-2
DIT
FFT
[Decimation-in-time (DIT) Radix-2 FFT]
Input
0
1
2
3
4
5
6
7
on
vector
with
length
8
Output
000
001
010
011
100
101
110
111
000
100
010
110
001
101
011
111
0
4
2
6
1
5
3
7
This pattern can be explained by the fact that applying an FFT of length N2 would
require reshuffle of input indices into even − odd relative to the new transform length
after the decomposition, e.g. N2 after the first decomposition. This would be the same as
7
sorting with length (2pN−1) with respect to the p − 1th bit first and then by the remaining
bits in the pth decomposition.
Input
0
2
4
6
1
3
5
7
Output
000
010
100
110
001
011
101
111
000
100
010
110
001
101
011
111
Input
0
4
2
6
1
5
3
7
Output
000
100
010
110
001
101
011
111
000
100
010
110
001
101
011
111
Input
0
4
2
6
1
5
3
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Output
000
100
010
110
001
101
011
111
000
100
010
110
001
101
011
111
0
1
2
3
4
5
6
7
After the complete decomposition of the problem is achieved, we would be left with an
FFT of length 2 and the number of decompositions done would be log2 (N ). Thus we
can do the FFT in log2 (N ) time steps, and each such step is referred to as Stages in this
paper.
8
2.2
In-Order Decimation in Time FFT
A DIT FFT does not necessarily imply an out of order input and an in order output.
We can reshuffle the input order at the expense of ordering of the output. For example,
in Fig 5, for the case of FFT length = 16, the reshuffled butterfly diagram has inorder input and out-of-order output. For the development of our GPU algorithm, this
approach would be more conducive for coalesced memory access.
As mentioned before, it is easy to observe that after every stage, there are always
N
2 twiddle multiplications. This is due to the fact that every decomposition stage p,
there are 2Np twiddle factors introduced by 2p−1 DFTs. As the recursive relation leads
to smaller transform lengths, the range of twiddle factors introduced at each step varies.
It is because of this reason that the difference between a DIT and DIF implementation
of FFT is the range of twiddle factors W Q used to multiply the intermediate values
with before every stage. Fortunately, the values that the twiddle factors take can be
determined, although the relationship between the twiddle factors and the input index
is not so obvious.
9
Figure 5: DIT FFT, In-order input, out-of-order output. Every stage has N/2 twiddle
factors [Decimation-in-time (DIT) Radix-2 FFT]
Stage
k
input index
range(Q)
1
2
3
0-3, 4-7
0-1, 2-3, 4-5, 6-7
0, 1, 2, 3, 4, 5, 6, 7
4-7, 12-15
2-3, 6-7, 10-11, 14-15
1, 3, 5, 7, 9, 11, 13, 15
04
0, 4, 2, 6
0, 4, 2, 6, 1, 5, 3, 7
At every stage, the Stride is defined to be (2Np ) and each such stage consists of 2p−1
sets of butterflies. Depending upon whether the input element index falls on the lower
half or the upper half of a butterfly set, elements are either added to or subtracted
10
from elements across a stride. More precisely, the following pseudocode demonstrate the
ADDSUB step of each stage:
i f ( ( t i d & STRIDE( p ) ) == STRIDE( p ) )
{
v a l [ i ] = v a l [ i − STRIDE( p ) ] − v a l [ i ] ;
}
else
{
v a l [ i ] = v a l [ i + STRIDE( p ) ] + v a l [ i ] ;
}
After the ADDSUB step, the MUL TWIDDLE step takes place for all stages except the
last one
i f ( ( t i d & STRIDE( p+1)) == STRIDE( p+1))
{
v a l [ i ] = v a l [ i ] ∗TWIDDLE KERNEL( i , p , N ) ;
}
Our implementation, does not make function calls to generate twiddle factors but are
instead hard coded. A general relation between the kth twiddle factor and its value Q can
be expressed as bitreverse(F LOOR[k/(N/2p )], log2 (N )−1).[Computing FFT Twiddle Factors]
This idea has been applied to our implemented case of length 128 and a detailed
relation between input index and Q is given in the next section.
3
CUDA Implementation
Our implementation does an FFT transform in the row major dimension of a given three
dimensional matrix at a time. Thus, the complete 3D FFT is a set of 1D FFT kernels
and transpose kernels which bring a desired coordinate axis to the row major format
to enable coalesced global reads. The implemented kernel performs a single precision
1D FFT and uses the fast math functions for calculating the sin and cos of the phases
corresponding to twiddle factors.
The GPU used is the C2050 Fermi GPU on the DIRAC cluster on the Carver system
provided by the National Energy Research Scientific Computing Center. The NVIDIA
Tesla C2050 has 448 parallel CUDA processor cores with 3 GB of memory.
11
3.1
Threadblock Geometry
The threadblock geometry and global data mapping onto the block can be done in more
ways than one.
Figure 6: Serial Configuration of thread block mapping
(Fig 6) shows that each CUDA thread within a thread block manages entire row
vector transform. this configuration does not meet memory usage requirement for large
input data size because of limitation of local storage space (registers per thread). Moreover, reads from global memory are not coalesced.
12
Figure 7: Serial Configuration of thread block mapping
(Fig 7) shows a different mapping approach of thread block structure. Threads within
one thread block are working on a single row vector. Required communication among
different threads only happens within each thread block.
13
Figure 8: Serial Configuration of thread block mapping
(Fig 8) is a modified configuration for that showed in (Fig 7)., a single row of threads
are grouped within one thread block responsible for calculating the transform of more
than one row of data. If we assume length of transform vector is 128, upto 4 rows can
be grouped within one thread block. For the optimal number, a trade-off may be made
between multiprocessor computing ability and resource.
We have implemented the third alternative that transforms more than one row of
data by a single row of threads resulting in higher Instruction Level Parallelism. Each
such threadblock is assigned a shared memory block of height H and row size TLEN =
128. Since, we are dealing with complex numbers two identical shared memory blocks
are maintained for both real and imaginary components of the data.
The real and imaginary component of the data is loaded onto the shared memory
from the global and is also read into the registers. After this step each of the 7 stages
of the FFT are computed. Each of these stages consists of the ADDSUB step, that is,
every thread reading data element of a thread either STRIDE(p) ahead and add the
value to its own value, or read from a thread STRIDE(p) behind and subtract its own
value from this value.
The next stride value STRIDE(p+1) decides which of these elements get multiplied
by the twiddle factor. Those threads that satisfy the condition are allowed to multiply
14
their data with the twiddle factor, after which the resultant data is written back to the
shared memory to be used by the threadblock. This sequence of steps continues till the
FFT is completed. The last stage however, does not involve multiplication by twiddle
factor and only consists of the ADDSUB step after which the values of each thread is first
written to the bit reversed column in the shared memory. The data is then read back
to the registers of every thread in a contiguous pattern and then written to the global
memory in a coalesced fashion. Although an increased H would apparently increase
the instruction level parallelism there are limits to the range of H as it can cause lower
occupancy due to limitation of shared memory resource. Apart from shared memory, an
increased H would also cause an increased register pressure and could lead to register
spilling leading to inefficient execution and lowered performance. Even if we somehow
limit the row size so that H * TLEN is lesser than 32 KB (Available register memory), a
severe limitation of the Fermi architecture is that it only allows 63 registers per thread.
Due to these concerns, we experimentally determined the optimal height H for which
the 1D FFT kernel would be most efficient. These experiments are dealt with in the
Experimental Results section.
3.2
Twiddle Multiplication
As explained in the Algorithms section, the relationship between the kth twiddle factor
and the twiddle factor value Q is given by bitreverse(F LOOR[k/(N/2p )], log2 (N ) − 1).
Since our implementation maps elements of the transformed vector along the threadblock, the thread index maps directly to the input index.
For our case, N = 128, log2 (N ) − 1 = 6 For p = 1,
Figure 9: Twiddle factors for stage 1
For p = 2,
15
Figure 10: Twiddle factors for stage 2
A general pattern emerges from these two tables. For any given thread id in the pth
stage, the most significant p bits are reversed in log2 (N ) − 1 bits and remaining bits
are ignored. For e.g., for stage 5, if the thread id corresponding to a particular data
item, which is to be multiplied with a twiddle factor is of the type ABCDEXY, then the
value of Q, which is to be multiplied with the data item, is EDCBA0. For each of the
stages for N = 128, different bit reversal techniques were used to maximize performance.
More often that not, unnecessary operations were removed from a generalized 6 bit, bit
reversal.
3.3
Transpose Kernel
Our 1D FFT kernel is designed to work in the row major dimension. Due to this, after
performing a 1D FFT, we must transpose the global data so that the 1D FFT kernel can
work in a different data dimension. Since the 3D FFT would also encompass transpose
operations, it is imperative that an efficient transpose kernel is developed. CUDA SDK
provides a great example for starting with two dimensional square matrix transpose
[Optimizing Matrix Transpose in CUDA]. The optimized kernel removes uncoalesced
memory accessing and shared memory bank conflicts. We modified transpose kernel to
adapt to our requirement for three dimensional transpose.
Considering the stride between sequence elements is large, performance tends to
degrade for higher dimensions. Global transpose is involved with significant amount
of global memory operation, consequently, optimized transpose operation is critical for
achieving better overall performance. Before performing the 1D FFT, the data dimension
in which the transform is desired is brought down to the lowest dimension. After this
process is completed for all the data dimensions, the data is transposed back to the
original data dimensions. In totality, the 1D FFT kernel is called thrice interspersed
with 4 transpose kernels.
16
4
4.1
Experiment Result
Occupancy calculation and analysis
Appendix A lists figures result of occupancy testing for multiprocessor warp occupancy
under different thread block height configurations. It can be shown that when H =
16, local shared memory usage serves as limiting factor for warp occupancy rate. Each
thread block is assigned 16KB shared memory block. On Dirac GPUs with 2.0 compute
capability [Node and GPU Configuration], each multiprocessor has only 48KB shared
memory space available. Applying this calculation to the case when H = 8, we found
that the concurrent executing threadblock on each multiprocessor is 6 which is less than
physical limits for active thread block. The optimal height H is 4, although it delivers same occupancy rate with smaller height size, but better local memory bandwidth
utilization benefits memory accessing latency hiding. However, occupancy calculation
also shows possibility for further improvement, each threadblock can have more threads
without touching/reaching register and shared memory usage critical points. Based on
discussion above, we expect implementation scheme such as assigning 2xTLEN threads
for one thread block will deliver better performance, although without implementation
and test result to support the hypothesis.
Figure 11: FFT 1D Implementation Vs. CUFFT 1D
17
Figure 12: FFT 1D Implementation Vs. CUFFT 3D
Figure 13: FFT 3D Execution Time (ms)
4.2
Benchmarking
Two testing schemes are utilized for benchmarking our kernel with CUFFT library performance.
• 1D CUFFT plan (1D in place transform) over three dimensional data
18
• 3D CUFFT plan (3D in place transform) over three dimensional data
5
Scope for further improvements
5.1
5.1.1
Algorithmic Improvements
CUDA FFT
The algorithm employed in this project can be improved further by utilizing register
memory for FFT transform. Recall that the optimum height of a global data chunk
belonging to a thread block is 4. Before starting a 1D FFT along the row major direction,
suppose in a thread block every thread having global Y index tid y were to load the data
row with index bitreverse(tid y). This would be followed by a 1D FFT in the row major
direction. Since the Height is 4, every thread has 4 elements in the Y direction, but since
we have loaded rows corresponding to bitreverse(tid y), , each thread would be able to
finish the first two stages of the FFT in the Y direction using register memory without
synchronization getting into the picture. This would be followed by 2 transposes so that
the Y data dimension becomes row major and Z data dimension becomes column major,
however its indices would be bitreversed. This would be followed by 1D FFT in the
row major order. Since we have read the rows in the bitreverse order in the last 1D
FFT, currently, the input to the 1D FFT would be out of order, the data would already
be in the third stage of the 1D FFT computation and the final output data would be
in-order. This would be followed by the same sequence of instructions till the 3D FFT
is computed.
5.1.2
Transpose Kernel
Current implementation of the 3D FFT involves calls to two transpose kernels after the
third 1D FFT is completed. These two calls individually exchange the data dimensions
XY and YZ to undo the effect of transpose required by the 1D FFT kernels. It is possible
to merge these two transpose kernels to transpose X, Y and Z dimensions simultaneously.
Since each transpose kernel spends non-trivial time calculating its new index, instruction
level parallelism can be utilized to hide the memory latency.
19
5.2
Kepler Architecture Specific Improvements
5.2.1
Register Per Thread
One of the prime reasons why the Height of data chunk assigned to a thread block could
not be increased is due to register pressure. It would be possible to increase this Height
in Kepler GK110 since the maximum number of registers per thread possible has been
raised to 255 as compared to 63 in Fermi. An increased Height could mean an improved
performance since it would allow even more stages of computation of column major
FFT (as explained in the previous subsection) to be computed by a single thread using
register memory only.
5.2.2
Shuffle Instruction
It is quite possible that the shared memory usage of the 1D FFT Kernel be drastically
reduced because of the shuffle instruction. Shuffle Instruction enables data exchange
within threads of a warp, hence shared memory usage would go down as a result. For
out of order input and in-order output, this would be especially more beneficial because
the first few stages would eliminated the need for inter warp communication.
A
Occupancy Calculation for Different Thread Block Heights
H=1
20
Figure 14: Threads Per Block Vs. SM Warp Occupancy
Figure 15: Registers Per Thread Vs. SM Warp Occupancy
21
Figure 16: Shared Memory Usage Vs. SM Warp Occupancy
H=2
Figure 17: Threads Per Block Vs. SM Warp Occupancy
22
Figure 18: Registers Per Block Vs. SM Warp Occupancy
Figure 19: Shared Memory Usage Vs. SM Warp Occupancy
23
H=4
Figure 20: Threads Per Block Vs. SM Warp Occupancy
24
Figure 21: Registers Per Block Vs. SM Warp Occupancy
H=8
25
Figure 22: Threads Per Block Vs. SM Warp Occupancy
Figure 23: Registers Per Block Vs. SM Warp Occupancy
26
Figure 24: Shared Memory Usage Vs. SM Warp Occupancy
H = 16
27
Figure 25: Threads Per Block Vs. SM Warp Occupancy
Figure 26: Registers Per Block Vs. SM Warp Occupancy
28
References
[Discrete Fourier Transform] http://en.wikipedia.org/wiki/Discrete Fourier transform
[Separability of 2D Fourier Transform] www.cs.unm.edu/ williams/cs530/ft2d.pdf
[The Cooley-Tukey Fast Fourier Transform Algorithm] http://cnx.org/content/m16334/latest/#uid8
[Node and GPU Configuration] http://www.nersc.gov/users/computationalsystems/dirac/node-and-gpu-configuration/
[Optimizing Matrix Transpose in CUDA]
[G. C. Danielson and C. Lanczos] ”Some improvements in practical Fourier analysis and
their application to X-ray scattering from liquids”, J. Franklin Inst. 233, 365 (1942)
[Cooley, J. W. and Tukey, O. W.] ”An Algorithm for the Machine Calculation of Complex Fourier Series.” Math. Comput. 19, 297-301, 1965.
[Computing FFT Twiddle Factors] http://www.dsprelated.com/showarticle/107.php
[Decimation-in-time (DIT) Radix-2 FFT] http://cnx.org/content/m12016/latest/
[Signal Flow Graphs of Cooley-Tukey FFTs] http://cnx.org/content/m16352/latest/
29
Download