Slides - Washington

advertisement
Accelerating Fast Fourier Transform for
Wideband Channelization
Carlo del Mundo*, Vignesh Adhinarayanan§, Wu-chun Feng*§
* Department of Electrical and Computer Engineering,
§ Department of Computer Science, Virginia Tech
synergy.cs.vt.edu
Forecast
• Goal: Accelerate the Fast Fourier Transform (FFT) using
graphics processing units (GPUs)
– Replace fixed hardware ASICs with programmable GPUs
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Forecast
• Goal: Accelerate the Fast Fourier Transform (FFT) using
graphics processing units (GPUs)
– Replace fixed hardware ASICs with programmable GPUs
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Motivation
• FFT is a critical building block
across many disciplines
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Motivation
• FFT is a critical building block
across many disciplines
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Motivation
• FFT is a critical building block
across many disciplines
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Motivation
• FFT is a critical building block
across many disciplines
http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Motivation
• FFT is a critical building block
across many disciplines
http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Motivation
• FFT is a critical building block
across many disciplines
http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html
http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Motivation
• FFT is a critical building block
across many disciplines
http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html
http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png
http://www.ajnr.org/content/27/6/1230/F1.large.jpg
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Introduction
• Wideband Channelization
– Purpose: To isolate channels
within a wideband signal
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Introduction
• Wideband Channelization
– Purpose: To isolate channels
within a wideband signal
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Introduction
• Wideband Channelization
– Purpose: To isolate channels
within a wideband signal
http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Introduction
• Wideband Channelization
– Purpose: To isolate channels
within a wideband signal
Figure: Stages in a
PFB Channelizer
http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Introduction (Channelization)
• Algorithm: Polyphase filter bank (PFB) channelizer
Figure: Stages in a
PFB Channelizer
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Introduction (Channelization)
• Algorithm: Polyphase filter bank (PFB) channelizer
– Problem: FFT stage grows fastest in channelization
Figure: Stages in a
PFB Channelizer
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Introduction (Channelization)
• Algorithm: Polyphase filter bank (PFB) channelizer
– Problem: FFT stage grows fastest in channelization
Figure: Stages in a
PFB Channelizer
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Choosing the Right Processor
• Criteria: Programmability & Performance
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Choosing the Right Processor
• Criteria: Programmability & Performance
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Choosing the Right Processor
• Criteria: Programmability & Performance
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Choosing the Right Processor
• Criteria: Programmability & Performance
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Choosing the Right Processor
• Criteria: Programmability & Performance
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Choosing the Right Processor
• Criteria: Programmability & Performance
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Choosing the Right Processor
• Criteria: Programmability & Performance
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg
http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga
Outline
•
•
•
•
Motivation
Introduction
Background
Approach
– System-level optimizations
– Algorithm-level optimizations
• Results
– Optimizations in isolation
– Optimizations in concert
• Conclusion
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Background (GPUs)
• GPU Memory Hierarchy
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Background (GPUs)
• GPU Memory Hierarchy
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Background (GPUs)
• GPU Memory Hierarchy
– Global Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Background (GPUs)
• GPU Memory Hierarchy
– Global Memory
Table: Memory Read Bandwidth for Radeon HD 6970
Memory Unit
Read Bandwidth (TB/s)
Global
0.17
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Background (GPUs)
• GPU Memory Hierarchy
– Global Memory
– Image Memory
Table: Memory Read Bandwidth for Radeon HD 6970
Memory Unit
Read Bandwidth (TB/s)
L1/L2 Cache
1.35 / 0.45
Global
0.17
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Background (GPUs)
• GPU Memory Hierarchy
– Global Memory
– Image Memory
– Constant Memory
Table: Memory Read Bandwidth for Radeon HD 6970
Memory Unit
Read Bandwidth (TB/s)
Constant
5.4
L1/L2 Cache
1.35 / 0.45
Global
0.17
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Background (GPUs)
• GPU Memory Hierarchy
–
–
–
–
Global Memory
Image Memory
Constant Memory
Local Memory
Table: Memory Read Bandwidth for Radeon HD 6970
Memory Unit
Read Bandwidth (TB/s)
Constant
5.4
Local
2.7
L1/L2 Cache
1.35 / 0.45
Global
0.17
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Background (GPUs)
• GPU Memory Hierarchy
–
–
–
–
–
Global Memory
Image Memory
Constant Memory
Local Memory
Registers
Table: Memory Read Bandwidth for Radeon HD 6970
Memory Unit
Read Bandwidth (TB/s)
Registers
16.2
Constant
5.4
Local
2.7
L1/L2 Cache
1.35 / 0.45
Global
0.17
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Outline
•
•
•
•
Motivation
Introduction
Background
Approach
– System-level optimizations
– Algorithm-level optimizations
• Results
– Optimizations in isolation
– Optimizations in concert
• Conclusion
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Approach
• Act as the “human compiler”
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Approach
• Act as the “human compiler”
1.
Derive a candidate set of optimizations for FFT on GPUs
Candidate Optimizations
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Approach
• Act as the “human compiler”
1.
2.
Derive a candidate set of optimizations for FFT on GPUs
Apply optimizations in isolation
Optimizations in Isolation
Candidate Optimizations
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Approach
• Act as the “human compiler”
1.
2.
3.
Derive a candidate set of optimizations for FFT on GPUs
Apply optimizations in isolation
Apply optimizations in concert
Optimizations in Isolation
Candidate Optimizations
Optimizations in Concert
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Approach
• System-level Optimizations (applicable to any application)
1.
2.
3.
4.
5.
6.
Register Preloading
Vector Access/{Vector,Scalar} Arithmetic
Constant Memory Usage
Dynamic Instruction Reduction
Memory Coalescing
Image Memory
• Algorithm-level Optimizations
1.
2.
3.
Transpose via LM
Compute/Transpose via LM
Compute/No Transpose via LM
C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Approach
• System-level Optimizations (applicable to any application)
1.
2.
3.
4.
5.
6.
Register Preloading
Vector Access/{Vector,Scalar} Arithmetic
Constant Memory Usage
Dynamic Instruction Reduction
Memory Coalescing
Image Memory
• Algorithm-level Optimizations
1.
2.
3.
Transpose via LM
Compute/Transpose via LM
Compute/No Transpose via LM
C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Approach
• System-level Optimizations (applicable to any application)
1.
2.
3.
4.
5.
6.
Register Preloading
Vector Access/{Vector,Scalar} Arithmetic
Constant Memory Usage
Dynamic Instruction Reduction
Memory Coalescing
Image Memory
C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Approach
• System-level Optimizations (applicable to any application)
1.
2.
3.
4.
5.
6.
Register Preloading
Vector Access/{Vector,Scalar} Arithmetic
Constant Memory Usage
Dynamic Instruction Reduction
Memory Coalescing
Image Memory
• Algorithm-level Optimizations
1.
2.
3.
Naïve Transpose (LM-CM)
Compute/Transpose via LM (LM-CC)
Compute/No Transpose via LM (LM-CT)
C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
1. Register Preloading (RP)
– Load to registers first
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
1. Register Preloading (RP)
– Load to registers first
Without Register Preloading
79 __kernel void unoptimized(__global float2 *buffer)
80 {
81 int index = …;
82 buffer += index;
83
84 FFT4_in_order_output(&buffer[0], &buffer[4],
&buffer[8], &buffer[12]);
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
1. Register Preloading (RP)
– Load to registers first
Without Register Preloading
79 __kernel void unoptimized(__global float2 *buffer)
80 {
81 int index = …;
82 buffer += index;
83
84 FFT4_in_order_output(&buffer[0], &buffer[4],
&buffer[8], &buffer[12]);
With Register Preloading
79 __kernel void optimized(__global float2 *buffer)
80 {
81 int index = …;
82 buffer += index;
83
84 __private float2 r0, r1, r2, r3; // Register Declaration
85 // Explicit Loads
86 r0 = buffer[0]; r1 = buffer[1]; r2 = buffer[2]; r3 = buffer[3];
87 FFT4_in_order_output(&r0, &r1, &r2, &r3);
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
2.
Vector Access (float{2, 4, 8, 16})
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
2.
Vector Access (float{2, 4, 8, 16})
a[0]
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
2.
Vector Access (float{2, 4, 8, 16})
a[0]
a[1]
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
2.
Vector Access (float{2, 4, 8, 16})
a[0]
a[1]
a[2]
a[3]
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
2.
Vector Access (float{2, 4, 8, 16})
a[0]
a[1]
a[2]
a[3]
– Scalar Math (VASM)
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
2.
Vector Access (float{2, 4, 8, 16})
a[0]
a[1]
a[2]
– Scalar Math (VASM)
• float + float
a[3]
+
=
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
2.
Vector Access (float{2, 4, 8, 16})
a[0]
a[1]
a[2]
– Scalar Math (VASM)
• float + float
a[3]
+
=
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
2.
Vector Access (float{2, 4, 8, 16})
a[0]
a[1]
a[2]
– Scalar Math (VASM)
• float + float
a[3]
+
=
– Vector Math (VAVM)
• float4 + float4
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
2.
Vector Access (float{2, 4, 8, 16})
a[0]
a[1]
a[2]
– Scalar Math (VASM)
• float + float
a[3]
+
=
+
=
– Vector Math (VAVM)
• float4 + float4
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
System-level Optimizations
2.
Vector Access (float{2, 4, 8, 16})
a[0]
a[1]
a[2]
– Scalar Math (VASM)
• float + float
a[3]
+
=
+
=
– Vector Math (VAVM)
• float4 + float4
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Approach
• System-level Optimizations (applicable to any application)
1.
2.
3.
4.
5.
6.
Register Preloading
Vector Access/{Vector,Scalar} Arithmetic
Constant Memory Usage
Dynamic Instruction Reduction
Memory Coalescing
Image Memory
• Algorithm-level Optimizations
1C.
del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Approach
• System-level Optimizations (applicable to any application)
1.
2.
3.
4.
5.
6.
Register Preloading
Vector Access/{Vector,Scalar} Arithmetic
Constant Memory Usage
Dynamic Instruction Reduction
Memory Coalescing
Image Memory
• Algorithm-level Optimizations
1.
2.
3.
1C.
Naïve Transpose (LM-CM)
Compute/Transpose via LM (LM-CC)
Compute/No Transpose via LM (LM-CT)
del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013.
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
• Transpose – elements across the diagonal are exchanged
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
• Transpose – elements across the diagonal are exchanged
4x4 matrix
Transposed matrix
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
• Transpose – elements across the diagonal are exchanged
4x4 matrix
Transposed matrix
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
• Transpose – elements across the diagonal are exchanged
4x4 matrix
Transposed matrix
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
• Transpose – elements across the diagonal are exchanged
4x4 matrix
Transposed matrix
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
• Transpose – elements across the diagonal are exchanged
4x4 matrix
Transposed matrix
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
Original
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
Transposed
synergy.cs.vt.edu
Algorithm-level optimizations
1. Naïve Transpose (LM-CM)
t0
t1
t2
Original
Transposed
t3
Register File
Local Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
Original
1. Naïve Transpose (LM-CM)
Transposed
t0 t1 t2 t3
Register File
Local Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
1. Naïve Transpose (LM-CM)
t0
t1
t2
Original
Transposed
t3
Register File
Local Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
1. Naïve Transpose (LM-CM)
t0
t1
t2
Original
Transposed
t3
Register File
Local Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
3. The pseudo transpose (LM-CT)
Original
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
Transposed
synergy.cs.vt.edu
Algorithm-level optimizations
3. The pseudo transpose (LM-CT)
Original
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
Transposed
synergy.cs.vt.edu
Algorithm-level optimizations
3. The pseudo transpose (LM-CT)
Original
Transposed
– Idea:
• Load data to local memory
Local Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
3. The pseudo transpose (LM-CT)
Original
Transposed
– Idea:
• Load data to local memory
Local Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
3. The pseudo transpose (LM-CT)
Original
Transposed
– Idea:
• Load data to local memory
• Perform computation on
columns,
Local Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
3. The pseudo transpose (LM-CT)
Original
Transposed
– Idea:
• Load data to local memory
• Perform computation on
columns, then rows.
Local Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
3. The pseudo transpose (LM-CT)
Original
Transposed
– Idea:
• Load data to local memory
• Perform computation on
columns, then rows.
– Advantage:
• Skips the transpose step
Local Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Algorithm-level optimizations
3. The pseudo transpose (LM-CT)
Original
Transposed
– Idea:
• Load data to local memory
• Perform computation on
columns, then rows.
– Advantage:
• Skips the transpose step
– Disadvantage:
• Local memory has lower
throughput than registers.
Local Memory
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Outline
•
•
•
•
Motivation
Introduction
Background
Approach
– System-level optimizations
– Algorithm-level optimizations
• Results
– Optimizations in isolation
– Optimizations in concert
• Conclusion
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Results (Experimental Testbed)
• Algorithm:
– 1D FFT (batched), N = 16 pts
– Cooley-Tukey Decomposition
GPU Testbed
Device (AMD Radeon)
Cores
Peak
Performance
(GFLOPS)
Peak
Bandwidth
(GB/s)
HD 7970
2048
3788
264
HD 6970 (VLIW)
1536
2703
176
HD 5870 (VLIW)
1600
2720
154
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
100%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
100%
160%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
40%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
40%
0% (No Change)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
20%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
20%
10%
41%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
20%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
0% (No Change)
20%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Neutral/Detrimental to Baseline (Min. % Decrease)
1.
20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Neutral/Detrimental to Baseline (Min. % Decrease)
1.
20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
20%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
20%
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Neutral/Detrimental to Baseline (Min. % Decrease)
1.
20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
40%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Neutral/Detrimental to Baseline (Min. % Decrease)
1.
20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
0% (No Change)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Neutral/Detrimental to Baseline (Min. % Decrease)
1.
20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
2. 0% - Dynamic instruction reduction (LU, CSE, IL)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Neutral/Detrimental to Baseline (Min. % Decrease)
1.
20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
2. 0% - Dynamic instruction reduction (LU, CSE, IL)
0% (No Change)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Neutral/Detrimental to Baseline (Min. % Decrease)
1.
20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
2. 0% - Dynamic instruction reduction (LU, CSE, IL)
3. 18% - Avoid large vectors & vector math
(VASM16, VAVM8/16)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Neutral/Detrimental to Baseline (Min. % Decrease)
1.
20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
2. 0% - Dynamic instruction reduction (LU, CSE, IL)
3. 18% - Avoid large vectors & vector math
(VASM16, VAVM8/16)
50%
39%
61%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in isolation)
Improvements to Baseline (Max. % Increase)
1. 160% - Minimize bus traffic via on-chip
optimizations (RP, LM-CC, LM-CT)
2. 40% - Coalesce memory accesses (CGAP)
3. 20% - Use scalar math (VASM2/VASM4)
AMD Radeon HD 7970 (Scalar, non-VLIW)
AMD Radeon HD 5870/6970 (VLIW)
Neutral/Detrimental to Baseline (Min. % Decrease)
1.
20% - Naïve transpose (LM-CM),
40% - Constant Memory (CM-K, CM-L)
2. 0% - Dynamic instruction reduction (LU, CSE, IL)
3. 18% - Avoid large vectors & vector math
(VASM16, VAVM8/16)
34%
18%
53%
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
IM: Image memory; RP: Register Preloading;
LM-{CM, CT,
CC}:Fourier
Local Memory-{Communication
Only; Compute,
No Transpose; Computation and
Accelerating
Fast
Transform for Wideband
Channelization
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. Increase)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. Increase)
2.9x
2.4x
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. Increase)
2.9x
1.8x
2.4x
2.4x
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. Increase)
– {RP + LM-CM} best on-chip
optimization
2.9x
1.8x
1.5x
2.4x
2.4x
2.1x
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. % Increase)
– {RP + LM-CM} best on-chip
optimization
– Use Constant Memory (CM)
for twiddle calculations
6.5x
5.6x
2.9x
1.8x
1.5x
2.4x
2.4x
2.1x
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. % Increase)
– {RP + LM-CM} best on-chip
optimization
– Use Constant Memory (CM)
for twiddle calculations
– Use global memory (instead
of image memory)
5.6x
5.6x
5.6x
5.6x
2.9x
1.8x
1.5x
6.5x
2.4x
2.4x
2.1x
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. % Increase)
– {RP + LM-CM} best on-chip
optimization
– Use Constant Memory (CM)
for twiddle calculations
– Use global memory (instead
of image memory)
5.6x
5.6x
5.6x
5.6x
2.9x
1.8x
1.5x
6.5x
6.5x
2.4x
2.4x
2.1x
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. % Increase)
– {RP + LM-CM} best on-chip
optimization
– Use Constant Memory (CM)
for twiddle calculations
– Use global memory (instead
of image memory)
5.6x
5.6x
5.6x
5.6x
2.9x
1.8x
1.5x
6.5x
6.5x
6.3x
2.4x
2.4x
2.1x
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. % Increase)
– {RP + LM-CM} best on-chip
optimization
– Use Constant Memory (CM)
for twiddle calculations
– Use global memory (instead
of image memory)
5.6x
5.6x
5.6x
5.6x
2.9x
1.8x
1.5x
2.4x
2.4x
2.1x
6.5x
6.5x
6.3x
2.4x
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. % Increase)
– {RP + LM-CM} best on-chip
optimization
– Use Constant Memory (CM)
for twiddle calculations
– Use global memory (instead
of image memory)
5.6x
5.6x
5.6x
5.6x
2.9x
1.8x
1.5x
2.4x
2.4x
2.1x
6.5x
6.5x
2.4x
6.3x
2.4x
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (in concert)
•
Improvements (Max. % Increase)
– {RP + LM-CM} best on-chip
optimization
– Use Constant Memory (CM)
for twiddle calculations
– Use global memory (instead
of image memory)
– Optimal set for AMD GPUs
5.6x
5.6x
5.6x
5.6x
2.9x
1.8x
1.5x
2.4x
2.4x
2.1x
6.5x
6.5x
2.4x
6.3x
2.4x
• RP – Register Preloading
• LM-CM – Transpose via
local memory
• CM – Constant memory
usage
• CGAP – Coalesced Global
Access Pattern
• VASM2 – Vector Access,
Scalar Math (float2)
*Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations.
2 All implementations are coalesced (CGAP) and use VASM2.
3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity).
IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and
Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage;
Carlo CSE:
del Mundo,
cdel@vt.edu,
carlodelmundo.com
CGAP: Coalesced Access Pattern; LU: Loop unrolling;
Common subexpression
elimination;
IL: Function inlining; Baseline: VASM2.
synergy.cs.vt.edu
Results (1D FFT 16-pts, GPU versions)
• Optimized GPU faster by factors of 14.5 over baseline GPU
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Results (1D FFT 16-pts, GPU versions)
• Optimized GPU faster by factors of 14.5 over baseline GPU
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Conclusions
• Contributions:
– A portable building block for FFT towards GPU-based radios
– Architecture-aware insights for mapping and optimizing FFT across three
generations of AMD GPUs
• Contact:
– Carlo del Mundo
– cdel@vt.edu
• Optimal set for AMD GPUs
–
–
–
–
–
RP – Register Preloading
LM-CM – Transpose via
local memory
CM – Constant memory
usage
CGAP – Coalesced Global
Access Pattern
VASM2 – Vector Access,
Scalar Math (float2)
http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html
http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Appendix Slides
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Introduction (FFT)
• Fast Fourier Transform (FFT)
– A spectral method
• Key computational idiom for present and future applications (dwarf)§
1.
2.
3.
4.
5.
6.
7.
§ Asanovic et al.
List of Dwarfs
8. Dynamic Prog.
Finite State Machine
9. Particle Methods
Circuits
10. Backtrack/B&B
Graph Algorithms
11. Graphical Models
Structured Grid
12. Unstructured
Dense Matrix
Grids
13. Map Reduce
Sparse Matrix
Spectral Methods
A View of the Parallel Computing Landscape. CACM, 2009.
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Background (Optimizing on GPUs)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
RP (Register Preloading) - All data elements are first preloaded onto the register file of the respective GPU.
Computation is facilitated solely on registers.
CGAP (Coalesced Global Access Pattern) - Threads access memory contiguously (the kth thread accesses
memory element k)
VASM2/4 (Vector Access, Scalar Math, float{2/4}) - Data elements are loaded as the listed vector type. Arithmetic
operations are scalar (float x float).
LM-CM (Local Memory, Communication Only) - Data elements are loaded into local memory only for
communication. Threads swap data elements solely in local memory.
LM-CT (Local Memory, Computation, No Transpose) - Data elements are loaded into local memory for
computation. The communication step is avoided by algorithm reorganization.
LM-CC (Local Memory, Computation and Communication) - All data elements are preloaded into local memory.
Computation is performed in local memory, while registers are used for scratchpad communication.
CM-K (Constant Memory - Kernel Argument) - The twiddle multiplication stage of FFT is precomputed on the
CPU and stored in the GPU constant memory for fast look up.
CSE (Common Subexpression Elimination) - A traditional optimization that collapses identical expressions in
order to save computation. This optimization may increase register live time, therefore, increasing register
pressure.
IL (Function Inlining) - A function's code body is inserted in place of a function call. It is used primarily for
functions that are frequently called.
IM (Image Memory) – The use of a texture image replaces the use of global memory.
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Motivation (GPU FFT vs. CPU FFT)
• GPU FFT outperforms CPU FFT by factors as high as 6.5*
– 1D batched FFT, N = 16 pts
* Device-Host Data Transfer Not Included
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Introduction (Channelizer Architecture)
• Channelizer Architecture
– FIR Filtering, FFT, and Channel Mapping.
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
S3: Constant Memory
• Fast cached lookup for
frequently used data
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
S3: Constant Memory
• Fast cached lookup for
frequently used data
16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2)
(1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f),
... more sin/cos values};
Without Constant Memory
61 for (int j = 1; j < 4; ++j)
62 {
63
double theta = -2.0 * M_PI * tid * j / 16;
64
float2 twid = make_float2(cos(theta), sin(theta));
65
result[j] = buffer[j*4] * twid;
66 }
With Constant Memory
61 for (int j = 1; j < 4; ++j)
62
result[j] = buffer[j*4] * twiddles[4*j+tid];
Accelerating Fast Fourier Transform for Wideband Channelization
Carlo del Mundo, cdel@vt.edu, carlodelmundo.com
synergy.cs.vt.edu
Download