Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo*, Vignesh Adhinarayanan§, Wu-chun Feng*§ * Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech synergy.cs.vt.edu Forecast • Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) – Replace fixed hardware ASICs with programmable GPUs Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Forecast • Goal: Accelerate the Fast Fourier Transform (FFT) using graphics processing units (GPUs) – Replace fixed hardware ASICs with programmable GPUs http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Motivation • FFT is a critical building block across many disciplines Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Motivation • FFT is a critical building block across many disciplines http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Motivation • FFT is a critical building block across many disciplines http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Motivation • FFT is a critical building block across many disciplines http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Motivation • FFT is a critical building block across many disciplines http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Motivation • FFT is a critical building block across many disciplines http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Motivation • FFT is a critical building block across many disciplines http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html http://www.elektrodaily.com/wp-content/uploads/2013/02/shazam-app.png http://www.ajnr.org/content/27/6/1230/F1.large.jpg Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Introduction • Wideband Channelization – Purpose: To isolate channels within a wideband signal Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Introduction • Wideband Channelization – Purpose: To isolate channels within a wideband signal Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Introduction • Wideband Channelization – Purpose: To isolate channels within a wideband signal http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Introduction • Wideband Channelization – Purpose: To isolate channels within a wideband signal Figure: Stages in a PFB Channelizer http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Introduction (Channelization) • Algorithm: Polyphase filter bank (PFB) channelizer Figure: Stages in a PFB Channelizer Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Introduction (Channelization) • Algorithm: Polyphase filter bank (PFB) channelizer – Problem: FFT stage grows fastest in channelization Figure: Stages in a PFB Channelizer Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Introduction (Channelization) • Algorithm: Polyphase filter bank (PFB) channelizer – Problem: FFT stage grows fastest in channelization Figure: Stages in a PFB Channelizer Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Choosing the Right Processor • Criteria: Programmability & Performance Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Choosing the Right Processor • Criteria: Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga Choosing the Right Processor • Criteria: Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga Choosing the Right Processor • Criteria: Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga Choosing the Right Processor • Criteria: Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga Choosing the Right Processor • Criteria: Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga Choosing the Right Processor • Criteria: Programmability & Performance http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://www.maximumpc.com/files/u154082/intel_cpu_socket3.jpg http://fr.academic.ru/pictures/frwiki/70/Fpga_xilinx_spartan.jpg Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga http://upload.wikimedia.org/wikipedia/commons/7/79/SSDTR-ASIC_technology.jpga Outline • • • • Motivation Introduction Background Approach – System-level optimizations – Algorithm-level optimizations • Results – Optimizations in isolation – Optimizations in concert • Conclusion Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Background (GPUs) • GPU Memory Hierarchy Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Background (GPUs) • GPU Memory Hierarchy Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Background (GPUs) • GPU Memory Hierarchy – Global Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Background (GPUs) • GPU Memory Hierarchy – Global Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Global 0.17 Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Background (GPUs) • GPU Memory Hierarchy – Global Memory – Image Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) L1/L2 Cache 1.35 / 0.45 Global 0.17 Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Background (GPUs) • GPU Memory Hierarchy – Global Memory – Image Memory – Constant Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Constant 5.4 L1/L2 Cache 1.35 / 0.45 Global 0.17 Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Background (GPUs) • GPU Memory Hierarchy – – – – Global Memory Image Memory Constant Memory Local Memory Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Constant 5.4 Local 2.7 L1/L2 Cache 1.35 / 0.45 Global 0.17 Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Background (GPUs) • GPU Memory Hierarchy – – – – – Global Memory Image Memory Constant Memory Local Memory Registers Table: Memory Read Bandwidth for Radeon HD 6970 Memory Unit Read Bandwidth (TB/s) Registers 16.2 Constant 5.4 Local 2.7 L1/L2 Cache 1.35 / 0.45 Global 0.17 Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Outline • • • • Motivation Introduction Background Approach – System-level optimizations – Algorithm-level optimizations • Results – Optimizations in isolation – Optimizations in concert • Conclusion Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Approach • Act as the “human compiler” Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Approach • Act as the “human compiler” 1. Derive a candidate set of optimizations for FFT on GPUs Candidate Optimizations Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Approach • Act as the “human compiler” 1. 2. Derive a candidate set of optimizations for FFT on GPUs Apply optimizations in isolation Optimizations in Isolation Candidate Optimizations Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Approach • Act as the “human compiler” 1. 2. 3. Derive a candidate set of optimizations for FFT on GPUs Apply optimizations in isolation Apply optimizations in concert Optimizations in Isolation Candidate Optimizations Optimizations in Concert Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Approach • System-level Optimizations (applicable to any application) 1. 2. 3. 4. 5. 6. Register Preloading Vector Access/{Vector,Scalar} Arithmetic Constant Memory Usage Dynamic Instruction Reduction Memory Coalescing Image Memory • Algorithm-level Optimizations 1. 2. 3. Transpose via LM Compute/Transpose via LM Compute/No Transpose via LM C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013. Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Approach • System-level Optimizations (applicable to any application) 1. 2. 3. 4. 5. 6. Register Preloading Vector Access/{Vector,Scalar} Arithmetic Constant Memory Usage Dynamic Instruction Reduction Memory Coalescing Image Memory • Algorithm-level Optimizations 1. 2. 3. Transpose via LM Compute/Transpose via LM Compute/No Transpose via LM C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013. Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Approach • System-level Optimizations (applicable to any application) 1. 2. 3. 4. 5. 6. Register Preloading Vector Access/{Vector,Scalar} Arithmetic Constant Memory Usage Dynamic Instruction Reduction Memory Coalescing Image Memory C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013. Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Approach • System-level Optimizations (applicable to any application) 1. 2. 3. 4. 5. 6. Register Preloading Vector Access/{Vector,Scalar} Arithmetic Constant Memory Usage Dynamic Instruction Reduction Memory Coalescing Image Memory • Algorithm-level Optimizations 1. 2. 3. Naïve Transpose (LM-CM) Compute/Transpose via LM (LM-CC) Compute/No Transpose via LM (LM-CT) C. del Mundo et al., “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013. Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 1. Register Preloading (RP) – Load to registers first Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 1. Register Preloading (RP) – Load to registers first Without Register Preloading 79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 1. Register Preloading (RP) – Load to registers first Without Register Preloading 79 __kernel void unoptimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 FFT4_in_order_output(&buffer[0], &buffer[4], &buffer[8], &buffer[12]); With Register Preloading 79 __kernel void optimized(__global float2 *buffer) 80 { 81 int index = …; 82 buffer += index; 83 84 __private float2 r0, r1, r2, r3; // Register Declaration 85 // Explicit Loads 86 r0 = buffer[0]; r1 = buffer[1]; r2 = buffer[2]; r3 = buffer[3]; 87 FFT4_in_order_output(&r0, &r1, &r2, &r3); Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] a[3] – Scalar Math (VASM) Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] – Scalar Math (VASM) • float + float a[3] + = Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] – Scalar Math (VASM) • float + float a[3] + = Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] – Scalar Math (VASM) • float + float a[3] + = – Vector Math (VAVM) • float4 + float4 Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] – Scalar Math (VASM) • float + float a[3] + = + = – Vector Math (VAVM) • float4 + float4 Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu System-level Optimizations 2. Vector Access (float{2, 4, 8, 16}) a[0] a[1] a[2] – Scalar Math (VASM) • float + float a[3] + = + = – Vector Math (VAVM) • float4 + float4 Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Approach • System-level Optimizations (applicable to any application) 1. 2. 3. 4. 5. 6. Register Preloading Vector Access/{Vector,Scalar} Arithmetic Constant Memory Usage Dynamic Instruction Reduction Memory Coalescing Image Memory • Algorithm-level Optimizations 1C. del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013. Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Approach • System-level Optimizations (applicable to any application) 1. 2. 3. 4. 5. 6. Register Preloading Vector Access/{Vector,Scalar} Arithmetic Constant Memory Usage Dynamic Instruction Reduction Memory Coalescing Image Memory • Algorithm-level Optimizations 1. 2. 3. 1C. Naïve Transpose (LM-CM) Compute/Transpose via LM (LM-CC) Compute/No Transpose via LM (LM-CT) del Mundo, W. Feng. “Accelerating Fast Fourier Transform for Wideband Channelization,” IEEE ICC, Budapest, Hungary, June 2013. Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations • Transpose – elements across the diagonal are exchanged Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations • Transpose – elements across the diagonal are exchanged 4x4 matrix Transposed matrix Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations • Transpose – elements across the diagonal are exchanged 4x4 matrix Transposed matrix Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations • Transpose – elements across the diagonal are exchanged 4x4 matrix Transposed matrix Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations • Transpose – elements across the diagonal are exchanged 4x4 matrix Transposed matrix Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations • Transpose – elements across the diagonal are exchanged 4x4 matrix Transposed matrix Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations Original Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Transposed synergy.cs.vt.edu Algorithm-level optimizations 1. Naïve Transpose (LM-CM) t0 t1 t2 Original Transposed t3 Register File Local Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations Original 1. Naïve Transpose (LM-CM) Transposed t0 t1 t2 t3 Register File Local Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations 1. Naïve Transpose (LM-CM) t0 t1 t2 Original Transposed t3 Register File Local Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations 1. Naïve Transpose (LM-CM) t0 t1 t2 Original Transposed t3 Register File Local Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations 3. The pseudo transpose (LM-CT) Original Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Transposed synergy.cs.vt.edu Algorithm-level optimizations 3. The pseudo transpose (LM-CT) Original Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com Transposed synergy.cs.vt.edu Algorithm-level optimizations 3. The pseudo transpose (LM-CT) Original Transposed – Idea: • Load data to local memory Local Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations 3. The pseudo transpose (LM-CT) Original Transposed – Idea: • Load data to local memory Local Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations 3. The pseudo transpose (LM-CT) Original Transposed – Idea: • Load data to local memory • Perform computation on columns, Local Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations 3. The pseudo transpose (LM-CT) Original Transposed – Idea: • Load data to local memory • Perform computation on columns, then rows. Local Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations 3. The pseudo transpose (LM-CT) Original Transposed – Idea: • Load data to local memory • Perform computation on columns, then rows. – Advantage: • Skips the transpose step Local Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Algorithm-level optimizations 3. The pseudo transpose (LM-CT) Original Transposed – Idea: • Load data to local memory • Perform computation on columns, then rows. – Advantage: • Skips the transpose step – Disadvantage: • Local memory has lower throughput than registers. Local Memory Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Outline • • • • Motivation Introduction Background Approach – System-level optimizations – Algorithm-level optimizations • Results – Optimizations in isolation – Optimizations in concert • Conclusion Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Results (Experimental Testbed) • Algorithm: – 1D FFT (batched), N = 16 pts – Cooley-Tukey Decomposition GPU Testbed Device (AMD Radeon) Cores Peak Performance (GFLOPS) Peak Bandwidth (GB/s) HD 7970 2048 3788 264 HD 6970 (VLIW) 1536 2703 176 HD 5870 (VLIW) 1600 2720 154 Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 100% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 100% 160% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 40% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 40% 0% (No Change) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) 20% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) 20% 10% 41% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) 20% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) 0% (No Change) 20% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 20% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) 20% AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 40% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 0% (No Change) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 2. 0% - Dynamic instruction reduction (LU, CSE, IL) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 2. 0% - Dynamic instruction reduction (LU, CSE, IL) 0% (No Change) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 2. 0% - Dynamic instruction reduction (LU, CSE, IL) 3. 18% - Avoid large vectors & vector math (VASM16, VAVM8/16) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 2. 0% - Dynamic instruction reduction (LU, CSE, IL) 3. 18% - Avoid large vectors & vector math (VASM16, VAVM8/16) 50% 39% 61% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in isolation) Improvements to Baseline (Max. % Increase) 1. 160% - Minimize bus traffic via on-chip optimizations (RP, LM-CC, LM-CT) 2. 40% - Coalesce memory accesses (CGAP) 3. 20% - Use scalar math (VASM2/VASM4) AMD Radeon HD 7970 (Scalar, non-VLIW) AMD Radeon HD 5870/6970 (VLIW) Neutral/Detrimental to Baseline (Min. % Decrease) 1. 20% - Naïve transpose (LM-CM), 40% - Constant Memory (CM-K, CM-L) 2. 0% - Dynamic instruction reduction (LU, CSE, IL) 3. 18% - Avoid large vectors & vector math (VASM16, VAVM8/16) 34% 18% 53% *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}:Fourier Local Memory-{Communication Only; Compute, No Transpose; Computation and Accelerating Fast Transform for Wideband Channelization Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. Increase) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. Increase) 2.9x 2.4x *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. Increase) 2.9x 1.8x 2.4x 2.4x *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. Increase) – {RP + LM-CM} best on-chip optimization 2.9x 1.8x 1.5x 2.4x 2.4x 2.1x *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. % Increase) – {RP + LM-CM} best on-chip optimization – Use Constant Memory (CM) for twiddle calculations 6.5x 5.6x 2.9x 1.8x 1.5x 2.4x 2.4x 2.1x *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. % Increase) – {RP + LM-CM} best on-chip optimization – Use Constant Memory (CM) for twiddle calculations – Use global memory (instead of image memory) 5.6x 5.6x 5.6x 5.6x 2.9x 1.8x 1.5x 6.5x 2.4x 2.4x 2.1x *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. % Increase) – {RP + LM-CM} best on-chip optimization – Use Constant Memory (CM) for twiddle calculations – Use global memory (instead of image memory) 5.6x 5.6x 5.6x 5.6x 2.9x 1.8x 1.5x 6.5x 6.5x 2.4x 2.4x 2.1x *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. % Increase) – {RP + LM-CM} best on-chip optimization – Use Constant Memory (CM) for twiddle calculations – Use global memory (instead of image memory) 5.6x 5.6x 5.6x 5.6x 2.9x 1.8x 1.5x 6.5x 6.5x 6.3x 2.4x 2.4x 2.1x *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. % Increase) – {RP + LM-CM} best on-chip optimization – Use Constant Memory (CM) for twiddle calculations – Use global memory (instead of image memory) 5.6x 5.6x 5.6x 5.6x 2.9x 1.8x 1.5x 2.4x 2.4x 2.1x 6.5x 6.5x 6.3x 2.4x *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. % Increase) – {RP + LM-CM} best on-chip optimization – Use Constant Memory (CM) for twiddle calculations – Use global memory (instead of image memory) 5.6x 5.6x 5.6x 5.6x 2.9x 1.8x 1.5x 2.4x 2.4x 2.1x 6.5x 6.5x 2.4x 6.3x 2.4x *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (in concert) • Improvements (Max. % Increase) – {RP + LM-CM} best on-chip optimization – Use Constant Memory (CM) for twiddle calculations – Use global memory (instead of image memory) – Optimal set for AMD GPUs 5.6x 5.6x 5.6x 5.6x 2.9x 1.8x 1.5x 2.4x 2.4x 2.1x 6.5x 6.5x 2.4x 6.3x 2.4x • RP – Register Preloading • LM-CM – Transpose via local memory • CM – Constant memory usage • CGAP – Coalesced Global Access Pattern • VASM2 – Vector Access, Scalar Math (float2) *Baseline refers to a functionally correct GPU implementation with VASM2 and no optimizations. 2 All implementations are coalesced (CGAP) and use VASM2. 3 The speedups listed in the graph only applies to the Radeon HD 7970 (for brevity). IM: Image memory; RP: Register Preloading; LM-{CM, CT, CC}: Local Memory-{Communication Only; Compute, No Transpose; Computation and Communication}; VASM{n}: Vectorized Access & Scalar Math{floatn}; VAVM{n}: Vectorized Access & Vector Math{floatn}; CM: Constant Memory Usage; Carlo CSE: del Mundo, cdel@vt.edu, carlodelmundo.com CGAP: Coalesced Access Pattern; LU: Loop unrolling; Common subexpression elimination; IL: Function inlining; Baseline: VASM2. synergy.cs.vt.edu Results (1D FFT 16-pts, GPU versions) • Optimized GPU faster by factors of 14.5 over baseline GPU Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Results (1D FFT 16-pts, GPU versions) • Optimized GPU faster by factors of 14.5 over baseline GPU Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Conclusions • Contributions: – A portable building block for FFT towards GPU-based radios – Architecture-aware insights for mapping and optimizing FFT across three generations of AMD GPUs • Contact: – Carlo del Mundo – cdel@vt.edu • Optimal set for AMD GPUs – – – – – RP – Register Preloading LM-CM – Transpose via local memory CM – Constant memory usage CGAP – Coalesced Global Access Pattern VASM2 – Vector Access, Scalar Math (float2) http://www.wireless.vt.edu/symposium/2012/tutorials/sessionA2.html http://images.bit-tech.net/content_images/2011/12/amd-radeon-hd-7970-3gb-review/amd-radeon-hd7970-e.jpg Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Appendix Slides Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Introduction (FFT) • Fast Fourier Transform (FFT) – A spectral method • Key computational idiom for present and future applications (dwarf)§ 1. 2. 3. 4. 5. 6. 7. § Asanovic et al. List of Dwarfs 8. Dynamic Prog. Finite State Machine 9. Particle Methods Circuits 10. Backtrack/B&B Graph Algorithms 11. Graphical Models Structured Grid 12. Unstructured Dense Matrix Grids 13. Map Reduce Sparse Matrix Spectral Methods A View of the Parallel Computing Landscape. CACM, 2009. Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Background (Optimizing on GPUs) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. RP (Register Preloading) - All data elements are first preloaded onto the register file of the respective GPU. Computation is facilitated solely on registers. CGAP (Coalesced Global Access Pattern) - Threads access memory contiguously (the kth thread accesses memory element k) VASM2/4 (Vector Access, Scalar Math, float{2/4}) - Data elements are loaded as the listed vector type. Arithmetic operations are scalar (float x float). LM-CM (Local Memory, Communication Only) - Data elements are loaded into local memory only for communication. Threads swap data elements solely in local memory. LM-CT (Local Memory, Computation, No Transpose) - Data elements are loaded into local memory for computation. The communication step is avoided by algorithm reorganization. LM-CC (Local Memory, Computation and Communication) - All data elements are preloaded into local memory. Computation is performed in local memory, while registers are used for scratchpad communication. CM-K (Constant Memory - Kernel Argument) - The twiddle multiplication stage of FFT is precomputed on the CPU and stored in the GPU constant memory for fast look up. CSE (Common Subexpression Elimination) - A traditional optimization that collapses identical expressions in order to save computation. This optimization may increase register live time, therefore, increasing register pressure. IL (Function Inlining) - A function's code body is inserted in place of a function call. It is used primarily for functions that are frequently called. IM (Image Memory) – The use of a texture image replaces the use of global memory. Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Motivation (GPU FFT vs. CPU FFT) • GPU FFT outperforms CPU FFT by factors as high as 6.5* – 1D batched FFT, N = 16 pts * Device-Host Data Transfer Not Included Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu Introduction (Channelizer Architecture) • Channelizer Architecture – FIR Filtering, FFT, and Channel Mapping. Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu S3: Constant Memory • Fast cached lookup for frequently used data Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu S3: Constant Memory • Fast cached lookup for frequently used data 16 __constant float2 twiddles[16] = { (float2)(1.0f,0.0f), (float2) (1.0f,0.0f), (float2)(1.0f,0.0f), (float2)(1.0f,0.0f), ... more sin/cos values}; Without Constant Memory 61 for (int j = 1; j < 4; ++j) 62 { 63 double theta = -2.0 * M_PI * tid * j / 16; 64 float2 twid = make_float2(cos(theta), sin(theta)); 65 result[j] = buffer[j*4] * twid; 66 } With Constant Memory 61 for (int j = 1; j < 4; ++j) 62 result[j] = buffer[j*4] * twiddles[4*j+tid]; Accelerating Fast Fourier Transform for Wideband Channelization Carlo del Mundo, cdel@vt.edu, carlodelmundo.com synergy.cs.vt.edu