FPGA vs GPU Performance Comparison on the Implementation of FIR Filters GPGPU platforms • GP - General Purpose computation using GPU • GPU - Graphics Processing Unit • CUDA and OpenCL are both frameworks for taskbased and data-based general purpose parallel execution. Their architectures show a great similarity. The key difference between these two frameworks is that OpenCL is a cross-platform framework (implemented on CPUs, GPUs, DSPs and etc.); whereas, CUDA is supported only by NVIDIA GPUs Memory Hierarchy • Memory hierarchy of GPGPU architectures show similarity with CPU memory • At the bottom level of the memory hierarchy the slowest but the largest capacity memory type resides. This type of memory is named as global memory in CUDA terminology. A typical global memory is 2 or 4 gigabyte size and resides outside of the GPU chip • Global memories are usually manufactured using DRAM • Constant memory is another memory type in CUDA devices and is optimized for broadcasting operations, so that it can be accessed faster than global memory. • Like caches in CPU memory hierarchy, there is a faster but smaller memory type in CUDA memory hierarchy called as shared memory. • Registers are other storage units in CUDA memory hierarchy which are private for each thread. Registers have the smallest latency and maximum throughput, but their amount is very limited. Memory Hierarchy CUDA Memory Hierarchy FIR Filter Overview • FIR filter structure is constructed from its transfer function and linear difference equation which is obtained from taking inverse Z-transform of the transfer function of the filter FIR Filter Overview • The output stream y(n) is calculated by multiplying the input signal [x(n), x(n-1), … x(n-M+1)] with the corresponding filter coefficients [b0, b1, … ,bM-1] and adding all the multiplication results together. FIR Filter Overview GPU Implementation • Three different implementation techniques are designed to compare the performance of GPUs with the FPGAs. • Two of the designs are implemented using CUDA. • The other design is an OpenCL kernel implementation. • The first CUDA design is a naïve and simple kernel that does not include any significant optimization. • The other optimized CUDA kernel uses shared memory and coalesce global memory accesses. • The third one is just an OpenCL version of the highly optimized CUDA FIR filter implementation. Basic CUDA FIR Filter FPGA Implementation • Three different implementation techniques are selected to synthesize FIR filters on various FPGAs: Direct-form, symmetric-form, and distributed arithmetic. • It is possible to achieve massive level of parallelism by utilizing multiplier sources of the FPGAs. Most Xilinx FPGAs have DSP48 macro blocks embedded in their chips. These slices have 18x18-bit multiplier units with pre-accumulator, 48-bit accumulators and selection multiplexers in order to speed up DSP operations. • For the direct-form and symmetric-form FPGA implementations Xilinx’s DSP48 macro slices are utilized. FPGA Implementation • Distributed arithmetic (DA) technique is an efficient method for implementing multiplication operations without using the DSP macro blocks of the FPGA. In the DA technique, the coefficients of the FIR filter is represented in two’s complement binary form and all possible sum combinations of the filter coefficients are stored in look-up tables (LUT). Using classical shift-adder method the multiplication operation can be performed effectively without using multiplier units of the FPGA. We used 4-input LUTs of the FPGA to implement the DA form of the FIR filter structure. • We chose three different FPGAs to compare the performance results of the FIR filters. Utilized FPGAs and their properties are given in Table 1. Xilinx ISE v14.1 software is used to synthesize the circuits. Results and Discussions GPU and CPU Performance Results of FIR Filter Application (Million Samples per Second) Conclusions • FIR filter order has a noticeable effect on performance. • For lower order FIR filters both FPGA and GPU achieved better performance with respect to higher order FIR filters. • Serialization due to the lack of enough multiplier units is the main performance decrease reason for FPGAs. • Logic resource capacity of an FPGA is another limiting factor to implement high order FIR filters • FPGAs have relatively lower prices than GPUs, yet GPUs enjoy the ease of programmability where FPGAs are still tough to program. • In general, FPGA performance is higher than GPU when the FIR filter is fully parallelized on FPGA device. • However, GPU outperforms FPGA when the FIR filter has to be implemented with serial parts on FPGA. Questions? • No!