CLFFT: An FFT code generator for heterogeneous systems Krishna G Pai Rejith George Joseph Girish Ravunnikutty Agenda FFT Intro to CLFFT Brief intro to OpenCL Comparisons of FFT Algorithms Comparison with CUFFT Future work Discrete Fourier Transform Takes O(n2) with a naive implementation. Fast Fourier Transforms (FFTs) are O(nlogn) implementation of DFTs. Image from Intel.com Why FFT’s ? Lots of Ongoing Research ◦ FFTW ( http://fftw.org/) ◦ Spiral (http://spiral.net/) Spectral methods are one of the 13 Dwarfs of Parallel Computing Rich set of Algorithms each optimal for certain ‘N’. And of course, Wide applicability. Our Approach FFTW generates code that adapts to a particular architecture (CPU’s) Spiral also the same but optimizes at compile time (Also CPU’s) Other research that is optimized for GPU’s most notably Govindaraju et al. Use all the available computing resources to make FFT’s really fast ! Heterogeneous Computing Intel Core 2 Duo Nvidia Tesla Use both these resources simultaneously Intro to CLFFT Future systems are going to be heterogeneous (multi core CPUs and GPGPUs as co processors) in nature. Study various FFT algorithms and implement them on a GPGPU and multi-core CPUs. Explore how FFT's can be scheduled across both these computing resources and the performance thus obtained. OpenCL to program the GPGPU and OpenMp to parallelize on CPU's. FFT’s Studied .. SlowFFT (Naive Implementation) Cooley-Tookey (Radix 2 , for N = 2k) Stockham (Radix 2 , for N = 2k) Sande-Tookey (Radix 2 , Decimation in Frequency, for N = 2k) Bluesteins (Radix 2 , for any N) Cooley-Tookey and SlowFFT also parallelized with OpenMp. Computational Parity Intel Xeon has about 70 GFlops at peak performance nVidia Tesla has about 933 Gflops So not much computational parity on the hpc tesla machines Better parity on Laptops with GPGPU’s. Thus more work can be shared b/w CPU and GPU. Source Intel and Nvidia Open CL Standard for parallel programming of heterogeneous systems involving CPU, GPU(s), CPU + GPU, IBM cell blade etc. So we can have portability across various architectures without a very great performance penalty* More on this when we compare matrix multiplication… Differences w.r.t CUDA No stand alone compiler to produce binaries. We compile at run time . Command Queues for launching kernels and Memory operations. Device memory managed via buffer objects, which provides richer functionality than in CUDA Allows a host memory region to be used by the device directly OpenCl requires memcpy between device and host to be explicitly synchronized OpenCL Implementations On August 5, 2009, AMD unveiled the first development tools for its OpenCL platform as part of its ATI Stream SDK v2.0 Beta Program On August 28, 2009, Apple released Mac OS X Snow Leopard, which contains a full implementation of OpenCL September 28, 2009, NVIDIA released OpenCL drivers and SDK implementation. Limitations Nvidia openCL supports only GPU as the openCL device. Driver doesnt consider CPU as an openCL device. Hence cannot invoke an openCL kernel on CPU. Had to use openMP for CPU AMD stream openCL has support for openCL on CPU Work Flow Currently , we split work b/w CPU and GPU’s only for Cooley-Tukey. Cooley-Tukey here is radix-2 Results are merged On Tesla, bias is highly in favor of GPU computation From the host one thread invokes OpenMp kernel and other threads equivalent to number of GPU’s invoke OpenCL Kernels Comparison of all Radix 2 Power 2 FFT 0.08 0.07 0.06 Time in Second 0.05 CL-1GPU 0.04 ST-1 GPU Santook-1GPU 0.03 0.02 0.01 0 32 64 128 256 512 1024 2048 4096 8192 16384 Comparison of Cooley-Tukey Cooley -Tukey 0.12 Time in Seconds 0.1 0.08 0.06 0.04 0.02 0 CPU-GPU GPU only CLFFT vs CuFFT for Power of 2 CooleyTukey vs CuFFT 0.007 0.006 Time in Seconds 0.005 0.004 CL-2GPU-1 CUFFT 0.003 0.002 0.001 0 32 64 128 256 512 1024 2048 4096 8192 16384 Performance Comparison OpenCL vs CUDA 100000 Time in MilliSeconds 10000 1000 Cuda OpenCL 100 10 1 512x512 1024x1024 2048x2048 Matrix Size 4096x4096 8192x8192 513 516 519 522 525 528 531 534 537 540 543 546 549 552 555 558 561 564 567 570 573 576 579 582 585 588 591 594 597 600 603 606 609 612 Time in Ms CuFFT vs CLFFT for any n CuFFT vs CLFFT for n from 513 to 615 0.3 0.25 0.2 0.15 Bluesteins CL CUFFT 0.1 0.05 0 Future Work Split Radix and Mixed Radix Algorithms 2^p3^q5^r point FFTs Winnograds Prime Number FFT Optimize CPU implementations Create Plan for an n and implement across multiple compute devices. Thank You