Advanced Technologies Division Channelization and Resampling Using a Graphics Processing Unit SAIC September 23rd, 2008 8/1/2008 1 Overview Advanced Technologies Division • Channelization and resampling of digital signals are common elements in SoftwareDefined/Cognitive Radio Applications. • However, channelization and resampling consume a significant portion of the overall system computation budget • This can limit the data rates at which the system can operate. • SAIC’s polyphase GPU implementation was typically on the order of 4-10 times faster than the associated CPU implementation. 8/1/2008 2 Polyphase GPU Implementation Advanced Technologies Division Interleave by Q, Segment into N blocks Data manipulation of N*Q*(F-L+1) elements 1,055,872 elements (3.6% time) Example: P=1,Q=64, L=31,N=73, F=256 (~0.0276 sec on 8800GTX incl file I/O) 8/1/2008 FFT Multiply with each 'P-block' of H(w) IFFT (N*Q) FFTs of length F P*N*Q*F complex multiplies (P*N*Q) IFFTs of length F 4672 256-point FFTs (1.1% time) 1,196, 032 complex multiplies (3.4% time) 4672 256-point FFTs (1.7% time) P: Upsample Rate Q: Downsample Rate L: (Filter Length / (P * Q)) N: Number of input blocks to process in parallel F: FFT length Rearrange into Q channels Move P*N*Q*(F-L+1) elements 1,055,872 elements (5.3% time) FFT (P*N*(F-L+1)) FFTs of length Q 16,498 64-point FFTs (1.6% time) •FFT length determined by filter length •This is related to max(P, Q) / min(P, Q) •Number of parallel input blocks N is constrained by memory and computing resources •~83% of time spent in data conversion and file I/O for this example 3 Implementation Details Advanced Technologies Division Host (CPU) implementation: •Uses fftw floating-point libraries (3.1.2, not multi-threaded) •Uses parallel dft plans GPU implementation: •Uses NVIDIA's CUFFT library for parallel FFTs •Custom kernels for data manipulations, block multiplies, and floating-point conversion •Tests run on 8800 GTX card under RHEL5 (32-bit) •8800 GTX has 16 multiprocessors •Maximum active threads = (16*768) = 12,288, each running at 1.35Mhz General Notes: •All timing results include reading data from the file •GPU timing results include transferring data to and from the GPU •All computations use single-precision IEEE-754 arithmetic 8/1/2008 4 Results Advanced Technologies Division GPU implementation was typically on the order of 4-10 times faster than the associated CPU polyphase implementation across a wide range of channelization levels. Channelization Time for 1GB 16-bit Complex Signed Data File 8/1/2008 Resample Time for 1GB 16-bit Complex, Signed Data File 5