Fast ARFTIS Reconstruction Algorithms using CUDA Deqi Cui*, Ningfang Liao, Wenmin Wu, Boneng Tan, Yu Lin Department of Optical and Electronical Engineering, Beijing Institute of Technology, No.5 Zhongguancun South Street, Haidian District, Beijing, 100081, China cuideqi@bit.edu.cn Abstract. In order to realize the fast reconstruction processing of all-reflective Fourier Transform Imaging Spectrometer (ARFTIS) data, the authors implement reconstruction algorithms on GPU using Compute Unified Device Architecture (CUDA). We use both CUDA 1DFFT library and customization CUDA parallel kernel to accelerate the spectrum reconstruction processing of ARFTIS. The results show that the CUDA can drive hundreds of processing elements (‘manycore’ processors) of GPU hardware and can enhance the efficiency of spectrum reconstruction processing significantly. Farther, computer with CUDA graphic cards will implement real-time data reconstruction of ARFTIS alone. Keywords. ARFTIS; Reconstruction; CUDA; Parallel. 1 Introduction For hyperspectral Fourier transform imaging spectrometer (FTIS) the data have two distinct characteristics: one is a large quantity of original data; the other is a large amount of calculation for reconstruction algorithm [1, 2]. This type of imaging spectrometer has been adopted as optical payload in Chinese "Chang'e I" lunar exploration satellite and Chinese "Huanjing I" environmental satellite. For the specificity of processing hyperspectral data, it will consume a lot of time, one-track data takes one day, using the general-purpose computer currently. It is difficult to satisfy the demand of users, especially for emergency command (such as 5•12 Wenchuan earthquake). Therefore, improving processing speed is particularly important. Graphics processing units (GPUs) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. Unlike multicore CPU architectures, which currently ship with two or four cores, GPU architectures are "manycore" with hundreds of cores capable of running thousands of threads in parallel [3]. From the above, we can see the powerful computational capability of the GPU. Moreover, as the programmability and parallel processing emerge, GPU begins being used in some non-graphics applications, which is general-purpose computing on the GPU (GPGPU). The emergence of CUDA (Compute Unified Device Architecture) technology can meet the demand of GPGPU in some degree. CUDA brings the C-like development environment to programmers for the first time, which uses a C compiler to compile programs, and provides some CUDA extended libraries [4-6]. Users needn’t map programs into graphics APIs anymore, so GPGPU program development becomes more flexible and efficient. Our goal in this paper is to examine the effectiveness of CUDA as a tool to express parallel computation in FTIS applications. 2 Interfere Data Reconstruction Theory Firstly, introduce all-reflective Fourier transform imaging spectrometer (ARFTIS) designed by our team. ARFTIS works on FTIS imaging principle. The optical structure of ARFTIS is shown in Figure 1. It consists of an all-reflected three-mirror anastigmatic (TMA) telescope, an entrance slit, the Fresnel double mirror, the reflective collimating system, the reflective cylindrical system and the focal plane array (FPA) detector. The incidence beam is first imaged onto the entrance silt by the TMA telescope system, and then the wavefront is split by the Fresnel double mirror, and it is collimated by the collimating system, then through the cylindrical system to image the interference fringes onto the FPA [7] . Interfere data cube can be transformed to spectrum data cube by data reconstruction processing, which includes interference frame filtering, window filtering, spectral reconstruction and phase-corrected etc.[8-10]. Spectral reconstruction and window filtering are able to be accelerated befittingly. The following parts will introduce the basic theories of them separately. 2 Deqi Cui*, Ningfang Liao, Wenmin Wu, Boneng Tan, Yu Lin Fig. 1 All-Reflected Fourier Transform Imaging Spectrometer Principle 2.1 Spectral Reconstruction The spectral cube can be obtained by using one-dimensional FFT transform to each line of each frame with in interference cube separately. The data must be symmetric processing in order to carry out FFT transform, if the original interference data is a single side acquisition. After symmetric processing we can obtained a double side interference data I l , then the modulus of the Fourier transform of I l is the spectral of measurement B . N 1 2 lk B( ) F [ I (l )] f (l )exp j N l 0 2.2 k=0, 1, 2, 3….N-1 (1) Window Filtering Theoretically, the monochromatic spectra B (as impulse function) can be obtained by a Fourier transform from the interferogram I l (as cosine function), which the optical path difference is from the negative infinity to positive infinity. But, in fact, the acquisition of FTIS I ' l is only a limited optical path difference ( L ~ L ) within the interferogram, which is equivalent to add a door function T l to the cosine function, and the spectra will be B' . After Inverse Fourier Transform in interferogram, it is no longer an impulse function, but a sin c function, which generates errors in spectral reconstruction. Therefore, window filtering is required before Fourier Transform. The window filtering function can be selected from Bartlett (triangular), Blackman, Hamming and Hanning etc. 3 3.1 Parallel Optimization CUDA Introduction The hardware architecture can be seen from Figure 2. CUDA-enable cards contain many SIMD stream multiprocessors (SM), and each SM has also several stream processors (SP) and Super Function Unit (SFU). These cards use on-chip memories and cache to accelerate memory access. Fast ARFTIS Reconstruction Algorithms using CUDA Fig. 2 CUDA Hardware Block Diagram 3 Fig. 3 Compiling CUDA For software, CUDA mainly extended the driver program and function library. The software stack consists of driver, runtime library, some APIs and NVCC compiler developed by NVIDIA [11]. In Figure 3, the integrated GPU and CPU program is compiled by NVCC, and then the GPU code and CPU code are separated. 3.2 Implementation Spectral Reconstruction For Fourier Transform, CUDA provide a library of CUFFT which contains many highly optimized function interfaces[12], we can call these APIs simply. Before using the CUFFT library, users just need to create a Plan of FFT transform, and then call the APIs. For FFT transform, device memory is needed to be allocated when creating the Plan and the device memory will not vary in succeeding computations. For window filtering we use customization CUDA parallel kernel to implementation matrix computing. The framework designed is considered the two-dimensional irrelevance of reconstruction processing, that means the reconstruction processing is parallelizable in space-dimensional. We designed parallel optimization reconstruction arithmetic, which is parallel processing per frame interference data of a point spectral. The flow chart shows in Figure 4. Beginning of Parallel Processing Parallel Execution by Frame Window Filtering Spectral Reconstruction Call CUFFT library Fig. 4 Flow Chart of Parallel Optimization Part Codes // Allocate host memory for the signal Complex* h_signal = (Complex*)malloc(sizeof(Complex) * SIGNAL_N * lines); // Allocate device memory for signal Complex* d_signal; const unsigned int mem_size = sizeof(Complex) * SIGNAL_N * lines; 4 Deqi Cui*, Ningfang Liao, Wenmin Wu, Boneng Tan, Yu Lin CUDA_SAFE_CALL(cudaMalloc((void**)&d_signal, mem_size)); // Copy host memory to device CUDA_SAFE_CALL(cudaMemcpy(d_signal, h_signal, sizeof(Complex)*SIGNAL_N*lines, cudaMemcpyHostToDevice)); // CUFFT plan cufftHandle plan; CUFFT_SAFE_CALL(cufftPlan1d(&plan, SIGNAL_N, CUFFT_C2C, 1)); Complex* p_signal = (Complex*)d_signal; for (int line = 0; line < lines; line++) { // Transform signal and kernel CUFFT_SAFE_CALL(cufftExecC2C(plan, (cufftComplex *)p_signal, (cufftComplex *)p_signal, CUFFT_INVERSE)); p_signal += SIGNAL_N; } //Destroy CUFFT context CUFFT_SAFE_CALL(cufftDestroy(plan)); // Copy device memory to host CUDA_SAFE_CALL(cudaMemcpy(h_signal, d_signal, sizeof(Complex)*SIGNAL_N*lines, cudaMemcpyDeviceToHost)); // cleanup memory free(h_signal); CUDA_SAFE_CALL(cudaFree(d_signal)); 4 Experiment Results and Analysis We experiment with simulation data, which is generate by FTIS Simulation Software (developed by CSEL, Beijing Institute of Technology) The data parameters are as follows, the total pixel of each scene of interfere data cube is 512 (track through width) × 512 (track along width) × 512 (interference-dimensional, double side). The input data precision is 12-bit integer per pixel. The output spectral data is 115bands per point, saved as 16-bit float. Reconstruction processing includes window filtering and 512 points FFT. We carried out parallel optimization using NVIDIA GTX 280. The hardware specifications show in table 1. Program development language is C++, and compiler is NVCC and the Microsoft Visual Studio 2008 compiler. The experiment environment is Intel Core 2 Duo E8500 CPU (dual-core, 3.16G), Windows XP 32bit OS. Table 1 NVIDIA GTX 280 Hardware Specifications Texture processor clusters (TPC) 10 Streaming multiprocessors (SM) per TPC 3 Streaming processors (SP) per SM 8 Super function units (SFU) per SM 2 Total SPs in entire processor array 240 Peak floating point performance GFLOPS 933 Experiment with one scene data (512 frames × 512 lines × 512 points × 12bit). Interference pattern shows in Figure 5; reconstruction image shows in Figure 6; spectral result shows in Figure 7; Fast ARFTIS Reconstruction Algorithms using CUDA 5 Fig. 5 the Interference Pattern of One Scene A A A Fig. 6 Reconstruction Image, Left: Original Simulation Image; Middle: 510nm Reconstruction Band Image; Right: 890nm Reconstruction Band Image. Fig. 7 Reconstruction Spectrum of Point A Left: Original Simulation Spectrum; Right: Reconstruction Spectrum. From Figure 6 and Figure 7 we can see that the reconstructed image matched with the original simulation image, and spectral characteristics is correct. However, there are some small difference between reconstructed spectra value and original simulation spectra value. We consider it impact by the sampling precision of simulation and computing precision of reconstruction, and it is in an acceptable error range. The optimize results show in table 2. It can be clearly seen from Table 2 that the effect of speed-up on GPU processing is obvious. Large memory block on GPU is better to accelerate the speed of stream processor shared bus access than system memory. 6 Deqi Cui*, Ningfang Liao, Wenmin Wu, Boneng Tan, Yu Lin Table 2 Optimize Results Compared CPU with GPU Mode Time(s) Speed-up CPU 116.8 1 CPU + GPU 1MB block 8.3 14.07 CPU + GPU 512MB block 5.5 21.24 5 Conclusions In this paper, we introduced novel implementations of ARFTIS reconstruction algorithms based on a novel technology and new type hardware. Using CUDA programming and CUDA-enabled GPUs we can easily implement basic reconstruction algorithms. It has been proved that the algorithm can enhance the efficiency of the spectrum reconstruction processing significantly. However, the advantages in performance and efficiency on CUDA depend on proper memory block allocated. It provides a feasible example for parallel optimizing of ARFTIS data reconstruction by using CUDA, as well as providing a reference for using GPGPU computing effectively. Future work includes implement more optional algorithms for ARFTIS. Another work is to optimize CUDA efficiency on multi-core CPU. Acknowledgment The authors gratefully acknowledge the support of National Natural Science Foundation of China, No.60377042; National 863 Foundation of China, No.2006AA12Z124; Aviation Science Foundation, No.20070172003 and Innovative Team Development Plan of Ministry of Education, No.IRT0606; National 973 Foundation of China, No.2009CB7240054. References [1] CHU Jianjun. The Study of Imaging Fourier Transform Spectroscopy[D]. Beijing: Beijing Institute of Technology, 2002. (in Chinese) [2] Leonard John Otten III, R. Glenn Sellar, and Bruce Rafert. Mighty Sat II.1 Fourier-transform hyperspectral imager payload performance[J]. Proc. SPIE, 1995, 2583: 566. [3] Hiroyuki Takizawa, Noboru Yamada, Seigo Sakai et al. Radiative Heat Transfer Simulation Using Programmable Graphics Hardware [J]. Proceedings of the 5th IEEE/ACIS International Conference on Computer and Information Science, 2006, 0-76952613-6/06 [4] NVIDIA, GPU Computing Programming a Massively Parallel Processor, 2006. [5] NVIDIA, NVIDIA CUDA Programming guide version 2.0, Available: http://www.developer.download.nvidia.com [6] J. Tölke, Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture, submitted to Computing and Visualization in Science, 2007 [7] Deqi Cui, Ningfang Liao, Ling Ma et al. Spectral calibration method for all-reflected Fourier transform imaging spectrometer Proc. SPIE, 2008, 716022. [8] MA Ling. The Study of Spectral Reconstruction for Imaging Fourier Transform Spectrometer[D]. Beijing: Beijing Institute of Technology, 2008.(in Chinese) [9] Griffiths PR. Fourier Transform Infrared Spectrometry[M]. New York: Wiley Interscience Publication, 1986. [10] Andre J. Villemaire, Serge Fortin, Jean Giroux et al. Imaging Fourier transform spectrometer[J]. Proc. SPIE, 1995, 2480: 387. [11] NVIDIA, The CUDA Compiler Driver NVCC, Available: http://www.developer.download.nvidia.com [12] NVIDIA, CUFFT Library version 1.1, Available: http://www.developer.download.nvidia.com