Ministry of Education and Science of Russian Federation Tomsk Polytechnic University Final Paper TECHNOLOGY OF PARALLEL COMPUTING ON MULTIPROCESSOR SYSTEMS FOR DIGITAL SIGNAL PROCESSING (Materials of scientific research) Trainee Instructor Consultant ___________ (signature) _____________ (signature) _____________ ___ (signature) _____ E.E. Luneva______ (data) (name) _______ ________________ (data) (name) _______ ________________ ________________ (data) (name) Tomsk – 2013 Annotation In this work is considered the technologies directed to improve mathematical calculations on a multiprocessor system. These technologies were used in for calculation of frequency-time correlation functions. It is shown that graphical processor units are faster than central processor units in solving problems related to general computing. Аннотация Рассмотрены технологии повышения эффективности математических расчетов на многопроцессорных системах на примере расчета частотновременной корреляционной функции. Показано, что графические процессоры превосходят по скорости процессоры общего назначения при решении задач связанных с вычислениями. 2 Contents Introduction ............................................................................................................... 4 1. Theoretical analysis ............................................................................................... 5 2. The experimental part .......................................................................................... 10 Conclusion ............................................................................................................... 12 References ............................................................................................................... 13 3 Introduction Today digital signal processing is an important branch of modern science and engineering. With development of processing power of modern computers, it has been increased the volume of processing information as well as have been developed new algorithms of signal processing. Also the range of application of signal processing has become wider. There are strict conditions to newly-developed methods and to software which is based on these methods. The conditions concern processing speed, possibility to make processing in real-time mode, and measurement accuracy. It’s difficult to fulfil the requirements without modern computing devices and technology. The aim of this work consists in researching the ways of efficiency increasing of using computer hardware resources for the tasks of the correlation analysis of signals. 4 1. Theoretical analysis Correlation analysis of signals is wide spread for solving the problems of nondestructive check as well as diagnostics and feature finding of electric system, also problems of digital picture processing [1]. Correlation functions are sufficiently simple determined with help of discrete Fourier transform (DFP) [1]. An efficiency of the determination in this case depends on the chosen method of discrete Fourier transform realization. The maximum efficiency is reached if it is used the method of fast Fourier transform (FFT). The main drawback of this method is that it does not show how signals are banded in different frequency ranges. The method of time-and-frequency correlation analysis is free from this drawback [2] and it can significantly enhance the information content of the analysis. However application of this method is connected with sufficient amount of computational power because the method of time-and-frequency correlation uses repeated procedure of executing fast Fourier transform. The number of necessary executing fast Fourier transform directly depends on the number of formed copies of a signal. To estimate the complexity of the calculations by this method it is given the aggregative algorithm of necessaries mathematical manipulations (pic. 1). Despite the good algorithmization and availability of optimized libraries for FFT computing, the problem of limited computational power is still a bottleneck of the method of time-and-frequency correlation and of any software that based on this method. In this case it is possible to increase efficiency of developed software with help of special-purpose technologies. The peculiarity of the algorithm (fig. 1) is as follows. It’s supposed there are two discrete signals xi and yi, the interconnection which, if any, should be identified. So the task is in identification of an interconnection of two signals as well as in identification of frequency spectrum where interconnection is appeared. Two signals xi and yi with dimension 2n size is transmitted to the input of blocks that calculate direct Fourier transform. From the resulting multiplication Pj 5 it’s formed m signals Mk, where j=0,1,...,2n–1+1; m=2,3,...,2n–1; k=0,1,...,m–1. That signals Mk are submitted to inverse Fourier transform. According to the results determined by the inverse Fourier transform it is time-frequency correlation function. Thus, the implementation of this algorithm parallel processing is possible under the discrete Fourier transform, as well as at the level of iteration cycles of the three blocks, due to the creation of multiple thread blocks and then their destruction. Begin 1 Determine the number of copies m k = 1, m-1 , 1 j = 1, 2n-1, 1 Perform FFT-1 for signals Mk, k Î 1...m Form the vector P as a multiplication FFT(xi) by FFT*(yi) Determine the frequency-time correlation function k = 1, m-1, 1 End Form m copies of signals (Mk, k Î 1...m) on the base of the vector P Parallelization is only possible at the level of loop iterations 1 Fig. 1. Algorithm for calculating the time-frequency correlation function: m – the number of the formed copies of the original signal; xi, y– discrete samples of signals; FFT* – the complex conjugate of the direct DFT; FFT-1 – inverse DFT; P – vector of results of multiplication direct DFT of signal xi with the complex conjugate of the direct DFT of signal yi. 6 For multiprocessor systems, in which all calculations are performed purely on the central processing unit (CPU), it is possible to increase the computational power of using the parallelization of computational processes. Microsoft has developed a multipurpose set of tools parallel execution of tasks, that was called «ParallelExtensions». These tools are part of a set of Microsoft. NET Framework 4.0 [3] and allow to automatically use all available processors for selected blocks of code, suitable for parallelization, in the existing sequential code. Division by tasks is done by calling the specially designed static methods of the class «Parallel», which is a part of the tools «ParallelExtensions». In particular, the methods «Parallel.For()», «Parallel.Foreach()» transform consistently running cycle in parallel. Some blocks of code could be transformed to separate methods and executed parallel by method «Parallel.Invoke()». However, the complexity of application development (based on the «ParallelExtensions» tools) is increased if we have to execute parallel tasks that use the same data. This results in necessarily to thorough control of the data and of changing messages between parallel tasks. For example, it’s possible for popular type of applications based on «WindowsForms», when some data is treated with help of background threads and this data has to be visualized in the main thread. Also using of the universal ParallelExtensions tools do not allow to maximum use of the concrete parallel architecture. Besides that, today there are some impediments in increasing of computer power of centrals processor units that are the reason for fundamental limitations of production integrated circuit. So it is reasonable to consider other solutions of increasing of computer power of the system. According to the literature [4, 5] the applications that use graphics processors (GPU) NVIDIA for non-graphical computing, this applications demonstrated a significant increase in computational efficiency compared to implementations that are based on solely central processor units. Technology CUDA (Compute Unified Device Architecture) is hardware and software architecture that enables use of NVIDIA graphics processor for non-graphical 7 calculations based Runtime API. Runtime API is an extension to the language C/C + +. This extension allows to allocate memory in the GPU by using the function «cudamalloc()», pass parts of the code to GPU for processing (function «cudaMemcpy()»). For parallel processing of data arrays can be used the mechanism of blocks, that allow to process parallel element of arrays. In the Fig. 2 it is shown the implementation of the algorithm for calculating the time-frequency correlation function using technology CUDA. CPU Include for data Allocate memory for proccessing data Allocate memory for data Main memory Copy data in the memory of GPU GPU ... Return results GPU memory The code snippet calculating the vector P in parallel computation GPU // Allocate memory on the GPU for the vector P cudaMalloc((void **)&P, (int)Math.Pow(2, n-1)) … // run method in 2n-1 parallel blocks Calculate_P<<(int)Math.Pow(2, n-1), 1>> (P, bpf_x,kbpf_y) … Fig. 2. The realization of the algorithm for calculating the time-frequency correlation function using technology CUDA: Calculate_P – function for calculating the vector P. It is possible to works with grids of parallel threads to handle matrices with help of type of «dim3» CUDA. A significant advantage of CUDA is a possibility of direct access to OpenGL and DirectX [4], which should undoubtedly be used in problems with image visualizing. 8 GPU programming models for general purposes, including complex memory hierarchy, and vector operations, traditionally are platform-dependent. These limitations make it difficult for developers to use the broad base of the source code for the CPU, GPU, and other types of processors. In particular, CUDA technology enables to work only with graphics processors by NVIDIA. In order to work with some of the others GPU manufacturers it is possible to use an opened standard OpenCL for general-purpose parallel programming, which does not requires to be licensed. OpenCL contain programming language and API-Application Programming Interface. However, the efficiency of applications that are based on OpenCL slower than applications developed using technology designed for a specific GPU, including CUDA [6]. 9 2. The experimental part On the basis of considered technologies were created software that calculates the frequency-time correlation functions. As the FFT algorithm it was selected most common Cooley-Tukey algorithm with fixed base 2 and that has the ease way of implementation, has clarity and can be effectively parallelized [1, 7]. A comparison of the efficiency of the considered technologies has been tested on a number of test cases. The experimental results were obtained on the sample sizes: 2048, 4096, 8192, 16384, 32768 samples. The number of copies generated m is 1121. The results of these experiments are summarized in the table. Maximum execution time of transformation was obtained for the application for the CPU without the realization of the tasks parallel computation. For clarity, estimate of efficiency of the technologies showed in the table and gives the ratio of time spent processing calculations for maximum computing time obtained on a fixed sample size, and it expressed in percentage terms. The sample size, sample 2048 4096 8192 16384 32768 Table. Results of experiments CPU Intel CPU Intel Celeron E1500 Celeron E1500 2 Core 2 Core + without Framework, % Framework, % 100 67 100 61 100 54 100 52 100 51 CUDA NVIDIA GeForce GTX640, % 3.2 2.9 2.7 2.5 2.4 On the basis of the results it can be stated that the application of the technology Microsoft. NET Framework on a dual core processor is justified and this has reduced the run time of transforms. A maximum efficiency was obtained at a length of 32 768 sample points, and the minimum at a length of 2048 sample points. This difference is easily explained by the fact that the time of thread creation can be comparable to, and sometimes exceed, the time of execution of the 10 transform. Parallel data processing in accordance with those presented in Fig. 1 the algorithm can be performed only at the level of loop iterations, as well as the calculation of the direct and inverse Fourier transform. These require significant CPU resource costs for the creating and termination of threads. Using CUDA technology has led to a significant reduction in the signal processing time due to the high speed characteristics of a grid and thread blocks. 11 Conclusion In this work it was considered the technologies that allow to increase the efficiency of the calculations. Because of their specialized computer architecture graphical processor units are faster than the general-purpose central processor units in solving problems related to computing. It was shown that the efficiency of use the technology Microsoft. NET Framework is entirely dependent on the number of cores in the CPU. Reducing the time limit to compute calculations corresponds to the number of cores, which was confirmed by the results of the experiments. The use of technology CUDA can involve unique computing architecture graphics processors and significantly reduce the total execution time for transforms. Significant effect is achieved when processing large amounts of data, because the time of creating the threads can exceed the time the transform. At the same time, the computational efficiency depends on the implementation of the fast Fourier transform algorithm. CUDA technology now makes it possible to create efficient software through its unique computing architecture. 12 References 1. Aiftchers E.S., Jevers B.U. Digital Signal Processing: A practical Approach.. 2nd edit. – M.: Williams, 2008. – 992 p. 2. Avramchuk VS, Tran Viet Tyau Time-frequency correlation analysis of digital signals / / Bulletin of the Tomsk Polytechnic University. – 2009. –T. 315. – № 5. – S. 112-115. 3. D. Richter CLR via C #. Programming on the platform Microsoft. NET Framework 4.0 in the language C #. 3rd ed. – St.: Peter, 2012. – 928 p. 4. Sanders, D., E. Kendrot CUDA technology in the examples: an introduction to programming graphical processes. – Moscow: DMK Press, 2011. – 232 p. 5. Lagunenko AI NVIDIA CUDA technology significantly accelerates research [Electronic resource]. – Mode of access: http://www.nvidia.com/object/io_1230126782852.html – 27.08.2012 6. The OpenCL Specification V 1.1 [Electronic resource]. – Mode of access: http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf – 27.08.2012 7. Nussbaumer G. Fast Fourier transform and convolution algorithms. – M.: Radio and communication, 1985. – 248 p. 13