2. The experimental part

advertisement
Ministry of Education and Science of Russian Federation
Tomsk Polytechnic University
Final Paper
TECHNOLOGY OF PARALLEL COMPUTING ON MULTIPROCESSOR
SYSTEMS FOR DIGITAL SIGNAL PROCESSING
(Materials of scientific research)
Trainee
Instructor
Consultant
___________
(signature)
_____________
(signature)
_____________
___
(signature)
_____ E.E. Luneva______
(data)
(name)
_______ ________________
(data)
(name)
_______ ________________
________________
(data)
(name)
Tomsk – 2013
Annotation
In this work is considered the technologies directed to improve mathematical
calculations on a multiprocessor system. These technologies were used in for
calculation of frequency-time correlation functions. It is shown that graphical
processor units are faster than central processor units in solving problems related to
general computing.
Аннотация
Рассмотрены технологии повышения эффективности математических
расчетов на многопроцессорных системах на примере расчета частотновременной корреляционной функции. Показано, что графические процессоры
превосходят по скорости процессоры общего назначения при решении задач
связанных с вычислениями.
2
Contents
Introduction ............................................................................................................... 4
1. Theoretical analysis ............................................................................................... 5
2. The experimental part .......................................................................................... 10
Conclusion ............................................................................................................... 12
References ............................................................................................................... 13
3
Introduction
Today digital signal processing is an important branch of modern science
and engineering. With development of processing power of modern computers, it
has been increased the volume of processing information as well as have been
developed new algorithms of signal processing. Also the range of application of
signal processing has become wider.
There are strict conditions to newly-developed methods and to software
which is based on these methods. The conditions concern processing speed,
possibility to make processing in real-time mode, and measurement accuracy. It’s
difficult to fulfil the requirements without modern computing devices and
technology.
The aim of this work consists in researching the ways of efficiency
increasing of using computer hardware resources for the tasks of the correlation
analysis of signals.
4
1. Theoretical analysis
Correlation analysis of signals is wide spread for solving the problems of
nondestructive check as well as diagnostics and feature finding of electric system,
also problems of digital picture processing [1]. Correlation functions are
sufficiently simple determined with help of discrete Fourier transform (DFP) [1].
An efficiency of the determination in this case depends on the chosen method of
discrete Fourier transform realization. The maximum efficiency is reached if it is
used the method of fast Fourier transform (FFT). The main drawback of this
method is that it does not show how signals are banded in different frequency
ranges. The method of time-and-frequency correlation analysis is free from this
drawback [2] and it can significantly enhance the information content of the
analysis. However application of this method is connected with sufficient amount
of computational power because the method of time-and-frequency correlation
uses repeated procedure of executing fast Fourier transform. The number of
necessary executing fast Fourier transform directly depends on the number of
formed copies of a signal.
To estimate the complexity of the calculations by this method it is given the
aggregative algorithm of necessaries mathematical manipulations (pic. 1). Despite
the good algorithmization and availability of optimized libraries for FFT
computing, the problem of limited computational power is still a bottleneck of the
method of time-and-frequency correlation and of any software that based on this
method. In this case it is possible to increase efficiency of developed software with
help of special-purpose technologies.
The peculiarity of the algorithm (fig. 1) is as follows. It’s supposed there are
two discrete signals xi and yi, the interconnection which, if any, should be
identified. So the task is in identification of an interconnection of two signals as
well as in identification of frequency spectrum where interconnection is appeared.
Two signals xi and yi with dimension 2n size is transmitted to the input of
blocks that calculate direct Fourier transform. From the resulting multiplication Pj
5
it’s formed m signals Mk, where j=0,1,...,2n–1+1; m=2,3,...,2n–1; k=0,1,...,m–1. That
signals Mk are submitted to inverse Fourier transform. According to the results
determined by the inverse Fourier transform it is time-frequency correlation
function.
Thus, the implementation of this algorithm parallel processing is possible
under the discrete Fourier transform, as well as at the level of iteration cycles of
the three blocks, due to the creation of multiple thread blocks and then their
destruction.
Begin
1
Determine the
number of
copies m
k = 1, m-1 , 1
j = 1, 2n-1, 1
Perform FFT-1 for
signals Mk,
k Î 1...m
Form the vector P as
a multiplication
FFT(xi) by FFT*(yi)
Determine the
frequency-time
correlation function
k = 1, m-1, 1
End
Form m copies of
signals (Mk, k Î
1...m) on the base
of the vector P
Parallelization is
only possible at the
level of loop
iterations
1
Fig. 1. Algorithm for calculating the time-frequency correlation function:
m – the number of the formed copies of the original signal; xi, y– discrete
samples of signals; FFT* – the complex conjugate of the direct DFT; FFT-1 –
inverse DFT; P – vector of results of multiplication direct DFT of signal xi with
the complex conjugate of the direct DFT of signal yi.
6
For multiprocessor systems, in which all calculations are performed purely
on the central processing unit (CPU), it is possible to increase the computational
power of using the parallelization of computational processes. Microsoft has
developed a multipurpose set of tools parallel execution of tasks, that was called
«ParallelExtensions». These tools are part of a set of Microsoft. NET Framework
4.0 [3] and allow to automatically use all available processors for selected blocks
of code, suitable for parallelization, in the existing sequential code.
Division by tasks is done by calling the specially designed static methods of
the class «Parallel», which is a part of the tools «ParallelExtensions». In particular,
the methods «Parallel.For()», «Parallel.Foreach()» transform consistently running
cycle in parallel. Some blocks of code could be transformed to separate methods
and executed parallel by method «Parallel.Invoke()».
However, the complexity of application development (based on the
«ParallelExtensions» tools) is increased if we have to execute parallel tasks that
use the same data. This results in necessarily to thorough control of the data and of
changing messages between parallel tasks. For example, it’s possible for popular
type of applications based on «WindowsForms», when some data is treated with
help of background threads and this data has to be visualized in the main thread.
Also using of the universal ParallelExtensions tools do not allow to maximum use
of the concrete parallel architecture.
Besides that, today there are some impediments in increasing of computer
power of centrals processor units that are the reason for fundamental limitations of
production integrated circuit. So it is reasonable to consider other solutions of
increasing of computer power of the system.
According to the literature [4, 5] the applications that use graphics
processors (GPU) NVIDIA for non-graphical computing, this applications
demonstrated a significant increase in computational efficiency compared to
implementations that are based on solely central processor units. Technology
CUDA (Compute Unified Device Architecture) is hardware and software
architecture that enables use of NVIDIA graphics processor for non-graphical
7
calculations based Runtime API.
Runtime API is an extension to the language C/C + +. This extension allows
to allocate memory in the GPU by using the function «cudamalloc()», pass parts of
the code to GPU for processing (function «cudaMemcpy()»).
For parallel processing of data arrays can be used the mechanism of blocks,
that allow to process parallel element of arrays. In the Fig. 2 it is shown the
implementation of the algorithm for calculating the time-frequency correlation
function using technology CUDA.
CPU
Include for
data
Allocate
memory for proccessing
data
Allocate
memory for
data
Main memory
Copy data in
the memory
of GPU
GPU
...
Return
results
GPU memory
The code snippet calculating the vector P
in parallel computation GPU
// Allocate memory on the GPU for the vector P
cudaMalloc((void **)&P, (int)Math.Pow(2, n-1))
…
// run method in 2n-1 parallel blocks
Calculate_P<<(int)Math.Pow(2, n-1), 1>>
(P, bpf_x,kbpf_y)
…
Fig. 2. The realization of the algorithm for calculating the time-frequency
correlation function using technology CUDA:
Calculate_P – function for calculating the vector P.
It is possible to works with grids of parallel threads to handle matrices with
help of type of «dim3» CUDA. A significant advantage of CUDA is a possibility
of direct access to OpenGL and DirectX [4], which should undoubtedly be used in
problems with image visualizing.
8
GPU programming models for general purposes, including complex memory
hierarchy, and vector operations, traditionally are platform-dependent. These
limitations make it difficult for developers to use the broad base of the source code
for the CPU, GPU, and other types of processors. In particular, CUDA technology
enables to work only with graphics processors by NVIDIA. In order to work with
some of the others GPU manufacturers it is possible to use an opened standard
OpenCL for general-purpose parallel programming, which does not requires to be
licensed.
OpenCL
contain
programming
language
and
API-Application
Programming Interface. However, the efficiency of applications that are based on
OpenCL slower than applications developed using technology designed for a
specific GPU, including CUDA [6].
9
2. The experimental part
On the basis of considered technologies were created software that
calculates the frequency-time correlation functions. As the FFT algorithm it was
selected most common Cooley-Tukey algorithm with fixed base 2 and that has the
ease way of implementation, has clarity and can be effectively parallelized [1, 7].
A comparison of the efficiency of the considered technologies has been tested on a
number of test cases. The experimental results were obtained on the sample sizes:
2048, 4096, 8192, 16384, 32768 samples. The number of copies generated m is
1121.
The results of these experiments are summarized in the table. Maximum
execution time of transformation was obtained for the application for the CPU
without the realization of the tasks parallel computation. For clarity, estimate of
efficiency of the technologies showed in the table and gives the ratio of time spent
processing calculations for maximum computing time obtained on a fixed sample
size, and it expressed in percentage terms.
The
sample
size,
sample
2048
4096
8192
16384
32768
Table. Results of experiments
CPU Intel
CPU Intel
Celeron E1500
Celeron E1500
2 Core
2 Core +
without
Framework, %
Framework, %
100
67
100
61
100
54
100
52
100
51
CUDA
NVIDIA
GeForce
GTX640,
%
3.2
2.9
2.7
2.5
2.4
On the basis of the results it can be stated that the application of the
technology Microsoft. NET Framework on a dual core processor is justified and
this has reduced the run time of transforms. A maximum efficiency was obtained at
a length of 32 768 sample points, and the minimum at a length of 2048 sample
points. This difference is easily explained by the fact that the time of thread
creation can be comparable to, and sometimes exceed, the time of execution of the
10
transform. Parallel data processing in accordance with those presented in Fig. 1 the
algorithm can be performed only at the level of loop iterations, as well as the
calculation of the direct and inverse Fourier transform. These require significant
CPU resource costs for the creating and termination of threads. Using CUDA
technology has led to a significant reduction in the signal processing time due to
the high speed characteristics of a grid and thread blocks.
11
Conclusion
In this work it was considered the technologies that allow to increase the
efficiency of the calculations. Because of their specialized computer architecture
graphical processor units are faster than the general-purpose central processor units
in solving problems related to computing.
It was shown that the efficiency of use the technology Microsoft. NET
Framework is entirely dependent on the number of cores in the CPU. Reducing the
time limit to compute calculations corresponds to the number of cores, which was
confirmed by the results of the experiments.
The use of technology CUDA can involve unique computing architecture
graphics processors and significantly reduce the total execution time for
transforms. Significant effect is achieved when processing large amounts of data,
because the time of creating the threads can exceed the time the transform. At the
same time, the computational efficiency depends on the implementation of the fast
Fourier transform algorithm. CUDA technology now makes it possible to create
efficient software through its unique computing architecture.
12
References
1. Aiftchers E.S., Jevers B.U. Digital Signal Processing: A practical Approach.. 2nd edit. – M.: Williams, 2008. – 992 p.
2. Avramchuk VS, Tran Viet Tyau Time-frequency correlation analysis of digital
signals / / Bulletin of the Tomsk Polytechnic University. – 2009. –T. 315. – №
5. – S. 112-115.
3. D. Richter CLR via C #. Programming on the platform Microsoft. NET
Framework 4.0 in the language C #. 3rd ed. – St.: Peter, 2012. – 928 p.
4. Sanders, D., E. Kendrot CUDA technology in the examples: an introduction to
programming graphical processes. – Moscow: DMK Press, 2011. – 232 p.
5. Lagunenko AI NVIDIA CUDA technology significantly accelerates research
[Electronic
resource].
–
Mode
of
access:
http://www.nvidia.com/object/io_1230126782852.html – 27.08.2012
6. The OpenCL Specification V 1.1 [Electronic resource]. – Mode of access:
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf – 27.08.2012
7. Nussbaumer G. Fast Fourier transform and convolution algorithms. – M.: Radio
and communication, 1985. – 248 p.
13
Download