09-Parallelization_and_CUDA_libraries

advertisement
Parallelization and CUDA libraries
Lei Zhou, Yafeng Yin, Hong Man
Outline
GPU & CUDA
Manually CUDA Coding
CUDA Library
FIR Realization
Auto Parallelizing Tool
GPU & CUDA
 GPUs are massively multithreaded
many core chips
 Hundreds of scalar processors
 Tens of thousands of concurrent threads
 CUDA is the acronym for Compute
Unified Device Architecture.
GeForce 8800 GTX (128 cores)
 A parallel computing architecture
developed by NVIDIA.
 The computing engine in GPU.
 CUDA can be accessible to software
developers through industry standard
programming languages.
Tesla C1060 (240 cores)
Processing Flow
Serial code executes on the host while parallel code executes on
the device.
Manually CUDA Coding
Find parallel kernels
Improve data reuse inside kernels to have better
compute intensity
Access the memory in a GPU-friendly
Take advantage of complex memory hierarchy
that make the GPU fast
Reduce the copy-in and copy-out transfers that
pile up on the PCIe
Reduce memory usage in the GPU
Limit inter-block synchronizations
CUDA Libraries
Basic CUDA computation library
 CUBLAS
 CUFFT
 GPULib
Advanced CUDA computation library
 CULA
 MAGMA
 VSIPL
Basic libraries
 CUBLAS provides a set of functions for basic vector and
matrix operations
 matrix‐vector copy, sort, dot product, Euclidean norm etc
 CUFFT is the CUDA FFT library
 cufftPlan1d() ,cufftPlan2d() ,cufftPlan3d()
 GPULib provides a library of mathematical functions
 addition, subtraction, multiplication, and division, as well as
unary functions, including sin(), cos(), gamma(), and exp(),
 interpolation, array reshaping, array slicing, and reduction
operations
Advanced libraries
CULA: GPU Accelerated Linear Algebra
provide LAPACK (Linear Algebra PACKage) function
on CUDA GPUs
MAGMA: Matrix Algebra on GPU and
Multicore Architectures
develop a dense linear algebra library similar to
LAPACK but for heterogeneous/hybrid
architectures and "Multicore+GPU" systems
Advanced lib -VSIPL
VSIPL: Vector Image Signal Processing Library
Generalized matrix product
Fast FIR filtering
Correlation
Fast Fourier Transform
QR decomposition
Random number generation
Elementwise arithmetic, logical, and comparison
operators, linear algebra procedures
Example
// Allocate device memory for filter kernel
Complex* d_filter_kernel;
cutilSafeCall(cudaMalloc((void**)&d_filter_kernel, mem_size));
// Copy host memory to device
cutilSafeCall(cudaMemcpy(d_filter_kernel, h_padded_filter_kernel,
mem_size, cudaMemcpyHostToDevice));
// CUFFT plan
cufftHandle plan;
cufftSafeCall(cufftPlan1d(&plan, new_size, CUFFT_C2C, 1));
// Transform signal and kernel
cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_signal,
(cufftComplex *)d_signal, CUFFT_FORWARD));
FIR Realization on CUDA
FIR Realization on CUDA
Threads
t
CUDA Demo (FIR)
GPU: NVIDIA GeForce 8600 GT
CPU: Intel Duo CPU 2.33G
Software: Visual Studio 2005
CUDA Demo (FIR)
FIR Performance
5000
4500
CPU
4000
CPU+GPU
3500
3000
msec
2500
2000
1500
1000
500
0
1000
10000
100000
1000000
10000000
Auto-Parallelizing Tool
Par4All (open source environment): C and
Fortran to CUDA C
PGI Accelerator: Fortran and C to CUDA C
Auto-parallelizing Compiler
CAPS HMPP: C and Fortran to CUDA C Autoparallelizing Compiler
Goose: C to CUDA C Auto-parallelizing
Compiler
NOAA F2C : Fortran to CUDA C Translator
Par4All (open source environment): C and
Fortran to CUDA C
Download