GPU acceleration in Matlab Jan Kamenický UTIA Friday seminar 9.11.2012 GPU acceleration • CPU – fast – general-purpose • GPU – highly parallel – handles specific tasks with large amount of data – memory transfers needed GPU acceleration in Matlab • Build-in functions – many Matlab functions support GPU acceleration natively • arrayfun – specific element-wise processing • CUDA kernels – write “.cu” files – compile to “.ptx” (parallel thread execution) – run using feval Prerequisites • Matlab 2010b or newer • Parallel Computing Toolbox ver Prerequisites >> ver ------------------------------------------------------------------------------------MATLAB Version 7.13.0.564 (R2011b) MATLAB License Number: XXXXXX Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1) Java VM Version: Java 1.6.0_17-b04 with Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM mixed mode ------------------------------------------------------------------------------------MATLAB Version 7.13 (R2011b) Simulink Version 7.8 (R2011b) Computer Vision System Toolbox Version 4.1 (R2011b) Curve Fitting Toolbox Version 3.2 (R2011b) DSP System Toolbox Version 8.1 (R2011b) Data Acquisition Toolbox Version 3.0 (R2011b) Filter Design HDL Coder Version 2.9 (R2011b) Fixed-Point Toolbox Version 3.4 (R2011b) Global Optimization Toolbox Version 3.2 (R2011b) Image Acquisition Toolbox Version 4.2 (R2011b) Image Processing Toolbox Version 7.3 (R2011b) MATLAB Compiler Version 4.16 (R2011b) MATLAB Distributed Computing Server Version 5.2 (R2011b) Neural Network Toolbox Version 7.0.2 (R2011b) Optimization Toolbox Version 6.1 (R2011b) Parallel Computing Toolbox Version 5.2 (R2011b) Partial Differential Equation Toolbox Version 1.0.19 (R2011b) Signal Processing Toolbox Version 6.16 (R2011b) Simulink 3D Animation Version 6.0 (R2011b) Statistics Toolbox Version 7.6 (R2011b) Symbolic Math Toolbox Version 5.7 (R2011b) Wavelet Toolbox Version 4.8 (R2011b) Prerequisites >> ver ------------------------------------------------------------------------------------MATLAB Version 7.13.0.564 (R2011b) MATLAB License Number: XXXXXX Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1) Java VM Version: Java 1.6.0_17-b04 with Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM mixed mode ------------------------------------------------------------------------------------MATLAB Version 7.13 (R2011b) Simulink Version 7.8 (R2011b) Computer Vision System Toolbox Version 4.1 (R2011b) Curve Fitting Toolbox Version 3.2 (R2011b) DSP System Toolbox Version 8.1 (R2011b) Data Acquisition Toolbox Version 3.0 (R2011b) Filter Design HDL Coder Version 2.9 (R2011b) Fixed-Point Toolbox Version 3.4 (R2011b) Global Optimization Toolbox Version 3.2 (R2011b) Image Acquisition Toolbox Version 4.2 (R2011b) Image Processing Toolbox Version 7.3 (R2011b) MATLAB Compiler Version 4.16 (R2011b) MATLAB Distributed Computing Server Version 5.2 (R2011b) Neural Network Toolbox Version 7.0.2 (R2011b) Optimization Toolbox Version 6.1 (R2011b) Parallel Computing Toolbox Version 5.2 (R2011b) Partial Differential Equation Toolbox Version 1.0.19 (R2011b) Signal Processing Toolbox Version 6.16 (R2011b) Simulink 3D Animation Version 6.0 (R2011b) Statistics Toolbox Version 7.6 (R2011b) Symbolic Math Toolbox Version 5.7 (R2011b) Wavelet Toolbox Version 4.8 (R2011b) Prerequisites • Matlab 2010b or newer • Parallel Computing Toolbox ver • NVIDIA GPU with CUDA version 1.3 or higher gpuDevice Prerequisites >> gpuDevice ans = parallel.gpu.CUDADevice handle Package: parallel.gpu Properties: Name: Index: ComputeCapability: SupportsDouble: DriverVersion: MaxThreadsPerBlock: MaxShmemPerBlock: MaxThreadBlockSize: MaxGridSize: SIMDWidth: TotalMemory: FreeMemory: MultiprocessorCount: ClockRateKHz: ComputeMode: GPUOverlapsTransfers: KernelExecutionTimeout: CanMapHostMemory: DeviceSupported: DeviceSelected: 'GeForce GTX 285' 1 '1.3' 1 5 512 16384 [512 512 64] [65535 65535] 32 2.1475e+009 1.9656e+009 30 1476000 'Default' 1 1 1 1 1 Methods, Events, Superclasses Prerequisites >> gpuDevice ans = parallel.gpu.CUDADevice handle Package: parallel.gpu Properties: Name: Index: ComputeCapability: SupportsDouble: DriverVersion: MaxThreadsPerBlock: MaxShmemPerBlock: MaxThreadBlockSize: MaxGridSize: SIMDWidth: TotalMemory: FreeMemory: MultiprocessorCount: ClockRateKHz: ComputeMode: GPUOverlapsTransfers: KernelExecutionTimeout: CanMapHostMemory: DeviceSupported: DeviceSelected: 'GeForce GTX 285' 1 '1.3' 1 5 512 16384 [512 512 64] [65535 65535] 32 2.1475e+009 1.9656e+009 30 1476000 'Default' 1 1 1 1 1 Methods, Events, Superclasses Basic usage • Send data to GPU – either allocate there or transfer from workspace • Run Matlab functions – GPU acceleration is used automatically • Retrieve the output data GPUArray class parallel.gpu.GPUArray – main data class for GPU computations – stored in the GPU memory – create directly using static methods zeros nan eye rand linspace ones true colon randi logspace inf false – copy from existing data gpuArray(img) randn GPUArray class • Supported data types: (u)int8, (u)int16, (u)int32, (u)int64, single, double, logical – determine the type using classUnderlying(gpuVar) • Retrieve the data using workspaceVar = gather(gpuVar) GPU accelerated Matlab functions (2012b) methods(‘parallel.gpu.GPUArray’) GPU accelerated Matlab functions (2012b) abs acos acosh acot acoth acsc acsch all angle any arrayfun asec asech asin asinh atan atan2 atanh beta betaln bitand bitcmp bitget bitor bitset bitshift bitxor blkdiag bsxfun cast cat ceil chol circshift classUnderlying colon complex cond conj conv conv2 convn cos cosh cot coth cov cross csc csch ctranspose cumprod cumsum det diag diff disp display dot double eig eps eq erf erfc erfcinv erfcx erfinv exp expm1 fft fft2 fftn fftshift filter filter2 find fix fliplr flipud flipdim floor fprintf full gamma gammaln gather ge gt horzcat hypot ifft ifft2 ifftn ifftshift imag ind2sub int16 int2str int32 int64 int8 inv ipermute iscolumn isempty isequal isequaln isfinite isinf islogical ismatrix isnan isreal isrow issorted issparse isvector kron ldivide le length log log10 log1p log2 logical lt lu mat2str max mean meshgrid min minus mldivide mod mpower mrdivide mtimes ndgrid ndims ne nnz norm normest not num2str numel perms permute plot (and related) plus pow2 power prod qr rank rdivide real reallog realpow realsqrt rem repmat reshape rot90 round sec sech shiftdim sign sin single sinh size sort sprintf sqrt squeeze std sub2ind subsasgn subsindex subsref sum svd tan tanh times trace transpose tril triu uint16 uint32 uint64 uint8 uminus uplus var vertcat Simple example • Solve system of linear equations (Ax = b) A b x x = = = = gpuArray(A); gpuArray(b); A\b; gather(x); Simple example • Compute convolution using FFT img msk msk I = M = res res = gpuArray(img); = padarray(msk,size(img)-size(msk),0,'post'); = gpuArray(msk); fft2(img); fft2(msk,size(img,1),size(img,2)); fft2(msk); = real(ifft2(I.*M)); = gather(res); Linear system solution benchmark Speedup of computations on GPU compared to CPU 3.5 3 Speedup 2.5 2 1.5 single-precision double-precision 1 0.5 0 Matrix size (number of equations) Convolution benchmark Speedup of computations on GPU compared to CPU 5 4.5 4 Speedup 3.5 3 2.5 2 1.5 single-precision 1 double-precision 0.5 0 Matrix size Profiling • Before optimizing (trying to use GPU) locate promising parts of code like – custom code consuming the majority of time – build-in functions that support GPUArray (consuming the majority of time) – large input/output data, simple data types • Test the speed afterwards • GPU code cannot be profiled Profiling