slides

advertisement

NVVP, Existing Libraries, Q/A

Text book / resources

Eclipse Nsight, NVIDIA Visual Profiler

Available libraries

Questions

Certificate dispersal

(Optional) Multiple GPUs: Where’s Pixel-

Waldo?

TEXT BOOK

Programming Massively

Parallel Processors, A

Hands on approach

David Kirk, Wen-mei Hwu

NVIDIA DEVELOPER ZONE

Early access to updated drivers / updates

Heavily curated help forum

Requires registration and approval (nearly automated) developer.nvidia.com

US!

We’re pretty passionate about this GPU computing stuff.

Collaboration is cool

If you think you’ve got a problem that can benefit from GPU computation we may have some ideas.

IDE with an Eclipse foundation

CUDA aware syntax highlighting / suggestions / recognition

Hooked into NVVP

Deep profiling of every aspect of GPU execution ( memory bandwidth, branch divergence, bank conflicts, compute / transfer overlap, and more! )

Provides suggestions for optimization

Graphical view of GPU performance

Nsight and NVVP are available on our cuda# machines

Ssh –X <user>@<cuda machine>

Nsight demo on Week 3 code

Why re-invent the wheel?

• There are many GPU enabled tools built on

CUDA that are already available

These tools have been extensively tested for efficiency and in most cases will outperform custom solutions

Some require CUDA-like code structure

Linear Algebra, cuBLAS

CUDA enabled basic linear algebra subroutines

• GPU-accelerated version of the complete standard BLAS library

Provided with the CUDA toolkit. Code examples are also provided

Callable from C and Fortran

Linear Algebra, cuBLAS

Linear Algebra, cuBLAS

Linear Algebra, CULA, MAGMA

CULA and MAGMA extend BLAS

• CULA (Paid)

 CULA-dense: LAPACK and BLAS implementations, solvers, decompositions, basic matrix operations

 CULA-sparse: sparse matrix specialized routines, specialized storage structures, iterative methods

MAGMA (Free, BSD) (Fortran Bindings)

 LAPACK and BLAS implementations, developed by the same dev. team as LAPACK.

Linear Algebra, CULA, MAGMA

Linear Algebra, CULA, MAGMA

IMSL Fortran/C Numerical Library

Large collection of mathematical and statistical gpu-accelerated functions

• Free evaluation, paid extension http://www.roguewave.com/products/imslnumerical-libraries/fortran-library.aspx

Image/Signal Processing: NVIDIA

Performance Primitives

1900 Image processing and 600 signal processing algorithms

• Free and provided with the CUDA toolkit, code examples included.

Can be used in tandem with visualization libraries like OpenGL, DirectX.

Image/Signal Processing: NVIDIA

Performance Primitives

CUDA without the CUDA:

Thrust Library

Thrust is a high level interface to GPU computing.

• Offers template-interface access to sort, scan, reduce, etc.

A production tested version is provided with the

CUDA toolkit.

CUDA without the CUDA:

Thrust Library

CUDA without the CUDA:

Thrust Library

CUDA without the CUDA:

Thrust Library

Python and CUDA

PyCUDA

• Python interface to CUDA functions.

• Simply a collection of wrappers, but effective.

NumbaPro (Paid)

• Announced this year at GTC 2013, native CUDA python compiler

Python = 4 th major cuda language

R and CUDA

R+GPU

• Package with accelerated alternatives for common R statistical functions

Rpud / rpudplus

• Package with accelerated alternatives for common R statistical functions

Rcuda

• … Package with accelerated alternatives for common R statistical functions

R and CUDA

Where’s Pixel-Waldo?

Motivation: Given two images which contain a unique suspect and a number of distinct bystanders, identify the suspect by pairwise comparison.

This is hard

We’ll simplify the problem by reducing the targets to pixel triples.

0: upload an image and a list to store targets to each GPU.

f.bmp

GPU0 0 | 0 | 0 | … s.bmp

GPU1 0 | 0 | 0 | …

1: Find all positions of potential targets

(triples) within each image using both GPUS independently.

f.bmp

GPU0

11 | 143 | 243 | … s.bmp

GPU1

3 | 1632 | 54321 | …

2: Allow GPU0 to access GPU1 memory, use both images and target lists to compare potential suspects.

GPU0 f.bmp

11 | 143 | 243 | …

0 | 0 s.bmp

GPU1

3 | 1632 | 54321 | …

PCI Bus

3: Print the positions of the single matching suspect.

f.bmp

GPU0

11 | 143 | 243 | …

132 | 629

CPU

PCI Bus

Walk though the source code.

Things to note:

• This is un-optimized and known to be inefficient, but the concepts of asynchronous streams, GPU context switching, universal addressing, and peer-to-peer access are covered

Source code requires the tclap library to compile appropriately.

Source code will be made available in a github repository after the workshop.

Download