NVVP, Existing Libraries, Q/A
Text book / resources
Eclipse Nsight, NVIDIA Visual Profiler
Available libraries
Questions
Certificate dispersal
(Optional) Multiple GPUs: Where’s Pixel-
Waldo?
TEXT BOOK
Programming Massively
Parallel Processors, A
Hands on approach
David Kirk, Wen-mei Hwu
NVIDIA DEVELOPER ZONE
Early access to updated drivers / updates
Heavily curated help forum
Requires registration and approval (nearly automated) developer.nvidia.com
US!
We’re pretty passionate about this GPU computing stuff.
Collaboration is cool
If you think you’ve got a problem that can benefit from GPU computation we may have some ideas.
IDE with an Eclipse foundation
CUDA aware syntax highlighting / suggestions / recognition
Hooked into NVVP
Deep profiling of every aspect of GPU execution ( memory bandwidth, branch divergence, bank conflicts, compute / transfer overlap, and more! )
Provides suggestions for optimization
Graphical view of GPU performance
Nsight and NVVP are available on our cuda# machines
Ssh –X <user>@<cuda machine>
Nsight demo on Week 3 code
Why re-invent the wheel?
•
•
• There are many GPU enabled tools built on
CUDA that are already available
These tools have been extensively tested for efficiency and in most cases will outperform custom solutions
Some require CUDA-like code structure
Linear Algebra, cuBLAS
CUDA enabled basic linear algebra subroutines
•
•
• GPU-accelerated version of the complete standard BLAS library
Provided with the CUDA toolkit. Code examples are also provided
Callable from C and Fortran
Linear Algebra, cuBLAS
Linear Algebra, cuBLAS
Linear Algebra, CULA, MAGMA
CULA and MAGMA extend BLAS
• CULA (Paid)
•
CULA-dense: LAPACK and BLAS implementations, solvers, decompositions, basic matrix operations
CULA-sparse: sparse matrix specialized routines, specialized storage structures, iterative methods
MAGMA (Free, BSD) (Fortran Bindings)
LAPACK and BLAS implementations, developed by the same dev. team as LAPACK.
Linear Algebra, CULA, MAGMA
Linear Algebra, CULA, MAGMA
IMSL Fortran/C Numerical Library
Large collection of mathematical and statistical gpu-accelerated functions
•
• Free evaluation, paid extension http://www.roguewave.com/products/imslnumerical-libraries/fortran-library.aspx
Image/Signal Processing: NVIDIA
Performance Primitives
1900 Image processing and 600 signal processing algorithms
•
• Free and provided with the CUDA toolkit, code examples included.
Can be used in tandem with visualization libraries like OpenGL, DirectX.
Image/Signal Processing: NVIDIA
Performance Primitives
CUDA without the CUDA:
Thrust Library
Thrust is a high level interface to GPU computing.
•
• Offers template-interface access to sort, scan, reduce, etc.
A production tested version is provided with the
CUDA toolkit.
CUDA without the CUDA:
Thrust Library
CUDA without the CUDA:
Thrust Library
CUDA without the CUDA:
Thrust Library
Python and CUDA
PyCUDA
• Python interface to CUDA functions.
• Simply a collection of wrappers, but effective.
NumbaPro (Paid)
•
• Announced this year at GTC 2013, native CUDA python compiler
Python = 4 th major cuda language
R and CUDA
R+GPU
• Package with accelerated alternatives for common R statistical functions
Rpud / rpudplus
• Package with accelerated alternatives for common R statistical functions
Rcuda
• … Package with accelerated alternatives for common R statistical functions
R and CUDA
Where’s Pixel-Waldo?
Motivation: Given two images which contain a unique suspect and a number of distinct bystanders, identify the suspect by pairwise comparison.
This is hard
We’ll simplify the problem by reducing the targets to pixel triples.
0: upload an image and a list to store targets to each GPU.
f.bmp
GPU0 0 | 0 | 0 | … s.bmp
GPU1 0 | 0 | 0 | …
1: Find all positions of potential targets
(triples) within each image using both GPUS independently.
f.bmp
GPU0
11 | 143 | 243 | … s.bmp
GPU1
3 | 1632 | 54321 | …
2: Allow GPU0 to access GPU1 memory, use both images and target lists to compare potential suspects.
GPU0 f.bmp
11 | 143 | 243 | …
0 | 0 s.bmp
GPU1
3 | 1632 | 54321 | …
PCI Bus
3: Print the positions of the single matching suspect.
f.bmp
GPU0
11 | 143 | 243 | …
132 | 629
CPU
PCI Bus
Walk though the source code.
Things to note:
•
•
• This is un-optimized and known to be inefficient, but the concepts of asynchronous streams, GPU context switching, universal addressing, and peer-to-peer access are covered
Source code requires the tclap library to compile appropriately.
Source code will be made available in a github repository after the workshop.