GPU Computing and CUDA

Cijo Thomas
Janith Kaiprath Valiyalappil
CS566 Parallel Programming, Spring '13
 GPU has 100s of cores compared to 4-8 cores for CPU
 CPU - executes a single thread very quickly
 GPU - executes many concurrent threads slowly - traditionally
excels for embarrassingly parallel tasks
 GPU and CPU have complementary properties.
 Solve General Purpose problems using GPU.
 Core idea is to map data parallel algorithms into equivalent
graphics concepts
 Have to make heavy use of graphics APIs.
 Traditionally a cumbersome task
 Never gained prominence among developers.
Compute Unified Device Architecture
Released in 2006 by NVIDIA
Easy programming of GPU using C extension
Transparently scales harnessing the ever growing power of
 Programs portable to newer GPU releases
 Scalable array of multi-threaded SMs (Streaming
 Each SM consists of multiple Streaming Processor (SM)
 Inter-thread communication using shared memory
 CUDA Terms – Host – CPU
Device - GPU
 Threads are grouped into thread blocks, and execute
concurrently on a single SM
 Thread blocks are grouped into Grids, and are executed
independently and parallely
 SIMT- Single Instruction Multiple Thread
 Thread creation,management,scheduling and execution
occurs in groups of 32 threads called warps
 Each thread has its own local memory apart from register and
stack space (Physically located on device memory off-chip)
 Next in hierarchy is a low-latency shared memory between
threads in a thread block
 Then there is high-latency global shared memory
 All the above memories are physically and logically separate
from system memory.
[Source: Nvidia]
 cudamalloc,cudafree is used for allocation and releasing
memory in Device.
 cudamemcpy- is used to transfer data in 2 directions
a) device to host memory - cudaMemcpyHostToDevice
b) host to device memory- cudaMemcpyDeviceToHost
 Device memory refers to global shared memory, and not
thread block shared memory
 CUDA programs are heterogeneous CPU+GPU co-processing
 Use CPU core for serial portions, GPU for parallel portions
 CUDA kernel - can be a simple function or a program on its
 GPU needs 1000s of threads for full efficiency
 CUDA threads are extremely light-weight with little or no
overhead in creation/switching
Allocate memory in device (GPU)
Copy data from system memory into device memory
Invoke CUDA kernel which performs processing the data
Copy results backs from device memory to system memory.
 CUDA 5 - The latest release of CUDA
 Released Oct 2013
 Kepler Architecture vs Fermi Architecture
 GPU thread can launch parallel GPU kernels
[Harris, GPU Tech Conf,2012]
[Harris, GPU Tech Conf,2012]
 Recursive parallel algorithms
 More efficient
– GPU kept more occupied
 Simplify CPU/GPU divide
 Library calls can be made from kernel
 GPU Object Linking
[Harris, GPU Tech Conf,2012]
 RDMA: Remote Direct Memory Access between any GPUs
in cluster
[Harris, GPU Tech Conf,2012]
 A source-source translation tool to relieve the programmer
from handling memory hierarchy
[Ueng, LCPC , 2008]
 makes CUDA architecture run on regular multi-core CPU
 Proves the effectiveness of CUDA model in non-GPU
systems as well
 CUDA not as simple as it sounds
 People have questioned the future of CUDA
 CUDA has a strong reputation for performance, but at the
expense of ease of programming
 Alternates like XMT is developed, challenging CUDA
 XMT – many core general purpose parallel architecture.
[Caragea,,Hotpar 2010]
375million CUDA capable GPUs sold by Nvidia
1 million toolkit downloads
>120,000 active developers
Active research community
New domains like Big-Data Analytics
Shazam – top 5 music app in Apple Store – real time twitter data analysis
 and many more….
Source : NVIDIA
CUDA is promising but only supports NVIDIA
OpenCL, AMD Brook not main stream yet.
Automatic extraction of parallelism
Automatic conversion of existing code base
in popular models eg: Java Threads
More support for higher level languages
[Buck,SC08,2008] : Massimiliano Fatica (NVIDIA), Patrick LeGresley (NVIDIA),Ian Buck
(NVIDIA) ,John Stone (University of Illinois at Urbana-Champaign) , Jim Phillips
(University of Illinois at Urbana-Champaign), Scott Morton (Hess Corporation), Paulius
Micikevicius (NVIDIA), "High Performance Computing with CUDA" Nov.2008
[Ueng, LCPC , 2008] :Sain-Zee Ueng, Melvin Lathara, Sara S,Wen-mei W. Hwu, CUDALite: Reducing GPU Programming ComplexityInternational Workshop, LCPC 2008,
Edmonton, Canada, July 31 - August 2, 2008
[Nickolls,IEEE,2010]: Nickolls, J, The GPU Computing Era, Micro IEEE, 2010
[Harris,GPU Tech Conf 2012] : Mark Harris, CUDA 5 and Beyond , GPU Tech Conference
[Nickolls,ACM,2008] : John Nickolls, Ian Buck, Michael Garland, Kevin Skadron, Scalable
Parallel Programming with CUDA ,Queue – GPU Computing Vol 6, Issue 2, ACM Digital
Library April 2008
[Kirk,2010]: Programming Massively Parallel Processors: A Hands-on Approach 2010,
David B. Kirk, Wen-mei W. Hwu
[Caragea,,Hotpar 2010] : GC Caragea, F Keceli, A Tzannes, U Vishkin - Proc. HotPar, 2010