GPU Computing and CUDA

advertisement
Cijo Thomas
Janith Kaiprath Valiyalappil
CS566 Parallel Programming, Spring '13
1
GPU vs CPU
 GPU has 100s of cores compared to 4-8 cores for CPU
 CPU - executes a single thread very quickly
 GPU - executes many concurrent threads slowly - traditionally
excels for embarrassingly parallel tasks
 GPU and CPU have complementary properties.
 Solve General Purpose problems using GPU.
 Core idea is to map data parallel algorithms into equivalent
graphics concepts
 Have to make heavy use of graphics APIs.
 Traditionally a cumbersome task
 Never gained prominence among developers.
Until......




Compute Unified Device Architecture
Released in 2006 by NVIDIA
Easy programming of GPU using C extension
Transparently scales harnessing the ever growing power of
NVIDIA GPUs
 Programs portable to newer GPU releases
 Scalable array of multi-threaded SMs (Streaming
Multiprocessors)
 Each SM consists of multiple Streaming Processor (SM)
 Inter-thread communication using shared memory
 CUDA Terms – Host – CPU
Device - GPU
[Nickolls,ACM,2008]
 Threads are grouped into thread blocks, and execute
concurrently on a single SM
 Thread blocks are grouped into Grids, and are executed
independently and parallely
 SIMT- Single Instruction Multiple Thread
 Thread creation,management,scheduling and execution
occurs in groups of 32 threads called warps
[Nickolls,ACM,2008]
 Each thread has its own local memory apart from register and
stack space (Physically located on device memory off-chip)
 Next in hierarchy is a low-latency shared memory between
threads in a thread block
 Then there is high-latency global shared memory
 All the above memories are physically and logically separate
from system memory.
[Source: Nvidia]
 cudamalloc,cudafree is used for allocation and releasing
memory in Device.
 cudamemcpy- is used to transfer data in 2 directions
a) device to host memory - cudaMemcpyHostToDevice
b) host to device memory- cudaMemcpyDeviceToHost
 Device memory refers to global shared memory, and not
thread block shared memory
 CUDA programs are heterogeneous CPU+GPU co-processing
systems
 Use CPU core for serial portions, GPU for parallel portions
 CUDA kernel - can be a simple function or a program on its
own
 GPU needs 1000s of threads for full efficiency
 CUDA threads are extremely light-weight with little or no
overhead in creation/switching




Allocate memory in device (GPU)
Copy data from system memory into device memory
Invoke CUDA kernel which performs processing the data
Copy results backs from device memory to system memory.
[Kirk,2010]
[Kirk,2010]
[Nickolls,ACM,2008]
[Nickolls,ACM,2008]
[Kirk,2010]
 CUDA 5 - The latest release of CUDA
 Released Oct 2013
 Kepler Architecture vs Fermi Architecture
 GPU thread can launch parallel GPU kernels
[Harris, GPU Tech Conf,2012]
[Harris, GPU Tech Conf,2012]
Advantages
 Recursive parallel algorithms
 More efficient
– GPU kept more occupied
 Simplify CPU/GPU divide
 Library calls can be made from kernel
 GPU Object Linking
[Harris, GPU Tech Conf,2012]
 RDMA: Remote Direct Memory Access between any GPUs
in cluster
[Harris, GPU Tech Conf,2012]
CUDA Lite
 A source-source translation tool to relieve the programmer
from handling memory hierarchy
[Ueng, LCPC , 2008]
m-CUDA
 makes CUDA architecture run on regular multi-core CPU
systems.
 Proves the effectiveness of CUDA model in non-GPU
systems as well
[Buck,SC08,2008]
 CUDA not as simple as it sounds
 People have questioned the future of CUDA
 CUDA has a strong reputation for performance, but at the
expense of ease of programming
 Alternates like XMT is developed, challenging CUDA
 XMT – many core general purpose parallel architecture.
[Caragea,,Hotpar 2010]





375million CUDA capable GPUs sold by Nvidia
1 million toolkit downloads
>120,000 active developers
Active research community
New domains like Big-Data Analytics
Shazam – top 5 music app in Apple Store
SalesForce.com – real time twitter data analysis
 and many more….
Source : NVIDIA
[Nickolls,IEEE,2010]





CUDA is promising but only supports NVIDIA
GPU
OpenCL, AMD Brook not main stream yet.
Automatic extraction of parallelism
Automatic conversion of existing code base
in popular models eg: Java Threads
More support for higher level languages







[Buck,SC08,2008] : Massimiliano Fatica (NVIDIA), Patrick LeGresley (NVIDIA),Ian Buck
(NVIDIA) ,John Stone (University of Illinois at Urbana-Champaign) , Jim Phillips
(University of Illinois at Urbana-Champaign), Scott Morton (Hess Corporation), Paulius
Micikevicius (NVIDIA), "High Performance Computing with CUDA" Nov.2008
[Ueng, LCPC , 2008] :Sain-Zee Ueng, Melvin Lathara, Sara S,Wen-mei W. Hwu, CUDALite: Reducing GPU Programming ComplexityInternational Workshop, LCPC 2008,
Edmonton, Canada, July 31 - August 2, 2008
[Nickolls,IEEE,2010]: Nickolls, J, The GPU Computing Era, Micro IEEE, 2010
[Harris,GPU Tech Conf 2012] : Mark Harris, CUDA 5 and Beyond , GPU Tech Conference
2012
[Nickolls,ACM,2008] : John Nickolls, Ian Buck, Michael Garland, Kevin Skadron, Scalable
Parallel Programming with CUDA ,Queue – GPU Computing Vol 6, Issue 2, ACM Digital
Library April 2008
[Kirk,2010]: Programming Massively Parallel Processors: A Hands-on Approach 2010,
David B. Kirk, Wen-mei W. Hwu
[Caragea,,Hotpar 2010] : GC Caragea, F Keceli, A Tzannes, U Vishkin - Proc. HotPar, 2010
Thank
You!
Download