Summary of CUDA

advertisement
SUMMARY
PARALLEL COMPUTING EXPERIENCES
WITH CUDA
M ICHAEL G ARLAND , S COTT L E G RAND , J OHN N ICKOLLS , J OSHUA A NDERSON , J IM H ARDWICK ,
S COTT M ORTON , E VERETT P HILLIPS , Y AO Z HANG AND V ASILY V OLKOV
IEEE Micro yr:2008 vol:28 iss:4 pg:13 - 27
After introducing NVIDIA Tesla hardware and corresponding Compute Unified Device
Architecture (CUDA), this week, we continue to focus on some practical experiences of
CUDA Programming.
1. MOTIVATION
As concluded in my first summary, general purpose computation on graphics processing units
(GPGPU) needs high-level programming model and tools to provide productive development
of parallel programs. With the process of more and more flexible hardware design of graphics
processing units (GPU), NVIDIA developed the CUDA programming model and software
environment to let programmers write scalable parallel programs using a straightforward
extension of the C language.
The CUDA programming model guides the programmer to expose substantial fine-grained
parallelism sufficient for utilizing massively multithreaded GPUs, while at the same time
providing scalability across the broad spectrum of physical parallelism available in the range
of GPU devices. Because it provides a fairly simple, minimalist abstraction of parallelism and
inherits all the well-known semantics of C, it lets programmers develop massively parallel
programs with relative ease.
2. TESLA ARCHITECTURE
The Tesla architecture is based on a scalable processor array. Figure 1 shows a block diagram
of a GTX 280 processor with 240 streaming processor (SP) cores, organized in 30 streaming
multiprocessors (SM). Each multithreaded SP core executes up to 128 concurrent threads
sharing a register file of 2,048 entries; in total, the GPU executes up to 30,720 concurrent
threads.
Figure 1: Tesla unified graphics and computing architecture of a GeForce GTX 280 or Tesla T10 GPU with 240 SP
streaming processor cores, organized in 30 SM multithreaded multiprocessors. Each multithreaded SP core
executes up to 128 concurrent threads; the GPU executes up to 30,720 concurrent threads.
SM multithreading
To efficiently execute hundreds of threads in parallel while running several different
programs, the SM is hardware multithreaded. It manages and executes up to 768 concurrent
threads in hardware with zero scheduling overhead.
Single-instruction, multiple-thread
To manage and execute hundreds of threads running several different programs efficiently,
the Tesla SM uses a new processor architecture we call single-instruction, multiple-thread
(SIMT). The SM’s SIMT multithreaded instruction unit creates, manages, schedules, and
executes threads in groups of 32 parallel threads called warps.
SIMT warp scheduling
As a unified graphics processor, the SM schedules and executes multiple warp types
concurrently. The SM warp scheduler operates at half the 1.5-GHz processor clock rate. At
each cycle, it selects one of the 24 warps to execute a SIMT warp instruction, as Figure 3
shows. An issued warp instruction executes as two sets of 16 threads over four processor
cycles. The SP cores and SFU units execute instructions independently, and by issuing
instructions between them on alternate cycles, the scheduler can keep both fully occupied.
3. PROGRAMMING MODEL
The CUDA parallel programming model emphasizes two key design goals:
1) Aim to extend a standard C/C++, with a minimalist set of abstractions for expressing
parallelism, which lets the programmer focus on efficient parallelism.
2) Design for writing transparently and efficiently parallel code that can run across tens of
thousands of concurrent threads and hundreds of processor cores.
A CUDA program is organized into a host program, consisting of one or more sequential
threads running on the host CPU, and one or more parallel kernels running on GPU.
The programmer organizes these threads into a grid of thread blocks. The threads of a single
thread block are allowed to synchronize with each other via barriers and have access to a
shared memory for inter-thread communication. Threads from different blocks in the same
grid can share data via global memory.
4. APPLICATION EXPERIENCE WITH CUDA
Many applications—both academic research and industrial products—have been accelerated
using CUDA to achieve significant parallel speedups. Of the many available examples, this
article surveys a few representative cases.
Area 1: Molecular Dynamics
Molecular dynamics is a simulation technique, which is to compute the movement of a
number of atoms, beginning in some initial configuration, and track their trajectory over
specified time intervals. Bead-spring polymer simulation is a typical example.
Here introduced two typical example applications: 1) Highly Optimized Object-Oriented
Molecular Dynamics (HOOMD); 2) Folding@Home.
Process with CUDA
The problem is decomposed into many time steps. Each time step of the simulation, the
program must calculate the forces acting on all atoms and integrate atom positions. Each atom
can be processed independently of the others during a single time step, which is natural to
map each atom to a single thread.
Molecular dynamics simulations are inherently parallel computations and are ideally suited to
the CUDA programming model.
Performance
1) HOOMD
-
-
Serial calculation, a single core of a 2.4-GHz Opteron 280 CPU

6.46 time steps per second

32 days to calculate
CUDA, Highly Optimized Object-Oriented Molecular Dynamics, GeForce 8800 GTX

203 TPS

equivalent to using 32 Opteron280 nodes
2) Floding@Home
-
CUDA implementation on the GTX 280 to run 3.6 times faster than the Brook-based ATI
implementation and 6.7 times faster than the Cell implementation.
Area 2: Numerical Linear Algebra
Here also introduced two typical examples in this research area:
1) Dense matrix-matrix multiplication is one of the fundamental building blocks of
numerical linear algebra algorithms.
2) Matrix factorizations are widely used to solve systems of linear equations and linear
least-square problems. The LU, Cholesky, and QR factorizations are most common.
Process with CUDA
1) Dense matrix-matrix multiplication is a natural fit for CUDA and the GPU because it is
inherently parallel and can naturally be expressed as a blocked computation.
2) Matrix factorizations can be implemented using blocked algorithms that do most of their
work with bulk matrix-matrix multiplies that have high arithmetic intensity and expose
substantial amounts of data and thread-level parallelism. The remaining work is panels
and pivoting. Panel factorization isn’t sufficient parallelism, which runs on the CPU,
while pivoting can be done by keeping matrices transposed in the GPU memory.
Performance
1) Matrix-matrix multiplication kernel called SGEMM
-
Achieves up to 206 Gflops
-
2.9 times improvement over the highest rate achieved by the Core2 Quad—and roughly
60 percent of the GeForce 8800 GTX peak multiply-add rate
2) Compare the performance of some factorization routines
-
CUDA factorization running on the GeForce 8800 GTX and 2.67-GHz Core2 Duo is up
to 5.5 times faster than the MKL factorization code running on the 2.4-GHz Core2 Quad
Area 3: Medical Imaging
TechniScan Medical Systems has been developing advanced inverse-scattering algorithms to
generate 3D volumetric images of the breast with ultrasound.
Process with CUDA
The inverse-scattering algorithm uses 2D convolution by FFT to simulate ultrasound
propagation. A 2D convolution routine can be implemented using CUFFT supplied with
CUDA
Performance
Use CUFFT the Fast Fourier Transform library supplied with CUDA
-
In general, running the algorithm on two Tesla D870s provided the performance as fast as
a 16-core Intel Core2 CPU cluster.
Area 4: Fluid Dynamics
Physical simulations based on finite element, finite-difference, finite-volume, and similar
methods are not as trivially parallelized as molecular dynamics.
2D compressible Euler equations are often considered, which are often used in the design of
aerospace vehicle components such as rocket nozzles and supersonic airfoils.
Process with CUDA
Phillips and colleagues developed this CUDA-based solver by adopting a blocking strategy
similar to those used in matrix multiplication and image processing, which can be
transformed into highly parallel computations.
Performance
Example simulations of a rocket nozzle and supersonic airfoil with 6.4 million nodes
-
Cluster consists of four nodes, each with two QuadroFX 5600 GPUs and dual Opteron
2216 CPUs connected by gigabit Ethernet.
-
Compared by one 2.4-GHz Core2 Duo, it delivers 22X serial performance on one GPU
and 160X on eight
Area 5: Seismic Imaging
The petroleum industry makes heavy use of seismic data to construct images of the earth’s
subsurface structure. Large amount of parallel computation is involved in the seismic imaging
process.
Process with CUDA
CUDA prototype wave-equation solver is supplied by CUDA to solve this problem.
Performance
A Tesla C870 is out-performing 2 dual 3.6GHz Intel Xeon and 6 dual 3.0GHz quad-core
Xeon.
5. CONCLUSION
From all above experiences, we can conclude some useful techniques when writing CUDA
programs:
1) Expose the fine-grained parallelism to hardware.
2) Blocking data and computations.
3) Try the best to make threads in a warp execute in the same path.
4) Take advantage of on-chip shared memory and extremely large register file on GPU
5) Use stream programming between CPU and GPU to hide the overhead of the host-device
communications.
M AX L V
2010/1/16
Download