SUMMARY PARALLEL COMPUTING EXPERIENCES WITH CUDA M ICHAEL G ARLAND , S COTT L E G RAND , J OHN N ICKOLLS , J OSHUA A NDERSON , J IM H ARDWICK , S COTT M ORTON , E VERETT P HILLIPS , Y AO Z HANG AND V ASILY V OLKOV IEEE Micro yr:2008 vol:28 iss:4 pg:13 - 27 After introducing NVIDIA Tesla hardware and corresponding Compute Unified Device Architecture (CUDA), this week, we continue to focus on some practical experiences of CUDA Programming. 1. MOTIVATION As concluded in my first summary, general purpose computation on graphics processing units (GPGPU) needs high-level programming model and tools to provide productive development of parallel programs. With the process of more and more flexible hardware design of graphics processing units (GPU), NVIDIA developed the CUDA programming model and software environment to let programmers write scalable parallel programs using a straightforward extension of the C language. The CUDA programming model guides the programmer to expose substantial fine-grained parallelism sufficient for utilizing massively multithreaded GPUs, while at the same time providing scalability across the broad spectrum of physical parallelism available in the range of GPU devices. Because it provides a fairly simple, minimalist abstraction of parallelism and inherits all the well-known semantics of C, it lets programmers develop massively parallel programs with relative ease. 2. TESLA ARCHITECTURE The Tesla architecture is based on a scalable processor array. Figure 1 shows a block diagram of a GTX 280 processor with 240 streaming processor (SP) cores, organized in 30 streaming multiprocessors (SM). Each multithreaded SP core executes up to 128 concurrent threads sharing a register file of 2,048 entries; in total, the GPU executes up to 30,720 concurrent threads. Figure 1: Tesla unified graphics and computing architecture of a GeForce GTX 280 or Tesla T10 GPU with 240 SP streaming processor cores, organized in 30 SM multithreaded multiprocessors. Each multithreaded SP core executes up to 128 concurrent threads; the GPU executes up to 30,720 concurrent threads. SM multithreading To efficiently execute hundreds of threads in parallel while running several different programs, the SM is hardware multithreaded. It manages and executes up to 768 concurrent threads in hardware with zero scheduling overhead. Single-instruction, multiple-thread To manage and execute hundreds of threads running several different programs efficiently, the Tesla SM uses a new processor architecture we call single-instruction, multiple-thread (SIMT). The SM’s SIMT multithreaded instruction unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. SIMT warp scheduling As a unified graphics processor, the SM schedules and executes multiple warp types concurrently. The SM warp scheduler operates at half the 1.5-GHz processor clock rate. At each cycle, it selects one of the 24 warps to execute a SIMT warp instruction, as Figure 3 shows. An issued warp instruction executes as two sets of 16 threads over four processor cycles. The SP cores and SFU units execute instructions independently, and by issuing instructions between them on alternate cycles, the scheduler can keep both fully occupied. 3. PROGRAMMING MODEL The CUDA parallel programming model emphasizes two key design goals: 1) Aim to extend a standard C/C++, with a minimalist set of abstractions for expressing parallelism, which lets the programmer focus on efficient parallelism. 2) Design for writing transparently and efficiently parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores. A CUDA program is organized into a host program, consisting of one or more sequential threads running on the host CPU, and one or more parallel kernels running on GPU. The programmer organizes these threads into a grid of thread blocks. The threads of a single thread block are allowed to synchronize with each other via barriers and have access to a shared memory for inter-thread communication. Threads from different blocks in the same grid can share data via global memory. 4. APPLICATION EXPERIENCE WITH CUDA Many applications—both academic research and industrial products—have been accelerated using CUDA to achieve significant parallel speedups. Of the many available examples, this article surveys a few representative cases. Area 1: Molecular Dynamics Molecular dynamics is a simulation technique, which is to compute the movement of a number of atoms, beginning in some initial configuration, and track their trajectory over specified time intervals. Bead-spring polymer simulation is a typical example. Here introduced two typical example applications: 1) Highly Optimized Object-Oriented Molecular Dynamics (HOOMD); 2) Folding@Home. Process with CUDA The problem is decomposed into many time steps. Each time step of the simulation, the program must calculate the forces acting on all atoms and integrate atom positions. Each atom can be processed independently of the others during a single time step, which is natural to map each atom to a single thread. Molecular dynamics simulations are inherently parallel computations and are ideally suited to the CUDA programming model. Performance 1) HOOMD - - Serial calculation, a single core of a 2.4-GHz Opteron 280 CPU 6.46 time steps per second 32 days to calculate CUDA, Highly Optimized Object-Oriented Molecular Dynamics, GeForce 8800 GTX 203 TPS equivalent to using 32 Opteron280 nodes 2) Floding@Home - CUDA implementation on the GTX 280 to run 3.6 times faster than the Brook-based ATI implementation and 6.7 times faster than the Cell implementation. Area 2: Numerical Linear Algebra Here also introduced two typical examples in this research area: 1) Dense matrix-matrix multiplication is one of the fundamental building blocks of numerical linear algebra algorithms. 2) Matrix factorizations are widely used to solve systems of linear equations and linear least-square problems. The LU, Cholesky, and QR factorizations are most common. Process with CUDA 1) Dense matrix-matrix multiplication is a natural fit for CUDA and the GPU because it is inherently parallel and can naturally be expressed as a blocked computation. 2) Matrix factorizations can be implemented using blocked algorithms that do most of their work with bulk matrix-matrix multiplies that have high arithmetic intensity and expose substantial amounts of data and thread-level parallelism. The remaining work is panels and pivoting. Panel factorization isn’t sufficient parallelism, which runs on the CPU, while pivoting can be done by keeping matrices transposed in the GPU memory. Performance 1) Matrix-matrix multiplication kernel called SGEMM - Achieves up to 206 Gflops - 2.9 times improvement over the highest rate achieved by the Core2 Quad—and roughly 60 percent of the GeForce 8800 GTX peak multiply-add rate 2) Compare the performance of some factorization routines - CUDA factorization running on the GeForce 8800 GTX and 2.67-GHz Core2 Duo is up to 5.5 times faster than the MKL factorization code running on the 2.4-GHz Core2 Quad Area 3: Medical Imaging TechniScan Medical Systems has been developing advanced inverse-scattering algorithms to generate 3D volumetric images of the breast with ultrasound. Process with CUDA The inverse-scattering algorithm uses 2D convolution by FFT to simulate ultrasound propagation. A 2D convolution routine can be implemented using CUFFT supplied with CUDA Performance Use CUFFT the Fast Fourier Transform library supplied with CUDA - In general, running the algorithm on two Tesla D870s provided the performance as fast as a 16-core Intel Core2 CPU cluster. Area 4: Fluid Dynamics Physical simulations based on finite element, finite-difference, finite-volume, and similar methods are not as trivially parallelized as molecular dynamics. 2D compressible Euler equations are often considered, which are often used in the design of aerospace vehicle components such as rocket nozzles and supersonic airfoils. Process with CUDA Phillips and colleagues developed this CUDA-based solver by adopting a blocking strategy similar to those used in matrix multiplication and image processing, which can be transformed into highly parallel computations. Performance Example simulations of a rocket nozzle and supersonic airfoil with 6.4 million nodes - Cluster consists of four nodes, each with two QuadroFX 5600 GPUs and dual Opteron 2216 CPUs connected by gigabit Ethernet. - Compared by one 2.4-GHz Core2 Duo, it delivers 22X serial performance on one GPU and 160X on eight Area 5: Seismic Imaging The petroleum industry makes heavy use of seismic data to construct images of the earth’s subsurface structure. Large amount of parallel computation is involved in the seismic imaging process. Process with CUDA CUDA prototype wave-equation solver is supplied by CUDA to solve this problem. Performance A Tesla C870 is out-performing 2 dual 3.6GHz Intel Xeon and 6 dual 3.0GHz quad-core Xeon. 5. CONCLUSION From all above experiences, we can conclude some useful techniques when writing CUDA programs: 1) Expose the fine-grained parallelism to hardware. 2) Blocking data and computations. 3) Try the best to make threads in a warp execute in the same path. 4) Take advantage of on-chip shared memory and extremely large register file on GPU 5) Use stream programming between CPU and GPU to hide the overhead of the host-device communications. M AX L V 2010/1/16