GP GPU Applications and Simulations Mike Metzger mmetzg3@uic.edu MS - ECE GPU History Term coined in 1999 by NVIDIA with the release of the GeForce 256 Real “first” GPU was the 1985 Commodore Amiga: graphics coprocessor with a primitive instruction set Modern day: GPUs are commercially available at low cost and prolific in computing systems Primarily used (obviously) for displaying and rapidly altering video data which is inherently data-parallel Architecture has become increasingly parallel over the last decade GPUs are structured as SIMD machines to exploit high levels of DLP GPU Capabilities Large matrix/vector operations Protein folding (molecular dynamics) modeling FFT (signal processing) Physics simulations Sequence matching Speech recognition Database manipulation Sort/search algorthims Medical imaging GPGPU Origins C. Thompson, S. Hahn, M. Oskin: “Using Modern Graphics Architectures for General-Purpose Computing: A Framework and Analysis.” International Symposium on Microarchitecture (MICRO), Turkey, Nov. 2002 Use the parallel architecture of GPUs to exploit data-level parallelism in common processes Wrote framework for testing various data-heavy operations in C++, implemented through OpenGL programming interface Run tests on Arithmetic, exponential, factorial, and multiplicative operation on large (10k-10 million member) vectors Uses NVIDIA's GeForce 4 with 128 MB of VRAM (18 specialized cores), compare results to 1.5 GHz Pentium IV with 1 GB of RAM No modifications to GPU or CPU hardware, test programs compiled with Microsoft Visual C++ 6 GeForce4 Architecture Provides ISA for vertex programming – registers hold and process quad-valued FP numbers Input and output attribute registers hold various graphical data No access to main memory – 96 constant registers used instead (video memory can be filled pre-runtime by CPU) 21 instructions available – mostly operate on all 4 input components Vertex programs have an instruction limit of 128 [Thompson, MICRO 2002] Programming Framework C++ framework for general-purpose programs using vector operations implemented through OpenGL (GPU API) Abstract the GPU functionality using C++ data types operated on by GPU assembly programs DVector: vector class, allocated a buffer of video memory DProgram: contains GPU assembly program written via array of strings DFunction: contains a DProgram, input/output DVectors, bindings for constant registers ; executes the Dprogram and converts vectors to quad-value format, reduce CPU usage in scalar computation using quad-floats Dsemaphore: object to stall CPU when waiting for GPU results [Thompson, MICRO 2002] 2002 results [Thompson, MICRO 2002] 2002 Results Arithmetic vector operation – GPU ~6.4 times faster for large vectors CPU run time doubles with doubled program complexity, GPU only triples with 12x program size Matrix multiplication: GPU ~3.2 times faster Boolean SAT: GPU ~2 times faster at large input sizes Proved that GPUs can be used for general-purpose computation and will result in significant speedup for applications with DLP [Thompson, MICRO 2002] Moving Forward – Stream Processing I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, P. Hanrahan: “Brook for GPUs: Stream Computing on Graphics Hardware.” Special Interest Group on Graphics and Interactive Techniques (SIGGRAPH), 2004. In 2004, GPGPUs were becoming a legitimate tool, but there was no universal tool for programming a GPU to be used this way Brook was an attempt by Stanford to create and share a stream programming model for GPGPU computing Extension of C language to include data-parallel constructs: streams and kernels, allowing for SIMD-type operation Stream: collection of data that can be opened in parallel Kernel: special function built to operate upon streams, called with input and output stream(s) Brook compiler maps this language to existing GPU APIs (specifically DirectX and OpenGL) Brook Goals & Methods Purpose: extend C to include data-parallel constructs for using a GPU as a stream processor Uses streams (collection of data) and kernels (functions that operate on streams) to express DLP native to various applications Improves arithmetic intensity by containing program computation within kernels Implementation is void of explicit graphics constructs and thus capable of being used on any architecture and API (NVIDIA/ATI and DirectX/OpenGL) Abstracts the GPU computing a higher level to remove need for knowledge of DirectX or OpenGL [Buck, SIGGRAPH, 2004] Brook Implementation Map kernels to Cg shaders, streams represented as floating point textures Brook Runtime (BRT) library allows for input/output streams to be rendered to a display Streams can be mapped to multiple textures to allow larger sizes than available on GPU architecture (2048x2048 or 4096x4096) Use fragment processor to execute kernels over the streams present in textures: non-stream arguments passed via constant registers, apply shader compiler to create GPU assembly, map process to fragment shaders [Buck, SIGGRAPH, 2004] Brook Performance Results (2004) Compares optimized reference implementation to Brook DirectX and OpenGL variants Normalized by CPU performance (black) [Buck, SIGGRAPH, 2004] Brook Results (cont) SAXPY: vector scaling and addition (y = ax + y) SGEMV: matrix-vector product, scaled vector addition (y = nAx + my) Segment: nonlinear diffusion-based region-growing algorithm, primarily used in medical image processing FFT: fast Fourier transform, used in graphical post-processing GPGPU performance increases with limited data reuse (SAXPY vs FFT) and increased arithmetic intensity (Segment vs SGEMV) Brook implementations within 80% of hand-coded (optimized) versions Important factor: read/write bandwidth – resulted in NVIDIA performing worse than ATI due to this difference (1.2 Gfloats/s vs 4.5 Gfloats/s) Made available as open source code for GPGPU software developers [Buck, SIGGRAPH, 2004] Moving forward – CUDA (2007) CUDA: Compute Unified Device Architecture Parallel computing architecture from NVIDIA, released in 2007 and compatible with GeForce 8 series and beyond (2006+) As GPGPUs became popular, there was a need for a universal tool to access the virtual instruction set and parallel architecture of commercial GPUs CUDA provides an API for software developers to use with a public SDK Modern GPU: GTX 690 has 3072 CUDA cores, 4096 MB of device memory, 6.9 billion transistors Modern GPGPU uses Arithmetic: matrix and vector operations Modeling molecular dynamics (protein folding, etc) FFT (signal processing, graphical post-processing) Physics simulations and engines (ex: modern games) Speech recognition Medical imaging Instruction Set Simulator Parity-Check Decoding ISA Simulator S. Raghav, M. Ruggiero, D. Atienza, C. Pinto, A. Marongiu and L. Benini: “Scalable instruction set simulator for thousand-core architectures running on GPGPUs”, Proceedings of High Performance Computing and Simulation (HPCS), pp.459-466, June/July 2010. Improve current standards of processor simulation by exploiting parallelism available in GPGPUs Accurate sequential simulators already exist (Cotson, m5, mpi-sim), much harder to efficiently simulate more complex environments Two fields: high-performance (x86) and embedded (ARM) CUDA threads simulate one or more cores, global memory provides a context structure and control logic for each simulated CPU Written in C++ and CUDA, simulates both instruction sets on NVIDIA GTX 295 with Intel i7 running Linux GTX 295: 2 GTX200 GPUs with 30 Streaming Multiprocessors (SM) including 240 stream processors, 938 MB VRAM Instruction Set Simulator - ARM Supports all non-Thumb ARM instructions Functional blocks for Fetch, Decode, and Execute placed on CUDA model Texture memory used to hold LUTs of instructions (working like a cache) – 16KB available SPMD simulation allows CUDA threads to run concurrently, MIMD task-based applications sometimes become serialized if branches are datadependent 16 GP registers, status & auxiliary registers Large matrix holds execution context for each processor [Raghav, HPCS, 2010] Instruction Set Simulator - x86 Simulator must support Intel IA-32 ISA Context is held in 8 32-bit GP registers, 6 segment regs and various control registers CISC architecture with complex decoding logic leads to some serialization: threads may branch to different functions/kernels depending on the parsed operation Task-based parallel applications incur performance hit when branches are data dependent CUDA concurrency is compromised by variable length instructions [Raghav, HPCS, 2010] ISS Testing & Results Best Case (BC): application has SIMD DLP, same kernel is running on different data subsets, all cores fetch same instructions Worst Case (WC): application has task-level parallelism (MIMD), cores may operate on different data sets, cores diverge in instruction retrieval due to data dependent branches Single Kernel (SK): entire ISS run in one CUDA kernel, components simulated in successive steps of one function Multiple Kernels (MK): system components modeled in separate CUDA kernels, requires many memory tranfers of device state when kernels swap & launch ARM performance dependent upon kernel swaps (SK vs MK) X86 performance dependent upon application type (BC vs WC) [Raghav, HPCS, 2010] MK Simulation Results SK WC [Raghav, HPCS, 2010] BC MK Simulation Results SK WC [Raghav, HPCS, 2010] BC ISS Testing – Real Workloads Test the ISS using real workloads to see if theoretical speedup is possible with real applications Matrix Multiplication, IDCT, FFT Use parallelization scheme like OpenMP to distribute workload Static loop parallelization: identical # of consecutive iterations are assigned to parallel threads Processor ID determines which dataset to use (HW2/3 scheme B) Stack-allocated variables determine lower and upper bounds of functional loops Simulation speedup – speedup relative to serial simulation of varying number of cores Application speedup – speedup of parallel simulation over a single simulated core [Raghav, HPCS, 2010] Speedup Results Takeaway: architecture is scalable to and beyond 1000 cores ~500-1000x speedup for best case scenarios (near ideal 1024) [Raghav, HPCS, 2010] Parallel Nonbinary LDPC Decoding G. Wang, H. Shen, B. Yin, M. Wu, Y. Sun, J. Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU”, 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012. Low-Density Parity-Check Codes (LDPC) are error-correcting codes over a Galois (or finite) field Finite Field: commutative ring in abstract algebra containing multiplicative inverse for every non-zero element Current implementations of LDPC decoding algorithms have poor flexibility & scalability Complexity of LDPC decoding algorithms increases greatly going from binary to nonbinary codes (with q>2 for GF(q)) Goal: create a highly parallel and flexible decoder supporting different code types, variable code lengths, and the ability to run of various devices Use OpenCL to employ a SIMT model to exploit LDPC decoding's inherent DLP LDPC Decoding – Nonbinary LDPC Parity check matrix H (spare q-ary MxN matrix) with elements defined in a Galois field GF(q) Can be represented by a Tanner graph: each row of H → check node, each column of H → variable node M(n) is the set of check nodes for variable node n N(m) is the set of variable nodes for check node m Row weight of a check node = dc Belief Propagation (BP) algorithm is one of the best decoding algorithms, this implementation uses the Min-Max approximation algorithm to exploit DLP [Wang, ASILOMAR, 2012] Implementation – Complexity Analysis Computation kernels of nonbinary (q>2) LDPC becomes more complex for check node processing (O(dc*q2) vs O(dc)) CNP and VNP take up 91.64% and 6.43% of serial runtime respectively [Wang, ASILOMAR, 2012] Implementation – Algorithm Mapping Develop work flow of decoding process Computation is all done on GPU to keep intermediate messages in device memory Use 5 OpenCL kernels to exploit DLP, distribute effectively Work items (q) become CUDA threads, work groups (M) become CUDA thread blocks: all have the same computation path and memory access patterns [Wang, ASILOMAR, 2012] Implementation – Nonbinary Arithmetic & Efficient Data Structures Addition and subtraction of nonbinary elements achieved through XOR operations Use LUT with expq & logq for multiplication & division: a*b = expq[(logq[a]+logq[b]+q-1)%(q-1)] a/b = expq[(logq[a]+logq[b]+q-1)%(q-1)] Expq and logq are used frequently → keep in local memory Compress H horizontally & vertically to create more efficient structure [Wang, ASILOMAR, 2012] Implementation – Accelerating ForwardBackward Algorithm in CNP Original algorithm shown has O(qdc), revised has O(dc*q2) Forwarded messages vector Fi(a) stored in local memory, updated by q work items in parallel for each stage (i) Use a barrier function after each stage for synchronization Requires 2*sizeof(cl_float)*q*dc of local memory 1.5KB for (3,6)-regular GF(32) (used in this implementation) [Wang, ASILOMAR, 2012] Implementation – Coalescing Global Memory Access Rm,n(a) and Qm,n(a) are complex 3D structures located in global memory Arrange in [N,q,M] format rather than [M,N,q] so that q work items always access data stored contiguously Enables coalesced memory access → ~4-5x speedup [Wang, ASILOMAR, 2012] Nonbinary LDPC Decoding - Results Run on 2 CPUs and 1 GPU: Intel i7-640LM (dual core, 2.93 GHz) AMD Phenom II X9-940 (quad core, 2.9 GHz) NVIDIA GTX470 (448 stream processors, 1.215 GHz, 1280MB device memory) 2.47 speedup for OpenCL over serial C on Intel i7 6.67 speedup for OpenCL over serial C on AMD Phenom II GPU has 69.92 speedup over Intel i7 and 33.46 over AMD Phenom II Worse case speedups: 38.48 and 18.41 [Wang, ASILOMAR, 2012] Nonbinary LDPC Decoding - Results GPU algorithm had 693.5 Kbps throughput and 1260 Kbps throughput with early termination Nonbinary decoders have complexity of 2q2 ~ 3q2 higher than binary decoders With q=32 in the samples run, this results in a 2000~3000x increase in complexity Due to massive parallelization in the decoding algorithm and the GPU, the gap between binary and nonbinary implementation is reduced to 50x This type of LDPC decoding (with short codewords and high GF(q) values) is most common in LDPC research & application, although better speedups & throughput values are found in this implementation with longer codewords & lower GF(q) values [Wang, ASILOMAR, 2012] Citations Chris J. Thompson, Sahngyun Hahn, Mark Oskin: “Using Modern Graphics Architectures for General-Purpose Computing: A Framework and Analysis.” International Symposium on Microarchitecture (MICRO), Turkey, Nov. 2002. Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan: “Brook for GPUs: Stream Computing on Graphics Hardware.” Special Interest Group for Graphics and Interactive Techniques (SIGGRAPH), Los Angeles, Aug. 2004. Shivani Raghav, Martino Ruggiero, David Atienza, Christian Pinto, Andrea Marongiu, Luca Benini: “Scalable Instruction Set Simulator for Thousandcore Architectures Running on GPGPUs.” Proceedings of High Performance Computing and Simulation (HPCS), pp. 459-466, France, June/July 2010. Guohui Wang, Hao Shen, Bei Yin, Michael Wu, Yang Sun, Joseph R. Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU.” 46th Asilomar Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012.