GP GPU Applications and Simulations Mike Metzger MS - ECE

advertisement
GP GPU Applications
and Simulations
Mike Metzger
mmetzg3@uic.edu
MS - ECE
GPU History






Term coined in 1999 by NVIDIA with the release of the
GeForce 256
Real “first” GPU was the 1985 Commodore Amiga: graphics
coprocessor with a primitive instruction set
Modern day: GPUs are commercially available at low cost
and prolific in computing systems
Primarily used (obviously) for displaying and rapidly altering
video data which is inherently data-parallel
Architecture has become increasingly parallel over the last
decade
GPUs are structured as SIMD machines to exploit high
levels of DLP
GPU Capabilities

Large matrix/vector operations

Protein folding (molecular dynamics) modeling

FFT (signal processing)

Physics simulations

Sequence matching

Speech recognition

Database manipulation

Sort/search algorthims

Medical imaging
GPGPU Origins






C. Thompson, S. Hahn, M. Oskin: “Using Modern Graphics Architectures
for General-Purpose Computing: A Framework and Analysis.”
International Symposium on Microarchitecture (MICRO), Turkey, Nov.
2002
Use the parallel architecture of GPUs to exploit data-level
parallelism in common processes
Wrote framework for testing various data-heavy operations in C++,
implemented through OpenGL programming interface
Run tests on Arithmetic, exponential, factorial, and multiplicative
operation on large (10k-10 million member) vectors
Uses NVIDIA's GeForce 4 with 128 MB of VRAM (18 specialized cores),
compare results to 1.5 GHz Pentium IV with 1 GB of RAM
No modifications to GPU or CPU hardware, test programs compiled with
Microsoft Visual C++ 6
GeForce4 Architecture





Provides ISA for vertex programming –
registers hold and process quad-valued
FP numbers
Input and output attribute registers hold
various graphical data
No access to main memory – 96 constant
registers used instead (video memory can
be filled pre-runtime by CPU)
21 instructions available – mostly operate
on all 4 input components
Vertex programs have an instruction limit
of 128
[Thompson, MICRO 2002]
Programming Framework






C++ framework for general-purpose programs using
vector operations implemented through OpenGL
(GPU API)
Abstract the GPU functionality using C++ data types
operated on by GPU assembly programs
DVector: vector class, allocated a buffer of video
memory
DProgram: contains GPU assembly program written
via array of strings
DFunction: contains a DProgram, input/output
DVectors, bindings for constant registers ; executes
the Dprogram and converts vectors to quad-value
format, reduce CPU usage in scalar computation
using quad-floats
Dsemaphore: object to stall CPU when waiting for
GPU results
[Thompson, MICRO 2002]
2002 results
[Thompson, MICRO 2002]
2002 Results


Arithmetic vector operation – GPU ~6.4 times faster for
large vectors
CPU run time doubles with doubled program complexity,
GPU only triples with 12x program size

Matrix multiplication: GPU ~3.2 times faster

Boolean SAT: GPU ~2 times faster at large input sizes

Proved that GPUs can be used for general-purpose
computation and will result in significant speedup for
applications with DLP
[Thompson, MICRO 2002]
Moving Forward – Stream Processing







I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, P.
Hanrahan: “Brook for GPUs: Stream Computing on Graphics
Hardware.” Special Interest Group on Graphics and Interactive
Techniques (SIGGRAPH), 2004.
In 2004, GPGPUs were becoming a legitimate tool, but there was no
universal tool for programming a GPU to be used this way
Brook was an attempt by Stanford to create and share a stream
programming model for GPGPU computing
Extension of C language to include data-parallel constructs: streams
and kernels, allowing for SIMD-type operation
Stream: collection of data that can be opened in parallel
Kernel: special function built to operate upon streams, called with input
and output stream(s)
Brook compiler maps this language to existing GPU APIs (specifically
DirectX and OpenGL)
Brook Goals & Methods





Purpose: extend C to include data-parallel constructs for using a
GPU as a stream processor
Uses streams (collection of data) and kernels (functions that operate
on streams) to express DLP native to various applications
Improves arithmetic intensity by containing program computation
within kernels
Implementation is void of explicit graphics constructs and thus
capable of being used on any architecture and API (NVIDIA/ATI and
DirectX/OpenGL)
Abstracts the GPU computing a higher level to remove need for
knowledge of DirectX or OpenGL
[Buck, SIGGRAPH, 2004]
Brook Implementation




Map kernels to Cg shaders, streams represented as floating point
textures
Brook Runtime (BRT) library allows for input/output streams to be
rendered to a display
Streams can be mapped to multiple textures to allow larger sizes
than available on GPU architecture (2048x2048 or 4096x4096)
Use fragment processor to execute kernels over the streams present
in textures: non-stream arguments passed via constant registers,
apply shader compiler to create GPU assembly, map process to
fragment shaders
[Buck, SIGGRAPH, 2004]
Brook Performance Results (2004)


Compares optimized reference implementation to Brook DirectX and
OpenGL variants
Normalized by CPU performance (black)
[Buck, SIGGRAPH, 2004]
Brook Results (cont)

SAXPY: vector scaling and addition (y = ax + y)

SGEMV: matrix-vector product, scaled vector addition (y = nAx + my)






Segment: nonlinear diffusion-based region-growing algorithm, primarily
used in medical image processing
FFT: fast Fourier transform, used in graphical post-processing
GPGPU performance increases with limited data reuse (SAXPY vs
FFT) and increased arithmetic intensity (Segment vs SGEMV)
Brook implementations within 80% of hand-coded (optimized) versions
Important factor: read/write bandwidth – resulted in NVIDIA performing
worse than ATI due to this difference (1.2 Gfloats/s vs 4.5 Gfloats/s)
Made available as open source code for GPGPU software developers
[Buck, SIGGRAPH, 2004]
Moving forward – CUDA (2007)





CUDA: Compute Unified Device Architecture
Parallel computing architecture from NVIDIA, released in
2007 and compatible with GeForce 8 series and beyond
(2006+)
As GPGPUs became popular, there was a need for a
universal tool to access the virtual instruction set and
parallel architecture of commercial GPUs
CUDA provides an API for software developers to use
with a public SDK
Modern GPU: GTX 690 has 3072 CUDA cores, 4096 MB
of device memory, 6.9 billion transistors
Modern GPGPU uses

Arithmetic: matrix and vector operations

Modeling molecular dynamics (protein folding, etc)

FFT (signal processing, graphical post-processing)

Physics simulations and engines (ex: modern games)

Speech recognition

Medical imaging

Instruction Set Simulator

Parity-Check Decoding
ISA Simulator







S. Raghav, M. Ruggiero, D. Atienza, C. Pinto, A. Marongiu and L.
Benini: “Scalable instruction set simulator for thousand-core
architectures running on GPGPUs”, Proceedings of High
Performance Computing and Simulation (HPCS), pp.459-466,
June/July 2010.
Improve current standards of processor simulation by exploiting
parallelism available in GPGPUs
Accurate sequential simulators already exist (Cotson, m5, mpi-sim),
much harder to efficiently simulate more complex environments
Two fields: high-performance (x86) and embedded (ARM)
CUDA threads simulate one or more cores, global memory provides a
context structure and control logic for each simulated CPU
Written in C++ and CUDA, simulates both instruction sets on NVIDIA
GTX 295 with Intel i7 running Linux
GTX 295: 2 GTX200 GPUs with 30 Streaming Multiprocessors (SM)
including 240 stream processors, 938 MB VRAM
Instruction Set Simulator - ARM






Supports all non-Thumb ARM
instructions
Functional blocks for Fetch, Decode, and
Execute placed on CUDA model
Texture memory used to hold LUTs of
instructions (working like a cache) –
16KB available
SPMD simulation allows CUDA threads
to run concurrently, MIMD task-based
applications sometimes become
serialized if branches are datadependent
16 GP registers, status & auxiliary
registers
Large matrix holds execution context for
each processor
[Raghav, HPCS, 2010]
Instruction Set Simulator - x86





Simulator must support Intel IA-32 ISA
Context is held in 8 32-bit GP registers,
6 segment regs and various control
registers
CISC architecture with complex
decoding logic leads to some
serialization: threads may branch to
different functions/kernels depending on
the parsed operation
Task-based parallel applications incur
performance hit when branches are data
dependent
CUDA concurrency is compromised by
variable length instructions
[Raghav, HPCS, 2010]
ISS Testing & Results




Best Case (BC): application has SIMD DLP, same kernel is running
on different data subsets, all cores fetch same instructions
Worst Case (WC): application has task-level parallelism (MIMD),
cores may operate on different data sets, cores diverge in instruction
retrieval due to data dependent branches
Single Kernel (SK): entire ISS run in one CUDA kernel, components
simulated in successive steps of one function
Multiple Kernels (MK): system components modeled in separate
CUDA kernels, requires many memory tranfers of device state when
kernels swap & launch

ARM performance dependent upon kernel swaps (SK vs MK)

X86 performance dependent upon application type (BC vs WC)
[Raghav, HPCS, 2010]
MK
Simulation Results
SK
WC
[Raghav, HPCS, 2010]
BC
MK
Simulation Results
SK
WC
[Raghav, HPCS, 2010]
BC
ISS Testing – Real Workloads

Test the ISS using real workloads to see if theoretical speedup is
possible with real applications

Matrix Multiplication, IDCT, FFT

Use parallelization scheme like OpenMP to distribute workload





Static loop parallelization: identical # of consecutive iterations are
assigned to parallel threads
Processor ID determines which dataset to use (HW2/3 scheme B)
Stack-allocated variables determine lower and upper bounds of
functional loops
Simulation speedup – speedup relative to serial simulation of varying
number of cores
Application speedup – speedup of parallel simulation over a single
simulated core
[Raghav, HPCS, 2010]
Speedup Results

Takeaway: architecture is scalable to and beyond 1000 cores

~500-1000x speedup for best case scenarios (near ideal 1024)
[Raghav, HPCS, 2010]
Parallel Nonbinary LDPC Decoding







G. Wang, H. Shen, B. Yin, M. Wu, Y. Sun, J. Cavallaro: “Parallel
Nonbinary LDPC Decoding on GPU”, 46th Asilomar Conference on
Signals, Systems, and Computers (ASILOMAR), Nov. 4-7, 2012.
Low-Density Parity-Check Codes (LDPC) are error-correcting codes
over a Galois (or finite) field
Finite Field: commutative ring in abstract algebra containing
multiplicative inverse for every non-zero element
Current implementations of LDPC decoding algorithms have poor
flexibility & scalability
Complexity of LDPC decoding algorithms increases greatly going
from binary to nonbinary codes (with q>2 for GF(q))
Goal: create a highly parallel and flexible decoder supporting different
code types, variable code lengths, and the ability to run of various
devices
Use OpenCL to employ a SIMT model to exploit LDPC decoding's
inherent DLP
LDPC Decoding – Nonbinary LDPC





Parity check matrix H (spare q-ary MxN matrix) with elements defined
in a Galois field GF(q)
Can be represented by a Tanner graph: each row of H → check node,
each column of H → variable node
M(n) is the set of check nodes for variable node n
N(m) is the set of variable nodes for check node m
Row weight of a check node = dc
Belief Propagation (BP)
algorithm is one of the
best decoding
algorithms, this
implementation uses the
Min-Max approximation
algorithm to exploit DLP
[Wang, ASILOMAR, 2012]
Implementation – Complexity Analysis


Computation kernels of
nonbinary (q>2) LDPC becomes
more complex for check node
processing (O(dc*q2) vs O(dc))
CNP and VNP take up 91.64%
and 6.43% of serial runtime
respectively
[Wang, ASILOMAR, 2012]
Implementation – Algorithm Mapping




Develop work flow of decoding process
Computation is all done on GPU to keep
intermediate messages in device
memory
Use 5 OpenCL kernels to exploit DLP,
distribute effectively
Work items (q) become CUDA threads,
work groups (M) become CUDA thread
blocks: all have the same computation
path and memory access patterns
[Wang, ASILOMAR, 2012]
Implementation – Nonbinary Arithmetic &
Efficient Data Structures


Addition and subtraction of nonbinary elements achieved through
XOR operations
Use LUT with expq & logq for multiplication & division:
a*b = expq[(logq[a]+logq[b]+q-1)%(q-1)]
a/b = expq[(logq[a]+logq[b]+q-1)%(q-1)]

Expq and logq are used frequently → keep in local memory

Compress H horizontally & vertically to create more efficient structure
[Wang, ASILOMAR, 2012]
Implementation – Accelerating ForwardBackward Algorithm in CNP


Original algorithm shown has O(qdc), revised has O(dc*q2)
Forwarded messages vector Fi(a) stored in local memory, updated by
q work items in parallel for each stage (i)

Use a barrier function after each stage for synchronization

Requires 2*sizeof(cl_float)*q*dc of local memory

1.5KB for (3,6)-regular GF(32)
(used in this implementation)
[Wang, ASILOMAR, 2012]
Implementation – Coalescing Global
Memory Access



Rm,n(a) and Qm,n(a) are complex 3D structures located in global
memory
Arrange in [N,q,M] format rather than [M,N,q] so that q work items
always access data stored contiguously
Enables coalesced memory access → ~4-5x speedup
[Wang, ASILOMAR, 2012]
Nonbinary LDPC Decoding - Results



Run on 2 CPUs and 1 GPU:
Intel i7-640LM (dual core, 2.93 GHz)
AMD Phenom II X9-940 (quad core, 2.9 GHz)
NVIDIA GTX470 (448 stream processors, 1.215 GHz, 1280MB
device memory)
2.47 speedup for OpenCL over serial C on Intel i7
6.67 speedup for OpenCL over serial C on AMD Phenom II
GPU has 69.92 speedup over Intel i7 and 33.46 over AMD Phenom II
Worse case speedups: 38.48 and 18.41
[Wang, ASILOMAR, 2012]
Nonbinary LDPC Decoding - Results





GPU algorithm had 693.5 Kbps throughput and 1260 Kbps
throughput with early termination
Nonbinary decoders have complexity of 2q2 ~ 3q2 higher than
binary decoders
With q=32 in the samples run, this results in a 2000~3000x increase
in complexity
Due to massive parallelization in the decoding algorithm and the
GPU, the gap between binary and nonbinary implementation is
reduced to 50x
This type of LDPC decoding (with short codewords and high GF(q)
values) is most common in LDPC research & application, although
better speedups & throughput values are found in this implementation
with longer codewords & lower GF(q) values
[Wang, ASILOMAR, 2012]
Citations




Chris J. Thompson, Sahngyun Hahn, Mark Oskin: “Using Modern Graphics
Architectures for General-Purpose Computing: A Framework and Analysis.”
International Symposium on Microarchitecture (MICRO), Turkey, Nov. 2002.
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian,
Mike Houston, Pat Hanrahan: “Brook for GPUs: Stream Computing on
Graphics Hardware.” Special Interest Group for Graphics and Interactive
Techniques (SIGGRAPH), Los Angeles, Aug. 2004.
Shivani Raghav, Martino Ruggiero, David Atienza, Christian Pinto, Andrea
Marongiu, Luca Benini: “Scalable Instruction Set Simulator for Thousandcore Architectures Running on GPGPUs.” Proceedings of High Performance
Computing and Simulation (HPCS), pp. 459-466, France, June/July 2010.
Guohui Wang, Hao Shen, Bei Yin, Michael Wu, Yang Sun, Joseph R.
Cavallaro: “Parallel Nonbinary LDPC Decoding on GPU.” 46th Asilomar
Conference on Signals, Systems, and Computers (ASILOMAR), Nov. 4-7,
2012.
Download