Uploaded by Usha Shree

Unit-4

advertisement
INTRODUCTION TO
MASSIVELY PARALLEL
COMPUTING
Unit-4
• Since 2003, semiconductor industry follow two
main trajectories:
TWO MAIN
TRAJECTORY
• Multicore: seek to maintain the execution speed of
sequential program. Reduce the Latency
• Many core: improve the execution throughput of
parallel application. Each heavily multi-threaded
core is much smaller and some cores share control
and instruction cache.
2
CPUS AND GPUS HAVE FUNDAMENTALLY
DIFFERENT DESIGN PHILOSOPHIES
ALU
ALU
ALU
ALU
Control
GPU
CPU
Cache
DRAM
DRAM
3
MULTICORE
CPU
• Optimized for sequential program sophisticated
control logic to allow instructions from a single
thread to execute faster. In order to minimize the
latency, large on-chip cache to reduce the longlatency memory access to cache accesses, the
execution latency of each thread is reduced.
However, the large cache memory (multiple
megabytes, low-latency arithmetic units and
sophisticated operand delivery logic consume
chip area and power.
• Latency-oriented design
4
MULTICORE CPU
• Many applications are limited by the speed at which data can be
moved from memory to processor.
• CPU has to satisfy the requirements from legacy OS and I/O, more
difficult to let memory bandwidth to increase. Usually 1/6 of GPU
5
• Shaped by the fast-growing video game industry
that expects tremendous massive number of
floating-pint calculations per video frame.
MANY-CORE
GPU
• Motive to look for ways to maximize the chip area
and power budget dedicated to floating-point
calculations. Solution is to optimize for the
execution throughput of massive number of
threads. The design saves chip area and power by
allowing pipelined memory channels and
arithmetic operations to have long latency. The
reduce area and power on memory and arithmetic
allows designers to have more cores on a chip to
increase the execution throughput.
6
MANY-CORE
GPU
• A large number of threads to find work to do when
some of them are waiting for long-latency
memory accesses or arithmetic operations. Small
cache are provide to help control the bandwidth
requirements so multiple threads that access the
same memory do not need to go the DRAM.
• Throughput-oriented design: that thrives to
maximize the total execution throughput of a large
number of threads while allowing individual
threads to take a potentially much longer time to
execute.
7
CPU + GPU
• GPU will not perform well on tasks on which CPUs
are design to perform well. For program that have
one or very few threads, CPUs with lower
operation latencies can achieve much higher
performance than GPUs.
• When a program has a large number of threads,
GPUs with higher execution throughput can
achieve much higher performance than CPUs.
Many applications use both CPUs and GPUs,
executing the sequential parts on the CPU and
numerically intensive parts on the GPUs.
8
• The processors of choice must have a very large
presence in the market place.
• 400 million CUDA-enabled GPUs in use to date.
• Practical form factors and easy accessibility
GPU
ADOPTION
• Until 2006, parallel programs are run on data
centers or clusters. Actual clinical applications on
MRI machines are based on a PC and special
hardware accelerators. GE and Siemens cannot sell
racks to clinical settings. NIH refused to fund
parallel programming projects. Today NIH funds
research using GPU.
9
WHY MASSIVELY
PARALLEL PROCESSOR
• A quiet revolution and potential build-up
• Calculation: 367 GFLOPS vs. 32 GFLOPS
• Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
• Until last year, programmed through graphics
API
• GPU in every PC and workstation – massive
volume and potential impact
10
ARCHITECTURE OF A CUDACAPABLE GPU
Host
Input Assembler
Thread Execution Manager
Two streaming multiprocessors form a building block
Each has a number of streaming processors that share control logic
and instruction cache.
Each GPU comes with multiple gigabytes of DRAM (global memory).
Offers High bandwidth off-chip, though with longer latency than typical system memory.
High bandwidth makes up for the longer latency for massively parallel applications
G80: 86.4 GB/s of memory bandwidth plus 8GB/s
up and down 4Gcommunication bandwidth with CPU
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Load/store
Load/store
Global Memory
11
A good application runs 5k to 12k threads. CPU support
2 to 8 threads.
1 TFLOPS peak performance (25-50 times of current
high-end microprocessors)
265 GFLOPS sustained for apps such as VMD
GT200
CHARACTERISTICS
Massively parallel, 128 cores, 90W
Massively threaded, sustains 1000s of threads per app
30-100 times speedup over high-end microprocessors
on scientific and media applications: medical imaging,
molecular dynamics
12
• Exciting applications in future mass computing market have been
traditionally considered “supercomputing applications”
• Molecular dynamics simulation, Video and audio coding and
manipulation, 3D imaging and visualization, Consumer game
physics, and virtual reality products
• These “Super-apps” represent and model physical, concurrent
world
• Various granularities of parallelism exist, but…
• programming model must not hinder parallel implementation
• data delivery needs careful management
FUTURE APPS REFLECT A CONCURRENT
WORLD
13
•
STRETCHING TRADITIONAL
ARCHITECTURES
Traditional parallel architectures cover some super-applications
• DSP, GPU, network apps, Scientific
•
The game is to grow mainstream architectures “out” or domainspecific architectures “in”
• CUDA is latter
Traditional applications
Current architecture
coverage
New applications
Domain-specific
architecture coverage
Obstacles
14
TEXTURE MAPPING
EXAMPLE
Texture mapping example: painting a world map
texture image onto a globe object.
15
• MPI: scale up to 100,000 nodes.
• CUDA shared memory for parallel execution. Programmers
manage the data transfer between CPU and GPU and detailed
parallel code construct.
SOFTWARE
EVOLVEMENT
• OpenMP: shared memory. Not able to scale beyond a couple of
hundred cores due to thread managements overhead and
cache coherence. Compilers do most of the automation in
managing parallel execution.
• OpenCL (2009): Apple, Intel AMD/ATI, NViDia: proposed a
standard programming model. Define language extension and
run-time API. Application developed in OpenCL can run on any
processors that support OpenCL language extension and API
without code modification
• OpenACC (2011): compiler directives to specific loops and
regions of code to offload from CPU to GPU. More like OpenMP.
16
SPEEDUP OF APPLICATIONS
457 316
431 263
210
79
GPU Speedup
Relative to CPU
60
50
40
Ke rn e l
Ap p lic a tio n
30
20
10
0
H .2 6 4
LBM
R C 5 -7 2
F EM
R PES
PN S
SA XPY T PA C F
FDTD
M R I-Q
M R IFHD
• GeForce 8800 GTX vs. 2.2GHz Opteron 248
• 10 speedup in a kernel is typical, as long as the kernel can occupy enough
parallel threads
• 25 to 400 speedup if the function’s data requirements and control flow suit the
GPU and the application is optimized
• “Need for Speed” Seminar Series organized by Patel and Hwu this semester.
17
HISTORY OF GPU COMPUTING
Scene
Transformations
GRAPHICS
PIPELINE
Lighting &
Shading
GPUs evolved as
hardware and software
algorithms evolve
Viewing
Transformations
Rasterization
EARLY GRAPHICS
• Originally, no specialized graphics hardware
• All processing in software on CPU,
• Results transmitted to frame buffer
 first, external frame buffers
 later, internal frame buffers.
CPU
Frame
buffer
Display
MORE
DETAILED
PIPELINE
Geometry data
Transform & lighting
Culling, perspective divide,
viewport mapping
Simple
functionality
transferred to
specialized
hardware.
Rasterization
Simple texturing
Depth test
Frame buffer blending
Geometry data
ADD MORE
FUNCTIONALI
TY TO GPU.
Simple
functionality
transferred to
specialized
hardware
Transform & lighting
Culling, perspective divide,
viewport mapping
Rasterization
Simple texturing
Depth test
Frame buffer blending
FIXED FUNCTION GPU
PIPELINE
• Pipeline implemented in hardware
• Each stage does fixed task
• Tasks are parameterized
• Inflexible – fixed, parameterized functions
• Vector-matrix operations (some parallelism).
Scene
Transformations
CPU
GPU
Lighting &
Shading
Viewing
Transformations
Rasterization
Frame
buffer
Display
TECHNOLOGY ADVANCES
• Hardware gets cheaper, smaller, and more
powerful
• Parallel architectures develop
• Graphics processing get more sophisticated
(environmental mapping, displacement mapping,
sub-surface scattering)
• Need more flexibility in GPUs.
MAKE THIS
PROGRAMMABLE:
VERTEX SHADER
Geometry data
Transform & lighting
Culling, perspective divide,
viewport mapping
Rasterization
Make this
programmable:
Fragment Shader
Complex texturing
Depth test, alpha test, stencil test
Frame buffer blending
VERTEX SHADER
• Graphics systems: convert everything to triangles
• Pass vertices, normals, colors, texture coordinates
to GPU processor
• GPU: vertex-based computations,
 Independent of other vertices
• Later, assemble into triangles.
FRAGMENT SHADER
• Fragment is triangle clipped to pixel
 Interpolate values
• Multiple textures, Alpha, stencil, depth
 Independent of other fragments
• Blend with contents of frame buffer.
Geometry data
Vertex
Shader
Introduce
parallelism:
add multiple
units
Vertex
Shader
Vertex
Shader
Culling, perspective divide,
viewport mapping
Rasterization
Fragment
Shader
Fragment
Shader
Fragment
Shader
Alpha test, depth test, stencil test
Frame buffer blending
CPU
Host
Host Interface
Vertex Control
A FIXED
FUNCTION GPU
PIPELINE
GPU
Vertex
Cache
VS/T&L
Triangle Setup
Raster
Shader
ROP
29
FBI
Texture
Cache
Frame
Buffer
Memory
ANTI-ALIASING EXAMPLE
Triangle Geometry
Triangle
Geometry
30
Aliased
Aliased
Anti-Aliased
Anti-Aliased
PROGRAMMABLE VERTEX AND PIXEL
PROCESSORS
3D Application
or Game
3D API
Commands
CPU
3D API:
OpenGL or
Direct3D
CPU – GPU Boundary
GPU
Command &
Data Stream
GPU
Front
End
Vertex Index
Stream
Primitive
Assembly
Pre-transformed
Vertices
Pixel
Location
Stream
Rasterization &
Interpolation
Rasterized
Transformed Pre-transformed
Vertices
Fragments
Programmable
Vertex
Processor
31
GPU
Assembled
Polygons,
Lines, and
Points
Pixel
Updates
Raster
Operation
s
Framebuffer
Transformed
Fragments
Programmable
Fragment
Processor
An example of separate vertex processor and fragment processor in
a programmable graphics pipeline
• GPUs have moved away from the traditional
fixed-function 3D graphics pipeline toward a
flexible general-purpose computational engine.
POWER
• The raw computational power of a GPU dwarfs
that of the most powerful CPU, and the gap is
steadily widening.
• GPUs have moved away from the traditional
fixed-function 3D graphics pipeline toward a
flexible general-purpose computational engine
NEXT: UNIFY SHADERS
• One set of shaders
• Allocate to either vertices or fragments
UNIFIED GRAPHICS PIPELINE
Host
Data Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
TF
L1
SP
TF
L1
34
SP
SP
SP
TF
L1
L1
SP
SP
TF
L1
L2
FB
Pixel Thread Issue
SP
TF
L2
FB
SP
SP
TF
L1
L2
FB
SP
Geom Thread Issue
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
Input Registers
Fragment Program
per thread
per Shader
per Context
Texture
Constants
Temp Registers
Output Registers
FB
35
Memory
The restricted input and output capabilities of a shader programming model.
PIPELINE
EVOLVED
EVOLVED
PIPELINE
GPGPU
• Make GPU more general – adapt certain
types of programs to it’s pipelined, parallel
architecture
• Single GeForce 8800 chip achieves a
sustained 330 billion floating-point
operations per second (Gflops) on simple
benchmarks
• Cost-effective: graphics driving demand up,
supply up, price down for GPUs
• Finding uses in non-graphics applications.
GEFORCE 8800 GTX
SHADERS IN DIRECT3D
• DirectX 9: Vertex Shader, Pixel Shader
• DirectX 10: Vertex Shader, Geometry Shader, Pixel Shader
• DirectX 11: Vertex Shader, Hull Shader, Domain Shader,
Geometry Shader, Pixel Shader, Compute Shader
• Observation: All of these shaders require
the same basic functionality: Texturing (or
Data Loads) and Math Ops.
Processing is highly data-parallel
• GPUs are highly multithreaded
• Use thread switching to hide memory latency
• Less reliance on multi-level caches
• Graphics memory is wide and high-bandwidth
GPU
ARCHITECTURES
Trend toward general purpose GPUs
• Heterogeneous CPU/GPU systems
• CPU for sequential code, GPU for parallel code
Programming languages/APIs
• DirectX, OpenGL
• C for Graphics (Cg), High Level Shader Language
(HLSL)
• Compute Unified Device Architecture (CUDA)
EVOLUTION OF GPU
STREAM
PROCESSING
THE SUPERCOMPUTING
REVOLUTION (1)
THE SUPERCOMPUTING
REVOLUTION (2)
• Need to understand how CPUs and GPUs differ
WHAT
ACCOUNTS
FOR THIS
DIFFERENCE?
• Latency Intolerance versus Latency Tolerance
• Task Parallelism versus Data Parallelism
• Multi-threaded Cores versus SIMT (Single
Instruction Multiple Thread) Cores
• 10s of Threads versus 10,000s of Threads
LATENCY
(1)
• GPUs are designed for tasks that can tolerate latency
• Example: Graphics in a game (simplified scenario):
CPU
Generate
Frame 0
Generate
Frame 1
Generate
Frame 2
GPU
Idle
Render
Frame 0
Render
Frame 1
Latency between frame generation and rendering (order of milliseconds)
• To be efficient, GPUs must have high throughput, i.e. processing
millions of pixels in a single frame
LATENCY
(2)
• CPUs are designed to minimize latency
• Example: Mouse or keyboard input
• Caches are needed to minimize latency
• CPUs are designed to maximize running operations out of cache
• Instruction pre-fetch
• Out-of-order execution, flow control
• 
CPUs need a large cache, GPUs do not
• GPUs can dedicate more of the transistor area to computation horsepower
CPU VERSUS GPU TRANSISTOR
ALLOCATION
• GPUs can have more ALUs for the same sized chip and therefore run
many more threads of computation
ALU
ALU
ALU
ALU
Control
Cache
DRAM
DRAM
CPU
GPU
• Modern GPUs run 10,000s of threads concurrently
How do we:
MANAGING
THREADS ON
A GPU
• Avoid synchronization issues between so many threads?
• Dispatch, schedule, cache, and context switch 10,000s of
threads?
• Program 10,000s of threads?
Design GPUs to run specific types of threads:
• Independent of each other – no synchronization issues
• SIMD (Single Instruction Multiple Data) threads –
minimize thread management
• Reduce hardware overhead for scheduling, caching etc.
• Program blocks of threads (e.g. one pixel shader per draw
call, or group of pixels)
Any problems which can be solved with this
type of computation?
DATA PARALLEL
PROBLEMS
Plenty of problems fall into
this category (luckily J)
Graphics, image & video processing, physics,
scientific computing, …
This type of parallelism is called data parallelism
And GPUs are the perfect
solution for them!
In fact the more the data, the more efficient GPUs
become at these algorithms
Bonus: You can relatively easily add more processing
cores to a GPU and increase the throughput
• What we just described:
• Given a (typically large) set of data (“stream”)
• Run the same series of operations (“kernel” or “shader”) on all
of the data (SIMD)
STREAM
PROCESSING
• GPUs use various optimizations to improve throughput:
• Some on-chip memory and local caches to reduce bandwidth
to external memory
• Batch groups of threads to minimize incoherent memory
access
• Bad access patterns will lead to higher latency and/or
thread stalls.
• Eliminate unnecessary operations by
exiting or killing threads
• Example: Z-Culling and Early-Z to kill
pixels which will not be displayed
• GPUs use stream processing to achieve high throughput
• GPUs designed to solve problems that tolerate high latencies
• High latency tolerance €
Lower cache requirements
• Less transistor area for cache €
units
TO
SUMMARIZE
• More computing units €
high throughput
More area for computing
10,000s of SIMD threads and
• GPUs win
• Additionally:
• Threads managed by hardware €
You are not required to
write code for e
thread and manage them yourself
h
c
a
• Easier to increase parallelism by adding more processors
• So, fundamental unit of a modern GPU is a stream processor…
G80 AND GT200 STREAMING
PROCESSOR
ARCHITECTURE
The future of high throughput
computing is programmable
stream processing
BUILDING A
PROGRAMMABLE
GPU
So build the architecture
around the unified scalar
stream processing cores
GeForce 8800 GTX (G80) was
the first GPU architecture built
with this new paradigm
G80 REPLACES THE PIPELINE
MODEL
Host
Input Assembler
SP
SP
SP
TF
Vtx Thread Issue
SP
L2
SP
SP
SP
TF
L1
L1
SP
SP
TF
L1
L2
FB
Pixel Thread Issue
SP
TF
L2
FB
SP
SP
TF
L1
L1
FB
SP
TF
TF
L1
SP
Geom Thread Issue
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
128 Unified Streaming
Processors
Setup / Rstr / ZCull
L2
FB
GT200 ADDS MORE PROCESSING
POWER
Host CPU
System Memory
GPU
Host Interface
Viewport / Clip /
Setup / Raster /
ZCull
Input Assemble
Vertex Work
Distribution
Geometry Work
Distribution
Pixel Work
Distribution
Compute Work
Distribution
Interconnection Network
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
8800GTX (HIGH-END G80)
• 16 Stream Multiprocessors
• Each one contains 8 unified streaming
processors – 128 in total
GTX280 (high-end GT200)
• 24 Stream Multiprocessors
• Each one contains 8 unified streaming
processors – 240 in total
INSIDE A STREAM
MULTIPROCESSOR (SM)
• Scalar register-based ISA
• Multithreaded Instruction Unit
• Up to 1024 concurrent threads
• Hardware thread scheduling
• In-order issue
• 8 SP: Thread Processors
• IEEE 754 32-bit floating point
• 32-bit and 64-bit integer
• 16K 32-bit registers
• 2 SFU: Special Function Units
• sin, cos, log, exp
• Double Precision Unit
• IEEE 754 64-bit floating point
• Fused multiply-add
• 16KB Shared Memory
I-Cache
MT Issue
C-Cache
SP SP
SP SP
SP SP
SP SP
SFU SFU
DP
Shared
Memory
Workloads are partitioned into blocks of
threads among multiprocessors
• a block runs to completion
• a block doesn’t run until resources are available
MULTIPROCESSOR
PROGRAMMING
MODEL
Allocation of hardware resources
• shared memory is partitioned among blocks
• registers are partitioned among threads
Hardware thread scheduling
• any thread not waiting for something can run
• context switching is free – every cycle
MEMORY
HIERARCHY
OF G80 AND
GT200
SM can directly
access device
memory (video
memory)
Not cached
Read & write
GT200: 140 GB/s
peak
SM can access
device memory via
texture unit
Cached
Read-only, for
textures and
constants
GT200: 48 GTexels/s
peak
On-chip shared
memory shared
among threads in
an SM
important for
communication
amongst threads
provides low-latency
temporary storage
G80 & GT200: 16KB
per SM
• For GPU, performance == throughput
• Cache are limited in the memory hierarchy
• Strategy: hide latency with computation, not cache
PERFORMANCE
PER
MILLIMETER
• Heavy multithreading
• Switch to another group of threads when the current group
is waiting for memory access
• Implication: need large number of threads to hide latency
• Occupancy: typically 128 threads/SM minimum
• Maximum 1024 threads/SM on GT200 (total 1024 * 24 =
24,576 threads)
• Strategy: Single Instruction Multiple Thread (SIMT)
SIMT THREAD
EXECUTION
• Group 32 threads (vertices, pixels or primitives)
into warps
• Threads in warp execute same instruction at a time
• Shared instruction fetch/dispatch
• Hardware automatically handles divergence
(branches)
• Warps are the primitive unit of scheduling
• Pick 1 of 24 warps for each instruction slot
• SIMT execution is an implementation choice
• Shared control logic leaves more space for ALUs
• Largely invisible to programmer
I-Cache
MT Issue
C-Cache
SP SP
SP SP
SP SP
SP SP
SFU SFU
DP
Shared
Memory
SHADER BRANCHING
PERFORMANCE
• G8x/G9x/GT200 branch
efficiency is 32 threads
(1 warp)
G80 – 32 pixel coherence
PS Branching Efficiency
48 pixel coherence
16
• If threads diverge, both
sides of branch will execute
on all 32
• More efficient compared to
architecture with branch
efficiency of 48 threads
number of coherent 4x4 tiles
14
12
10
8
6
4
2
0%
20%
40%
60%
80%
100%
120%
• Execute in blocks can maximally exploits data parallelism
• Minimize incoherent memory access
• Adding more ALU yields better performance
• Performs data processing in SIMT fashion
• Group 32 threads into warps
• G80 and GT200
Streaming Processor
Architecture
• Threads in warp execute same instruction at a time
• Thread scheduling is automatically handled by hardware
• Context switching is free (every cycle)
• Transparent scalability. Easy for programming
• Memory latency is covered by large number of in-flight
threads
• Cache is mainly used for read-only memory access
(texture, constants).
GPU BEYOND
GRAPHICS
Same components as a typical CPU
ARCHITECTURE
OFA GPU
However,…
More computing
elements
More types of memory
Original GPUs had
vertex and pixel
shaders
Specifically for
graphics
Modern GPUs are
slightly different
CUDA – Compute
Unified Device
Architecture
COMPUTATIONAL ELEMENTS
OF A GPU
Streaming Processor – Core of the design
 Place where all of the computation takes place
Streaming Multiprocessor
 Groups of streaming multiprocessors
 In addition to the SPs, these also contain the
Special Function Units and Load/Store Units
Instructional Schedulers
Complex Control Logic
STREAMING
MULTIPROCESSOR
ARCHITECTURE
TYPES OFGPU
MEMORY
Global
DRAM
Slowest Performance
Texture
Cached Global
Memory
“Bound” at runtime
Constant
Cached Global
Memory
Shared
Local to a block of
threads
Thread
TERMINOLOGY
Thread – The smallest grain of the hierarchy of
device computation
Block
Block – A group of threads
Grid
Grid – A group of blocks
Warp
Warp – A group of 32 threads that are executed
simultaneously on the device
Kernel
Kernel ‐ The creator of a grid for GPU execution
GRIDS,
BLOCKS,
AND
THREADS
CUDA MEMORY
Faster, per-block
Fastest, per-thread
Slower, global
Read-only, cached
HETEROGENEOUS COMPUTING
Host: the CPU and its memory Device: the GPU and its memory
Download