Chapter 10: GPU Introduction - Southern Illinois University

advertisement
Programming Massively Parallel
Processors
Lecture Slides for Chapter 1:
Introduction
© David Kirk/NVIDIA and Wen-mei W. Hwu,
1
Two Main Trajectory
• Since 2003, semiconductor industry follow
two main trajectories:
– Multicore: seek to maintain the execution speed
of sequential program. Reduce the Latency
– Many core: improve the execution throughput of
parallel application. Each heavily multi-threaded
core is much smaller and some cores share
control and instruction cache.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
2
CPUs and GPUs have fundamentally
different design philosophies
ALU
ALU
ALU
ALU
Control
CPU
GPU
Cache
DRAM
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
DRAM
3
Multicore CPU
• Optimized for sequential program sophisticated
control logic to allow instructions from a single
thread to execute faster. In order to minimize the
latency, large on-chip cache to reduce the longlatency memory access to cache accesses, the
execution latency of each thread is reduced.
However, the large cache memory (multiple
megabytes, low-latency arithmetic units and
sophisticated operand delivery logic consume chip
area and power.
– Latency-oriented design
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
4
Multicore CPU
• Many applications are limited by the speed at
which data can be moved from memory to
processor.
– CPU has to satisfy the requirements from legacy
OS and I/O, more difficult to let memory
bandwidth to increase. Usually 1/6 of GPU
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
5
Many-core GPU
• Shaped by the fast-growing video game industry that
expects tremendous massive number of floating-pint
calculations per video frame.
• Motive to look for ways to maximize the chip area and
power budget dedicated to floating-point calculations.
Solution is to optimize for the execution throughput of
massive number of threads. The design saves chip area and
power by allowing pipelined memory channels and
arithmetic operations to have long latency. The reduce area
and power on memory and arithmetic allows designers to
have more cores on a chip to increase the execution
throughput.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
6
Many-core GPU
• A large number of threads to find work to do when some of
them are waiting for long-latency memory accesses or
arithmetic operations. Small cache are provide to help
control the bandwidth requirements so multiple threads that
access the same memory do not need to go the DRAM.
– Throughput-oriented design: that thrives to maximize the total
execution throughput of a large number of threads while allowing
individual threads to take a potentially much longer time to execute.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
7
CPU + GPU
• GPU will not perform well on tasks on which CPUs
are design to perform well. For program that have
one or very few threads, CPUs with lower
operation latencies can achieve much higher
performance than GPUs.
• When a program has a large number of threads,
GPUs with higher execution throughput can
achieve much higher performance than CPUs.
Many applications use both CPUs and GPUs,
executing the sequential parts on the CPU and
numerically intensive parts on the GPUs.
© David Kirkand Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
8
GPU adoption
• The processors of choice must have a very large
presence in the market place.
– 400 million CUDA-enabled GPUs in use to date.
• Practical form factors and easy accessibility
– Until 2006, parallel programs are run on data centers or
clusters. Actual clinical applications on MRI machines
are based on a PC and special hardware accelerators. GE
and Siemens cannot sell racks to clinical settings. NIH
refused to fund parallel programming projects. Today
NIH funds research using GPU.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
9
Why Massively Parallel Processor
• A quiet revolution and potential build-up
–
–
–
Calculation: 367 GFLOPS vs. 32 GFLOPS
Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
Until last year, programmed through graphics API
–
GPU in every PC and workstation – massive volume and potential
impact
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
10
Architecture of a CUDA-capable GPU
Host
Input Assembler
Two streaming multiprocessors form a building block
Each has a number of streaming processors that share control logic
and instruction cache.
Each GPU comes with multiple gigabytes of DRAM (global memory).
Offers High bandwidth off-chip, though with longer latency than typical system memory.
High bandwidth makes up for the longer latency for massively parallel applications
G80: 86.4 GB/s of memory bandwidth plus 8GB/s
up and down 4Gcommunication bandwidth with CPU
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Load/store
Load/store
Global Memory
A good application runs 5k to 12k threads. CPU support
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 2 to 8 threads.
11
ECE 408, University of Illinois, Urbana-Champaign
GT200 Characteristics
• 1 TFLOPS peak performance (25-50 times of current highend microprocessors)
• 265 GFLOPS sustained for apps such as VMD
• Massively parallel, 128 cores, 90W
• Massively threaded, sustains 1000s of threads per app
• 30-100 times speedup over high-end microprocessors on
scientific and media applications: medical imaging,
molecular dynamics
“I think they're right on the money, but the huge performance
differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s)
will invite close scrutiny so I have to be careful what I say
publically until I triple check those numbers.”
-John Stone, VMD group, Physics UIUC
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
12
Future Apps Reflect a Concurrent
World
• Exciting applications in future mass computing
market have been traditionally considered
“supercomputing applications”
– Molecular dynamics simulation, Video and audio coding and
manipulation, 3D imaging and visualization, Consumer game
physics, and virtual reality products
– These “Super-apps” represent and model physical,
concurrent world
• Various granularities of parallelism exist, but…
– programming model must not hinder parallel implementation
– data delivery needs careful management
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
13
Stretching Traditional Architectures
• Traditional parallel architectures cover some
super-applications
– DSP, GPU, network apps, Scientific
• The game is to grow mainstream architectures
“out” or domain-specific architectures “in”
– CUDA is latter
Traditional applications
Current architecture
coverage
New applications
Domain-specific
architecture coverage
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
Obstacles
14
Software Evolvement
• MPI: scale up to 100,000 nodes.
• CUDA shared memory for parallel execution. Programmers manage the
data transfer between CPU and GPU and detailed parallel code
construct.
• OpenMP: shared memory. Not able to scale beyond a couple of hundred
cores due to thread managements overhead and cache coherence.
Compilers do most of the automation in managing parallel execution.
• OpenCL (2009): Apple, Intel AMD/ATI, NViDia: proposed a standard
programming model. Define language extension and run-time API.
Application developed in OpenCL can run on any processors that
support OpenCL language extension and API without code modification
• OpenACC (2011): compiler directives to specific loops and
regions of code to offload from CPU to GPU. More like
OpenMP.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 408, University of Illinois, Urbana-Champaign
15
Download