Introduction: Programming of Massively Parallel Processors

advertisement
Programming of Massively Parallel
Processors
Introduction
Based on slides by
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
and those used by Sanjay Patel in ECE 498AL Spring
2010, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
1
Goals
• Learn how to program massively parallel
processors and achieve
– high performance
– functionality and maintainability
– scalability across future generations
• Acquire technical knowledge required to
achieve the above goals
– principles and patterns of parallel programming
– processor architecture features and constraints
– programming API, tools and techniques
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
2
Machine Problems
• Your own PCs running G80 emulators
– Better debugging environment
– Sufficient for first couple of weeks
• NVIDIA boards
– The Cyprus Institute Computation-based Science
and Technology Research Center (CaSToRC)
NVIDIA GPU cluster accounts (more on
hardware later)
– Much faster but less debugging support
©David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
3
Texts
1. Draft textbook by Prof. Hwu and Prof. Kirk available
at the UIUC course website
http://courses.ece.illinois.edu/ece498/al/Syllabus.html
2. NVIDIA, NVidia CUDA Programming Guide, NVidia,
2007 (reference book)
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
4
Tentative Planned Lectures
• Introduction / GPU Computing and CUDA Intro
• CUDA threading model
• CUDA memory model
• CUDA memory model, tiling
• CUDA performance
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
5
Why Massively Parallel Processing?
•
A quiet revolution and potential build-up
– Calculation: 1 TFLOPS vs. 100 GFLOPS
– Memory Bandwidth: ~10x
Many-core GPU
Multi-core CPU
Courtesy: John Owens
Figure 1.1. Enlarging Perform ance Gap betw een GPUs and CPUs.
– GPU in every PC– massive volume and potential impact
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
6
CPU & GPU Different Design Philosophies
• CPU design is optimized for sequential code
performance
– Sophisticated control allow instructions from a single
thread to execute in parallel or out of order
– Large caches to reduce access latencies to main memory
• GPU design optimizes for the execution of a
massive number of threads
– hardware uses a large number of threads to find work to
do when some of threads are waiting for long-latency
memory accesses (NO Large caches)
– much more chip area is dedicated to the floating-point
calculations
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
7
CPU & GPU Different Design Philosophies (cont.)
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
8
G80 GPU Chip (2006)
• 16 Streaming Multiprocessors (SMs)
• SM has 8 streaming processors (SPs), for a
total of 128 cores (16*8); 90Watts
– supports up to 768 threads per core (SP)
– about 12,000 threads total
• SP has a MAD, MUL & special function
units
• Pair of SMs form a building block
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
9
More G80 Characteristics
• 367 GFLOPS peak performance (25-50 times of current
high-end microprocessors)
• 265 GFLOPS sustained for some applications
• 30-100 times speedup over high-end microprocessors on
scientific and media applications: medical imaging,
molecular dynamics
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
10
G80 Chip
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Load/store
Load/store
Global Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
11
Fermi (2010)
~1.5TFLOPS (SP)/~800GFLOPS (DP)
230 GB/s DRAM Bandwidth
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
12
Future Applications Reflect a Concurrent World
• Exciting applications in future mass computing market
have been traditionally considered “supercomputing
applications”
– Molecular dynamics simulation, Video and audio coding and
manipulation, 3D imaging and visualization, Consumer game
physics, and virtual reality products
– These “Super-apps” represent and model physical, concurrent
world
• Various granularities of parallelism exist, but…
– programming model must not hinder parallel implementation
– data delivery needs careful management
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
13
Speedup of Applications
457 316
431 263
210
79
GPU Speedup
Relative to CPU
60
50
40
Ke rn e l
Ap p lic a tio n
30
20
10
0
H .2 6 4
LBM
R C 5 -7 2
F EM
R PES
PN S
SA XPY T PA C F
FDTD
M R I-Q
M R IFHD
• GeForce 8800 GTX vs. 2.2GHz Opteron 248
• 10 speedup in a kernel is typical, as long as the kernel can occupy
enough parallel threads
• 25 to 400 speedup if the function’s data requirements and control flow
suit the GPU and the application is optimized
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
14
Download