VT PowerPoint Template - Computer Science & Engineering

advertisement
Carlo del Mundo
Department of Electrical and
Computer Engineering
Ubiquitous Parallelism
Are You Equipped To Code For Multi- and
Many- Core Platforms?
Agenda
• Introduction/Motivation
• Why Parallelism? Why now?
• Survey of Parallel Hardware
• CPUs vs. GPUs
• Conclusion
• How Can I Start?
2
Talk Goal
• Encourage undergraduates to
answer the call to the era of
parallelism
• Education
• Software Engineering
3
Why Parallelism? Why now?
• You’ve already been exposed to
parallelism
• Bit Level Parallelism
• Instruction Level Parallelism
• Thread Level Parallelism
4
Why Parallelism? Why now?
• Single-threaded performance has
plateaued
• Silicon Trends
• Power Consumption
• Heat Dissipation
5
Why Parallelism? Why now?
6
Power Chart: P = CV2F
7
Heat Chart (Feature Size)
8
Why Parallelism? Why now?
• Issue: Power & Heat
• Good: Cheaper to have more cores, but
slower
• Bad: Breaks hardware/software
contract
9
Why Parallelism? Why now?
• Hardware/Software Contract
• Maintain backwards-compatibility with
existing codes
10
Why Parallelism? Why now?
11
Agenda
• Introduction/Motivation
• Why Parallelism? Why now?
• Survey of Parallel Hardware
• CPUs vs. GPUs
• Conclusion
• How Can I Start?
12
Personal Mobile Device Space
iPhone 5
13
Galaxy S3
Personal Mobile Device Space
2 CPU cores/
3 GPU cores
iPhone 5
14
Galaxy S3
Personal Mobile Device Space
2 CPU cores/
3 GPU cores
iPhone 5
15
4 CPU cores/
4 GPU cores
Galaxy S3
Desktop Space
16
Desktop Space
16 CPU cores
• Rare To Have “Single
Core” CPU
• Clock Speeds < 3.0
GHz
• Power Wall
• Heat Dissipation
AMD Opteron 6272
17
Desktop Space
• General Purpose
2048 GPU Cores
• Power Efficient
• High Performance
• Not All Problems
Can Be Done on
GPU
18
AMD Radeon 7970
Warehouse Space (HokieSpeed)
• Each node:
• 2x Intel Xeon
5645 (6 cores
each)
• 2x NVIDIA
C2050 (448
GPUs each)
19
Warehouse Space (HokieSpeed)
• Each node:
• 2x Intel Xeon
5645 (6 cores
each)
• 2x NVIDIA
C2050 (448
GPUs each)
• 209 nodes
20
Warehouse Space (HokieSpeed)
• Each node:
• 2x Intel Xeon
5645 (6 ★
cores2508 CPU cores
each) ★ 187264 GPU cores
• 2x NVIDIA
C2050 (448
GPUs each)
• 209 nodes
21
All Spaces
22
Convergence in Computing
• Three Classes:
• Warehouse
• Desktop
• Personal Mobile Device
• Main Criteria
• Power, Performance, Programmability
23
Agenda
• Introduction/Motivation
• Why Parallelism? Why now?
• Survey of Parallel Hardware
• CPUs vs. GPUs
• Conclusion
• How Can I Start?
24
What is a CPU?
• CPU
• SR71 Jet
• Capacity
• 2 passengers
• Top Speed
• 2200 mph
25
What is the GPU?
• GPU
• Boeing 747
• Capacity
• 605 passengers
• Top Speed
• 570 mph
26
CPU vs. GPU
27
Capacity
(passengers)
Speed
(mph)
Throughput
(passengers * mph)
“CPU”
Fighter Jet
2
2200
4400
“GPU”
747
452
555
250,860
CPU Architecture
• Latency Oriented (Speculation)
28
GPU Architecture
29
APU = CPU + GPU
• Accelerated Processing Unit
• Both CPU + GPU on the same die
30
CPUs, GPUs, APUs
• How to handle parallelism?
• How to extract performance?
• Can I just throw processors at a
problem?
31
CPUs, GPUs, APUs
• Multi-threading (2-16 threads)
• Massive multi-threading (100,000+)
• Depends on Your Problem
32
Agenda
• Introduction/Motivation
• Why Parallelism? Why now?
• Survey of Parallel Hardware
• CPUs vs. GPUs
• Conclusion
• How Can I Start?
33
How Can I start?
• CUDA Programming
• You most likely have
a CUDA enabled GPU
if you have a recent
NVIDIA card
34
How Can I start?
• CPU or GPU
Programming
• Use OpenCL (your
laptop could
potentially run)
35
How Can I start?
• Undergraduate research
• Senior/Grad Courses:
•
•
•
•
36
CS 4234 – Parallel Computation
CS 5510 – Multiprocessor Programming
ECE 4504/5504 – Computer Architecture
CS 5984 – Advanced Computer Graphics
In Summary …
• Parallelism is here to stay
• How does this affect you?
• How fast is fast enough?
• Are we content with current computer
performance?
37
Thank you!
• Carlo del Mundo,
• Senior, Computer Engineering
• Website: http://filebox.vt.edu/users/cdel/
• E-mail: cdel@vt.edu
Previous Internships @
38
Appendix
39
Programming Models
•
•
•
•
40
pthreads
MPI
CUDA
OpenCL
pthreads
• A UNIX API to create and destroy threads
41
MPI
• A communications protocol
• “Send and Receive” messages between
nodes
42
CUDA
• Massive
multithreading
(100,000+)
• Threadlevel
parallelism
43
OpenCL
• Heterogeneous programming model
that is catered to several devices
(CPUs, GPUs, APUs)
44
Comparisons
pthreads
MPI
CUDA
OpenCL
Number
Threads
2-16
--
100,000+
2 – 100,000+
Platform
CPU only
Any Platform
NVIDIA Only
Any Platform
Productivity†
Easy
Medium
Hard
Hard
Parallelism
through
Threads
Messages
Threads
Threads
†
Productivity is subjective and draws from my experiences
Parallel Applications
• Vector Add
• Matrix Multiplication
46
Vector Add
+
=
47
Vector Add
• Serial
• Loop N times
• N cycles†
• Parallel
• Assume you have N cores
• 1 cycles†
† Assume
48
1 add = 1 cycle
+
=
Matrix Multiplication
B
A
C
49
Matrix Multiplication
B
A
C
50
Matrix Multiplication
B
A
C
51
Matrix Multiplication
• Embarassingly Parallel
• Let L be the length of each side
• L^2 elements, each element requires L
multiplies and L adds
52
Performance
•
•
•
•
53
Operations/Second (FLOPS)
Power (W)
Throughput (# things/unit time)
FLOPS/W
Puss In Boots
• Renders that took hours now take minutes
• - Ken Mueseth, Effects R&D Supervisor
• DreamWorks Animation
54
Computational Finance
• Black-Scholes
– A PDE which
governs the
price of an
option
essentially
“eliminating”
risk
55
Genome Sequencing
• Knowledge of the human
genome can provide
insights to new medicine
and biotechnology
• E.g.: genetic engineering,
hybridization
56
Applications
57
Why Should You Care?
• Trends:
• CPU Core Counts Double Every 2 years
• 2006 – 2 cores, AMD Athlon 64 X2
• 2010 – 8-12 cores, AMD Magny Cours
• Power Wall
58
Then And Now
• Today’s state-of-the-art hardware is
yesterday’s supercomputer
• 1998 – Intel TFLOPS supercomputer
• 1.8 trillion floating point ops / sec (1.8
TFLOP)
• 2008 – AMD Radeon 4870 GPU x2
• 2400 trilliion floating point ops / sec (2.4
TFLOP)
59
Download