Cell Briefing - Euro

advertisement
IBM Research
Multicore
Programming
Challenges
Michael Perrone
IBM Master Inventor
Mgr., Multicore Computing Dept.
© 2009
IBM Research
Performance
Multicore Performance Challenge
# of Cores
2
mpp@us.ibm.com
© 2009
IBM Research
Take Home Messages
“Who needs 100 cores to run MS Word?”
- Dave Patterson, Berkeley
Performance is critical and it's not free!
•
Data movement is critical to performance!
Performance
•
Which
curve are
you on?
# of Cores
3
mpp@us.ibm.com
© 2009
IBM Research
Outline
4
•
What’s happening?
•
Why is it happening?
•
What are the implications?
•
What can we do about it?
mpp@us.ibm.com
© 2009
IBM Research
What’s happening?
•
Intel, IBM, AMD, Sun, nVidia, Cray, etc.
# Cores
Heterogeneity (e.g., Cell processor, system level)
Decreasing
–
–
–
5
Heterogeneous
Multicore
Increasing
–
–
•
Homogeneous
Multicore
Industry shift to multicore
–
•
Single core
Core complexity (e.g., Cell processor, GPUs)
Decreasing since Pentium4 single core
Bytes per FLOP
mpp@us.ibm.com
© 2009
IBM Research
Heterogeneity: Amdahl’s Law for Multicore
Cores
Unicore
Serial
Parallel
Homogeneous
Heterogeneous
Even for square root performance growth (Hill & Marty, 2008)
Loophole: Have cores work in concert on serial code…
mpp@us.ibm.com
© 2009
IBM Research
Good & Bad News
GOOD NEWS
Multicore programming is parallel programming
BAD NEWS
Multicore programming is parallel programming
mpp@us.ibm.com
© 2009
IBM Research
Many Levels of Parallelism
8
•
Node
•
Socket
•
Chip
•
Core
•
Thread
•
Register/SIMD
•
Multiple instruction pipelines
•
Need to be aware of all of them!
mpp@us.ibm.com
© 2009
IBM Research
Additional System Types
Multicore
CPU
accelerator
accelerator
accelerator
System Bus
System Bus
main
memory
main
memory
9
accelerator
accelerator
accelerator
memory
accelerator
accelerator
accelerator
Multicore
Power
core
CPU
System Bus
PCIe
mpp@us.ibm.com
Network attached
main
memory
E’net
IB
bridge
System Bus
bridge
Multicore
CPU
Heterogeneous bus attached
NIC
IO bus attached
main
memory
bridge
Homogeneous bus attached
On-chip
I/O bus
Multicore
CPU
NIC
Multicore
CPU
System Bus
memory
© 2009
IBM Research
Multicore Programming Challenge
Higher
Performance
Interesting
research!
Nirvana
Better tools
Better programming
Danger
Zone!
“Lazy” Programming
Lower
Easier
Harder
Programmability
10
mpp@us.ibm.com
© 2009
IBM Research
Outline
11
•
What’s happening?
•
Why is it happening?
–
HW Challenges
–
BW Challenges
•
What are the implications?
•
What can we do about it?
mpp@us.ibm.com
© 2009
IBM Research
Power Density – The fundamental problem
1000
W/cm 2
Nuclear Reactor
100
10
1
Pentium III
Pentium II ®
Hot Plate
Pentium Pro ®
Pentium®
i386
i486
1.5
1
0.7
0.5 0.35 0.25 0.18 0.13
®
0.1 0.07
Source: Fred Pollack, Intel. New Microprocessor Challenges
in the Coming Generations of CMOS Technologies, Micro32
mpp@us.ibm.com
© 2009
IBM Research
What’s causing the problem?
65 nM
10S Tox=11A
Gate
Stack
Power Density (W/cm2)
1000
100
10
1
0.1
0.01
0.001
1
Gate dielectric approaching a
fundamental limit (a few atomic layers)
mpp@us.ibm.com
0.1
0.01
Gate Length
© 2009
IBM Research
Microprocessor Clock Speed Trends
Managing power dissipation is limiting clock speed increases
Clock Speed (MHz)
Clock Frequency (MHz)
10
1.0E+04
4
1.0E+03
10
3
10
1.0E+02
2
1990
mpp@us.ibm.com
1995
2000
2005
2010
© 2009
IBM Research
Intuition: Power vs. Performance Trade Off
5
Relative
Power
1.4
1
.7
.8
1
1.3
1.6
Relative Performance
mpp@us.ibm.com
© 2009
IBM Research
Outline
16
•
What’s happening?
•
Why is it happening?
–
HW Challenges
–
BW Challenges
•
What are the implications?
•
What can we do about it?
mpp@us.ibm.com
© 2009
IBM Research
The Hungry Beast
Data
(“food”)
Data Pipe
Processor
(“beast”)
 Pipe too small = starved beast
 Pipe big enough = well-fed beast
 Pipe too big = wasted resources
17
mpp@us.ibm.com
© 2009
IBM Research
The Hungry Beast
Data
(“food”)
Data Pipe
Processor
(“beast”)
 Pipe too small = starved beast
 Pipe big enough = well-fed beast
 Pipe too big = wasted resources
 If flops grow faster than pipe capacity…
… the beast gets hungrier!
18
mpp@us.ibm.com
© 2009
IBM Research
Move the food closer
Cache
Processor
Data
(“food”)
 Load more food while the beast eats
19
mpp@us.ibm.com
© 2009
IBM Research
What happens if the beast is still hungry?
 If the data set doesn’t fit in cache

Cache misses

Memory latency exposed

Performance degraded
Cache
Processor
 Several important application classes don’t fit
20

Graph searching algorithms

Network security

Natural language processing

Bioinformatics

Many HPC workloads
mpp@us.ibm.com
© 2009
IBM Research
Make the food bowl larger
 Cache size steadily increasing
 Implications
21

Chip real estate reserved for cache

Less space on chip for computes

More power required for fewer FLOPS
mpp@us.ibm.com
Cache
Processor
© 2009
IBM Research
Make the food bowl larger
 Cache size steadily increasing
 Implications

Chip real estate reserved for cache

Less space on chip for computes

More power required for fewer FLOPS
Cache
Processor
 But…
22

Important application working sets are growing faster

Multicore even more demanding on cache than unicore
mpp@us.ibm.com
© 2009
IBM Research
The beast is hungry!
Data pipe not growing
fast enough!
23
mpp@us.ibm.com
© 2009
IBM Research
The beast had babies
•
24
Multicore makes the data problem worse!
–
Efficient data movement is critical
–
Latency hiding is critical
mpp@us.ibm.com
© 2009
IBM Research
GOAL: The proper care and
feeding of hungry beasts
25
mpp@us.ibm.com
© 2009
IBM Research
Outline
26
•
What’s happening?
•
Why is it happening?
•
What are the implications?
•
What can we do about it?
mpp@us.ibm.com
© 2009
IBM Research
Example: The Cell/B.E. Processor
27
mpp@us.ibm.com
© 2009
IBM Research
Feeding the Cell Processor
 8 SPEs each with
– LS
SPE
SPU
SPU
SPU
SPU
SPU
SPU
SPU
SXU
SXU
SXU
SXU
SXU
SXU
SXU
LS
LS
LS
LS
LS
LS
LS
LS
MFC
MFC
MFC
MFC
MFC
MFC
MFC
MFC
– MFC
– SXU
SPU
SXU
16B/cycle
 PPE
EIB (up to 96B/cycle)
– OS functions
16B/cycle
16B/cycle
PPE
– Disk IO
– Network IO
PPU
L2
L1
MIC
16B/cycle (2x)
BIC
PXU
32B/cycle 16B/cycle
Dual
XDRTM
FlexIOTM
64-bit Power Architecture with VMX
28
mpp@us.ibm.com
© 2009
IBM Research
Cell Approach: Feed the beast more efficiently
 Explicitly “orchestrate” the data flow

Enables detailed programmer control of data flow



Avoids restrictive HW cache management



Get/Put data when & where you want it
Hides latency: Simultaneous reads, writes & computes
Unlikely to determine optimal data flow
Potentially very inefficient
Allows more efficient use of the existing bandwidth
 BOTTOM LINE:
It’s all about the data!
29
mpp@us.ibm.com
© 2009
IBM Research
Lessons Learned Cell Processor
• Core simplicity impacted algorithmic design
– Increased predictability
– Avoid recursion & branches
– Simpler code is better code
– e.g., bubble vs. comb sort
• Heterogeneity
– Serial core must balance parallel cores well
• Programmability suffered
– Forced to address data flow directly
– Led to better algorithms & performance portability
30
mpp@us.ibm.com
© 2009
IBM Research
What are the implications?
31
•
Computational Complexity
•
Parallel programming
•
Communication
•
Synchronization
•
Collecting metadata
•
Merging Operations
•
Grouping Operations
•
Memory Layout
•
Memory Conflicts
•
Debugging
mpp@us.ibm.com
Some general
Some Cell specific
© 2009
IBM Research
Computational complexity is inadequate
•
Focus on computes: O(N), O(N2), O(lnN), etc.
•
Ignores BW analysis
•
–
–
Memory flows are now the bottlenecks
Memory hierarchies are critical to performance
–
Need to incorporate memory into the picture
Need “Data Complexity”
–
–
32
Necessarily HW dependent
Calculate data movement (track where they
come from) and divide by BW to get time for data
mpp@us.ibm.com
© 2009
IBM Research
Don’t apply computational complexity blindly
O(N2)
Run Time
O(N)
O(N) isn’t always
better than O(N2)
N
You are here
More cores can lead to smaller N per core…
mpp@us.ibm.com
© 2009
IBM Research
Where is your data?
Localize your data!
Disk
Run Time
Tape
L3 cache
L2 cache
L1 cache
N
(“Locality”)
Put your data where you want it when you want it!
mpp@us.ibm.com
© 2009
IBM Research
Example: Compression
Computational Complexity
• Compress to reduce data flow
Compression
• Increases slope of O(N)
Compute
• But reduces run time
N
Compute
Read
Write
Compute
Compression
Read
Write
Run Time
mpp@us.ibm.com
© 2009
IBM Research
Implication: Communication Overhead
1
36
•
BW can swamp compute
•
Minimize communication
mpp@us.ibm.com
2
© 2009
IBM Research
Implication: Communication Overhead
L
L
9L
37
vs.
4L
•
Modify partitioning to reduce communications
•
Trade off with synchronization
mpp@us.ibm.com
© 2009
IBM Research
Implications: Synchronization Overhead
Synchronization
Overhead
Time
38
mpp@us.ibm.com
© 2009
IBM Research
Implications: Synchronization – Load Balancing
Uniform
•
39
Adaptive
Modify data partitioning to balance workloads
mpp@us.ibm.com
© 2009
IBM Research
Implications: Synchronization – Nondeterminism
Suppose:
40
mpp@us.ibm.com
=
© 2009
IBM Research
Implications: Synchronization – Nondeterminism
Average nondeterministic
Probability
Deterministic
Max of N Threads
Run Time
41
mpp@us.ibm.com
© 2009
IBM Research
Implications: Metadata - Parallel sort example
Unsorted
data
42
Metadata
Sorted
data
•
Collect histogram in first pass
•
Use histogram to parallelize second pass
mpp@us.ibm.com
© 2009
IBM Research
Implications: Merge Operations – FFT Example
•
•
Naive
–
1D FFT (x axis)
–
Transpose
–
1D FFT (y axis)
–
Transpose
–
43
Buffer
Input
Image
Improved – Merge steps
–
•
Tile
FFT/Transpose (x axis)
FFT/Transpose (y axis)
Avoid unnecessary data
movement
mpp@us.ibm.com
Transposed
Tile
Transposed
Image
Transposed
Buffer
© 2009
IBM Research
Implications: Restructure to Avoid Data Movement
Compute A
Compute A
Compute A
Transform A to B
Compute A
Compute B
Compute A
Transform B to A
Transform A to B
Compute A
Compute B
Transform A to B
Compute B
Compute B
Compute B
Transform B to A
Compute B
44
mpp@us.ibm.com
© 2009
IBM Research
Implications: Streaming Data & Finite Automata
DFA
Data
Replicate &
Overlap
DFA
DFA
DFA
Enables loop unrolling & software pipelining
45
mpp@us.ibm.com
© 2009
IBM Research
Implications: Streaming Data – NID Example
Sample Word List:
“the”
“that”
“math”
 Find (lots of) substrings in (long) string
 Build graph of words & represent as DFA
46
mpp@us.ibm.com
© 2009
IBM Research
Implications: Streaming Data – NID Example
Random access to large state transition table (STT)
47
mpp@us.ibm.com
© 2009
IBM Research
Implications: Streaming Data – Hiding Latency
48
mpp@us.ibm.com
© 2009
IBM Research
Implications: Streaming Data – Hiding Latency
Enables loop unrolling & software pipelining
49
mpp@us.ibm.com
© 2009
IBM Research
Roofline Model (S. Williams)
Processing
Rate
Compute bound
Latency
bound
Software Pipelining
Low
High
Data Locality
50
mpp@us.ibm.com
© 2009
IBM Research
Implications: Group Like Operations – Tokenization Ex.
Action
DFA
Data
•
51
Intuitive
–
Get data
Serial
–
State Transition
Serial
–
Action
Branchy & Nondeterministic
–
Repeat
mpp@us.ibm.com
© 2009
IBM Research
Implications: Group Like Operations – Tokenization Ex.
Action
Action List 1
DFA
Data
Action List 2
Better
52
–
–
Get data
State Transition
Serial
Serial
–
–
–
Add Action to List
Repeat
Process Action Lists
Serial
mpp@us.ibm.com
Serial
Action List 3
•
•
•
Loop unrolling
SIMD
Load balance
© 2009
IBM Research
Implications: Covert BW to Compute Bound – NN Ex.
F
Output
N Basis functions: dot product + nonlinearity
DxN Matrix of parameters
D Input dimensions
X
 Neural net function F(X)
– RBF, MLP, KNN, etc.
 If too big for cache, BW Bound
53
mpp@us.ibm.com
© 2009
IBM Research
Implications: Covert BW to Compute Bound – NN Ex.
Merge
 Split function over multiple SPEs
 Avoids unnecessary memory traffic
 Reduce compute time per SPE
 Minimal merge overhead
54
mpp@us.ibm.com
© 2009
IBM Research
Implications:
Pay Attention to Memory Hierarchy
Register
File
L1
L2
Main Memory
BW:
High
Low
Latency:
Low
High
Size:
Small
Larger
55
mpp@us.ibm.com
© 2009
IBM Research
Implications: Pay Attention to Memory Hierarchy
C
L1
C
L1
C
L1
L2
C
L1
L3
C
L1
L2
C
L1
C
L1
L2
C
56
L1
•
Data eviction rate
•
Optimal tiling
•
Shared memory space can impact load balancing
mpp@us.ibm.com
© 2009
IBM Research
Implications: Memory Hierarchy & Tiling
=
X
Optimal tiling
depends on
cache size
57
mpp@us.ibm.com
© 2009
IBM Research
Implications: Data Re-Use – FFT Revisited
•
Long stride trashes cache
•
Use full cachelines where possible
Stride
N2
Single Element
Data envelope
N
Stride 1
58
mpp@us.ibm.com
© 2009
IBM Research
Implications: Handle Race Conditions (Debugging)
Thread
1
Write data
2
Read data
1
•
59
Write data
Good
?
Bad
Heisenberg Uncertainty Principle
–
Instrumenting the code changes behavior
–
Problem with maintaining exact timing
mpp@us.ibm.com
© 2009
IBM Research
Implications: More Cores – More Memory Conflicts
Thread
Bank 1
Bank 8
1
2
3
4
5
6
7
8
•
HOT SPOT
Bank 1
–
Plan data layout
–
Avoid multiples of
the number of
banks
–
Randomize start
points
–
Make critical data
sizes and number
of threads
relatively prime
Bank 8
1
2
3
4
5
6
7
8
60
Avoid bank conflicts
mpp@us.ibm.com
© 2009
IBM Research
Implications: Reduce Data Movement
Data
Green’s Function
(X,Y)
 D ( x + i , y + j )G ( x , y , i , j )
ij
 New G at each (x,y)
 Radial symmetry of G reduces BW requirements
61
mpp@us.ibm.com
© 2009
IBM Research
Implications: Reduce Data Movement
SPE 0
62
SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6
mpp@us.ibm.com
Data
SPE 7
© 2009
IBM Research
Implications: Reduce Data Movement
SPE 0
63
SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6
mpp@us.ibm.com
Data
SPE 7
© 2009
IBM Research
Implications: Reduce Data Movement
2R+1
 For each X
– Load next column of data
– Load next column of indices
– For each Y
(X,Y)
• Load Green’s functions
• SIMDize Green’s functions
• Compute convolution at
(X,Y)
R
H
– Cycle buffers
1
Data buffer
Green’s Index buffer
2
64
mpp@us.ibm.com
© 2009
IBM Research
Outline
65
•
What’s happening?
•
Why is it happening?
•
What are the implications?
•
What can we do about it?
mpp@us.ibm.com
© 2009
IBM Research
What can we do about it?
•
•
66
We want
–
High performance
–
Low power
–
Easy programmability
Choose
any two!
We need
–
“Magic” compiler
–
Multicore enabled libraries
–
Multicore enabled tools
–
New algorithms
mpp@us.ibm.com
© 2009
IBM Research
What can we do about it?
•
Compiler “magic”
–
•
OpenMP, autovectorization
BUT… Doesn’t encourage parallel thinking
Programming models
–
•
Tools
–
•
CUDA, OpenCL, Pthreads, UPC, PGAS, etc
Cell SDK, RapidMind (Intel), PeakStream (Google), Cilk
(Intel), Gedae, VSIPL++, Charm++, Atlas, FFTW, PHiPAC
If you want performance…
–
No substitute for better algorithms & hand-tuning!
–
Performance analyzers
» HPCToolkit, FDPR-Pro, Code Analyzer, Diablo, TAU, Paraver, VTune,
SunStudio Performance Analyzer, Code Analyzer, PDT, Trace
Analyzer, Thor, etc.
67
mpp@us.ibm.com
© 2009
IBM Research
What can we do about it?
Example: OpenCL
• Open “standard”
• Based on C - not difficult to learn
• Allows natural transition from (proprietary) CUDA programs
• Interoperates with MPI
• Provides application portability
– Hides specifics of underlying accelerator architecture
– Avoids HW lock-in: “future-proofs” applications
• Weaknesses
– No DP, no recursion & accelerator model only
Portability does not equal performance portability!
68
mpp@us.ibm.com
© 2009
IBM Research
What can we do about it?
Hide Complexity in Libraries
• Manually
– Slow, expensive, new library for each architecture
• Autotuners
– Search program space for optimal performance
– Examples: Atlas (BLAS), FFTW (FFT), Spiral (DSP).
OSKI (Sparse BLAS), PhiPAC (BLAS)
• Local Optimality Problem:
– F() & G() may be optimal, but will F(G()) be?
mpp@us.ibm.com
© 2009
IBM Research
What can we do about it?
It’s all about the data!
 The data problem is growing
 Intelligent software prefetching
– Use DMA engines
– Don’t rely on HW prefetching
 Efficient data management
70
– Multibuffering:
Hide the latency!
– BW utilization:
Make every byte count!
– SIMDization:
Make every vector count!
– Problem/data partitioning:
Make every core work!
– Software multithreading:
Keep every core busy!
mpp@us.ibm.com
© 2009
IBM Research
Conclusions
•
Programmability will continue to suffer
–
•
Incorporate data flow into algorithmic development
–
•
71
Computational complexity vs. “data flow” complexity
Restructure algorithms to minimize:
–
•
No pain - no gain
Synchronization, communication, non-determinism, load
imbalance, non-locality
Data management is the key to better performance
–
Merge/Group data operations to minimize memory traffic
–
–
Restructure data traffic: Tile, Align, SIMDize, Compress
Minimize memory bottlenecks
mpp@us.ibm.com
© 2009
IBM Research
Backup Slides
72
mpp@us.ibm.com
© 2009
IBM Research
Abstract
The computer industry is facing fundamental challenges that are driving a major change in the design of computer processors. Due to
restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock
frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a
result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same
time, competitive market conditions are giving business advantage to those companies that can field new streaming applications,
handle larger data sets, and update their models to market conditions faster. This desire for newer, faster and larger is driving
continued demand for higher computer performance.
The industry’s response to address these challenges has been to embrace “multicore” technology by designing processors that have
multiple processing cores on each silicon chip. Increasing the number of cores per chip has enabled processor peak performance to
double with each doubling of the number of cores. With performance doubling occurring at approximately constant clock frequency
so that energy costs can be controlled, multicore technology is poised to deliver the performance users need for their next generation
applications while at the same time reducing total cost of ownership per FLOP.
The multicore solution to the clock frequency problem comes at a cost: Performance scaling on multicore is generally sub-linear and
frequently decreases beyond some number of cores. For a variety of technical reasons, off-chip bandwidth is not increasing as fast
as the number of cores per chip which is making memory and communication bottlenecks the main barriers to improved
performance. What these bottlenecks mean to multicore users is that precise and flexible control of data flows will be crucial to
achieving high performance. Simple mappings of their existing algorithms to multicore will not result in the naïve performance
scaling one might expect from increasing the number of cores per chip. Algorithmic changes, in many cases major, will have to be
made to get value out of multicore. Multicore users will have to re-think and in many cases re-write their applications if they want to
achieve high performance. Multicore forces each programmer to become a parallel programmer; to think of their chips as clusters;
and to deal with the issues of communication, synchronization, data transfer and non-determinism as integral elements of their
algorithms. And for those already familiar with parallel programming, multicore processors add a new level of parallelism and
additional layers of complexity.
This talk will highlight some of the challenges that need to be overcome in order to get better performance scaling on multicore, and will
suggest some solutions.
73
mpp@us.ibm.com
© 2009
IBM Research
Cell Comparison: ~4x the FLOPS @ ~½ the power
Both 65nm technology
(to scale)
74
mpp@us.ibm.com
© 2009
IBM Research
To Scale Comparison of L2
AMD
Cell
BE
Intel
IBM
75
mpp@us.ibm.com
© 2009
IBM Research
Intel Multi-Core Forum (2006)
The Issue
Linux
Throughput
35000
SDET
26250
9.8x
17500
8750
0
0 2 4 6 8 10 12 14 16 18 20 22 24
Processors
mpp@us.ibm.com
© 2009
IBM Research
The “Yale Patt Ladder”
Problem
Algorithm
Program
ISA (Instruction Set Architecture)
Microarchitecture
To improve
performance
need people
who can cross
between levels
Circuits
Electrons
mpp@us.ibm.com
© 2009
Download