Powerpoint slides - courses

advertisement
(How) Can Programmers
Conquer the
Multicore Menace?
Saman Amarasinghe
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo
• Algorithmic Choices via PetaBricks
• Conquering the Multicore Menace
2
Today: The Happily Oblivious
Average Joe Programmer
• Joe is oblivious about the processor
– Moore’s law bring Joe performance
– Sufficient for Joe’s requirements
• Joe has built a solid boundary between
Hardware and Software
– High level languages abstract away the processors
– Ex: Java bytecode is machine independent
• This abstraction has provided a lot of freedom for Joe
• Parallel Programming is only practiced by a few experts
3
Moore’s Law
100000
1,000,000,000
10000
P4
??%/year
100,000,000
P3
1000
52%/year
P2
10,000,000
Pentium
100
486
1,000,000
386
10
8086
1
286
25%/year
Number of Transistors
Performance (vs. VAX-11/780)
From Hennessy and Patterson, Computer Architecture: Itanium 2
A Quantitative Approach, 4th edition, 2006
Itanium
100,000
10,000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
4
From David Patterson
Uniprocessor Performance
(SPECint)
100000
10000
1,000,000,000
100,000,000
P4
P3
1000
52%/year
P2
10,000,000
Pentium
100
486
1,000,000
386
10
8086
1
286
25%/year
Number of Transistors
Performance (vs. VAX-11/780)
From Hennessy and Patterson, Computer Architecture: Itanium 2
A Quantitative Approach, 4th edition, 2006
Itanium
100,000
10,000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
5
From David Patterson
Uniprocessor Performance
(SPECint)
100000
1,000,000,000
10000
P4
??%/year
100,000,000
P3
1000
52%/year
P2
10,000,000
Pentium
100
486
1,000,000
386
10
8086
1
286
25%/year
Number of Transistors
Performance (vs. VAX-11/780)
From Hennessy and Patterson, Computer Architecture: Itanium 2
A Quantitative Approach, 4th edition, 2006
Itanium
100,000
10,000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
6
From David Patterson
Squandering of the
Moore’s Dividend
• 10,000x performance gain in 30 years! (~46% per year)
• Where did this performance go?
• Last decade we concentrated on correctness and
programmer productivity
• Little to no emphasis on performance
• This is reflected in:
–
–
–
–
Languages
Tools
Research
Education
• Software Engineering: Only engineering discipline where
performance or efficiency is not a central theme
7
Matrix Multiply
An Example of Unchecked Excesses
•
Abstraction and Software Engineering
– Immutable Types
– Dynamic Dispatch
– Object Oriented
•
•
High Level Languages
Memory Management
– Transpose for unit stride
– Tile for cache locality
•
•
•
Vectorization
Prefetching
Parallelization
296,260x
87,042x
33,453x
12,316x
2,271x
7,514x
1,117x
522x
220x
Matrix Multiply
An Example of Unchecked Excesses
• Typical Software
Engineering Approach
–
–
–
–
–
–
In Java
Object oriented
296,260x
Immutable
Abstract types
No memory optimizations
No parallelization
• Good Performance
Engineering Approach
–
–
–
–
In C/Assembly
Memory optimized (blocked)
BLAS libraries
Parallelized (to 4 cores)
• In Comparison: Lowest to Highest MPG in transportation
14,700x
294,000x
9
Joe the Parallel Programmer
• Moore’s law is not bringing
anymore performance
gains
• If Joe needs performance
he has to deal with
multicores
– Joe has to deal with
performance
– Joe has to deal with
parallelism
10
Why Parallelism is Hard
• A huge increase in complexity and work for the programmer
– Programmer has to think about performance!
– Parallelism has to be designed in at every level
• Programmers are trained to think sequentially
– Deconstructing problems into parallel tasks is hard for many of us
• Parallelism is not easy to implement
– Parallelism cannot be abstracted or layered away
– Code and data has to be restructured in very different (non-intuitive) ways
• Parallel programs are very hard to debug
– Combinatorial explosion of possible execution orderings
– Race condition and deadlock bugs are non-deterministic and illusive
– Non-deterministic bugs go away in lab environment and with
instrumentation
11
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo
– Joint work with Marek Olszewski and Jason Ansel
• Algorithmic Choices via PetaBricks
• Conquering the Multicore Menace
12
Racing for Lock Acquisition
• Two threads
Time
– Start at the same time
– 1st thread: 1000 instructions to the lock acquisition
– 2nd thread: 1100 instructions to the lock acquisition
Instruction #
13
Non-Determinism
• Inherent in parallel applications
– Accesses to shared data can experience many possible
interleavings
– New! Was not the case for sequential applications!
– Almost never part of program specifications
– Simplest parallel programs, i.e. a work queue, is non deterministic
• Non-determinism is undesirable
– Hard to create programs with repeatable results
– Difficult to perform cyclic debugging
– Testing offers weaker guarantees
14
Deterministic Multithreading
• Observation:
– Non-determinism need not be a required property of threads
– We can interleave thread communication in a deterministic manner
– Call this Deterministic Multithreading
• Deterministic multithreading:
– Makes debugging easier
– Tests offer guarantees again
– Supports existing programming models/languages
– Allows programmers to “determinize” computations that have
previously been difficult to do so using today’s programming idioms
– e.g.: Radiosity (Singh et al. 1994), LocusRoute (Rose 1988), and
Delaunay Triangulation (Kulkarni et al. 2008)
Deterministic Multithreading
• Strong Determinism
– Deterministic interleaving for all accesses to shared data for a given input
– Attractive, but difficult to achieve efficiently without hardware support
• Weak Determinism
– Deterministic interleaving of all lock acquisitions for a given input
– Cheaper to enforce
– Offers same guarantees as strong determinism for data-race-free program
executions
– Can be checked with a dynamic race detector!
Kendo
• A Prototype Deterministic Locking Framework
– Provides Weak Determinism for C and C++ code
– Runs on commodity hardware today!
– Implements a subset of the pthreads API
– Enforces determinism without sacrificing load balance
– Tracks progress of threads to dynamically construct the
deterministic interleaving:
– Deterministic Logical Time
– Incurs low performance overhead (16% geomean on
Splash2)
Deterministic Logical Time
• Abstract counterpart to physical time
– Used to deterministically order events on an SMP machine
– Necessary to construct the deterministic interleaving
• Represented as P independently updated deterministic logical clocks
– Not updated based on the progress of other threads
(unlike Lamport clocks)
– Event1 (on Thread 1) occurs before Event2 (on Thread 2) in Deterministic
Logical Time if:
– Thread 1 has lower deterministic logical clock than Thread 2 at time of events
Deterministic Logical Clocks
• Requirements
– Must be based on events that are deterministically
reproducible from run to run
– Track progress of threads in physical time as closely as
possible (for better load balancing of the deterministic
interleaving)
– Must be cheap to compute
– Must be portable over micro-architecture
– Must be stored in memory for other threads to observe
Deterministic Logical Clocks
• Some x86 performance counter events satisfy many of
these requirements
– Chose the “Retired Store Instructions” event
• Required changes to Linux Kernel
– Performance counters are kernel level accessible only
– Added an interrupt service routine
– Increments each thread’s deterministic logical clock (in memory) on
every performance counter overflow
– Frequency of overflows can be controlled
Locking Algorithm
• Construct a deterministic interleaving of lock acquires
from deterministic logical clocks
– Simulate the interleaving that would occur if running in
deterministic logical time
• Uses concept of a turn
– It’s a thread’s turn when:
– All thread’s with smaller ID have greater deterministic logical clocks
– All thread’s with larger ID have greater or equal deterministic logical
clocks
Locking Algorithm
function det_mutex_lock(l) {
pause_logical_clock();
wait_for_turn();
lock(l);
inc_logical_clock();
enable_logical_clock();
}
function det_mutex_unlock(l) {
unlock(l);
}
Example
Physical Time
t=5
Thread 2
t=3
Deterministic Logical Time
Thread 1
Example
Thread 1
Thread 2
t=6
Deterministic Logical Time
Physical Time
t=11
It’s a race!
Example
Thread 1
Thread 2
t=11
t=20
It’s a race!
Deterministic Logical Time
Physical Time
Example
Thread 1
Thread 2
t=18
det_lock(a)
t=25
Deterministic Logical Time
Physical Time
Example
Thread 1
Thread 2
t=18
wait_for_turn() det_lock(a)
t=25
Deterministic Logical Time
Physical Time
Example
Thread 1
Thread 2
det_lock(a)
wait_for_turn() det_lock(a)
t=25
t=22
Deterministic Logical Time
Physical Time
Example
Thread 1
Thread 2
det_lock(a)
wait_for_turn() det_lock(a)
t=25
t=22 wait_for_turn()
Deterministic Logical Time
Physical Time
Example
Thread 1
Thread 2
det_lock(a)
wait_for_turn() det_lock(a)
t=25
t=22 lock()
Deterministic Logical Time
Physical Time
Example
Thread 1
Thread 2
det_lock(a)
wait_for_turn() det_lock(a)
t=22
t=25
Thread 2 will always acquire the lock first!
Deterministic Logical Time
Physical Time
Example
Thread 1
Thread 2
det_lock(a)
wait_for_turn() det_lock(a)
t=25
t=26
Deterministic Logical Time
Physical Time
Example
Thread 1
Thread 2
det_lock(a)
lock(a)
det_lock(a)
t=25
t=26
Deterministic Logical Time
Physical Time
Example
Thread 1
Thread 2
det_lock(a)
lock(a)
det_lock(a)
t=25
det_unlock(a)
t=32
Deterministic Logical Time
Physical Time
Example
Thread 1
Thread 2
det_lock(a)
det_lock(a)
t=28
det_unlock(a)
t=32
Deterministic Logical Time
Physical Time
Locking Algorithm Improvements
• Eliminate deadlocks in nested locks
– Make thread increment its deterministic logical clock
while it spins on the lock
– Must do so deterministically
• Queuing for fairness
• Lock priority boosting
• See ASPLOS09 Paper on Kendo for details
Evaluation
• Methodology
– Converted Splash2 benchmark suite to run use the Kendo
framework
– Eliminated data-races
– Checked determinism by examining output and the final
deterministic logical clocks of each thread
• Experimental Framework
– Processor: Intel Core2 Quad-core running at 2.66GHz
– OS: Linux 2.6.23 (modified for performance counter support)
Results
Execution Time (Relative to Non-Deterministic)
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
tsp
quicksort
ocean
barnes
radiosity raytrace
fmm
volrend
waternsqrd
mean
Benchmarks
Application Time
Interrupt Overhead
Deterministic Wait Overhead
Effect of interrupt frequency
Execution Time (Relative to Non-Deterministic)
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
64
128
256
512
1K
2K
4K
8K
16K
Interrupt Period
Application Time
Interrupt Overhead
Deterministic Wait Overhead
Related Work
• DMP – Deterministic Multiprocessing
– Hardware design that provides Strong Determinism
• StreamIt Language
– Streaming programming model only allows one interleaving of interthread communication
• Cilk Language
– Fork/join programming model that can produce programs with
semantics that always match a deterministic “serialization” of the
code
– Cannot be used with locks
– Must be data-race free (can be checked with a Cilk race detector)
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo
• Algorithmic Choices via PetaBricks
– Joint work with Jason Ansel, Cy Chan, Yee Lok
Wong, Qin Zhao, and Alan Edelman
• Conquering the Multicore Menace
48
Observation 1: Algorithmic Choice
• For many problems there are multiple algorithms
– Most cases there is no single winner
– An algorithm will be the best performing for a given:
–
–
–
–
–
Input size
Amount of parallelism
Communication bandwidth / synchronization cost
Data layout
Data itself (sparse data, convergence criteria etc.)
• Multicores exposes many of these to the programmer
– Exponential growth of cores (impact of Moore’s law)
– Wide variation of memory systems, type of cores etc.
• No single algorithm can be the best for all the cases
49
Observation 2: Natural Parallelism
• World is a parallel place
– It is natural to many, e.g. mathematicians
– ∑, sets, simultaneous equations, etc.
• It seems that computer scientists have a hard time
thinking in parallel
– We have unnecessarily imposed sequential ordering on the world
– Statements executed in sequence
– for i= 1 to n
– Recursive decomposition (given f(n) find f(n+1))
• This was useful at one time to limit the complexity….
But a big problem in the era of multicores
50
Observation 3: Autotuning
• Good old days  model based optimization
• Now
– Machines are too complex to accurately model
– Compiler passes have many subtle interactions
– Thousands of knobs and billions of choices
• But…
– Computers are cheap
– We can do end-to-end execution of multiple runs
– Then use machine learning to find the best choice
51
PetaBricks Language
• Implicitly parallel
description
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
}
y
A
y
AB
h
c
c
x
w
52
xw
B
h
PetaBricks Language
• Implicitly parallel
description
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
// Recursively decompose in c
to(AB ab)
from(A.region(0, 0, c/2, h ) a1,
A.region(c/2, 0, c, h ) a2,
B.region(0, 0, w, c/2) b1,
B.region(0, c/2, w, c ) b2) {
ab = MatrixAdd(MatrixMultiply(a1, b1),
MatrixMultiply(a2, b2));
}
53
• Algorithmic choice
a1
A
a2
b1
B
b2
AB
PetaBricks Language
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
// Recursively decompose in c
to(AB ab)
from(A.region(0, 0, c/2, h ) a1,
A.region(c/2, 0, c, h ) a2,
B.region(0, 0, w, c/2) b1,
B.region(0, c/2, w, c ) b2) {
ab = MatrixAdd(MatrixMultiply(a1, b1),
MatrixMultiply(a2, b2));
}
54
// Recursively decompose in w
to(AB.region(0, 0, w/2, h ) ab1,
AB.region(w/2, 0, w, h ) ab2)
from( A a,
B.region(0, 0, w/2, c ) b1,
B.region(w/2, 0, w, c ) b2) {
ab1 = MatrixMultiply(a, b1);
ab2 = MatrixMultiply(a, b2);
}
a
ab1 ABab2
b1
B
b2
PetaBricks Language
transform MatrixMultiply
from A[c,h], B[w,c]
to AB[w,h]
{
// Base case, compute a single element
to(AB.cell(x,y) out)
from(A.row(y) a, B.column(x) b) {
out = dot(a, b);
}
// Recursively decompose in w
to(AB.region(0, 0, w/2, h ) ab1,
AB.region(w/2, 0, w, h ) ab2)
from( A a,
B.region(0, 0, w/2, c ) b1,
B.region(w/2, 0, w, c ) b2) {
ab1 = MatrixMultiply(a, b1);
ab2 = MatrixMultiply(a, b2);
}
// Recursively decompose in c
to(AB ab)
from(A.region(0, 0, c/2, h ) a1,
A.region(c/2, 0, c, h ) a2,
B.region(0, 0, w, c/2) b1,
B.region(0, c/2, w, c ) b2) {
ab = MatrixAdd(MatrixMultiply(a1, b1),
MatrixMultiply(a2, b2));
}
// Recursively decompose in h
to(AB.region(0, 0, w, h/2) ab1,
AB.region(0, h/2, w, h ) ab2)
from(A.region(0, 0, c, h/2) a1,
A.region(0, h/2, c, h ) a2,
B b) {
ab1=MatrixMultiply(a1, b);
ab2=MatrixMultiply(a2, b);
}
}
55
PetaBricks Compiler Internals
Compiler Passes
PetaBricks
Source
Code
Rule/Transform
Headers
Rule
Bodies
Compiler
Passes
Rule
Body
ChoiceGrid
ChoiceGrid
IR
Parallel Dynamically
Scheduled
Sequential
Leaf Code
Code Generation
C++
ChoiceGrid
ChoiceGrid
ChoiceGrid
Choice
Dependency
Graph
Compiler
Passes
Choice Grids
transform RollingSum from A[n] to B[n] {
Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … }
Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … }
}
A:
Input
n
0
B:
Rule2
0
Rule1 or Rule2
1
n
Choice Dependency Graph
transform RollingSum from A[n] to B[n] {
Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … }
Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … }
}
Input
(r1, <)
(r1, =), (r2, <=)
(r2, =)
(r1, =, -1)
Rule2
Rule1 or Rule2
PetaBricks Autotuning
Compiler Passes
PetaBricks
Source
Code
Rule/Transform
Headers
ChoiceGrid
ChoiceGrid
ChoiceGrid
Choice
Dependency
Graph
Compiler
Passes
Rule
Bodies
Compiler
Passes
Rule
Body
ChoiceGrid
ChoiceGrid
IR
Parallel Dynamically
Scheduled
Sequential
Leaf Code
Choice
Configuration
File
Code Generation
Compiled
ChoiceGrid
ChoiceGrid
User
Code
Parallel
Runtime
Engine
Autotuner
C++
PetaBricks Execution
Compiler Passes
PetaBricks
Source
Code
Rule/Transform
Headers
ChoiceGrid
ChoiceGrid
ChoiceGrid
Choice
Dependency
Graph
Rule
Bodies
Compiler
Passes
Pruning
Rule
Body
ChoiceGrid
ChoiceGrid
IR
Parallel Dynamically
Scheduled
Sequential
Leaf Code
Choice
Configuration
File
Dependency
Graph
Compiled
ChoiceGrid
ChoiceGrid
User
Code
Parallel
Runtime
Engine
Code Generation
C++
Compiler
Passes
Experimental Setup
• Test System
– Dual-quad core (8 cores)
– Xeon X5460 @ 3.16GHz w/ 8GB RAM
– CSAIL Debian 4.0 (etch), kernel 2.6.18
• Training
– Using our hybrid genetic tuner
– Trained using all 8 cores
– Training times varied from ~1 min to ~1 hour
Sort
0.010
Insertion Sort
Quick Sort
Merge Sort
Radix Sort
Time
0.008
0.006
0.004
0.002
0.000
0
200
400
600
800
Size
62
1000
1200
1400
1600
1800
2000
Sort
0.010
Insertion Sort
Quick Sort
Merge Sort
Radix Sort
Autotuned
Time
0.008
0.006
0.004
0.002
0.000
0
200
400
600
800
Size
63
1000
1200
1400
1600
1800
2000
Eigenvector Solve
0.05
Bisection
DC
0.04
Time
QR
0.03
0.02
0.01
0.00
0
50
100
150
200
Size
64
250
300
350
400
450
500
Eigenvector Solve
0.05
Bisection
DC
0.04
Time
QR
Autotuned
0.03
0.02
0.01
0.00
0
50
100
150
200
Size
65
250
300
350
400
450
500
Poisson
256
32
Time
4
0.5
0.0625
Direct
0.0078125
Jacobi
0.0009766
SOR
Multigrid
0.0001221
1.526E-05
3
5
9
17
33
Matrix Size
66
65
129
257
513
1025
2049
Poisson
256
32
Time
4
0.5
0.0625
Direct
0.0078125
Jacobi
0.0009766
SOR
Multigrid
0.0001221
Autotuned
1.526E-05
3
5
9
17
33
Matrix Size
67
65
129
257
513
1025
2049
Scalability
8
7
Speedup
6
5
4
3
MM
2
Sort
Poisson
1
Eigenvector Solve
0
1
2
3
4
5
Number of Cores
68
6
7
8
Impact of Autotuning
• Custom hybrid genetic tuner
• Huge gains by training on the target
architecture:
Trained On:
SunFire T200
Niagra (8 cores)
Xeon E7340
(8 cores)
SunFire T200
Niagra (8 cores)
1.00x
0.72x
Xeon E7340
(8 cores)
0.43x
1.00x
Xeon E7340
(1 core)
Run On:
0.30x
Related Work
• SPARSITY, OSKI – Sparse Matrices
• ATLAS, FLAME – Linear Algebra
• FFTW
• STAPL – Template Framework Library
• SPL – Digital signal processing
• High level optimization via automated statistical
modeling. (Eric Brewer)
Outline
• The Multicore Menace
• Deterministic Multithreading via Kendo
• Algorithmic Choices via PetaBricks
• Conquering the Multicore Menace
71
Conquering the Menace
• Parallelism Extraction
– The world is parallel,
but most computer science is based in sequential thinking
– Parallel Languages
– Natural way to describe the maximal concurrency in the problem
– Parallel Thinking
– Theory, Algorithms, Data Structures  Education
• Parallelism Management
– Mapping algorithmic parallelism to a given architecture
– New hardware support
– Easier to enforce correctness
– Reduce the cost of bad decisions
– A Universal Parallel Compiler
72
Hardware Opportunities
• Don’t have to contend with uniprocessors
• Not your same old multiprocessor problem
– How does going from Multiprocessors to Multicores
impact programs?
– What changed?
– Where is the Impact?
– Communication Bandwidth
– Communication Latency
73
Communication Bandwidth
•
How much data can be communicated
between two cores?
•
What changed?
– Number of Wires
– Clock rate
– Multiplexing
•
Impact on programming model?
– Massive data exchange is possible
– Data movement is not the bottleneck
 processor affinity not that important
74
10,000X
32 Giga bits/sec
~300 Tera bits/sec
Parallel Language Opportunities
• We need a lot more innovation!
Languages that…..
– require no non-intuitive reorganization of data or code.
– make the programmer focus on concurrency, but not
performance Off-load the parallelism and performance issues to
the compiler (akin to ILP compilation to VLIW machines)
– eliminate hard problems such as race conditions and deadlocks
(akin to the elimination of memory bugs in Java)
– inform the programmer if they have done something illegal
(akin to a type system or runtime null-pointer checks)
– take advantage of domains to reduce the parallelization burden
(akin to the StreamIt language for the streaming domain)
– use novel hardware to eliminate problems & help the
programmer (akin to cache coherence hardware)
75
Compilation Opportunities
• Universal Parallel Compiler: GCC for Uniprocessors
– Easily portable to any uniprocessor
– Able to obtain respectable performance
 Single program (in C) runs on all uniprocessors
• MultiCompiler: Universal Compiler for Parallel Systems
– Language exposes maximal parallelism  Compiler manages it
– Unlike uniprocessors, many single decisions are performance critical
– Candidates: Don’t bind a single decision, keep multiple tracks
– Learning: Learn and improve heuristics
– Adaptation: Dynamically choose candidates and adapt the program to
resources and runtime conditions
76
Conclusions
•
Kendo
– The first system to efficiently provide weak determinism on commodity hardware
– Provide a systematic method of reproducing many non-deterministic bugs
– Incurs modest performance overhead when running on 4 processors
– This low overhead makes it possible to leave on while an application is deployed
•
PetaBricks
– First language where micro-level algorithmic choice can be naturally expressed
– Autotuning can find the best choice
– Can switch between choices as solution is constructed
•
Switching to multicores without losing the gains in programmer productivity may
be the Grandest of the Grand Challenges
– Half a century of work, still no winning solution
– Will affect everyone!
– A lot more work to do to solve this problem!!!
77
http://groups.csail.mit.edu/commit/
Download