Optimizing Code for Memory Hierarchies

advertisement
Towards a Science
of
Parallel Programming
Keshav Pingali
University of Texas, Austin
1
Problem Statement
•
Community has worked on parallel
programming for more than 30 years
–
–
–
–
•
programming models
machine models
programming languages
….
However, parallel programming is still a
research problem
– matrix computations, stencil computations,
FFTs etc. are well-understood
– each new application is a “new
phenomenon”
• few insights for irregular applications
•
Thesis: we need a science of parallel
programming
– analysis: framework for thinking about
parallelism in application
– synthesis: produce an efficient parallel
implementation of application
“The Alchemist” Cornelius Bega (1663)
2
Analogy: science of electro-magnetism
Seemingly
unrelated phenomena
Unifying abstractions
Specialized models
that exploit structure
3
Organization of talk
•
Seemingly unrelated parallel algorithms
and data structures
–
–
–
–
–
•
Unifying abstractions
–
–
–
–
•
Stencil codes
Delaunay mesh refinement
Event-driven simulation
Graph reduction of functional languages
…
Operator formulation of algorithms
Amorphous data-parallelism
Galois programming model
Baseline parallel implementation
Specialized implementations that exploit
structure
– Structure of algorithms
– Optimized compiler and runtime system
support for different kinds of structure
•
Ongoing work
4
Some parallel algorithms
5
Examples
Application/domain
Algorithm
Meshing
Generation/refinement/partitioning
Compilers
Iterative and elimination-based
dataflow algorithms
Functional interpreters
Graph reduction, static and dynamic
dataflow
Maxflow
Preflow-push, augmenting paths
Minimal spanning trees
Prim, Kruskal, Boruvka
Event-driven simulation
Chandy-Misra-Bryant, Jefferson
Timewarp
AI
Message-passing algorithms
Stencil computations
Jacobi, Gauss-Seidel, red-black
ordering
Sparse linear solvers
Sparse MVM, sparse Cholesky
factorization
6
Stencil computation: Jacobi iteration
•
Finite-difference method for solving PDEs
–
•
Values at interior points are updated using values at
neighbors
–
•
(i-1,j)
dense arrays
Parallelism:
–
–
•
values at boundary points are fixed
Data structure:
–
•
discrete representation of domain: grid
values at all interior points can be computed simultaneously
parallelism is not dependent on input values
(i,j-1)
(i,j+1)
(i,j)
Compiler can find the parallelism
–
spatial loops are DO-ALL loops
(i+1,j)
//Jacobi iteration with 5-point stencil
//initialize array A
for time = 1, nsteps
for <i,j> in [2,n-1]x[2,n-1]
temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1))
for <i,j> in [2,n-1]x[2,n-1]
A(i,j) = temp(i,j)
5-point stencil
7
Delaunay Mesh Refinement
Mesh
m = /* read
in meshto*/remove badly
• Iterative
refinement
shaped
WorkList
wl; triangles:
while there are bad triangles do {
wl.add(m.badTriangles());
pick a bad triangle;
while (find
!wl.empty()
){
its cavity;
retriangulate
cavity;
may create new bad triangles
Element
e =// wl.get();
}
if (e no longer in mesh) continue;
• Don’t-care
Cavity cnon-determinism:
= new Cavity(e);//determine new cavity
– c.expand();
final mesh depends on order in which bad
triangles are processed
region
– c.retriangulate();//re-triangulate
applications do not care which mesh is
produced
m.update(c);//update
mesh
• Data
structure:
wl.add(c.badTriangles());
– graph in which nodes represent triangles
}
and edges represent triangle adjacencies
•
Parallelism:
– bad triangles with cavities that do not
overlap can be processed in parallel
– parallelism is very “input-dependent”
•
compilers cannot determine this parallelism
– (Miller et al.) at runtime, repeatedly build
interference graph and find maximal
independent sets for parallel execution
8
Event-driven simulation
•
•
•
•
Stations communicate by sending
messages with time-stamps on FIFO
channels
Stations have internal state that is
updated when a message is processed
Messages must be processed in timeorder at each station
Data structure:
–
•
Messages in event-queue, sorted in timeorder
2
A
6
4
B
Parallelism:
–
conservative: Chandy-Misra-Bryant
•
•
–
5
station fires when it has messages on all
incoming edges and processes earliest
message
requires null messages to avoid deadlock
optimistic: Jefferson time-warp
•
•
station can fire when it has an incoming
message on any edge
requires roll-back if speculative conflict is
detected
9
Remarks on algorithms
• Diverse algorithms and data structures
• Exploiting parallelism in irregular algorithms is very complex
– Miller et al. DMR implementation: interference graph + maximal
independent sets
– Jefferson Timewarp algorithm for event-driven simulation
• Algorithms:
– parallelism can be very input-dependent
• DMR, event-driven simulation, graph reduction,….
– don’t-care non-determinism
• has nothing to do with concurrency
• DMR, graph reduction
– activities created dynamically may interfere with existing activities
• event-driven simulation…
• Data structures:
– relatively few algorithms use dense arrays
– more common: graphs, trees, lists, priority queues,…
10
Organization of talk
•
Seemingly unrelated parallel algorithms
and data structures
–
–
–
–
–
•
Stencil codes
Delaunay mesh refinement
Event-driven simulation
Graph reduction of functional languages
………
Unifying abstractions
– Amorphous data-parallelism
– Baseline parallel implementation for
exploiting amorphous data-parallelism
•
Specialized implementations that exploit
structure
– Structure of algorithms
– Optimized compiler and runtime system
support for different kinds of structure
•
Ongoing work
11
Requirements
• Provide a model of parallelism in irregular
algorithms
• Unified treatment of parallelism in regular and
irregular algorithms
– parallelism in regular algorithms must emerge as a
special case of general model
– (cf.) correspondence principles in Physics
• Abstractions should be effective
– should be possible to write an interpreter to execute
algorithms in parallel
12
Traditional abstraction
•
Computation graph
–
–
•
Parallelism
–
•
–
–
–
dependences between computations are a
function of input data
don’t-care non-determinism
conflicting work may be created dynamically
…
Data structures play almost no role in this
abstraction
–
•
dataflow model of Dennis, Arvind et al.
Inadequate for irregular applications
–
•
width of the computation graph
Effective parallel computation graph model
–
•
nodes are computations
edges are dependences
in most programs, parallelism comes from
data-parallelism (concurrent operations on
data structure elements)
New abstraction
–
–
data-centric: data structures play a central role
we will use graph ADT to illustrate concepts
13
Operator formulation of algorithms
• Algorithm =
repeated application of operator to graph
i3
i1
– active element:
• node or edge where operator is applied
– Jacobi: interior nodes of mesh
– DMR: nodes representing bad triangles
– Event-driven simulation: station with
incoming message
i2
– neighborhood:
• set of nodes and edges read/written to
perform computation
– Jacobi: nodes in stencil
– DMR: cavity of bad triangle
– Event-driven simulation: station
• distinct usually from neighbors in graph
– ordering:
• order in which active elements must be executed
in a sequential implementation
– any order (Jacobi, DMR, graph reduction)
– some problem-dependent order (eventdriven simulation)
i4
i5
: active node
: neighborhood
14
Parallelism
•
Amorphous data-parallelism:
i1
– parallelism in processing active nodes
subject to
• neighborhood constraints
• ordering constraints
•
i3
i2
i4
Computations at two active elements are
independent if
– Neighborhoods do not overlap
– More generally, neither of them writes to an
element in the intersection of the
neighborhoods
•
Unordered active elements
– In principle, independent active elements can
be processed in parallel
– How do we find independent active elements?
•
Ordered active elements
– Independence is not enough since elements
can become active dynamically (see example)
– How do we determine what is safe to execute in
parallel?
•
i5
2
A
3
B
6
4
C
5
How do we make this model effective?
15
Galois programming model (PLDI 2007)
•
•
•
Program written in terms of
abstractions in model
Programming model: sequential, OO
Graph class: provided by Galois library
– specialized versions to exploit
structure (see later)
•
Galois set iterators: for iterating over
unordered and ordered sets of active
elements
– for each e in Set S do B(e)
• evaluate B(e) for each element in set S
• no a priori order on iterations
• set S may get new elements during
execution
Mesh m = /* read in mesh */
Set ws;
ws.add(m.badTriangles()); // initialize ws
for each tr in Set ws do { //unordered Set iterator
if (tr no longer in mesh) continue;
Cavity c = new Cavity(tr);
c.expand();
c.retriangulate();
m.update(c);
ws.add(c.badTriangles()); //bad triangles
}
– for each e in OrderedSet S do B(e)
• evaluate B(e) for each element in set S
• perform iterations in order specified by
OrderedSet
• set S may get new elements during
execution
DMR using Galois iterators
16
Galois parallel execution model
•
Parallel execution model:
– shared-memory
– optimistic execution of Galois
iterators
•
Implementation:
– master thread begins execution
of program
– when it encounters iterator,
worker threads help by executing
iterations concurrently
– barrier synchronization at end of
iterator
•
main()
….
for each …..{
…….
…….
}
.....
Master
i3
i1
i2
i4
i5
Independence of neighborhoods:
– software TM variety
– logical locks on nodes and edges
•
Ordering constraints for ordered
set iterator:
– execute iterations out of order
but commit in order
– cf. out-of-order CPUs
Program
Threads
Shared
Memory
17
Parameter tool (PPoPP 2009)
• Idealized execution model:
– unbounded number of processors
– applying operator at an active node takes one time
step
– execute a maximal set of active nodes, subject to
neighborhood and ordering constraints
• Measures amorphous data-parallelism in
irregular program execution
• Useful as an analysis tool
18
Example: DMR
• Input mesh:
– Produced by Triangle
(Shewchuck)
– 550K triangles
– Roughly half are badly
shaped
• Available parallelism:
– How many non-conflicting
triangles can be expanded
at each time step?
• Parallelism intensity:
– What fraction of the total
number of bad triangles
can be expanded at each
step?
19
Examples
• Boruvka MST algorithm
– Builds MST bottom-up
– Unordered active elements
• Agglomerative clustering
(AC)
Boruvka: 10K node graph, avg degree 5
– Data-mining algorithm
– Ordered active elements
• Similarity in parallelism
profiles arises from similarity
in algorithmic structure
AC: 20K random points in 2D
20
Summary
•
•
Old abstraction: computation graphs
New abstraction: operator formulation of
algorithms
– active elements
– neighborhoods
– ordering of active elements
•
Amorphous data-parallelism
– generalizes conventional data-parallelism
•
i3
i1
Baseline execution model
i2
i4
– Galois programming model
• sequential, OO
• uses new abstractions
– optimistic parallel execution
•
Parameter tool
i5
– provides estimates of amorphous dataparallelism in programs written using
Galois programming model
21
Organization of talk
•
Seemingly unrelated parallel algorithms
and data structures
–
–
–
–
–
•
Unifying abstractions
–
–
–
–
•
Stencil codes
Delaunay mesh refinement
Event-driven simulation
Graph reduction of functional languages
………
Operator formulation of algorithms
Amorphous data-parallelism
Galois programming model
Baseline parallel implementation
Specialized implementations that exploit
structure
– Structure of algorithms
– Optimized compiler and runtime system
support for different kinds of structure
•
Ongoing work
22
Key idea
• Baseline implementation is general but usually inefficient
– (e.g.) dynamic scheduling of iterations is not needed for Jacobi
since grid structure is known at compile-time
– (e.g.) hand-written parallel implementations of DMR and Jacobi
do not buffer updates to neighborhood until commit point
• Efficient execution requires exploiting structure in
algorithms and data structures
• How do we talk about structure in algorithms?
– Previous approaches: like descriptive biology
•
•
•
•
Mattson et al. book
Parallel programming patterns (PPP): Snir et al.
Berkeley dwarfs
…
– Our approach: like molecular biology
• based on amorphous data-parallelism framework
23
Algorithm abstractions
general graph
topology
grid
tree
morph: modifies structure of graph
iterative
algorithms
operator
local computation: only updates values on nodes/edges
reader: does not modify graph in any way
unordered
ordering
ordered
Jacobi: topology: grid, operator: local computation, ordering: unordered
DMR, graph reduction: topology: graph, operator: morph, ordering: unordered
Event-driven simulation: topology: graph, operator: local computation, ordering: ordered
24
Morphs
u
uv
n

a
m
m
n
a
v
Edge contraction
Node elimination
refinement: DMR, Prim MST, Barnes-Hut tree building
node elimination: sparse Cholesky factorization
morph
operator
…..
….
coarsening
edge contraction: Metis, Kruskal MST, Boruvka MST, AC
sub-graph elimination: elimination-based dataflow analysis
general: graph reduction
25
Reducing Overheads
of
Optimistic Parallel Execution
26
Graph partitioning (ASPLOS 2008)
Cores
•
Algorithm structure:
– general graph/grid + unordered active elements
•
Optimization I:
– partition the graph/grid and work-set between cores
– data-centric work assignment: core gets active elements from its own partition
•
Pros and cons:
– eliminates contention for worklist
– improves locality and can dramatically reduce conflicts
– dynamic load-balancing may be needed
•
Optimization II:
– lock coarsening: associate logical locks with partitions, not graph elements
– reduces overhead of lock management
•
Over-decomposition may improve core utilization
27
Zero-copy implementation
• Cautious operator:
– reads all the elements in its neighborhood
before modifying any of them
– (e.g.) Delaunay mesh refinement
• Algorithm structure:
– cautious operator + unordered active
elements
• Optimization: optimistic execution w/o
buffering updates
– grab locks on elements during read phase
• conflict: someone else has lock, so release
your locks
– once update phase begins, no new locks
will be acquired
• update in-place w/o making copies
– note: this is not two-phase locking
28
Delaunay mesh refinement
•
Algorithm structure:
–
•
Optimizations:
–
–
•
partitioning + lock-coarsening + zerobuffering
very efficient implementations possible
Maverick@TACC
–
–
–
•
•
general graph/grid + cautious operator +
unordered active elements
128-core Sun Fire E25K 1.05 GHz
64 dual-core processors
Sun Solaris
Speed-up of 20 on 32 cores for refinement
Mesh partitioning is still sequential
–
–
time for mesh partitioning starts to dominate
after 8 processors (32 partitions)
Need parallel mesh partitioning
29
Survey propagation on Maverick
• SP is a heuristic for
solving difficult SAT
problems
• SP: general graph +
cautious operator +
unordered elements
• Implementation:
– partitioning
– lock coarsening
– zero-buffering
Survey propagation on Maverick
(roughly 1500 clauses, 250-500 variables)
30
Eliminating the Need
for
Optimistic Parallel Execution
31
Scheduling
•
Baseline implementation
–
•
autonomous scheduling: no coordination between execution of different active elements
Global coordination possible for some algorithms
–
Run-time scheduling: cautious operator + unordered active elements
•
•
•
•
–
Just-in-time scheduling: local computation + structure-driven + cautious, unordered (e.g.)
sparse MVM
•
–
execute all activities partially to determine neighborhoods
create interference graph and find independent set of activities
execute independent set of activities in parallel w/o synchronization
used in Gary Miller’s implementation of DMR
Inspector-executor approach
Compile-time scheduling: previous case + graph is known at compile-time (e.g.) Jacobi
•
make all scheduling decisions at compile-time time
32
Ongoing work
h2
h2
n2
n2
n1
h4
h1n
h4
–
n4
important for some algorithms on dense graphs
locality
incorporating scheduling information into Galois
•
program refinements?
Compiler analysis
–
analyze and optimize code for operators
Runtime system
–
•
h4
h3
Language/programming model:
–
•
n4
h3
divide-and-conquer algorithms
transforming ordered algorithms into unordered algorithms
intra-operator parallelism
•
•
n3
n3
h3
n4
1
Algorithm studies:
–
–
–
•
n2
n1
n3
•
h2
adaptive control system for managing threads
Application studies
–
Case studies of hand-optimized codes
•
•
–
understand hand optimizations
figure out how to incorporate them into system
Lonestar benchmark suite for irregular programs
•
joint work with Calin Cascaval’s group at IBM Yorktown Heights
33
Acknowledgements
(in alphabetical order)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Kavita Bala (Cornell)
Martin Burtscher (UT Austin)
Patrick Carribault (UT Austin)
Calin Cascaval (IBM)
Paul Chew (Cornell)
Amber Hassaan (UT Austin)
Tony Ingraffea (Cornell)
Milind Kulkarni (UT Austin)
Mario Mendez (UT Austin)
Rajasekhar Inkulu (UT Austin)
Donald Nguyen (UT Austin)
Dimitrios Prountzos (UT Austin)
Ganesh Ramanarayanan (Microsoft)
Xin Sui (UT Austin)
Bruce Walter (Cornell)
Zifei Zhong (UT Austin)
34
Science of Parallel Programming
i3
i1
i2
2
A
i4
i5
B
……..
Seemingly
unrelated algorithms
Unifying abstractions
Specialized models
that exploit structure
35
Download