Cavity

advertisement
Parallel Programming
with
Object Assemblies
Swarat Chaudhuri
Penn State
Roberto Lublinerman
Pavol Cerny
Penn State
IST Austria
Taming parallelism
Task-parallelism
Message-passing
Data parallelism:
- Highly coarse-grained
(MapReduce)
- Highly fine-grained
(numeric computations
on dense arrays)
-Problem-specific methods
Taming parallelism
Our target:
Data-parallel computations
over large, unstructured,
shared-memory graphs
Unknown granularity
High-level correctness as well
as efficiency.
Delaunay mesh refinement
• Triangulate a given set of points.
• Delaunay property:
No point is contained within
the circumcircle of a triangle.
• Quality property:
No bad triangles—i.e.,
triangles with an angle > 120o.
• Mesh refinement:
Fix bad triangles through an
iterative algorithm.
Retriangulation
Cavity: all triangles whose circumcircle contains new point.
Quality constraint may not hold for all new triangles.
Sequential mesh refinement
Mesh m = /* read input mesh */
Worklist wl = new Worklist(m.getBad());
foreach triangle t in wl {
Cavity c = new Cavity(t);
c.expand();
c.retriangulate();
m.updateMesh(c);
wl.add(c.getBad());
}
• Cavities are contiguous “regions” in the mesh.
• Worst-case cavities can encompass the whole mesh.
Parallelization
• Computation over complex, unstructured graphs
Mesh = Heap-allocated graph. Nodes = triangles.
Edges = adjacency
• Atomicity: Cavities must be retriangulated atomically.
• Non-overlapping cavities can be processed in parallel.
• Seems impossible to handle with static analysis:
– Shape of data structure changes greatly over time.
– Shape of data structure is highly input-dependent.
– Without deep algorithmic knowledge, impossible to say if
statically if cavities will overlap.
• Lots of recent work, notably by Pingali et al.
List of similar applications
•
•
•
•
•
•
Delaunay mesh refinement, Delaunay triangulation
Agglomerative clustering, ray tracing
Social network maintenance
Minimum spanning tree, Maximum flow
N-body simulation, epidemiological simulation
Sparse matrix-vector multiplication, sparse Cholesky
factorization
• Belief propagation, survey propagation in Bayesian
inference
• Iterative dataflow analysis, Petri net simulation
• Finite-difference PDE solution
Locality of updates in Chorus
Cavity
•
•
•
On a mesh of ~100,000 triangles from Lonestar
benchmarks: Average cavity size = 3.75 triangles.
Maximum cavity size = 12 triangles
Average-case locality the essence of parallelism.
Chorus: parallel computation driven by
“neighborhoods” in heaps.
Heaps, regions, assemblies
• Heap = directed graph
Nodes = objects
Labeled edges = pointers
• Region = induced subgraph
• Assembly =
region + thread
of control
Typically speculative
and shortlived.
Programs, assembly classes
• Assembly class = set of local variables +
set of guarded updates + constructor + public
variables.
• Program = set of classes
• Synchronization happens in guard evaluation.
busy
executing
update
terminated
ready
to be
:: Guard: Update
preempted
or execute next
update
Guards can merge assemblies
u
f
:: merge (u.f): S
:: merge (u.f)
when g: S
• g is a condition on the
local variables and owned
objects of
•
gets a bigger region,
keeps local state
•
dies.
•
must be in ready state
while merge happens
Updates can split an assembly
split(T)
• Split into assemblies of
class T.
• Other assemblies not
affected.
• Not a synchronization
construct.
Local updates
• Attempts to access objects
outside region lead to
exceptions.
x = u.f;
x.f = y;
u
f
Delaunay mesh refinement
• Use two assembly classes: Triangle and
Cavity.
– Cavity = local region in mesh.
• Each triangle:
– Determines if it is bad (local check).
– If so, merges with neighbors to become cavity.
• Each cavity:
– Determines if it is complete (local check).
– If no, merges with a neighbor.
– If yes, retriangulates (locally) and splits.
Delaunay mesh refinement:
sketch
assembly Triangle:: ...
action::
merge (v.f, Cavity) when isBad:
skip
assembly Cavity:: ...
action::
merge (v.f) when (not isComplete):
...
isComplete:
retriangulate();
split(Triangle)
Delaunay mesh refinement:
sketch
assem Triangle:: ...
action::
merge (v.f, Cavity, u) when bad?:
skip
assem Cavity:: ...
What happens on a conflict?
action::
merge (v.f) when (not complete?):
• Cavity i “absorbed” by cavity j.
skip
• Cavity j now has some
complete?:
“unnecessary” triangles.
retriangulate();
• j will later split.
split(Triangle)
Boruvka’s algorithm for minimum
spanning tree
• Assembly = spanning tree
• Initially, each assembly has
one node.
• As algorithm progresses, trees
merge.
Race-freedom
• No aliasing, only
ownership transfer.
•
can merge with only
when is not in the
middle of an update.
Deadlock-freedom
• Classic definition: Process P waits for a resource from Q and
vice versa.
• Deadlock in Chorus:
–
has a locally enabled merge with
–
has a locally enabled merge with
– No other progress is possible.
u
• But one of the merges
can always be carried out.
(An assembly can always
be killed at its ready state.)
JChorus
• Chorus + sequential
Java.
• Assembly classes in
addition to object
classes.
7: assembly Cavity {
8: action {
// expand cavity
9: merge(outgoingedges,
TriangleObject t):
10:
{ outgoingedges.remove(t);
11:
frontier.add(t);
12:
build(); }
13: }
14: Set members; Set border;
15: Queue frontier; // current frontier
16: List outgoingedges;
// outgoing edges on which to merge
17: TriangleObject initial;
...
Division-based implementation
• Division = set of assemblies
mapped to a core.
• Local access:
Merge-actions within a division
Split-actions
Local updates
• Remote access:
Merge-actions issued across
divisions
• Uses assembly-level locks.
Implementation strategies
• Adaptive divisions. Heuristic for reducing
the number of remote merges.
• During a merge, not only the target assembly, but
also assemblies reachable by k pointer indirections,
are migrated.
• Adaptation heuristic does elementary load
balancing.
• Union-find data structure to relate objects
and assemblies that they belong to
• Needed for splits and merges.
• Token-passing for deadlock prevention and
termination detection.
Experiments: Delaunay refinement
from Lonestar benchmarks
• Large dataset from Lonestar benchmarks.
– 100,364 triangles.
– 47,768 initially bad.
• 1 to 8 threads.
• Competing approaches:
– Object-level locking
– DSTM (Software transactions)
Locality: mesh snapshots
The initial mesh and divisions
Mesh after several thousand
retriangulations
Delaunay: Speedup over sequential
Delaunay: Self-relative speedup
Delaunay: Conflicts
Related models
• Threads + explicit locking: Global heap abstraction,
arbitrary aliasing.
• Software transactions: Burden of reasoning passed to
transaction manager. In most implementations, heap is
viewed as global.
• Static data partitioning: Unpredictable nature of the
computation makes static analysis hard.
• Actors: Based on low-level messaging. If sending
references, potential of races. If copying triangles,
inefficient.
• Pingali et al’s Galois: Same problem, but ours is an
alternative.
More information
Parallel programming with object assemblies.
Roberto Lublinerman, Swarat Chaudhuri,
Pavol Cerny.
OOPSLA 2009.
http://www.cse.psu.edu/~swarat/chorus
Download