slides

advertisement
Galois Performance
Mario Mendez-Lojo
Donald Nguyen
Overview
• Galois system is a test bed to explore opts
– Safe but not fast out of the box
• Important optimizations
– Select least transactional overhead
– Select right scheduling
– Select appropriate data structure
• Quantify optimizations on applications
2
Algorithms
general graph
topology
grid
tree
morph
irregular
algorithms
operator
1. Barnes-Hut
2. Delaunay Mesh Refinement
3. Preflow-push
local computation
reader
ordering
unordered
ordered
3
Methodology
Threads
Time
Serial
Idle
GC
Compute
• Abort Ratio: Aborted It/Total it
• GC options
•
•
•
UseParallelGC
UseParallelOldGC
NewRatio=1
4
Terms
• Base
– Default scheduling, Default graph
• Serial
– Galois classes => No concurrency control classes
• Speedup
– Best mean performance of a serial variant
• Throughput
– # Serial Iterations / time
5
Numbers
• Runtime
– Last of 5 runs in same VM
– Ignore time to read and construct initial graph
• Other statistics
– Last of 5 runs
6
Test Environment
•
•
•
•
2 x Xeon X5570 (4 core, 2.93 GHz)
Java 1.6.0_0-b11
Linux 2.6.24-27 x86_64
20GB heap size
7
BARNES-HUT
Most Distant Galaxy Candidates
in the Hubble Ultra Deep Field
8
Barnes-Hut
• N-body algorithm
– Oct-tree acceleration structure
– Serial
• Tree build, center of mass, particle update
– Parallel
• Force computation
• Structure
– Reader on tree
• Variants
– Splash2, Reader Galois
9
Reader Optimization
child = octree.getNeighbor(nn, 1);
child = octree.getNeighbor(nn, 1, MethodFlag.NONE);
10
ParaMeter Profile
11
Barnes-Hut Results
Best serial: base
Serial time: 10271 ms
Best // time: 1553 ms
Best speedup: 6.6X
100,000 points, 1 time step
12
Barnes-Hut Results
Best serial: base
Serial time: 10271 ms
Best // time: 1553 ms
Best speedup: 6.6X
100,000 points, 1 time step
13
Barnes-Hut Scalability
14
15
DELAUNAY MESH REFINEMENT
16
Delaunay Mesh Refinement
• Refine “bad” triangles
– Maintained in worklist
• Structure
– Cautious operator on graph
• Variants
– Flag optimized, locallifo
base:
Priority.defaultOrder()
local lifo:
Priority.first(ChunkedFIFO.class).
thenLocally(LIFO.class)
17
Cautious Optimization
• No need to save undo info
• Only check conflicts up to first write
mesh.contains(item);
...
mesh.remove(preNodes.get(i));
...
mesh.add(node);
mesh.contains(item, MethodFlag.CHECK_CONFLICT);
...
mesh.remove(preNodes.get(i), MethodFlag.NONE);
...
mesh.add(node, MethodFlag.NONE);
LIFO Optimization
GaloisRuntime.foreach(
...,
Priority.defaultOrder());
GaloisRuntime.foreach(
...,
Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class));
19
ParaMeter Profile
20
DMR Results
Best serial: locallifo.flagopt
Serial time: 17002 ms
Best // time: 3745 ms
Best speedup: 4.5X
0.5M triangles, 0.25M bad triangles
21
22
PREFLOW-PUSH
23
Preflow-push
• Max-flow algorithm
– Nodes push flow downhill
• Structure
– Cautious, local computation
• Variants
– Flag optimized, local computation graph
base (discharge):
Priority.first(Bucketed.class, numHeight+1, false, indexer).
then(FIFO.class)
base (relabel):
Priority.first(ChunkedFIFO.class, 8)
Local Computation Optimization
graph = ...
graph = ...
b = new LocalComputationGraph.ObjectGraphBuilder();
graph = b.from(graph).create()
25
ParaMeter Profile
26
Preflow-push Results
C: 11450 ms
Java: 30234 ms
Best serial: lc.flagopt
Serial time: 57121 ms
Best // time: 18242 ms
Best speedup: 3.1X
From challenge problem (genmf-wide)
14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edges
http://avglab.com/andrew/CATS/maxflow_synthetic.htm
27
Preflow-push Scalability
28
29
What performance did we expect?
Threads
Time
Measured
Indirectly
Error
//Compute Serial GC Idle Miss-Speculation Synchronization, …
30
What performance did we expect?
• Naïve:
• Amdahl:
r(x) = t1 / x
r(x) = tp / x + ts
t1 = tp + ts
ts = tidle + tgc+ tserial
• Simple:
r(x) = (tp (ix / i1)) / x + ts
31
Barnes-Hut
32
Delaunay Mesh Refinement
33
Preflow-push
34
Summary
• Many profitable optimizations
– Selecting among method flags, worklists, graph
variants
• Open topics
– Automation
– Static, dynamic and performance analysis
– Efficient ordered algorithms
35
36
Download