Galois Performance Mario Mendez-Lojo Donald Nguyen Overview • Galois system is a test bed to explore opts – Safe but not fast out of the box • Important optimizations – Select least transactional overhead – Select right scheduling – Select appropriate data structure • Quantify optimizations on applications 2 Algorithms general graph topology grid tree morph irregular algorithms operator 1. Barnes-Hut 2. Delaunay Mesh Refinement 3. Preflow-push local computation reader ordering unordered ordered 3 Methodology Threads Time Serial Idle GC Compute • Abort Ratio: Aborted It/Total it • GC options • • • UseParallelGC UseParallelOldGC NewRatio=1 4 Terms • Base – Default scheduling, Default graph • Serial – Galois classes => No concurrency control classes • Speedup – Best mean performance of a serial variant • Throughput – # Serial Iterations / time 5 Numbers • Runtime – Last of 5 runs in same VM – Ignore time to read and construct initial graph • Other statistics – Last of 5 runs 6 Test Environment • • • • 2 x Xeon X5570 (4 core, 2.93 GHz) Java 1.6.0_0-b11 Linux 2.6.24-27 x86_64 20GB heap size 7 BARNES-HUT Most Distant Galaxy Candidates in the Hubble Ultra Deep Field 8 Barnes-Hut • N-body algorithm – Oct-tree acceleration structure – Serial • Tree build, center of mass, particle update – Parallel • Force computation • Structure – Reader on tree • Variants – Splash2, Reader Galois 9 Reader Optimization child = octree.getNeighbor(nn, 1); child = octree.getNeighbor(nn, 1, MethodFlag.NONE); 10 ParaMeter Profile 11 Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X 100,000 points, 1 time step 12 Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X 100,000 points, 1 time step 13 Barnes-Hut Scalability 14 15 DELAUNAY MESH REFINEMENT 16 Delaunay Mesh Refinement • Refine “bad” triangles – Maintained in worklist • Structure – Cautious operator on graph • Variants – Flag optimized, locallifo base: Priority.defaultOrder() local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class) 17 Cautious Optimization • No need to save undo info • Only check conflicts up to first write mesh.contains(item); ... mesh.remove(preNodes.get(i)); ... mesh.add(node); mesh.contains(item, MethodFlag.CHECK_CONFLICT); ... mesh.remove(preNodes.get(i), MethodFlag.NONE); ... mesh.add(node, MethodFlag.NONE); LIFO Optimization GaloisRuntime.foreach( ..., Priority.defaultOrder()); GaloisRuntime.foreach( ..., Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class)); 19 ParaMeter Profile 20 DMR Results Best serial: locallifo.flagopt Serial time: 17002 ms Best // time: 3745 ms Best speedup: 4.5X 0.5M triangles, 0.25M bad triangles 21 22 PREFLOW-PUSH 23 Preflow-push • Max-flow algorithm – Nodes push flow downhill • Structure – Cautious, local computation • Variants – Flag optimized, local computation graph base (discharge): Priority.first(Bucketed.class, numHeight+1, false, indexer). then(FIFO.class) base (relabel): Priority.first(ChunkedFIFO.class, 8) Local Computation Optimization graph = ... graph = ... b = new LocalComputationGraph.ObjectGraphBuilder(); graph = b.from(graph).create() 25 ParaMeter Profile 26 Preflow-push Results C: 11450 ms Java: 30234 ms Best serial: lc.flagopt Serial time: 57121 ms Best // time: 18242 ms Best speedup: 3.1X From challenge problem (genmf-wide) 14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edges http://avglab.com/andrew/CATS/maxflow_synthetic.htm 27 Preflow-push Scalability 28 29 What performance did we expect? Threads Time Measured Indirectly Error //Compute Serial GC Idle Miss-Speculation Synchronization, … 30 What performance did we expect? • Naïve: • Amdahl: r(x) = t1 / x r(x) = tp / x + ts t1 = tp + ts ts = tidle + tgc+ tserial • Simple: r(x) = (tp (ix / i1)) / x + ts 31 Barnes-Hut 32 Delaunay Mesh Refinement 33 Preflow-push 34 Summary • Many profitable optimizations – Selecting among method flags, worklists, graph variants • Open topics – Automation – Static, dynamic and performance analysis – Efficient ordered algorithms 35 36