Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel Outline Is tracing GC ready for the many-core? Evaluating the heap shape scalability Idealized Trace Utilization Improving the heap shape scalability How the heap shape is related? Solution 1: Reshaping with Shortcut References Solution 2: Tracing with Speculative Roots Related work & conclusion ISMM 2010 2 Is Tracing GC Ready for Many-core ? GC tracing Traverse lots of objects Roots a Sequential trace b Each live object is touched (BFS, DFS) e Parallel trace ISMM 2010 Load balancing 1K cores really soon Heap c d f g j k h i m l 3 Can Heaps Spoil the Scalability? Roots 1 2 4M live objects 4M 3 Sequential trace Single linked list 4M steps Parallel trace Not any faster 4K Heap ISMM 2010 4 Deep Object Graphs Can be Evil Definition: Object Depth Length of the minimal path from some root object Object-Graph Depth Maximal live object depth Example: Object Depths 0 How deep are object graphs of Java programs? 1 2 Heap ISMM 2010 SpecJVM, Dacapo, SpecJBB Instrumented BFS trace 3 5 Object-Graph Depths of Java Benchmarks Name Heap Size (MB) GC Cycles Max Depth Java compiler run 3 times 32 15 1,234 3D raytracer 32 8 1,416 Java byte code analyzer 48 344 1,195 Java code analyzer 48 59 18,482 Transforms XML into HTML 128 129 8,476 Description SpecJVM javac mtrt Dacapo bloat pmd xalan Other 15 benchmarks ISMM 2010 128 6 Object-Graph Depths of Java Benchmarks Name Heap Size (MB) GC Cycles Max Depth Java compiler run 3 times 32 15 1,234 3D raytracer 32 8 1,416 Java byte code analyzer 48 344 1,195 Java code analyzer 48 59 18,482 Transforms XML into HTML 128 129 8,476 Description SpecJVM javac mtrt Dacapo bloat pmd xalan Other 15 benchmarks ISMM 2010 128 7 Object-Graph Depths of Java Benchmarks Name Heap Size (MB) GC Cycles Max Depth Java compiler run 3 times 32 15 1,234 3D raytracer 32 8 1,416 Java byte code analyzer 48 344 1,195 Java code analyzer 48 59 18,482 Transforms XML into HTML 128 129 8,476 Description SpecJVM javac mtrt Dacapo bloat pmd xalan Other 15 benchmarks ISMM 2010 128 8 Not all Deep Object Graphs are Evil Roots Object-graph 1 2 3 Sequential trace … 4K 4K 4M steps Parallel trace 4K 1K same sized linked lists of 4K objects Scales well for up to 1K processors Heap ISMM 2010 9 Deep and Narrow Object Graphs are Evil Definition: Object Depths Distribution Amount of objects at different depths Example: Graphical Representation (Object-graph shape): #objects 1 2 4 # objects 5 4 3 2 1 3 0 1 Heap ISMM 2010 1 2 3 4 5 depth 10 Object-Graph Shapes of Java Benchmarks # objects jython depth # objects xalan depth ISMM 2010 11 # objects (log 10) Object-Graph Shapes of Java Benchmarks db jython jess bloat jack javac lusearch mtrt hsqldb xalan antlr pmd depth (log 10) ISMM 2010 depth (log 10) 12 The Idealized Trace Utilization Simulate the idealized traversal by N threads Perfect load balancing Perfect cache behavior BFS traversal Single time tick object scan During the traversal, count Objects available to be scanned at every time tick Processor slots: some are busy and some are wasted At the end, report the utilization (ITU) Total Scanned Objects * 100% Total Processor Slots ISMM 2010 13 Idealized Trace Utilization Example Core 1 Core 2 4 Tracers Core 3 Core 4 Heap objects Time ticks 1 Scanned objects 2 2 3 4 5 6 7 8 5 9 11 12 13 14 15 Total Scanned Objects 15 * 100% = 47 % ITU = * 100% = 8*4 Total Processor Slots ISMM 2010 14 Graphical Representation 1. Simulate and compute 2. Draw the graph Utilization # objects 100 80 60 40 20 0 1 depth ISMM 2010 2 4 8 Processors 15 Worst Case ITU for Java Benchmarks 100 check compress db 80 Utilization jack javac 60 jess mpegaudio mtrt 40 antlr bloat 20 hsqldb jython lusearch 0 1 2 4 8 16 32 64 Processors ISMM 2010 128 256 512 1024 pmd xalan 16 Average ITU for Java Benchmarks check 100 compress db 80 jack Utilization javac 60 jess mpegaudio mtrt 40 antlr bloat 20 hsqldb jython 0 lusearch 1 2 4 8 16 32 64 Processors ISMM 2010 128 256 512 1024 pmd xalan 17 What’s Next? Problematic heaps exist javac, mtrt, pmd, bloat, xalan Can we improve the trace scalability without modifying the benchmarks? Reshape with Shortcut References Trace with Speculative Roots ISMM 2010 18 Reshape with Shortcut References Roots Sequential trace 1 16K 2 New references are added 3 4 Invisible to the program Useful for the tracers Parallel trace 4K 16K steps Scales for 4 processors Heap ISMM 2010 19 Evaluation Prototype Devise a shortcut strategy When the program is stopped for GC Where shortcuts are needed Compute the Idealized Trace Utilization Run the shortcuts adding algorithm Compute the ITU for the modified heap Report ISMM 2010 ITU improvement Amount of shortcuts added 20 Shortcut Strategy and Parameters Identify candidate subgraphs With at least size objects Size=5 Depth=4 With depth-to-size ratio no less than ratio Ratio=0.8 Add shortcut to the root of the subgraph Leading to the objects length pointers away Next shortcut introduced not closer than distance pointers away Distance (2) 1 ISMM 2010 2 3 4 Length (4) 5 6 7 8 9 21 Results for SpecJVM mtrt Worst before Worst after Avg before 16 64 Avg after 100 Utilization 80 60 40 20 0 1 Size=50 2 4 8 32 128 256 512 1024 Processors Ratio=0.2 ~ 500K of live objects Length=50 Max shortcuts – 110 Distance=25 Avg shortcuts – 94 ISMM 2010 22 Results for DaCapo xalan Worst before Worst after Avg before 16 64 Avg after 100 Utilization 80 60 40 20 0 1 Size=50 2 4 8 32 128 256 512 1024 Processors Ratio=0.2 ~ 400K of live objects Length=50 Max shortcuts – 888 Distance=25 Avg shortcuts – 536 ISMM 2010 23 Results for DaCapo bloat Worst before Worst after Avg before 16 64 Avg after 100 Utilization 80 60 40 20 0 Size=50 Ratio=0.2 1 2 4 8 32 Processors 128 256 512 1024 ~ 400K of live objects Length=50 Max shortcuts – 940 Distance=25 Avg shortcuts – 378 ISMM 2010 24 Results for DaCapo pmd Worst before Worst after Avg before 16 64 Avg after 100 Utilization 80 60 40 20 0 Size=600 Ratio=0.1 1 2 4 8 32 128 256 512 1024 Processors ~ 434K of live objects Length=120 Max shortcuts – 5,874 Distance=40 Avg shortcuts – 432 ISMM 2010 25 Results for SpecJVM javac Worst before Worst after Avg before Avg after 100 Utilization 80 60 40 20 0 1 Size=500 Ratio=0.1 2 4 8 16 32 64 128 256 512 1024 Processors ~ 383K of live objects Length=100 Max shortcuts – 292 Distance=50 Avg shortcuts – 16 ISMM 2010 26 Trace with Speculative Roots Roots Sequential trace 16M steps 4M Helper tracers Parallel trace 4K Heap ISMM 2010 Pick random roots Trace using custom colors Scales for 4 processors 27 Speculative Trace Helper tracer Regular trace Pick up the root Pick up the color, e.g. red Trace; if blue object is discovered, mark blue as reachable from red Trace from root; if blue object is discovered, mark blue as live Complete trace ISMM 2010 All colors reachable from live colors marked live All objects marked by live colors survive the collection 28 Evaluation Prototype 4 regular tracers, 4 helper tracers Speculative roots – random unmarked objects ITU before and after the colored trace a b e Useful helpers work c d f g j k h i Wasted helpers work m ISMM 2010 Dead objects colored by dead colors Floating garbage Heap Live objects colored by live colors Dead objects colored by live colors l 29 Limit the floating garbage Maximal amount of objects colored by a single color Make the random roots choices smarter Helpers must save discovered but not traced objects Trace completion phase takes care of the saved fronts To avoid choosing dead objects To reach deeper parts of the live object graph Filter for the recursive objects ISMM 2010 Objects with referents of their own type 30 Results Lots of floating garbage Hard to find good roots Progressively harder as the live objects are getting marked Trace completion phase is complex Even with the filter Can defeat the purpose Modest improvement in the Idealized Trace Utilization scores ISMM 2010 31 Results for DaCapo xalan Worst case ITU improvement, with the random choices filter 100 Utilization 80 60 Before After 40 20 0 1 2 4 8 16 32 64 128 256 512 1024 Processors ISMM 2010 32 Results for DaCapo bloat Worst case ITU improvement, with the random choices filter 100 Utilization 80 60 Before After 40 20 0 1 2 4 8 16 32 64 128 256 512 1024 Processors ISMM 2010 33 Related Work Parallel Garbage Collection Folklore Siebert (ISMM’08) There are heap structures that can foil any clever load balancing scheme Reported object graph depths for SpecJVM benchmarks Proposed upper bound on the worst case scalability as a way to compute RT guarantees for the GC tracing Random tracing originally proposed by Click ISMM 2010 34 Summary Studied the heap shape properties of Java benchmarks Devised a measure to quantify the heap shape scalability Out of twenty considered benchmarks, five had not scalable heap shapes during the run Idealized Trace Utilization Proposed, prototyped and evaluated two approaches to improve the tracing scalability ISMM 2010 Reshaping with Shortcuts appears to be more promising than Tracing from Speculative Roots 35 Thank You! ISMM 2010 36