380C lecture 19 • Where are we & where we are going – Managed languages • Dynamic compilation • Inlining • Garbage collection – Opportunity to improve data locality on-the-fly – Other opportunities? – – – – – Why you need to care about workloads Alias analysis Dependence analysis Loop transformations EDGE architectures 1 CS380C Lecture 19 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass), Zhenlin Wang (MTU), Perry Cheng (IBM) 2 CS380C Lecture 19 Today: Advanced Topics • Generational Garbage Collection • Copying objects is an opportunity • Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT), J Eliot B Moss (UMass), Zhenlin Wang (MTU), Perry Cheng (IBM), “The Garbage Collection Advantage: Improving Program Locality,” OOPSLA 2004. 3 CS380C Lecture 19 Motivation • Memory gap problem • OO programs become more popular • OO programs exacerbates memory gap problem – Automatic memory management – Pointer data structures – Many small methods Goal: improve OO program locality 4 CS380C Lecture 19 Allocation Mechanisms Bump-Pointer Fast (increment & bounds check) contemporaneous object locality Can't incrementally free & reuse: must free en masse 5 CS380C Lecture 19 Allocation Mechanisms Bump-Pointer Fast (increment & bounds check) contemporaneous object locality Can't incrementally free & reuse: must free en masse 6 CS380C Lecture 19 Allocation Mechanisms Bump-Pointer Free-List Fast (increment & bounds check) contemporaneous object locality Can't incrementally free & reuse: must free en masse Slightly slower (consult list for fit) Mystery locality Can incrementally free & reuse cells 7 CS380C Lecture 19 State-of-the-art throughput Copying Generational GC etc. etc … ‘nursery’ ‘older generation’ • Requirements – write-barrier to track inter-generation pointers • remsets, cards – copy reserve • Advantages: – Minimizes copying of older objects – Compaction of long-lived objects • Problems: – Not very incremental – Very youngest objects always copied – What order should GC use to copy objects? 8 CS380C Lecture 19 Opportunity • Generational copying garbage collector reorders objects at runtime 9 CS380C Lecture 19 Copying of Linked Objects 1 2 3 4 5 6 7 Breadth First 10 CS380C Lecture 19 Copying of Linked Objects 1 2 4 3 6 5 Breadth First 1 2 3 4 5 7 6 7 Depth First 11 CS380C Lecture 19 Copying of Linked Objects 1 2 4 3 5 Breadth First Depth First 1 1 2 2 3 3 7 6 4 5 Online Object Reordering 5 4 6 7 6 7 12 CS380C Lecture 19 Outline • • • • • Motivation Online Object Reordering (OOR) Methodology Experimental Results Conclusion 13 CS380C Lecture 19 Cache Performance Matters Total Cycles (in billions) _213_javac 40 35 30 25 20 15 10 5 0 c rfe 1 IL ,1 D ct K 28 fe er ,P K L1 tI ,8 2 tL 1 DL c rfe Pe Pe 8K L1 L2 CS380C Lecture 19 14 Online Object Reordering • Where are the cache misses? • How to identify hot field accesses at runtime? • How to reorder the objects? 15 CS380C Lecture 19 Where Are The Cache Misses? • Heap structure: VM Objects Stack Older Generation Nursery Not to scale 16 CS380C Lecture 19 _209_db 2000 1800 1600 1400 1200 1000 800 600 400 200 0 L2 hits L2 misses N er ts en y G ec bj k O r se ur ld O ac St VM Total Accesses (in millions) Where Are The Cache Misses? 17 CS380C Lecture 19 Where Are The Cache Misses? • Two opportunities to reorder objects in the older generation – Promote nursery objects – Full heap collection 18 CS380C Lecture 19 How to Find Hot Fields? • Runtime info (intercept every read)? • Compiler analysis? • Runtime information + compiler analysis Key: Low overhead estimation 19 CS380C Lecture 19 Which Classes Need Reordering? Step 1: Compiler analysis – Excludes cold basic blocks – Identifies field accesses Step 2: JIT adaptive sampling identifies hot methods – Mark as hot field accesses in hot methods Key: Low overhead estimation 20 CS380C Lecture 19 Example: Compiler Analysis Method Foo { Class A a; try { Hot BB Collect access info …=a.b; … } catch(Exception e){ Cold BB Ignore …a.c } } Compiler Access List: 1. A.b 2. …. …. 21 CS380C Lecture 19 Example: Adaptive Sampling Method Foo { Class A a; try { …=a.b; Adaptive Sampling … Foo Accesses: 1. A.b 2. …. …. Foo is hot } catch(Exception e){ …a.c } } A.b is hot A c ….. b A’s type information B c b 22 CS380C Lecture 19 Copying of Linked Objects Type Information 1 4 3 1 2 3 5 4 6 7 Online Object Reordering Hot space Cold space 23 CS380C Lecture 19 OOR System Overview Hot Methods Source Code Baseline Compiler Executing Code Input/Output Look Up Access Info Database Adaptive Sampling Optimizing Compiler Affects Improves Locality Adds Entries GC: Copies Objects JikesRVM component CS380C Lecture 19 Register Hot Field Accesses Advice OOR addition 24 Outline • • • • • Motivation Online Object Reordering Methodology Experimental Results Conclusion 25 CS380C Lecture 19 Methodology: Virtual Machine • Jikes RVM – – – – VM written in Java High performance Timer based adaptive sampling Dynamic optimization • Experiment setup – Pseudo-adaptive – 2nd iteration [Eeckhout et al.] 26 CS380C Lecture 19 Methodology: Memory Management • Memory Management Toolkit (MMTk): – Allocators and garbage collectors – Multi-space heap • Boot image • Large object space (LOS) • Immortal space • Experiment setup – Generational copying GC with 4M bounded nursery 27 CS380C Lecture 19 Overhead: OOR Analysis Only Benchmark Base Execution Time (sec) w/ only OOR Analysis (sec) Overhead jess 4.39 4.43 0.84% jack 5.79 5.82 0.57% raytrace 4.63 4.61 -0.59% mtrt 4.95 4.99 0.70% javac 12.83 12.70 -1.05% compress 8.56 8.54 0.20% pseudojbb 13.39 13.43 0.36% db 18.88 18.88 -0.03% 0.94 0.91 -2.90% hsqldb 160.56 158.46 -1.30% ipsixql 41.62 42.43 1.93% jython 37.71 37.16 -1.44% ps-fun 129.24 128.04 -1.03% antlr Mean -0.19% 28 CS380C Lecture 19 Detailed Experiments • • • • Separate application and GC time Vary thresholds for method heat Vary thresholds for cold basic blocks Three architectures – x86, AMD, PowerPC • x86 Performance counter: – DL1, trace cache, L2, DTLB, ITLB 29 CS380C Lecture 19 Performance javac 30 CS380C Lecture 19 Performance db 31 CS380C Lecture 19 Performance jython Any static ordering leaves you vulnerable to pathological cases. 32 CS380C Lecture 19 Phase Changes 33 CS380C Lecture 19 Related Work • Evaluate static orderings [Wilson et al.] – Large performance variation • Static profiling [Chilimbi et al., and others] – Lack of flexibility • Instance-based object reordering [Chilimbi et al.] – Too expensive 34 CS380C Lecture 19 Conclusion • Static traversal orders have up to 25% variation • OOR improves or matches best static ordering • OOR has very low overhead • Past predicts future 35 CS380C Lecture 19 380C • Where are we & where we are going – Managed languages • Dynamic compilation • Inlining • Garbage collection – Why you need to care about workloads & methodology – – – – • Read: Blackburn et al., Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century, ACM CACM, 51(8): 83--89, August, 2008. Alias analysis Dependence analysis Loop transformations EDGE architectures CS380C Lecture 19 36