Improving Instruction Locality with Just-In-Time Code Layout J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University 1 Goals • Improve instruction reference locality –big problem for commodity applications • Eliminate need for profile information –required by current compiler-based solutions 2 How? Implement layout dynamically using Activation Order: • A new heuristic for code layout. • Locate procedures in order of use. 3 Requirements • No special hardware support. • Minimal changes to the operating system. • Minimal system overhead. 4 Optimizing Procedure Layout Bad Layout Better Layout 5 Current Practice: Pettis and Hansen • Nodes are procedures. WinMain() 1 • Edges are caller/callee pairs. 1 Initialize() • Weights are call frequency. EventLoop() 129394 68754 GetEvent() React() 128404 CheckForInputError() 1 68753 HandleRareCase() 10 HandleCommonCase() HandleInputError() 6 Pettis and Hansen Layout layout: [] layout: [GetEvent, CheckForInputErrors] layout: [EventLoop, GetEvent, CheckForInputErrors] EventLoop() 129394 68754 EventLoop() GetEvent() 129394 Node-2 68754 68754 React() React() React() Node-1 128404 CheckForInputError() 68753 68753 68753 HandleCommonCase() HandleCommonCase() layout: [React, EventLoop, GetEvent, CheckForInputErrors] HandleCommonCase() layout: [HandleCommonCase, React, EventLoop, GetEvent, CheckForInputErrors] Node-3 68753 Node-4 HandleCommonCase() 7 A New Heuristic Activation Order: Co-locate procedures that are activated sequentially. Example: Code void main() { Initialize(); Do some stuff; for (10000 interations) { Do some stuff; foo(); bar(); } } P&H foo() main() bar() Initialize() AO main() Initialize() foo() bar() 8 Implementing JITCL __start: perform initializations call thunk_main thunk_main: . . . thunk_foo: . . . __InstructionMemory: Thunk routines implement code layout on-the-fly. 9 Thunk routines // Global variables: // ProcPointers[] - one element per procedure // INDEX_proc and LENGTH_proc for each procedure thunk_main: if (InCodeSegment(ProcPointers[INDEX_main])) ProcPointers[INDEX_main] = CopyToTextSegment(ProcPointer[INDEX_main], LENGTH_main); PatchCallSite(ProcPointer[INDEX_main], ComputeCallSiteFromReturnAddress(RA)); jmp ProcPointer[INDEX_main]; The thunk routines copy procedures into the text segment and update call sites at run-time. 10 Simulation Methodology Cache Size Associativity Simulation UNIX/RISC 8K Direct-Mapped ATOM Win32/x86 8K 2-Way Etch 11 Workloads Benchmark Description Text Size UNIX compress file compression Gcc The GNU C compiler 112 1552 m88ksim Simulation of Motorola 88K 160 Perl The perl scripting language 376 Raytrace Image rendering 192 Xanim MPEG player 2024 Win32 Mazelord Maze game 1445 Perfmon Windows NT system utility 2805 Wordpro 96 Lotus document preparation 5148 Word 7 Microsoft document preparation 7694 IE302 Microsoft web browser product 4990 12 Results • The AO heuristic is effective. • The overhead of JITCL is negligible. • JITCL improves procedure layout without requiring profile information. • JITCL reduces program memory requirements. 13 Results: The AO Heuristic Improvement 0.04 in I-Cache Miss Rate 0.03 Pettis & Hansen Activation Order 0.02 0.01 xanim raytrace perl m88ksim gcc compress 0 Conclusion: Effectiveness of heuristic is comparable to P&H. 14 Overhead of JITCL • Copy overhead – instruction overhead – cache overhead • Cache consistency • Disk overhead - comparable to demand loaded text; not evaluated. 15 Results: Overhead Overhead Instructions (%) 0.1 0.075 0.05 0.025 xanim raytrace perl m88ksim gcc compress 0 Conclusion: JITCL Overhead is less than 0.1% in all cases. 16 Results: Performance Saved Cycles per Instruction 1.2 Pettis & Hansen JITCL 1 0.8 0.6 0.4 0.2 xanim raytrace perl m88ksim gcc compress 0 Conclusion: Overall performance is comparable to P&H. 17 JITCL for Win32 Applications • Windows applications are composed of multiple executable modules. • When transitions between modules are frequent, intra-module code layout is less effective. • With JITCL, inter-module code layout is possible and beneficial. 18 Win32 Cache Miss Rates L1 Cache 0.06 Miss Rate 0.05 0.04 default P&H JITCL 0.03 0.02 0.01 0 mazelord perfmon wordpro 96 word 7 ie302 Conclusion: Careful layout did not help Win32 applications. 19 Text Segment Size Text size in megabytes 3500 3000 default JITCL 2500 2000 1500 1000 500 0 mazelord perfmon wordpro 96 word 7 ie302 Conclusion: JITCL typically reduces text size by 50%. 20 JITCL vs. PBO • JITCL provides an alternative to feedback-based procedure layout. • Many important optimizations still require profile information. – instruction scheduling – register allocation – other intra-procedural optimizations • Don’t expect profile-based optimization to go away! 21 Conclusions Just-In-Time code layout achieves comparable benefit to profile-based code layout without the need for profiles. • The AO heuristic is effective. • The overhead of procedure copying is low. • Benefit in I-Cache is comparable to Pettis and Hansen layout. • JITCL can reduce working set size. 22 The Morph Project Morph For more information: http://www.eecs.harvard.edu/morph/ 23