(PowerPoint 108Kb)

advertisement
Improving Instruction Locality
with Just-In-Time Code Layout
J. Bradley Chen and Bradley D. D. Leupen
Division of Engineering and Applied Sciences
Harvard University
1
Goals
• Improve instruction reference locality
–big problem for commodity applications
• Eliminate need for profile information
–required by current compiler-based solutions
2
How?
Implement layout dynamically using
Activation Order:
• A new heuristic for code layout.
• Locate procedures in order of use.
3
Requirements
• No special hardware support.
• Minimal changes to the operating system.
• Minimal system overhead.
4
Optimizing Procedure Layout
Bad Layout
Better Layout
5
Current Practice: Pettis and Hansen
• Nodes are procedures.
WinMain()
1
• Edges are caller/callee pairs.
1
Initialize()
• Weights are call frequency.
EventLoop()
129394
68754
GetEvent()
React()
128404
CheckForInputError()
1
68753
HandleRareCase()
10
HandleCommonCase()
HandleInputError()
6
Pettis and Hansen Layout
layout: []
layout: [GetEvent,
CheckForInputErrors]
layout: [EventLoop, GetEvent,
CheckForInputErrors]
EventLoop()
129394
68754
EventLoop()
GetEvent()
129394
Node-2
68754
68754
React()
React()
React()
Node-1
128404
CheckForInputError()
68753
68753
68753
HandleCommonCase()
HandleCommonCase()
layout: [React, EventLoop,
GetEvent,
CheckForInputErrors]
HandleCommonCase()
layout: [HandleCommonCase, React,
EventLoop, GetEvent,
CheckForInputErrors]
Node-3
68753
Node-4
HandleCommonCase()
7
A New Heuristic
Activation Order: Co-locate procedures that are activated
sequentially.
Example:
Code
void main()
{
Initialize();
Do some stuff;
for (10000 interations) {
Do some stuff;
foo();
bar();
}
}
P&H
foo()
main()
bar()
Initialize()
AO
main()
Initialize()
foo()
bar()
8
Implementing JITCL
__start:
perform initializations
call thunk_main
thunk_main:
. . .
thunk_foo:
. . .
__InstructionMemory:
Thunk routines implement code layout on-the-fly.
9
Thunk routines
// Global variables:
//
ProcPointers[] - one element per procedure
//
INDEX_proc and LENGTH_proc for each procedure
thunk_main:
if (InCodeSegment(ProcPointers[INDEX_main]))
ProcPointers[INDEX_main] =
CopyToTextSegment(ProcPointer[INDEX_main],
LENGTH_main);
PatchCallSite(ProcPointer[INDEX_main],
ComputeCallSiteFromReturnAddress(RA));
jmp ProcPointer[INDEX_main];
The thunk routines copy procedures into the text
segment and update call sites at run-time.
10
Simulation Methodology
Cache Size
Associativity
Simulation
UNIX/RISC
8K
Direct-Mapped
ATOM
Win32/x86
8K
2-Way
Etch
11
Workloads
Benchmark
Description
Text Size
UNIX
compress file compression
Gcc The GNU C compiler
112
1552
m88ksim Simulation of Motorola 88K
160
Perl The perl scripting language
376
Raytrace Image rendering
192
Xanim MPEG player
2024
Win32
Mazelord Maze game
1445
Perfmon Windows NT system utility
2805
Wordpro 96 Lotus document preparation
5148
Word 7 Microsoft document preparation
7694
IE302 Microsoft web browser product
4990
12
Results
• The AO heuristic is effective.
• The overhead of JITCL is negligible.
• JITCL improves procedure layout without
requiring profile information.
• JITCL reduces program memory requirements.
13
Results: The AO Heuristic
Improvement 0.04
in I-Cache
Miss Rate
0.03
Pettis & Hansen
Activation Order
0.02
0.01
xanim
raytrace
perl
m88ksim
gcc
compress
0
Conclusion: Effectiveness of heuristic is comparable to P&H.
14
Overhead of JITCL
• Copy overhead
– instruction overhead
– cache overhead
• Cache consistency
• Disk overhead - comparable to demand
loaded text; not evaluated.
15
Results: Overhead
Overhead
Instructions
(%)
0.1
0.075
0.05
0.025
xanim
raytrace
perl
m88ksim
gcc
compress
0
Conclusion: JITCL Overhead is less than 0.1% in all cases.
16
Results: Performance
Saved
Cycles per
Instruction
1.2
Pettis & Hansen
JITCL
1
0.8
0.6
0.4
0.2
xanim
raytrace
perl
m88ksim
gcc
compress
0
Conclusion: Overall performance is comparable to P&H.
17
JITCL for Win32 Applications
• Windows applications are composed of
multiple executable modules.
• When transitions between modules are
frequent, intra-module code layout is
less effective.
• With JITCL, inter-module code layout is
possible and beneficial.
18
Win32 Cache Miss Rates
L1 Cache 0.06
Miss Rate
0.05
0.04
default
P&H
JITCL
0.03
0.02
0.01
0
mazelord perfmon
wordpro
96
word 7
ie302
Conclusion: Careful layout did not help Win32 applications.
19
Text Segment Size
Text size in
megabytes
3500
3000
default
JITCL
2500
2000
1500
1000
500
0
mazelord perfmon
wordpro
96
word 7
ie302
Conclusion: JITCL typically reduces text size by 50%.
20
JITCL vs. PBO
• JITCL provides an alternative to
feedback-based procedure layout.
• Many important optimizations still require
profile information.
– instruction scheduling
– register allocation
– other intra-procedural optimizations
• Don’t expect profile-based optimization
to go away!
21
Conclusions
Just-In-Time code layout achieves comparable
benefit to profile-based code layout without the
need for profiles.
• The AO heuristic is effective.
• The overhead of procedure copying is low.
• Benefit in I-Cache is comparable to Pettis and
Hansen layout.
• JITCL can reduce working set size.
22
The Morph Project
Morph
For more information:
http://www.eecs.harvard.edu/morph/
23
Download