Introduction to Programming Languages and Compilers

advertisement
Dynamic Binary Translation
Lecture 24
acknowledgement: E. Duesterwald (IBM), S. Amarasinghe (MIT)
Ras Bodik CS 164 Lecture 24
1
Lecture Outline
• Binary Translation: Why, What, and When.
• Why: Guarding against buffer overruns
• What, when: overview of two dynamic translators:
– Dynamo-RIO by HP, MIT
– CodeMorph by Transmeta
• Techniques used in dynamic translators
– Path profiling
Ras Bodik CS 164 Lecture 24
2
Motivation: preventing buffer overruns
Recall the typical buffer overrun attack:
1. program calls a method foo()
2. foo() copies a string into an on-stack array:
–
–
–
string supplied by the user
user’s malicious code copied into foo’s array
foo’s return address overwritten to point to user code
3. foo() returns
–
unknowingly jumping to the user code
Ras Bodik CS 164 Lecture 24
3
Preventing buffer overrun attacks
Two general approaches:
• static (compile-time): analyze the program
– find all array writes that may outside array bounds
– program proven safe before you run it
• dynamic (run-time): analyze the execution
– make sure no write outside an array happens
– execution proven safe (enough to achieve security)
Ras Bodik CS 164 Lecture 24
4
Dynamic buffer overrun prevention
the idea, again:
• prevent writes outside the intended array
– as is done in Java
– harder in C: must add “size” to each array
• done in CCured, a Berkeley project
Ras Bodik CS 164 Lecture 24
5
A different idea
perhaps less safe, but easier to implement:
–
goal: detect that return address was overwritten.
instrument the program so that
–
it keeps an extra copy of the return address:
1. store aside the return address when function
called (store it in an inaccessible shadow stack)
2. when returning, check that the return address in
AR matches the stored one;
3. if mismatch, terminate program
Ras Bodik CS 164 Lecture 24
6
Commercially interesting
• Similar idea behind the product by
determina.com
• key problem:
– reducing overhead of instrumentation
• what’s instrumentation, anyway?
– adding statements to an existing program
– in our case, to x86 executables
• Determina uses binary translation
Ras Bodik CS 164 Lecture 24
7
What is Binary Translation?
• Translating a program in one binary format to
another, for example:
– MIPS  x86 (to port programs across platforms)
• We can view “binary format” liberally:
– Java bytecode  x86 (to avoid interpretation)
– x86  x86 (to optimize the executable)
Ras Bodik CS 164 Lecture 24
8
When does the translation happen?
• Static (off-line): before the program is run
– Pros: no serious translation-time constraints
• Dynamic (on-line): while the program is running
– Pros:
• access to complete program (program is fully linked)
• access to program state (including values of data struct’s)
• can adapt to changes in program behavior
• Note: Pros(dynamic) = Cons(static)
Ras Bodik CS 164 Lecture 24
9
Why? Translation Allows Program Modification
Static
Program
Compiler
Linker
Dynamic
Loader
Runtime
System
• Instrumenters
• Debuggers
• Load time optimizers
• Shared library mechanism
Ras Bodik CS 164 Lecture 24
•
•
•
•
•
•
•
Interpreters
Just-In-Time Compilers
Dynamic Optimizers
Profilers
Dynamic Checkers
instrumenters
Etc.
10
Applications, in more detail
• profilers:
– add instrumentation instructions to count basic
block execution counts (e.g., gprof)
• load-time optimizers:
– remove caller/callee save instructions
(callers/callees known after DLLs are linked)
– replace long jumps with short jumps
(code position known after linking)
• dynamic checkers
– finding memory access bugs (e.g., Rational Purify)
Ras Bodik CS 164 Lecture 24
11
Dynamic Program Modifiers
Running Program
Dynamic Program Modifier:
Observe/Manipulate Every Instruction in the Running Program
Hardware Platform
Ras Bodik CS 164 Lecture 24
12
In more detail
application
application
application
DLL
DLL
DLL
CodeMorph
OS
OS
CPU
CPU=VLIW
common setup
CodeMorph
(Transmeta)
Ras Bodik CS 164 Lecture 24
OS
Dynamo
CPU=x86
Dynamo-RIO
(HP, MIT)
13
Dynamic Program Modifiers
Requirements:




Ability to intercept execution at arbitrary points
Observe executing instructions
Modify executing instructions
Transparency
- modified program is not specially prepared
 Efficiency
- amortize overhead and achieve near-native performance
 Robustness
 Maintain full control and capture all code
- sampling is not an option (there are security applications)
Ras Bodik CS 164 Lecture 24
14
HP Dynamo-RIO
• Building a dynamic program modifier
•
•
•
•
Trick I: adding a code cache
Trick II: linking
Trick III: efficient indirect branch handling
Trick IV: picking traces
• Dynamo-RIO performance
• Run-time trace optimizations
Ras Bodik CS 164 Lecture 24
15
System I: Basic Interpreter
next
VPC
fetch next
instruction
decode
execute
Instruction Interpreter
update VPC
exception
handling
 Intercept execution
 Observe & modify executing instructions
 Transparency
Efficiency? - up to several 100 X slowdown
Ras Bodik CS 164 Lecture 24
16
Trick I: Adding a Code Cache
next
VPC
lookup VPC
fetch block
at VPC
exception
handling
emit
block
execute
block
context switch
BASIC BLOCK
CACHE
non-control-flow
instructions
Ras Bodik CS 164 Lecture 24
17
Example Basic Block Fragment
add
cmp
jle
%eax, %ecx
$4, %eax
$0x40106f
frag7: add
cmp
jle
jmp
stub1: mov
mov
jmp
stub2: mov
mov
jmp
%eax, %ecx
$4, %eax
<stub1>
<stub2>
%eax, eax-slot
&dstub1, %eax
context_switch
%eax, eax-slot
&dstub2, %eax
context_switch
# spill eax
# store ptr to stub table
# spill eax
# store ptr to stub table
Ras Bodik CS 164 Lecture 24
18
Runtime System with Code Cache
next
VPC
basic block builder
context switch
BASIC BLOCK
CACHE
non-control-flow
instructions
Improves performance:
• slowdown reduced from 100x to 17-26x
• remaining bottleneck: frequent (costly) context switches
Ras Bodik CS 164 Lecture 24
19
Linking a Basic Block Fragment
add
%eax, %ecx
frag7: add
cmp
$4, %eax
cmp
$4, %eax
jle
$0x40106f
jle
<frag42>
jmp
<frag8>
stub1: mov
%eax, %ecx
%eax, eax-slot
mov
&dstub1, %eax
jmp
context_switch
stub2: mov
%eax, eax-slot
mov
&dstub2, %eax
jmp
context_switch
Ras Bodik CS 164 Lecture 24
20
Trick II: Linking
next
VPC
lookup VPC
fetch block
at VPC
exception
handling
link
block
emit
block
execute until
cache miss
context switch
BASIC BLOCK
CACHE
non-control-flow
instructions
Ras Bodik CS 164 Lecture 24
21
Slowdown over Native Execution
Performance Effect of
Basic Block Cache with direct branch linking
vpr (Spec2000)
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
26.03
data set 1
data set 2
17.45
2.97
block cache
3.63
block cache with direct
linking
Performance Problem: mispredicted indirect branches
Ras Bodik CS 164 Lecture 24
22
Indirect Branch Handling
Conditionally “inline” a preferred indirect branch target as the
continuation of the trace
ret
<preferred target>
mov
pop
<save
cmp
%edx, edx_slot
# save app’s edx
%edx
# load actual target
flags>
%edx, $0x77f44708 # compare to
# preferred target
jne
<exit stub >
mov
edx_slot, %edx
# restore app’s edx
<restore flags>
<inlined preferred target>
Ras Bodik CS 164 Lecture 24
23
Indirect Branch Linking
Shared Indirect Branch Target
(IBT) Table
<load actual target>
<compare to inlined target>
if equal goto <inlined target>
lookup IBT table
if (! tag-match)
goto <exit stub>
jump to tag-value
original target F
original target H
linked
targets
H
K
I
L
<inlined target>
J
<exit stub>
Trick III: Efficient Indirect Branch Handling
next
VPC
basic block builder
context switch
miss
BASIC BLOCK
CACHE
non-control-flow
instructions
miss
indirect
branch lookup
Ras Bodik CS 164 Lecture 24
25
Performance Effect of indirect branch linking
26.03
17.45
Slowdown over Native
Execution
10
9
8
7
6
5
4
3
2
1
0
vpr (Spec2000)
data set 1
data set 2
3.63
2.97
1.20
block cache
block cache with direct
linking
1.15
block cache with linking
(direct+indirect)
Performance Problem: poor code layout in code cache
Ras Bodik CS 164 Lecture 24
26
Trick IV: Picking Traces
Block Cache has poor execution efficiency:
• Increased branching, poor locality
Pick traces to:
• reduce branching & improve layout and locality
• New optimization opportunities across block boundaries
Block Cache
A
D
G
Trace Cache
A
J
B
B
C
E
F
H
I
E
F
K
G
K
J
H
L
D
Ras Bodik CS 164 Lecture 24
27
Picking Traces
START
trace selector
basic block builder
dispatch
context switch
BASIC BLOCK
CACHE
non-control-flow
instructions
TRACE
indirect
branch lookup
CACHE
non-control-flow
instructions
Ras Bodik CS 164 Lecture 24
28
Picking hot traces
• The goal: path profiling
– find frequently executed control-flow paths
– Connect basic blocks along these paths into
contiguous sequences, called traces.
• The problem: find a good trade-off between
– profiling overhead (counting execution events), and
– accuracy of the profile.
Ras Bodik CS 164 Lecture 24
29
Alternative 1: Edge profiling
The algorithm:
•
•
Edge profiling: measure frequencies of all controlflow edges, then after a while
Trace selection: select hot traces by following
highest-frequency branch outcome.
Disadvantages:
•
•
Inaccurate: may select infeasible paths (due to
branch correlation)
Overhead: must profile all control-flow edges
Ras Bodik CS 164 Lecture 24
30
Alternative 2: Bit-tracing path profiling
The algorithm:
–
–
–
–
collect path signatures and their frequencies
path signature = <start addr>.history
example: <label7>.0101101
must include addresses of indirect branches
Advantages:
– accuracy
Disadvantages:
– overhead: need to monitor every branch
– overhead: counter storage (one counter per path!)
Ras Bodik CS 164 Lecture 24
31
Alternative 3: Next Executing Tail (NET)
This is the algorithm of Dynamo:
– profiling: count only frequencies of start-of-trace
points (which are targets of original backedges)
– trace selection: when a start-of-trace point
becomes sufficiently hot, select the sequence of
basic blocks executed next.
– may select a rare (cold) path, but statistically
selects a hot path!
Ras Bodik CS 164 Lecture 24
32
NET (continued)
Advantages of NET:
 very light-weight
#instrumentation points = #targets of backward branches
#counters = #targets of backward branches
 statistically likely to pick the
hottest path
 pick only feasible paths
 easy to implement
Ras Bodik CS 164 Lecture 24
A
D
G
J
B
E
H
K
C
F
I
L
33
Spec2000 Performance on Windows
(w/o trace optimizations)
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
Ras Bodik CS 164 Lecture 24
H_MEAN
vpr
vortex
twolf
perlbmk
parser
mesa
mcf
gzip
gcc
gap
equake
eon
crafty
bzip2
0.0
art
Slowdown vs. Native Execution
2.2
34
Ras Bodik CS 164 Lecture 24
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
35
H_MEAN
wupwise
vpr
vortex
twolf
swim
sixtrack
perlbmk
parser
mgrid
mesa
mcf
gzip
gcc
gap
equake
eon
crafty
bzip2
art
apsi
applu
ammp
Slowdown vs. Native Execution
Spec2000 Performance on Linux
(w/o trace optimizations)
Slowdown vs. Native Execution
Performance on Desktop Applications
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Adobe Acrobat
Microsoft Excel
Microsoft
PowerPoint
Ras Bodik CS 164 Lecture 24
Microsoft Word
36
Performance Breakdown
trace branch taken
2%
indirect branch
lookup
11%
rest of system
1%
code cache
86%
Ras Bodik CS 164 Lecture 24
37
Trace optimizations
• Now that we built the traces, let’s optimize them
• But what’s left to optimize in a statically optimized
code?
• Limitations of static compiler optimization:
– cost of call-specific interprocedural optimization
– cost of path-specific optimization in presence of complex
control flow
– difficulty of predicting indirect branch targets
– lack of access to shared libraries
– sub-optimal register allocation decisions
– register allocation for individual array elements or pointers
Ras Bodik CS 164 Lecture 24
38
Maintaining Control (in the real world)
• Capture all code: execution only takes place out of
the code cache
• Challenging for abnormal control flow
• System must intercept all abnormal control flow
events:
•
•
•
•
•
Exceptions
Call backs in Windows
Asynchronous procedure calls
Setjmp/longjmp
Set thread context
Ras Bodik CS 164 Lecture 24
39
Download