Dynamic Binary Translation Lecture 24 acknowledgement: E. Duesterwald (IBM), S. Amarasinghe (MIT) Ras Bodik CS 164 Lecture 24 1 Lecture Outline • Binary Translation: Why, What, and When. • Why: Guarding against buffer overruns • What, when: overview of two dynamic translators: – Dynamo-RIO by HP, MIT – CodeMorph by Transmeta • Techniques used in dynamic translators – Path profiling Ras Bodik CS 164 Lecture 24 2 Motivation: preventing buffer overruns Recall the typical buffer overrun attack: 1. program calls a method foo() 2. foo() copies a string into an on-stack array: – – – string supplied by the user user’s malicious code copied into foo’s array foo’s return address overwritten to point to user code 3. foo() returns – unknowingly jumping to the user code Ras Bodik CS 164 Lecture 24 3 Preventing buffer overrun attacks Two general approaches: • static (compile-time): analyze the program – find all array writes that may outside array bounds – program proven safe before you run it • dynamic (run-time): analyze the execution – make sure no write outside an array happens – execution proven safe (enough to achieve security) Ras Bodik CS 164 Lecture 24 4 Dynamic buffer overrun prevention the idea, again: • prevent writes outside the intended array – as is done in Java – harder in C: must add “size” to each array • done in CCured, a Berkeley project Ras Bodik CS 164 Lecture 24 5 A different idea perhaps less safe, but easier to implement: – goal: detect that return address was overwritten. instrument the program so that – it keeps an extra copy of the return address: 1. store aside the return address when function called (store it in an inaccessible shadow stack) 2. when returning, check that the return address in AR matches the stored one; 3. if mismatch, terminate program Ras Bodik CS 164 Lecture 24 6 Commercially interesting • Similar idea behind the product by determina.com • key problem: – reducing overhead of instrumentation • what’s instrumentation, anyway? – adding statements to an existing program – in our case, to x86 executables • Determina uses binary translation Ras Bodik CS 164 Lecture 24 7 What is Binary Translation? • Translating a program in one binary format to another, for example: – MIPS x86 (to port programs across platforms) • We can view “binary format” liberally: – Java bytecode x86 (to avoid interpretation) – x86 x86 (to optimize the executable) Ras Bodik CS 164 Lecture 24 8 When does the translation happen? • Static (off-line): before the program is run – Pros: no serious translation-time constraints • Dynamic (on-line): while the program is running – Pros: • access to complete program (program is fully linked) • access to program state (including values of data struct’s) • can adapt to changes in program behavior • Note: Pros(dynamic) = Cons(static) Ras Bodik CS 164 Lecture 24 9 Why? Translation Allows Program Modification Static Program Compiler Linker Dynamic Loader Runtime System • Instrumenters • Debuggers • Load time optimizers • Shared library mechanism Ras Bodik CS 164 Lecture 24 • • • • • • • Interpreters Just-In-Time Compilers Dynamic Optimizers Profilers Dynamic Checkers instrumenters Etc. 10 Applications, in more detail • profilers: – add instrumentation instructions to count basic block execution counts (e.g., gprof) • load-time optimizers: – remove caller/callee save instructions (callers/callees known after DLLs are linked) – replace long jumps with short jumps (code position known after linking) • dynamic checkers – finding memory access bugs (e.g., Rational Purify) Ras Bodik CS 164 Lecture 24 11 Dynamic Program Modifiers Running Program Dynamic Program Modifier: Observe/Manipulate Every Instruction in the Running Program Hardware Platform Ras Bodik CS 164 Lecture 24 12 In more detail application application application DLL DLL DLL CodeMorph OS OS CPU CPU=VLIW common setup CodeMorph (Transmeta) Ras Bodik CS 164 Lecture 24 OS Dynamo CPU=x86 Dynamo-RIO (HP, MIT) 13 Dynamic Program Modifiers Requirements: Ability to intercept execution at arbitrary points Observe executing instructions Modify executing instructions Transparency - modified program is not specially prepared Efficiency - amortize overhead and achieve near-native performance Robustness Maintain full control and capture all code - sampling is not an option (there are security applications) Ras Bodik CS 164 Lecture 24 14 HP Dynamo-RIO • Building a dynamic program modifier • • • • Trick I: adding a code cache Trick II: linking Trick III: efficient indirect branch handling Trick IV: picking traces • Dynamo-RIO performance • Run-time trace optimizations Ras Bodik CS 164 Lecture 24 15 System I: Basic Interpreter next VPC fetch next instruction decode execute Instruction Interpreter update VPC exception handling Intercept execution Observe & modify executing instructions Transparency Efficiency? - up to several 100 X slowdown Ras Bodik CS 164 Lecture 24 16 Trick I: Adding a Code Cache next VPC lookup VPC fetch block at VPC exception handling emit block execute block context switch BASIC BLOCK CACHE non-control-flow instructions Ras Bodik CS 164 Lecture 24 17 Example Basic Block Fragment add cmp jle %eax, %ecx $4, %eax $0x40106f frag7: add cmp jle jmp stub1: mov mov jmp stub2: mov mov jmp %eax, %ecx $4, %eax <stub1> <stub2> %eax, eax-slot &dstub1, %eax context_switch %eax, eax-slot &dstub2, %eax context_switch # spill eax # store ptr to stub table # spill eax # store ptr to stub table Ras Bodik CS 164 Lecture 24 18 Runtime System with Code Cache next VPC basic block builder context switch BASIC BLOCK CACHE non-control-flow instructions Improves performance: • slowdown reduced from 100x to 17-26x • remaining bottleneck: frequent (costly) context switches Ras Bodik CS 164 Lecture 24 19 Linking a Basic Block Fragment add %eax, %ecx frag7: add cmp $4, %eax cmp $4, %eax jle $0x40106f jle <frag42> jmp <frag8> stub1: mov %eax, %ecx %eax, eax-slot mov &dstub1, %eax jmp context_switch stub2: mov %eax, eax-slot mov &dstub2, %eax jmp context_switch Ras Bodik CS 164 Lecture 24 20 Trick II: Linking next VPC lookup VPC fetch block at VPC exception handling link block emit block execute until cache miss context switch BASIC BLOCK CACHE non-control-flow instructions Ras Bodik CS 164 Lecture 24 21 Slowdown over Native Execution Performance Effect of Basic Block Cache with direct branch linking vpr (Spec2000) 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 26.03 data set 1 data set 2 17.45 2.97 block cache 3.63 block cache with direct linking Performance Problem: mispredicted indirect branches Ras Bodik CS 164 Lecture 24 22 Indirect Branch Handling Conditionally “inline” a preferred indirect branch target as the continuation of the trace ret <preferred target> mov pop <save cmp %edx, edx_slot # save app’s edx %edx # load actual target flags> %edx, $0x77f44708 # compare to # preferred target jne <exit stub > mov edx_slot, %edx # restore app’s edx <restore flags> <inlined preferred target> Ras Bodik CS 164 Lecture 24 23 Indirect Branch Linking Shared Indirect Branch Target (IBT) Table <load actual target> <compare to inlined target> if equal goto <inlined target> lookup IBT table if (! tag-match) goto <exit stub> jump to tag-value original target F original target H linked targets H K I L <inlined target> J <exit stub> Trick III: Efficient Indirect Branch Handling next VPC basic block builder context switch miss BASIC BLOCK CACHE non-control-flow instructions miss indirect branch lookup Ras Bodik CS 164 Lecture 24 25 Performance Effect of indirect branch linking 26.03 17.45 Slowdown over Native Execution 10 9 8 7 6 5 4 3 2 1 0 vpr (Spec2000) data set 1 data set 2 3.63 2.97 1.20 block cache block cache with direct linking 1.15 block cache with linking (direct+indirect) Performance Problem: poor code layout in code cache Ras Bodik CS 164 Lecture 24 26 Trick IV: Picking Traces Block Cache has poor execution efficiency: • Increased branching, poor locality Pick traces to: • reduce branching & improve layout and locality • New optimization opportunities across block boundaries Block Cache A D G Trace Cache A J B B C E F H I E F K G K J H L D Ras Bodik CS 164 Lecture 24 27 Picking Traces START trace selector basic block builder dispatch context switch BASIC BLOCK CACHE non-control-flow instructions TRACE indirect branch lookup CACHE non-control-flow instructions Ras Bodik CS 164 Lecture 24 28 Picking hot traces • The goal: path profiling – find frequently executed control-flow paths – Connect basic blocks along these paths into contiguous sequences, called traces. • The problem: find a good trade-off between – profiling overhead (counting execution events), and – accuracy of the profile. Ras Bodik CS 164 Lecture 24 29 Alternative 1: Edge profiling The algorithm: • • Edge profiling: measure frequencies of all controlflow edges, then after a while Trace selection: select hot traces by following highest-frequency branch outcome. Disadvantages: • • Inaccurate: may select infeasible paths (due to branch correlation) Overhead: must profile all control-flow edges Ras Bodik CS 164 Lecture 24 30 Alternative 2: Bit-tracing path profiling The algorithm: – – – – collect path signatures and their frequencies path signature = <start addr>.history example: <label7>.0101101 must include addresses of indirect branches Advantages: – accuracy Disadvantages: – overhead: need to monitor every branch – overhead: counter storage (one counter per path!) Ras Bodik CS 164 Lecture 24 31 Alternative 3: Next Executing Tail (NET) This is the algorithm of Dynamo: – profiling: count only frequencies of start-of-trace points (which are targets of original backedges) – trace selection: when a start-of-trace point becomes sufficiently hot, select the sequence of basic blocks executed next. – may select a rare (cold) path, but statistically selects a hot path! Ras Bodik CS 164 Lecture 24 32 NET (continued) Advantages of NET: very light-weight #instrumentation points = #targets of backward branches #counters = #targets of backward branches statistically likely to pick the hottest path pick only feasible paths easy to implement Ras Bodik CS 164 Lecture 24 A D G J B E H K C F I L 33 Spec2000 Performance on Windows (w/o trace optimizations) 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 Ras Bodik CS 164 Lecture 24 H_MEAN vpr vortex twolf perlbmk parser mesa mcf gzip gcc gap equake eon crafty bzip2 0.0 art Slowdown vs. Native Execution 2.2 34 Ras Bodik CS 164 Lecture 24 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 35 H_MEAN wupwise vpr vortex twolf swim sixtrack perlbmk parser mgrid mesa mcf gzip gcc gap equake eon crafty bzip2 art apsi applu ammp Slowdown vs. Native Execution Spec2000 Performance on Linux (w/o trace optimizations) Slowdown vs. Native Execution Performance on Desktop Applications 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Adobe Acrobat Microsoft Excel Microsoft PowerPoint Ras Bodik CS 164 Lecture 24 Microsoft Word 36 Performance Breakdown trace branch taken 2% indirect branch lookup 11% rest of system 1% code cache 86% Ras Bodik CS 164 Lecture 24 37 Trace optimizations • Now that we built the traces, let’s optimize them • But what’s left to optimize in a statically optimized code? • Limitations of static compiler optimization: – cost of call-specific interprocedural optimization – cost of path-specific optimization in presence of complex control flow – difficulty of predicting indirect branch targets – lack of access to shared libraries – sub-optimal register allocation decisions – register allocation for individual array elements or pointers Ras Bodik CS 164 Lecture 24 38 Maintaining Control (in the real world) • Capture all code: execution only takes place out of the code cache • Challenging for abnormal control flow • System must intercept all abnormal control flow events: • • • • • Exceptions Call backs in Windows Asynchronous procedure calls Setjmp/longjmp Set thread context Ras Bodik CS 164 Lecture 24 39