Pin Building Customized Program Analysis Tools with Dynamic Instrumentation CK Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Kim Hazelwood Intel Vijay Janapa Reddi University of Colorado http://rogue.colorado.edu/Pin PLDI’05 1 Instrumentation • Insert extra code into programs to collect information about execution – Program analysis: • Code coverage, call-graph generation, memory-leak detection – Architectural study: • Processor simulation, fault injection • Existing binary-level instrumentation systems: – Static: • ATOM, EEL, Etch, Morph – Dynamic: • Dyninst, Vulcan, DTrace, Valgrind, Strata, DynamoRIO C Pin is a new dynamic binary instrumentation system PLDI’05 2 Advantages of Pin Instrumentation 1. Easy-to-use Instrumentation API – – Instrumentation code written in C/C++/asm ATOM-like API, based on procedure calls 2. Instrumentation tools portable across platforms – – Same tools work on IA32, EM64T (x86-64), Itanium, ARM Same tools work on Linux and Windows (ongoing work) 3. Low instrumentation overhead – – Pin automatically optimizes instrumentation code Pin can attach instrumentation to a running process 4. Robust – Handle mixed code and data, variable-length instructions, dynamically-generated code 5. Transparent – Application sees original addresses, values, and stack content PLDI’05 3 A Pintool for Tracing Memory Writes #include <iostream> #include "pin.H" FILE* trace; executed immediately before a write is executed • Same source code works on thesize) 4 architectures VOID RecordMemWrite(VOID* ip, VOID* addr, UINT32 { fprintf(trace, “%p: W %p %d\n”, ip, addr, size); } => Pin takes care of different addressing modes VOID Instruction(INS *v) { • No needins, to VOID manually save/restore application state if (INS_IsMemoryWrite(ins)) INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(RecordMemWrite), => Pin does it for you automatically and efficiently IARG_INST_PTR, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_END); } executed when an instruction int main(int argc, char * argv[]) { PIN_Init(argc, argv); is dynamically compiled trace = fopen(“atrace.out”, “w”); INS_AddInstrumentFunction(Instruction, 0); PIN_StartProgram(); PLDI’05 4 return 0; } Dynamic Instrumentation Original code Code cache 1’ 1 2 3 5 Exits point back to Pin 2’ 4 7’ 6 7 PLDI’05 Pin Pin fetches trace starting block 1 and start instrumentation 5 Dynamic Instrumentation Original code Code cache 1’ 1 2 3 5 2’ 4 7’ 6 7 Pin transfers control into code cache (block 1) PLDI’05 Pin 6 Dynamic Instrumentation Original code Code cache trace linking 1 2 3 5 1’ 3’ 2’ 5’ 7’ 6’ 4 6 7 PLDI’05 Pin fetches and instrument a new trace Pin 7 Pin’s Software Architecture Address space Pintool Pin 3 programs (Pin, Pintool, App) in same address space: User-level only Instrumentation APIs Instrumentation APIs: Application Virtual Machine (VM) JIT Compiler Code JIT compiler: Cache Emulation Unit Operating System Hardware PLDI’05 Through which Pintool communicates with Pin Dynamically compile and instrument Emulation unit: Handle insts that can’t be directly executed (e.g., syscalls) Code cache: Store compiled code => Coordinated by VM 8 Pin Internal Details • • • • • Loading of Pin, Pintool, & Application An Improved Trace Linking Technique Register Re-allocation Instrumentation Optimizations Multithreading Support PLDI’05 9 Register Re-allocation • Instrumented code needs extra registers. E.g.: – – – • Virtual registers available to the tool A virtual stack pointer pointing to the instrumentation stack Many more … Approaches to get extra registers: 1. Ad-hoc (e.g., DynamoRIO, Strata, DynInst) – Whenever you need a register, spill one and fill it afterward 2. Re-allocate all registers during compilation a. Local allocation (e.g., Valgrind) – Allocate registers independently within each trace b. Global allocation (Pin) – PLDI’05 Allocate registers across traces (can be inter-procedural) 10 Valgrind’s Register Re-allocation Trace 1 Original Code mov 1, %eax mov 1, %eax mov 2, %ebx mov 2, %esi cmp %ecx, %edx re-allocate jz t cmp %ecx, %edx Virtual Physical mov %eax, SPILLeax %eax %ebx %ecx %edx %eax %esi %ecx %edx mov SPILLeax, %eax Virtual Physical mov SPILLebx ,%edi %eax %ebx %ecx %edx %eax %edi %ecx %edx mov %esi, SPILLebx t: jz t’ add 1, %eax sub 2, %ebx Trace 2 t’: C Simple but inefficient add 1, %eax sub 2, %edi • All modified registers are spilled at a trace’s end PLDI’05 • Refill registers at a trace’s beginning 11 Pin’s Register Re-allocation Scenario (1): Compiling a new trace at a trace exit Trace 1 Original Code mov 1, %eax mov 1, %eax mov 2, %ebx mov 2, %esi cmp %ecx, %edx re-allocate cmp %ecx, %edx jz t’ jz t t: add 1, %eax sub 2, %ebx Trace 2 t’: add 1, %eax Compile Trace 2 using the binding at Trace 1’s exit: Virtual Physical %eax %ebx %ecx %edx %eax %esi %ecx %edx sub 2, %esi PLDI’05 C No spilling/filling needed across traces 12 Pin’s Register Re-allocation Scenario (2): Targeting an already generated trace at a trace exit Trace 1 (being compiled) Original Code mov 1, %eax mov 1, %eax mov 2, %ebx mov 2, %esi re-allocate cmp %ecx, %edx cmp %ecx, %edx jz t mov %esi, SPILLebx mov SPILLebx, %edi t: jz t’ add 1, %eax sub 2, %ebx Physical %eax %ebx %ecx %edx %eax %esi %ecx %edx Trace 2 (in code cache) t’: PLDI’05 Virtual add 1, %eax Virtual Physical sub 2, %edi %eax %ebx %ecx %edx %eax %edi %ecx %edx C Minimal spilling/filling code 13 Instrumentation Optimizations 1. Inline instrumentation code into the application 2. Avoid saving/restoring eflags with liveness analysis 3. Schedule inlined instrumentation code PLDI’05 14 Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> add %ecx, %edx cmp %edx, 0 je <target2> BBL_InsertCall(bbl, IPOINT_BEFORE, docount(), IARG_UINT32, BBL_NumIns(bbl), IARG_END) C 33 extra instructions executed altogether Instrument without applying any optimization bridge() Trace mov %esp,SPILLappsp mov SPILLpinsp,%esp call <bridge> cmov %esi, %edi mov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’> mov %esp,SPILLappsp mov SPILLpinsp,%esp call <bridge> add %ecx, %edx PLDI’05 cmp %edx, 0 je <target2’> pushf push %edx push %ecx push %eax movl 0x3, %eax call docount pop %eax pop %ecx pop %edx popf ret docount() add %eax,icount ret 15 Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> Inlining add %ecx, %edx cmp %edx, 0 je <target2> C 11 extra instructions executed PLDI’05 Trace mov %esp,SPILLappsp mov SPILLpinsp,%esp pushf add 0x3, icount popf cmov %esi, %edi mov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’> mov %esp,SPILLappsp mov SPILLpinsp,%esp pushf add 0x3, icount popf add %ecx, %edx cmp %edx, 0 je <target2’> 16 Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> Inlining + eflags liveness analysis add %ecx, %edx cmp %edx, 0 je <target2> C 7 extra instructions executed Trace mov %esp,SPILLappsp mov SPILLpinsp,%esp pushf add 0x3, icount popf cmov %esi, %edi mov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’> add 0x3, icount add %ecx, %edx cmp %edx, 0 je <target2’> PLDI’05 17 Example: Instruction Counting Original code cmov %esi, %edi cmp %edi, (%esp) jle <target1> Inlining + eflags liveness analysis + scheduling add %ecx, %edx cmp %edx, 0 je <target2> C 2 extra instructions executed Trace cmov %esi, %edi add 0x3, icount cmp %edi, (%esp) jle <target1’> add 0x3, icount add %ecx, %edx cmp %edx, 0 je <target2’> PLDI’05 18 Pin Instrumentation Performance Runtime overhead of basic-block counting with Pin on IA32 Average Slowdown Without optimization Inlining Inlining + eflags liveness analysis Inlining + eflags liveness analysis + scheduling 11 10 9 8 7 6 5 4 3 2 1 0 10.4 7.8 2.8 2.5 SPECINT PLDI’05 3.9 3.5 1.5 1.4 SPECFP (SPEC2K using reference data sets) 19 Comparison among Dynamic Instrumentation Tools Runtime overhead of basic-block counting with three different tools Average Slowdown Valgrind 9 8 7 6 5 4 3 2 1 0 DynamoRIO Pin 8.3 5.1 2.5 SPECINT • Valgrind is a popular instrumentation tool on Linux • Call-based instrumentation, no inlining • DynamoRIO is the performance leader in binary dynamic optimization • Manually inline, no eflags liveness analysis and scheduling PLDI’05 20 C Pin automatically provides efficient instrumentation Pin Applications • Sample tools in the Pin distribution: – Cache simulators, branch predictors, address tracer, syscall tracer, edge profiler, stride profiler • Some tools developed and used inside Intel: – Opcodemix (analyze code generated by compilers) – PinPoints (find representative regions in programs to simulate) – A tool for detecting memory bugs • Some companies are writing their own Pintools: – A major database vendor, a major search engine provider • Some universities using Pin in teaching and research: – U. of Colorado, MIT, Harvard, Princeton, U of Minnesota, Northeastern, Tufts, University of Rochester, … PLDI’05 21 Conclusions • Pin – A dynamic instrumentation system for building your own program analysis tools – Easy to use, robust, transparent, efficient – Tool source compatible on IA32, EM64T, Itanium, ARM – Works on large applications • database, search engine, web browsers, … – Available on Linux; Windows version coming soon • Downloadable from http://rogue.colorado.edu/Pin – User manual, many example tools, tutorials – 3300 downloads since 2004 July PLDI’05 22 Acknowledgments • Prof Dan Connors – Hosting Pin website at U of Colorado • Intel Bistro Team – Providing the Falcon decoder/encoder – Suggesting instrumentation scheduling • Mark Charney – Providing the XED decoder/encoder • Ramesh Peri – Implementing part of Itanium Instrumentation PLDI’05 23 Backup PLDI’05 24 Talk Outline • • • • • A Sample Pintool Pin Internal Details Experimental Results Pin Applications Conclusions PLDI’05 25 Trace Linking • Trace linking is a very effective optimization – Bypass VM when transferring from one trace to another – Slowdown without trace linking as much as 100x • Linking direct branches/calls – Straightforward as targets are unique • Linking indirect branches/calls & returns – More challenging because the target can be different each time – Our approach: • For all indirect control transfers, use chaining • For returns, further optimizes with function cloning PLDI’05 26 Indirect Trace Linking original indirect jump jmp [%eax] chain of predicted targets target_1’: mov [%eax], T jmp target_1’ if (T != target_1) jmp target_2’ … target_N’: if (T != target_N) jmp LookupHtab … • Chains are built incrementally LookupHtab: if (hit) jmp translated[T] else call Pin slow path – Most recent target inserted at the chain’s head • Hash table is local to each indirect jump C Improved prediction accuracy over existing schemes PLDI’05 27 Return-Address Prediction • Distinguish different callers to a function by cloning: F’(): pop T A(): jmp A’ call F() F(): ret B(): call F() F_A’() : pop T jmp A’ F_B’() : pop T jmp B’ PLDI’05 A’: B’: if (T != A) jmp B’ … if (T != B) jmp Lookuphtab1 … A’: if (T != A) jmp Lookuphtab1 … B’: if (T != B) jmp Lookuphtab2 … C Prediction accuracy further improved 28 Pin Multithreading Support • For instrumenting multithreaded programs: – Pin intercepts all threading-related system calls: • Create and start jitting a thread if a clone() is seen – Pin provides a “thread id” for pintools to index threadlocal storage – Pin’s virtual registers are backed up by per-thread spilling area • For writing multithreaded pintools: – Since Pin cannot link in libpthread in the pintool (to avoid conflicts in setting up signal handlers by two libpthreads) PLDI’05 • Pin implements a subset of libpthread itself • Pin can also redirect libpthread calls in pintool to the application’s libpthread 29 Instrumenting Multithreaded Programs • Pin instruments multithreaded programs: – Spilling area has to be thread local • Create a new per-thread spilling area when a thread-create system call (e.g., clone()) is intercepted • How to access to per-thread spilling area? – Steal a physical register to point to the per-thread spilling area – x86-specific optimization: • Initially assuming single-threaded program – Access to the spilling area via its absolute address • If multiple threads detected later: – Flush the code cache – Recompile with a physical register pointing to per-thread spilling area PLDI’05 30 Optimizing Instrumentation Performance Observations: – Slowdown largely due to executing instrumentation code rather than dynamic compilation Make sense to spend more time to optimize – Focus on optimizing simple instrumentation tools: • Performance depends on how fast we can transit between the application and the tool • Simple yet commonly used (e.g., basic-block profiling) PLDI’05 31 Pin Source Code Organization • Pin source organized into generic, architecturedependent, OS-dependent modules: Architecture #source files #source lines Generic 87 (48%) 53595 (47%) x86 (32-bit + 64-bit) 34 (19%) 22794 (20%) Itanium 34 (19%) 20474 (18%) ARM 27 (14%) 17933 (15%) TOTAL 182 (100%) 114796 (100%) C ~50% code shared among architectures PLDI’05 32 Pin Instrumentation Performance 2000 Without optimization Inlining Inlining + eflags liveness analysis Inlining + eflags liveness analysis + scheduling 1500 138 317 104 110 105 149 110 109 144 127 152 110 114 121 118 248 162 134 289 450 179 108 189 412 259 214 236 500 343 1000 Average slowdown INT FP Without optimization 10.4x 3.9x Inlining 7.8x 3.5x Inlining + eflags analysis 2.8x 1.5x 2.5x 1.4x PLDI’05 Inlining + eflags analysis + scheduling 33 FP-AriMean wupwise swim sixtrack mgrid mesa lucas galgel fma3d facerec equake art apsi applu ammp INT-AriMean vpr vortex twolf perlbmk parser mcf gzip gcc gap eon crafty 0 bzip2 Normalized Execution Time (%) Performance of basic-block counting with Pin/IA32 Comparison among Dynamic Instrumentation Tools Performance of basic-block counting with three different tools DynamoRIO Pin/IA32 834 251 320 162 Tr Ar iM ea n 289 vp IN x vo rte ol f tw • Valgrind is a popular instrumentation tool on Linux 508 817 391 269 134 450 520 793 rlb pe rs e r m k 179 191 158 108 pa m cf ip gz gc c 189 259 412 574 480 934 718 633 860 p ga cr af ty 236 343 606 617 582 479 ip 2 936 1220 1583 1091 1600 1400 1200 1000 800 600 400 200 0 bz Normalized Execution Time (%) Valgrind • Call-based instrumentation, no inlining • DynamoRIO is the performance leader in dynamic optimization • Manually inline, no eflags liveness analysis and scheduling PLDI’05 34 C Pin automatically provides efficient instrumentation bz ip 2 cr af ty eo n ga p gc c gz ip m c pa f rs pe e r rlb m k tw o vo lf rte x IN T- v p A rt M r ea am n m ap p pl u ap si a eq rt ua fa ke ce re fm c a3 d ga lg e lu l ca s m es a m gr s i id xt ra ck sw w u im FP pwi -A se riM ea n 150 PLDI’05 101 111 100 104 101 VM 106 101 103 JIT-Other 121 103 104 JIT-Regalloc 101 102 JIT-Decode 109 198 237 250 114 115 Code Cache 100 100 101 299 300 111 156 182 200 122 108 Normalized Execution Time (%) Pin/IA32 Performance (no instrumentation) Total 400 350 154 105 50 0 35 PLDI’05 0 103 101 102 104 apsi art equake facerec 101 105 101 103 101 106 lucas mesa mgrid sixtrack swim wupwise VM FP-AriMean 104 galgel JIT-Other 111 101 applu JIT-Regalloc fma3d 101 50 ammp 110 JIT-Decode INT-AriMean vpr 183 200 vortex 111 296 300 twolf 112 376 Code Cache perlbmk parser 107 100 100 mcf gzip 148 144 159 400 gcc gap 105 150 eon crafty bzip2 Normalized Execution Time (%) Pin/EM64T Performance (no instrumentation) Total 350 250 163 104 36 IN TA PLDI’05 ea n m p ap pl u ap si a eq r t ua k fa e ce re c fm a3 d ga lg el lu ca s m es a m g si rid xt ra ck sw w im F P upw -A is riM e ea n am riM r 125 128 142 115 112 132 104 100 117 135 113 99 100 109 114 126 142 125 125 105 250 210 260 300 120 173 200 133 122 357 400 vp af ty eo n ga p gc c gz ip m p a cf rs pe er rlb m k tw o vo lf rte x ip 2 150 cr bz Normalized Execution Time (%) Pin0/IPF Performance (no instrumentation) Total 350 167 119 50 0 37