Rapid Identification of Architectural Bottlenecks via Precise Event Counting John Demme, Simha Sethumadhavan Columbia University {jdd,simha}@cs.columbia.edu 2002 Objective-C Scheme Language Popularity Platforms C# Lisp Python Delphi Javascript Other Java PHP Perl C Visual Basic C++ Source: TIOBE Index http://www.tiobe.com/index.php/tiobe_index CASTL: Computer Architecture and Security Technologies Lab 2 2011 Language Popularity Platforms Go Other Lua Java Ruby Objective-C C# C Scheme Ada Lisp Python Delphi Moore’s Law C++ PHP Javascript Perl Visual Basic Multicore Source: TIOBE Index http://www.tiobe.com/index.php/tiobe_index CASTL: Computer Architecture and Security Technologies Lab 3 HOW CAN WE POSSIBLY KEEP UP? CASTL: Computer Architecture and Security Technologies Lab 4 Architectural Lifecycle Performance Data Collection Architectural Improvement CASTL: Computer Architecture and Security Technologies Lab Human Analysis 5 Performance Data Collection • Analytical Models – Fast, but questionable accuracy • Simulation – Often the gold standard – Very detailed information – Very slow • Production Hardware (performance counters) – Very fast – Not very detailed CASTL: Computer Architecture and Security Technologies Lab 6 Performance Data Collection • Analytical Models – Fast, but questionable accuracy • Simulation – Often the gold standard – Very detailed information – Very slow • Production Hardware (Performance Counters) – Very fast – Not very detailed – Relatively detailed CASTL: Computer Architecture and Security Technologies Lab 7 ACCURACY, PRECISION & PERTURBATION A comparison of performance monitoring techniques and the uncertainty principal CASTL: Computer Architecture and Security Technologies Lab 8 Accuracy, Precision & Perturbation Normal Program Execution Corresponding Machine State (Cache, Branch Predictor, etc) Time • In normal execution, program interacts with microarchitecture as expected CASTL: Computer Architecture and Security Technologies Lab 9 Precise Instrumentation Monitored Program Execution Measured Machine State (Cache, Branch Predictor, etc) Start of Start of Start of mutex_lock mutex_unlock barrier_wait “Correct” Machine State (Cache, Branch Predictor, etc) Time • When instrumentation is inserted, the machine state is disrupted and measurements are inaccurate CASTL: Computer Architecture and Security Technologies Lab 10 Performance Counter SW Landscape Precise Reads counters whenever program or instrumentation requests a read Heavyweight Examples Overhead • PAPI • perf_event • Proportional to # of reads • PAPI: 1048ns • Perf_event: 262ns CASTL: Computer Architecture and Security Technologies Lab 11 Sampling vs. Instrumentation Traditional Instrumented Program Execution Start of mutex_lock Start of mutex_unlock Start of barrier_wait Sampled Program Execution n cycles n cycles Time • Traditional instrumentation like polling • Sampling uses interrupts CASTL: Computer Architecture and Security Technologies Lab 12 Performance Counter SW Landscape Sampling Interrupts every n cycles and extrapolates Precise Reads counters whenever program or instrumentation requests a read Heavyweight Examples • vTune • OProfile Overhead • Inversely proportional to n • Up to 20% • Usually much less • PAPI • perf_event • Proportional to # of reads • PAPI: 1048ns • Perf_event: 262ns CASTL: Computer Architecture and Security Technologies Lab 13 The Problem with Sampling 40 if (info->s->concurrent_insert) rw_rdlock(&info->s-> 41 key_root_lock[inx]); 42 changed=_mi_test_if_changed(info); 43 if (!flag) { switch(info->s-> 44 keyinfo[inx].key_alg) { /* 37 lines omitted */ 82 } 84 if (info->s->concurrent_insert) { if (!error) { 85 while (...) { 86 /* 10 lines omitted */ } 97 } 98 rw_unlock(&info->s-> 99 key_root_lock[inx]); 100 } Sample Interrupt Is this a critical section? Conditional Locks CASTL: Computer Architecture and Security Technologies Lab 14 Corrected with Precision 40 if (info->s->concurrent_insert) rw_rdlock(&info->s-> 41 key_root_lock[inx]); 42 changed=_mi_test_if_changed(info); 43 if (!flag) { switch(info->s-> 44 keyinfo[inx].key_alg) { /* 37 lines omitted */ 82 } 84 if (info->s->concurrent_insert) { if (!error) { 85 while (...) { 86 /* 10 lines omitted */ } 97 } 98 rw_unlock(&info->s-> 99 key_root_lock[inx]); 100 } Read counter Read counter Conditional Locks CASTL: Computer Architecture and Security Technologies Lab 15 But, Precision Adds Overhead Monitored Program Execution Measured Machine State (Cache, Branch Predictor, etc) “Correct” Machine State (Cache, Branch Predictor, etc) Time CASTL: Computer Architecture and Security Technologies Lab 16 Instrumentation Adds Perturbation Monitored Program Execution Measured Machine State (Cache, Branch Predictor, etc) “Correct” Machine State (Cache, Branch Predictor, etc) Time • If instrumentation sections are short, perturbation is reduced and measurements become more accurate CASTL: Computer Architecture and Security Technologies Lab 17 Performance Counter SW Landscape Sampling Interrupts every n cycles and extrapolates Precise Reads counters whenever program or instrumentation requests a read Heavyweight Examples • vTune • OProfile Overhead • Inversely proportional to n • Up to 20% • Usually much less Lightweight • PAPI • perf_event • Proportional to # of reads • PAPI: 1048ns • Perf_event: 262ns CASTL: Computer Architecture and Security Technologies Lab 18 Performance Counter SW Landscape Sampling Interrupts every n cycles and extrapolates Precise Reads counters whenever program or instrumentation requests a read Heavyweight Examples • vTune • OProfile Overhead • Inversely proportional to n • Up to 20% • Usually much less • PAPI • perf_event • Proportional to # of reads • PAPI: 1048ns • Perf_event: 262ns CASTL: Computer Architecture and Security Technologies Lab Lightweight • LiMiT • Proportional to # of reads • 11ns 19 Related Work • No recent papers for better precise counting – Original PAPI paper: Browne et al. 2000 – Some software, none offering LiMiT’s features • Characterizing performance counters – Weaver & Dongarra 2010 • Sampling – Counter multiplexing techniques • Mytkowicz et al. 2007 • Azimi et al. 2005 – Trace Alignment • Mytkowicz et al. 2006 CASTL: Computer Architecture and Security Technologies Lab 20 REDUCING COUNTER READ OVERHEADS Implementing lightweight, precise monitoring CASTL: Computer Architecture and Security Technologies Lab 21 Avoid system calls to avoid overhead Why Precision is Slow Perfmon2 & Perf_event Program requests counter read LiMiT Program reads counter Why is this so hard? Kernel reads counter and returns result Program uses value Program uses value CASTL: Computer Architecture and Security Technologies Lab 22 A Self-Monitoring Process CASTL: Computer Architecture and Security Technologies Lab 23 Run, process, run L1 Misses Branches Cycles CASTL: Computer Architecture and Security Technologies Lab 53 24 39 24 Overflow L1 Misses 7 Branches 24 Cycles 100 39 95 Psst! CASTL: Computer Architecture and Security Technologies Lab 25 Overflow L1 Misses 7 Branches 24 Cycles 1 00 Overflow Space L1 Misses Branches Cycles CASTL: Computer Architecture and Security Technologies Lab 0 0 0 100 26 Modified Read 20 + 100 120 L1 Misses Branches Cycles 7 24 20 Overflow Space L1 Misses Branches Cycles CASTL: Computer Architecture and Security Technologies Lab 0 0 100 27 Overflow During Read 99 L1 Misses Branches Cycles 7 24 99 Overflow Space CASTL: Computer Architecture and Security Technologies Lab L1 Misses Branches 0 0 Cycles 0 28 Overflow! 99 L1 Misses 7 Branches 24 Cycles 1 00 Overflow Space L1 Misses Branches Cycles CASTL: Computer Architecture and Security Technologies Lab 0 0 100 0 29 Atomicity Violation! 99 + 100 199 L1 Misses Branches Cycles 7 24 0 Overflow Space L1 Misses Branches Cycles CASTL: Computer Architecture and Security Technologies Lab 0 0 100 30 OS Detection & Correction 99 L1 Misses 7 Branches 24 Cycles 1 00 Overflow Space L1 Misses Branches Cycles CASTL: Computer Architecture and Security Technologies Lab 0 0 100 0 31 OS Detection & Correction 0 99 Looks like he was reading that… L1 Misses Branches Cycles 7 24 00 Overflow Space L1 Misses Branches Cycles CASTL: Computer Architecture and Security Technologies Lab 0 0 100 32 Atomicity Violation Corrected 0 + 100 100 L1 Misses Branches Cycles 7 24 0 Overflow Space So what does all this effort buy us? CASTL: Computer Architecture and Security Technologies Lab L1 Misses Branches Cycles 0 0 100 33 Time to collect 3*107 readings Time User System Wall PAPI 1.26s Perf_event 0.53s LiMiT 0.034s Speedup 3.7x / 1.56x 30.10s 31.44s 7.30s 7.87s 0 0.34s ∞ 92x / 23.1x Average LiMiT Readout Number of instructions 5 Number of cycles 37.14 Time 11.3 ns CASTL: Computer Architecture and Security Technologies Lab 34 LiMiT Enables Detailed Study • Short counter reads decrease perturbation • Little perturbation allows detailed study of – Short synchronization regions – Short function calls • Three Case Studies – Synchronization in production web applications • Not presented here, see paper – Synchronization changes in MySQL over time – User/Kernel code behavior in runtime libraries CASTL: Computer Architecture and Security Technologies Lab 35 CASE STUDY: LONGITUDINAL STUDY OF LOCKING BEHAVIOR IN MYSQL Has MySQL gotten better since the advent of multi-cores? CASTL: Computer Architecture and Security Technologies Lab 36 Evolution of Locking in MySQL • Questions to answer – Has MySQL gotten better at locking? – What techniques have been used? • Methodology – Intercept pthread locking calls – Count overheads and critical sections CASTL: Computer Architecture and Security Technologies Lab 37 MySQL Synchronization Times 100% Percentage of Execution 90% 80% 70% 60% Free 50% Locking 40% Lock Held 30% Unlocking 20% 10% 0% MySQL 4.1 (2004) MySQL 5.0 (2005) MySQL 5.1 (2008) CASTL: Computer Architecture and Security Technologies Lab MySQL 5.5 (Beta, 2009) 38 MySQL Critical Sections Overall Time With Lock Held Avg. Lock Hold Time 1400 Percentage of Execution with Lock Held 40% 1200 35% 1000 30% 25% 800 20% 600 15% 400 10% 200 5% 0% Average Number of Cycles Lock is Held 45% 0 MySQL 4.1 (2004) MySQL 5.0 (2005) MySQL 5.1 (2008) CASTL: Computer Architecture and Security Technologies Lab MySQL 5.5 (Beta, 2009) 39 Number of Locks in MySQL Static Locks 6.E+08 4.E+05 5.E+08 3.E+05 3.E+05 4.E+08 2.E+05 3.E+08 2.E+05 2.E+08 Static Locks Dynamic Locks Dynamic Locks 1.E+05 1.E+08 5.E+04 0.E+00 0.E+00 MySQL 4.1 (2004) MySQL 5.0 (2005) MySQL 5.1 (2008) CASTL: Computer Architecture and Security Technologies Lab MySQL 5.5 (Beta, 2009) 40 Observations & Implications • Coarser granularity, better performance – Total critical section time has decreased – Average CS times have increased – Number of locks has decreased • Performance counters useful for software engineering studies CASTL: Computer Architecture and Security Technologies Lab 41 CASE STUDY: KERNEL/USERSPACE OVERHEADS IN RUNTIME LIBRARY Does code in the kernel and runtime library behave? CASTL: Computer Architecture and Security Technologies Lab 42 Full System Analysis w/o Simulation • Questions to answer – How much time do system applications spend in in runtime libraries? – How well do they perform in them? Why? • Methodology – Intercept common libc, libm and libpthread calls – Count user-/kernel- space events during the calls – Break down by purpose (I/O, Memory, Pthread) • Applications – MySQL, Apache • Intel Nehalem Microarchitecture CASTL: Computer Architecture and Security Technologies Lab 43 Execution Cycles in Library Calls 50% Percentage of Total Cycles 45% 40% 35% Pthreads Memory I/O 30% 25% 20% 15% 10% 5% 0% MySQL (User) MySQL (Kernel) Apache (User) CASTL: Computer Architecture and Security Technologies Lab Apache (Kernel) 44 MySQL Clocks per Instruction 2 1.8 Clocks per Instruction 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 User Kernel Libc CASTL: Computer Architecture and Security Technologies Lab Program 45 L3 Cache MPKI L3 MPKI I/O Memory Pthreads 35 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 30 25 20 15 10 5 0 MySQL (User) MySQL (Kernel) Apache (User) CASTL: Computer Architecture and Security Technologies Lab Apache (Kernel) 46 I-Cache Stall Cycles I/O Memory 22.4% 3.0% Pthreads 12.0% Percentage of Total Cycles 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% MySQL (User) MySQL (Kernel) Apache (User) CASTL: Computer Architecture and Security Technologies Lab Apache (Kernel) 47 Observations & Implications • Apache is fundamentally I/O bound – Optimization of the I/O subsystem necessary • Kernel code suffers from I-Cache stalls – Speculation: bad interrupt instruction prefetching • LiMiT yields detailed performance data – Not as accurate or detailed as simulation – But gathered in hours rather than weeks CASTL: Computer Architecture and Security Technologies Lab 48 CONCLUSIONS Research Methodology Implications, Closing thoughts CASTL: Computer Architecture and Security Technologies Lab 49 Conclusions • Implications from case studies – – – – MySQL’s multicore experience helped scalability Performance counting for non-architecture Libraries and kernels perform very differently I/O subsystems can be slow • Research Methodology – LiMiT can provide detailed results quickly – Simulators are more detailed but slow – Opportunity to build microbenchmarks • Identify bottlenecks with counters • Verify representativeness with counters • Then simulate CASTL: Computer Architecture and Security Technologies Lab 50 QUESTIONS? CASTL: Computer Architecture and Security Technologies Lab 51 BACKUP SLIDES Man down! Need backup! CASTL: Computer Architecture and Security Technologies Lab 52 Performance Evaluation Methods Accuracy Precision Speed Cost Simulators ↑ ↑ ↓ ↑/↓ Analytical Models Prototype Hardware ? ? ↑ ↓ ↑ ↑ ↑ ↑ Production Hardware ↑/↓ ↑/↓ ↑ ↓ Accuracy and Precision are traded off • Production hardware provides performance counters • However, existing interfaces make accuracy/precision tradeoff difficult CASTL: Computer Architecture and Security Technologies Lab 53 Sampling vs. LiMiT LiMiT Instrumented Program Execution Start of mutex_lock Start of mutex_unlock Start of barrier_wait Sampled Program Execution n cycles n cycles CASTL: Computer Architecture and Security Technologies Lab 54 Another process runs CASTL: Computer Architecture and Security Technologies Lab Miles 75 9 Pushups Situps 24 39 55 Fix: Virtualization 30 Miles! I did pretty well today. Miles 30 7 Pushups Situps 24 39 No you didn’t. CASTL: Computer Architecture and Security Technologies Lab 56 Avoiding Communication Miles CASTL: Computer Architecture and Security Technologies Lab 30 Pushups Situps 0 0 Miles Pushups Situps 7 24 39 57 LiMiT Operation Program Execu on Kernel Scheduling (Timer Interrupt Handler) Counter Reading Code Timer Interrupts mov $0, %edx r dpmc shl or q $32, %r dx %r ax, %r dx addq ovf l , %r dx Process Swap Kernel saves PMC Different Program Executes Return to Program Process Swap Kernel attempts to restore PMC PMC0 < 2³¹ PMC0 >= 2³¹ No Regular mer interrupt processing Transi on to kernel Special kernel handling required to avoid double coun ng. Atomicity Violation! Error handler: reset %rdx, %rax before returning to program Yes Detect Counter Read Counter Overflow! Is the program currently executing a PMC read? Examine interrupted instructions and look for read pattern Kernel increments overflow variable and resets counter: ovfl += PMC0 PMC0 = 0 CASTL: Computer Architecture and Security Technologies Lab 58 RDTSC ?@#*='45'A$4*#, , 'B, 4+%C4: '4: 'A#$54$2 %: *#'D 4: E=4$E: &' ! "#$%&#'( ) *+#, '- '. / ' - . /01% +"#$! ) % *"#$! ' % ) "#$! ' % ( "#$! ' % &"#$! ' % ! "#$! ! % 234 3/% No Resource Core Sharing Process Sharing (SMT) Swapping !% *% +) % &( % , &% (!% ( *% ' )% )(% 0 12 3#$'45'67$#%8, '9. : '; '( 4$#'<) , =#2 >' CASTL: Computer Architecture and Security Technologies Lab 59 MySQL Instrumentation Overhead MySQL Execution Cycles (User Time) 2.50E+12 2.00E+12 1.50E+12 1.00E+12 5.00E+11 0.00E+00 None LiMiT perf_event CASTL: Computer Architecture and Security Technologies Lab PAPI 60 CASE STUDY A: LOCKING IN WEB WORKLOADS How does web-related software use locks? CASTL: Computer Architecture and Security Technologies Lab 61 Locking on the Web • Questions to answer – Is locking a significant concern? – How can architects help? – Are traditional benchmarks similar? • Methodology – Intercept pthread mutex calls, time w/ LiMiT • Applications – – – – Firefox Apache MySQL PARSEC CASTL: Computer Architecture and Security Technologies Lab 62 Execution Time by Region 100% Percentage of Total User Cycles 90% 80% 70% 60% Free 50% Lock 40% Lock Held 30% Unlock 20% 10% 0% Firefox Apache Parsec MySQL Apache Parsec MySQL LiMiT LiMiT LiMiT LiMiT PAPI PAPI PAPI CASTL: Computer Architecture and Security Technologies Lab 63 Locking Statistics Avg. Lock Held Time (cycles) Dynamic Locks per 10k Cycles Static Locks Firefox Apache PARSEC MySQL 789 149 118 1076 3.24 1.12 0.545 3.18 57 1 17 13853 CASTL: Computer Architecture and Security Technologies Lab 64 Observations & Implications • Applications like Firefox and MySQL use locks differently from Apache and PARSEC – Many notions of synchronization based on scientific computing probably don’t apply • Locking overheads up to 8 - 13% – More efficient mechanisms may be helpful – But, 13% is upper bound on speedup • MySQL has some very long critical sections – Prime targets for micro-arch optimization – If they run faster, MySQL scales better CASTL: Computer Architecture and Security Technologies Lab 65 Hardware Enhancements • 64-bit Reads and Writes – Overflows are primary source of complexity – 64-bit counters w/ full read/write eliminates it • Destructive Reads – Difference = 2 reads, store, load & subtract – Destructive read difference = 2 reads • Combined Reads – X86 counter read requires 2 instructions – Combining should reduce overhead • AMD’s Lightweight Profiling Proposal – Really good, depending on microarchitecture CASTL: Computer Architecture and Security Technologies Lab 66