Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn) Bottleneck Analysis Bottleneck Analysis: Determining the performance effect of an event on execution time An event could be: • • • • • • an instruction’s execution an instruction-window-full stall a branch mispredict a network request inter-processor communication etc. Why is Bottleneck Analysis Important? Bottleneck Analysis Applications Run-time Optimization • Resource arbitration • e.g., how to scheduling memory accesses? • Effective speculation • e.g., which branches to predicate? •Dynamic reconfiguration • e.g, when to enable hyperthreading? • Energy efficiency • e.g., when to throttle frequency? Design Decisions • Overcoming technology constraints • e.g., how to mitigate effect of long wire latencies? Programmer Performance Tuning • Where have the cycles gone? • e.g., which cache misses should be prefetched? Why is Bottleneck Analysis Hard? Current state-of-art Event counts: Exe. time = (CPU cycles + Mem. cycles) * Clock cycle time where: Mem. cycles = Number of cache misses * Miss penalty miss1 (100 cycles) miss2 (100 cycles) 2 misses but only 1 miss penalty Parallelism Parallelism in systems complicates performance understanding • Two parallel cache misses • Two parallel threads • A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floatingpoint multiplies are executing Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware Our Approach Our Approach: Criticality Bottleneck Analysis: Determining the performance effect of an event on execution time Critical events affect execution time, non-critical do not Defining criticality Need Performance Sensitivity • slowing down a “critical” event should slow down the entire program • speeding up a “noncritical” event should leave execution time unchanged Standard Waterfall Diagram Time 1 2 3 R5 = 0 F E C R3 = 0 F E C R1 = #array + R3 F E R6 = ld[R1] F 11 12 13 R5 = R5 + 100 F E C R0 = R5 F R3 = R3 + 1 F R5 = R6 + R5 F cmp R6, 0 bf L1 Ret R0 4 5 6 7 8 9 10 14 15 C E C E C E F C E C F E C E F C E C Annotated with Dependence Edges Time 1 2 3 R5 = 0 F E C R3 = 0 F E C R1 = #array + R3 F E R6 = ld[R1] F 11 12 13 R5 = R5 + 100 F E C R0 = R5 F R3 = R3 + 1 F R5 = R6 + R5 F cmp R6, 0 bf L1 (MISP) Ret R0 4 5 6 7 8 9 10 14 15 C E C E C E F C E C F E C E F C E C Annotated with Dependence Edges Time 1 2 3 R5 = 0 F E C R3 = 0 F E C R1 = #array + R3 F R6 = ld[R1] F E R3 = R3 + 1 F R5 = R6 + R5 F cmp R6, 0 bf L1 4 5 6 7 8 9 10 11 13 14 15 Fetch BW Data Dep C E C E C E F ROB C E C F E Branch Misp. C R5 = R5 + 100 F R0 = R5 F Ret R0 12 E C E F C E C Edge Weights Added Time 1 2 3 R5 = 0 F E C R3 = 0 F 11 12 E C R5 = R5 + 100 F E 1 C R0 = R5 F 1 R1 = #array + R3 1 1 F E 1 R6 = ld[R1] 4 1 F F R5 = R6 + R5 F1 bf L1 6 2 C 0 C 1 E 7 8 9 10 13 14 15 C E R3 = R3 + 1 cmp R6, 0 5 E F C E F 1 C E C 3 Ret R0 E 1 C F E C Convert to Graph R5 = 0 F 1 E 1 C R3 = 0 0 F 1 E 1 C F 1 1 E 1 C 0 F 2 1 E 2 1 C 1 R1 = #array + R3 R6 = ld[R1] 1 2 F 1 R3 = R3 + 1 0 F R5 = R6 + R5 1 0 E 1 E 1 1 1 C 1 C cmp R6, 0 F 2 E 1 bf L1 F 1 C E 1 C 1 C 0 F 1 E 1 1 E 1 C F 1 E 1 C 3 R5 = R5 + 100 R0 = R5 Ret R0 F 1 1 Convert to Graph R5 = 0 F 1 E 1 C R3 = 0 0 F 1 E 1 C F 1 1 E 1 C 0 F 2 1 E 2 1 C 1 R1 = #array + R3 R6 = ld[R1] 1 2 F 1 R3 = R3 + 1 0 F R5 = R6 + R5 1 0 E 1 E 1 1 1 C 1 C cmp R6, 0 F 2 E 1 bf L1 F 1 C E 1 C 1 C 0 F 1 E 1 1 E 1 C F 1 E 1 C 3 R5 = R5 + 100 R0 = R5 Ret R0 F 1 1 Smaller graph instance Critical Icache miss, But how costly? F 0 1 C 10 F 1 1 E F 1 1 E 1 C 0 F 1 E 3 1 C E 1 C Non-critical, But how much slack? 1 F 1 E 1 C Add “hidden” constraints Critical Icache miss, But how costly? F 0 1 E 1 1 1 C 10 F E C 0 1 1 1 E 3 1 0 F 1 1 1 1 F E F 1 2 1 1 1 C C 1 0 Non-critical, But how much slack? E 1 C Add “hidden” constraints Cost = 13 – 7 = 6 cycles F 0 1 E 1 1 1 C 10 F E C 0 1 1 1 E 3 1 0 F 1 1 1 1 F E F 1 2 1 1 1 C C 1 0 Slack = 13 – 7 = 6 cycles E 1 C Slack “sharing” F 0 1 E 1 1 1 C 10 F E C 0 1 E 3 1 1 1 1 C 1 F 1 1 1 0 F E 1 2 1 C E 1 1 0 F 1 C Slack = 6 cycles Can delay one edge by 6 cycles, but not both! Slack = 6 cycles Percent of Dynamic Instructions Machine Imbalance ~80% insts have at least 5 cycles of apportioned slack 100 90 80 70 global 60 50 apportioned 40 30 20 10 0 0 10 20 30 40 50 60 70 Number of Cycles of Slack (perl) 80 90 100 Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware Simple criticality not always enough Sometimes events have nearly equal criticality miss #1 (99) miss #2 (100) Want to know • how critical is each event? • how far from critical is each event? Actually, even that is not enough Our solution: measure interactions Two parallel cache misses miss #1 (99) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 1 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 100 0+1 icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99 Interaction cost (icost) icost = aggregate cost – sum of individual costs 1. Positive icost parallel interaction 2. Zero icost ? miss #1 miss #2 Interaction cost (icost) icost = aggregate cost – sum of individual costs 1. Positive icost parallel interaction 2. Zero icost independent 3. Negative icost ? miss #1 miss #1 miss #2 ... miss #2 Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = ? Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost serial interaction Interaction cost (icost) icost = aggregate cost – sum of individual costs 1. Positive icost miss #1 parallel interaction 2. Zero icost independent 3. Negative icost serial interaction Fetch BranchBW mispredict Load-Replay LSQ stall Trap miss #2 miss #1 .. . miss #1 miss #2 miss #2 ALU latency Why care about serial interactions? miss #1 (100) miss #2 (100) ALU latency (110 cycles) Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1 Icost Case Study: Deep pipelines 1 4 Dcache (DL1) Looking for serial interactions! Icost Breakdown (6 wide, 64-entry window) gcc DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss ... Total gzip vortex Icost Breakdown (6 wide, 64-entry window) gcc DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss ... Total gzip 30.5 % vortex Icost Breakdown (6 wide, 64-entry window) gcc gzip DL1 30.5 % DL1+window -15.3 DL1+bw 6.0 DL1+bmisp -3.4 DL1+dmiss -0.4 DL1+alu -8.2 DL1+imiss 0.0 ... ... Total 100.0 vortex Icost Breakdown (6 wide, 64-entry window) gcc gzip vortex DL1 18.3 % 30.5 % 25.8 % DL1+window -4.2 -15.3 -24.5 DL1+bw 10.0 6.0 15.5 DL1+bmisp -7.0 -3.4 -0.3 DL1+dmiss -1.4 -0.4 -1.4 DL1+alu -1.6 -8.2 -4.7 DL1+imiss 0.1 0.0 0.4 ... ... ... ... Total 100.0 100.0 100.0 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware Exploit in Hardware • Criticality Analyzer • Online, fast-feedback • Limited to critical/not critical • Replacement for Performance Counters • Requires offline analysis • Constructs entire graph Only last-arriving edges can be critical R2 Observation: R3 E R1 R2 + R3 Dependence resolved early If dependence into R2 is on critical path, then value of R2 arrived last. critical arrives last arrives last critical Determining last-arrive edges Observe events within the machine last_arrive[F] = F F F F F F F F E E E E E E E E C C C C C C C C EF if branch misp. CF if ROB stall FF otherwise last_arrive[C] = last_arrive[E] = F F F F F F F F F E E E E E E E E E C C C C C C C C C EE observe arrival order of operands FE if data ready on fetch EC if commit pointer is delayed CC otherwise Last-arrive edges The last-arrive rule CP consists only of “last-arrive” edges F E C Prune the graph Only need to put last-arrive edges in graph No other edges could be on CP F E C newest …and we’ve found the critical path! Backward propagate along last-arrive edges F E C Found CP by only observing last-arrive edges but still requires constructing entire graph newest Step 2. Reducing storage reqs CP is a ”long” chain of last-arrive edges. the longer a given chain of last-arrive edges, the more likely it is part of the CP Algorithm: find sufficiently long last-arrive chains 1. Plant token into a node n 2. Propagate forward, only along last-arrive edges 3. Check for token after several hundred cycles 4. If token alive, n is assumed critical Online Criticality Detection Forward propagate token F Plant Token E C newest Online Criticality Detection Forward propagate token F Plant Token E C Tokens “Die” newest Online Criticality Detection Forward propagate token F Plant Token E C Token survives! Putting it all together Prediction Path CP prediction table PC E-critical? Training Path Token-Passing Analyzer OOO Core Last-arrive edges (producer retired instr) Results • Performance (Speed) • Scheduling in clustered machines • 10% speedup • Selective value prediction • Deferred scheduling (Crowe, et al) • 11% speedup • Heterogeneous cache (Rakvic, et al.) • 17% speedup • Energy • Non-uniform machine: fast and slow pipelines • ~25% less energy • Instruction queue resizing (Sasanka, et al.) • Multiple frequency scaling (Semeraro, et al.) • 19% less energy with 3% less performance • Selective pre-execution (Petric, et al.) Exploit in Hardware • Criticality Analyzer • Online, fast-feedback • Limited to critical/not critical • Replacement for Performance Counters • Requires offline analysis • Constructs entire graph Profiling goal Goal: • Construct graph many dynamic instructions Constraint: • Can only sample sparsely Genome sequencing Profiling goal Goal: • DNA strand Construct graph DNA Constraint: • Can only sample sparsely “Shotgun” genome sequencing DNA “Shotgun” genome sequencing DNA “Shotgun” genome sequencing DNA ... ... “Shotgun” genome sequencing DNA ... ... Find overlaps among samples ... ... Mapping “shotgun” to our situation many dynamic instructions Icache miss Dcache miss Branch misp. No event Profiler hardware requirements ... ... Profiler hardware requirements ... Match! ... Sources of error Error Source Gcc Parser Twolf Modeling execution as a graph 2.1 % 6.0% 0.1 % Errors in graph construction 5.3 % 1.5 % 1.6 % Sampling only a few graph fragments 4.8 % 6.5 % 7.2 % Total 12.2 % 14.0 % 8.9 % Conclusion: Grand Challenges • Cost • How much speedup possible from optimizing an event? modeling • Slack token-passing analyzer • How much can an event be “slowed down” before increasing execution time? • Interactions shotgun profiling • When do multiple events need to be optimized simultaneously? • When do we have a choice? parallel interactions serial interactions Conclusion: Bottleneck Analysis Applications Run-time Optimization • Effective speculation • Resource arbitration Selective value prediction Scheduling and steering in clustered processors • Dynamic reconfiguration • Energy efficiency Resize instruction window Non-uniform machines Design Decisions • Overcoming technology constraints Programmer Performance Tuning • Where have the cycles gone? Helped cope with highlatency dcache Measured cost of cache misses/branch mispredicts Outline Simple Criticality • Definition (ISCA ’01) • Detection (ISCA ’01) • Application (ISCA ’01-’02) Advanced Criticality • Interpretation (MICRO ’03) • What types of interactions are possible? • Hardware Support (MICRO ’03, TACO ’04) • Enhancement to performance counters Simple criticality not always enough Sometimes events have nearly equal criticality miss #1 (99) miss #2 (100) Want to know • how critical is each event? • how far from critical is each event? Actually, even that is not enough Our solution: measure interactions Two parallel cache misses miss #1 (99) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 1 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 100 0+1 icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99 Interaction cost (icost) icost = aggregate cost – sum of individual costs 1. Positive icost parallel interaction 2. Zero icost ? miss #1 miss #2 Interaction cost (icost) icost = aggregate cost – sum of individual costs 1. Positive icost parallel interaction 2. Zero icost independent 3. Negative icost ? miss #1 miss #1 miss #2 ... miss #2 Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = ? Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost serial interaction Interaction cost (icost) icost = aggregate cost – sum of individual costs 1. Positive icost miss #1 parallel interaction 2. Zero icost independent 3. Negative icost serial interaction Fetch BranchBW mispredict Load-Replay LSQ stall Trap miss #2 miss #1 .. . miss #1 miss #2 miss #2 ALU latency Why care about serial interactions? miss #1 (100) miss #2 (100) ALU latency (110 cycles) Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1 Outline Simple Criticality • Definition (ISCA ’01) • Detection (ISCA ’01) • Application (ISCA ’01-’02) Advanced Criticality • Interpretation (MICRO ’03) • What types of interactions are possible? • Hardware Support (MICRO ’03, TACO ’04) • Enhancement to performance counters Profiling goal Goal: • Construct graph many dynamic instructions Constraint: • Can only sample sparsely Genome sequencing Profiling goal Goal: • DNA strand Construct graph DNA Constraint: • Can only sample sparsely “Shotgun” genome sequencing DNA “Shotgun” genome sequencing DNA “Shotgun” genome sequencing DNA ... ... “Shotgun” genome sequencing DNA ... ... Find overlaps among samples ... ... Mapping “shotgun” to our situation many dynamic instructions Icache miss Dcache miss Branch misp. No event Profiler hardware requirements ... ... Profiler hardware requirements ... Match! ... Sources of error Error Source Gcc Parser Twolf Sources of error Error Source Gcc Parser Twolf Modeling execution as a graph 2.1 % 6.0% 0.1 % Sources of error Error Source Gcc Parser Twolf Modeling execution as a graph 2.1 % 6.0% 0.1 % Errors in graph construction 5.3 % 1.5 % 1.6 % Sources of error Error Source Gcc Parser Twolf Modeling execution as a graph 2.1 % 6.0% 0.1 % Errors in graph construction 5.3 % 1.5 % 1.6 % Sampling only a few graph fragments 4.8 % 6.5 % 7.2 % Total 12.2 % 14.0 % 8.9 % Conclusion: Bottleneck Analysis Applications Run-time Optimization • Effective speculation • Resource arbitration Selective value prediction Scheduling and steering in clustered processors • Dynamic reconfiguration • Energy efficiency Resize instruction window Non-uniform machines Design Decisions • Overcoming technology constraints Programmer Performance Tuning • Where have the cycles gone? Helped cope with highlatency dcache Measured cost of cache misses/branch mispredicts Conclusion: Grand Challenges • Cost • How much speedup possible from optimizing an event? modeling • Slack token-passing analyzer • How much can an event be “slowed down” before increasing execution time? • Interactions shotgun profiling • When do multiple events need to be optimized simultaneously? • When do we have a choice? parallel interactions serial interactions Backup Slides Related Work Criticality Prior Work Critical-Path Method, PERT charts • Developed for Navy’s “Polaris” project-1957 • Used as a project management tool • Simple critical-path, slack concepts “Attribution” Heuristics • Rosenblum et al.: SOSP-1995, and many others • Marks instruction at head of ROB as critical, etc. • Empirically, has limited accuracy • Does not account for interactions between events Related Work: Microprocessor Criticality Latency tolerance analysis • Srinivasan and Lebeck: MICRO-1998 Heuristics-driven criticality predictors • Tune et al.: HPCA-2001 • Srinivasan et al.: ISCA-2001 “Local” slack detector • Casmira and Grunwald: Kool Chips Workshop-2000 ProfileMe with pair-wise sampling • Dean, et al.: MICRO-1997 Unresolved Issues Alternative I: Addressing Unresolved Issues Modeling and Measurement • What resources can we model effectively? • difficulty with mutual-exclusion-type resouces (ALUs) • Efficient algorithms • Release tool for measuring cost/slack Hardware • Detailed design for criticality analyzer • Shotgun profiler simplifications • gradual path from counters Optimization • explore heuristics for exploiting interactions Alternative II: Chip-Multiprocessors Programmer Performance Tuning Parallelizing applications • What makes a good division into threads? • How can we find them automatically, or at least help programmers to find them? Design Decisions • • • • Should each core support out-of-order execution? Should SMT be supported? How many processors are useful? What is the effect of inter-processor latency? Unresolved issues Modeling and Measurement • What resources can we model effectively? • difficulty with mutual-exclusion-type resouces (ALUs) • In other words, unanticipated side effects Altered Execution Original Execution (to compute cost of inst #3 cache miss) 1. ld r2, [Mem] (cache miss) 2. add r3 r2 + 1 3. ld r4, [Mem] (cache miss) 4. add r6 r4 + 1 F 1 0 10 E 1 C 0 F 0 F 1 1 E E 10 C 0 10 0 10 Adder contention F 1 E E 1 C 0 F 1 1 C 0 Contention edge 1. ld r2, [Mem] (cache miss) 2. add r3 r2 + 1 3. ld r4, [Mem] (cache hit) 4. add r6 r4 + 1 10 10 C 0 F 0 F 1 1 E E 1 C 0 2 0 2 No contention F 1 E 1 C 0 Incorrect critical path due to contention edge 1 C Should not be here Unresolved issues Modeling and Measurement (cont.) • How should processor policies be modeled? • relationship to icost definition • Efficient algorithms for measuring icosts • pairs of events, etc. • Release tool for measuring cost/slack Unresolved issues Hardware • Detailed design for criticality analyzer • help to convince industry-types to build it • Shotgun profiler simplifications • gradual path from counters Optimization • Explore icost optimization heuristics • icosts are difficult to interpret Validation Validation: can we trust our model? Run two simulations : • Reduce CP latencies Expect “big” speedup • Reduce non-CP latencies Expect no speedup Speedup per Cycle Reduced Validation: can we trust our model? 1 0.8 Reducing CP Latencies 0.6 0.4 Reducing non-CP Latencies 0.2 0 crafty eon gcc gzip perl vortex galgel mesa Validation Two steps: Increase latencies of insts. by their apportioned slack 1. • 2. for three 1) 2) possible, 3) apportioning strategies: latency+1, 5-cycles to as many instructions as 12-cycles to as many loads as possible Compare to baseline (no delays inserted) Validation 120% Percent of Execution Time 110% 100% 90% 80% 70% baseline 60% latency + 1 50% five cycles 40% 12 cycles to loads 30% 20% 10% 0% ammp art gcc gzip mesa parser perl Worst case: Inaccuracy of 0.6% vortex average Slack Measurements Three slack variants Local slack: # cycles latency can be increased without delaying any subsequent instructions Global slack: # cycles latency can be increased without delaying the last instruction in the program Apportioned slack: Distribute global slack among instructions using an apportioning strategy Percent of Dynamic Instructions Slack measurements 100 90 80 70 60 50 ~21% insts have at least 5 cycles of local slack 40 30 20 local 10 0 0 10 20 30 40 50 60 70 Number of Cycles of Slack (perl) 80 90 100 Percent of Dynamic Instructions Slack measurements ~90% insts have at least 5 cycles of global slack 100 90 80 70 global 60 50 40 30 20 local 10 0 0 10 20 30 40 50 60 70 Number of Cycles of Slack (perl) 80 90 100 Percent of Dynamic Instructions Slack measurements ~80% insts have at least 5 cycles of apportioned slack 100 90 80 70 global 60 50 40 apportioned 30 20 local 10 0 0 10 20 30 40 50 60 70 80 Number of Cycles of Slack (perl) A large amount of exploitable slack exists 90 100 Application-centered Slack Measurements Load slack Can we tolerate a long-latency L1 hit? design: wire-constrained machine, e.g. Grid non-uniformity: multi-latency L1 apportioning strategy: apportion ALL slack to load instructions Apportion all slack to loads Percent of Dynamic Loads 100 90 80 perl 70 gcc 60 50 40 gzip 30 20 10 0 0 10 20 30 40 50 60 70 80 Number of Cycles of Slack on Load Instructions Most loads can tolerate an L2 cache hit 90 100 Multi-speed ALUs Can we tolerate ALUs running at half frequency? design: fast/slow ALUs non-uniformity: multi-latency execution latency, bypass apportioning strategy: give slack equal to original latency + 1 Percent of Dynamic Instructions Latency+1 apportioning 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% ammp art gcc gzip mesa parser perl vortex average Most instructions can tolerate doubling their latency Slack Locality and Prediction Predicting slack Two steps to PC-indexed, history-based prediction: 1. 2. Measure slack of a dynamic instruction Store in array indexed by PC of static instruction Two requirements: 1. 2. Locality of slack Ability to measure slack of a dynamic instruction Percent of (weighted) static instructions Locality of slack 100 90 80 70 60 50 40 ideal 30 20 10 0 ammp art gcc gzip mesa parser perl vortex average Percent of (weighted) static instructions Locality of slack 100 90 80 70 60 50 40 ideal 30 20 10 100% 0 ammp art gcc gzip mesa parser perl vortex average Percent of (weighted) static instructions Locality of slack 100 90 80 70 60 50 40 ideal 30 90% 20 95% 10 100% 0 ammp art gcc gzip mesa parser perl vortex PC-indexed, history-based predictor can capture most of the available slack average Slack Detector delay and observe effective for hardware predictor Problem #1 Iterating repeatedly over same dynamic instruction Solution Only sample dynamic instruction once Problem #2 Determining if overall execution time increased Solution Check if delay made instruction critical Slack Detector delay and observe Goal: Determine whether instruction has n cycles of slack 1. Delay the instruction by n cycles 2. Check if critical (via critical-path analyzer) 3. No, instruction has n cycles of slack 4. Yes, instruction does not have n cycles of slack Slack Application Fast/slow cluster microarchitecture P F2 save ~37% core power Fetch + Rename Steer WIN Reg ALUs Fast, 3-wide cluster WIN Reg ALUs Data Cache Bypass Bus Slow, 3-wide cluster Aggressive non-uniform design: • Higher execution latencies • Increased (cross-domain) bypass latency • Decreased effective issue bandwidth Picking bins for the slack predictor Two decisions 1. Steer to fast/slow cluster 2. Schedule with high/low priority within a cluster Use implicit slack predictor with four bins: 1. 2. 3. 4. Steer Steer Steer Steer to to to to fast cluster + schedule with high priority fast cluster + schedule with low priority slow cluster + schedule with high priority slow cluster + schedule with low priority Slack-based policies 1.1 1 0.9 Normalized IPC 0.8 0.7 2 fast, high-power clusters 0.6 slack-based policy 0.5 0.4 reg-dep steering 0.3 0.2 0.1 0 ammp art gcc gzip mesa parser perl vortex average 10% better performance from hiding non-uniformities CMP case study Multithreaded Execution Case Study Two questions: • How should a program be divided into threads? • what makes a good cutpoint? • how can we find them automatically, or at least help programmers find them? • What should a multiple-core design look like? • should each core support out-of-order execution? • should SMT be supported? • how many processors are useful? • what is the effect of inter-processor latency? Parallelizing an application Why parallelize a single-thread application? • Legacy code, large code bases • Difficult to parallelize apps • Interpreted code, kernels of operating systems • Like to use better programming languages • Scheme, Java instead of C/C++ Parallelizing an application Simplifying assumption • Program binary unchanged Simplified problem statement • Given a program of length L, find a cutpoint that divides the program into two threads that provides maximum speedup Must consider: • data dependences, execution latencies, control dependences, proper load balancing Parallelizing an application Naive solution: • try every possible cutpoint Our solution: • efficiently determine the effect of every possible cutpoint • model execution before and after every cut Solution first instruction start 0 F 1 1 2 1 3 2 0 1 3 1 1 1 2 2 0 1 1 1 4 0 1 1 1 E C 1 0 1 0 4 0 0 1 1 0 2 0 0 1 1 2 2 0 0 last instruction Parallelizing an application Considerations: • Synchronization overhead • add latency to EE edges • Synchronization may involve turning EE to EF • Scheduling of threads • additional CF edges Challenges: • State behavior (one thread to multiple processors) • caches, branch predictor • Control behavior • limits where cutpoints can be made Parallelizing an application More general problem: • Divide a program into N threads • NP-complete Icost can help: • icost(p1,p2) << 0 implies p1 and p2 redundant • action: move p1 and p2 further apart Preliminary Results Experimental Setup • Simulator, based loosely on SimpleScalar • Alpha SpecInt binaries Procedure 1. Assume execution trace is known 2. Look at each 1k run of instructions 3. Test every possible cutpoint using 1k graphs Dynamic Cutpoints Cost Distribution of Dynamic Cutpoints Cumulative Pct. of Cutpoints 100 bzip 90 crafty 80 eon 70 gap gcc 60 parser 50 perl 40 tw ol vpr 30 20 10 0 0 20 40 60 80 Execution time reduction (cycles) Only 20% of cuts yield benefits of > 20 cycles 100 Usefulness of cost-based policy Speedups from parallelizing programs for a twoprocessor system 30 fixed-interval simple cost-based Speedup % 25 20 15 10 5 0 bzip crafty eon gap gcc gzip mcf parser perl twolf vpr Static Cutpoints Cost Distribution of Static Cutpoints Cumulative Pct. of Instructions 100 bzip 90 crafty 80 eon gap 70 gcc 60 gzip mcf 50 parser 40 perl tw olf 30 vpr 20 10 0 0 20 40 60 80 100 120 140 160 180 Avg. per-dynamic-instance Cost of Static Instructions Up to 60% of cuts yield benefits of > 20 cycles Future Avenues of Research • Map cutpoints back to actual code • Compare automatically generated cutpoints to humangenerated ones • See what performance gains are in a simulator, as opposed to just on the graph • Look at the effect of synchronization operations • What additional overhead do they introduce? • Deal with state, control problems • Might need some technique outside of the graph Multithreaded Execution Case Study Two possible questions: • How should a program be divided into threads? • what makes a good cutpoint? • how can we find them automatically, or at least help programmers find them? • What should a multiple-core design look like? • should each core support out-of-order execution? • should SMT be supported? • how many processors are useful? • what is the effect of inter-processor latency? CMP design study What we can do: • Try out many configurations quickly • dramatic changes in architecture often only small changes in graph • Identifying bottlenecks • especially interactions CMP design study: Out-of-orderness Is out-of-order execution necessary in a CMP? Procedure • model execution with different configurations • adjust CD edges • compute breakdowns • notice resource/events interacting with CD edges CMP design study: Out-of-orderness first instruction 0 F 1 1 2 1 3 2 0 1 3 1 1 1 2 2 0 1 1 1 4 0 1 1 1 E C 1 0 1 0 4 0 0 1 1 0 2 0 0 1 1 2 2 0 0 last instruction CMP design study: Out-of-orderness Results summary • Single-core: Performance taps out at 256 entries • CMP: Performance gains up through 1024 entries • some benchmarks see gains up to 16k entries Why more beneficial? • Use breakdowns to find out..... CMP design study: Out-of-orderness Components of window cost • cache misses holding up retirement? • long strands of data dependencies? • predictable control flow? Icost breakdowns give quantitative and qualitative answers CMP design study: Out-of-orderness Independent 100% ALU window cost 0% cache misses Parallel Interaction ALU cache misses interaction Serial Interaction ALU cache misses interaction cost(window) + icost(window, A) + icost(window, B) + icost(window, AB) = 0 equal Summary of Preliminary Results icost(window, ALU operations) << 0 • primarily communication between processors • window often stalled waiting for data Implications • larger window may be overkill • need a cheap non-blocking solution • e.g., continual-flow pipelines CMP design study: SMT? Benefits • reduced thread start-up latency • reduced communication costs How we could help • distribution of thread lengths • breakdowns to understand effect of communication CMP design study: How many processors? Start #1 #2 #1 #2 #1 CMP design study: Other Questions What is the effect of inter-processor communication latency? • understand hidden vs. exposed communication Allocating processors to programs • methodology for O/S to better assign programs to processors Waterfall To Graph Story Standard Waterfall Diagram Time 1 2 3 R5 = 0 F E C R3 = 0 F E C R1 = #array + R3 F E R6 = ld[R1] F 11 12 13 R5 = R5 + 100 F E C R0 = R5 F R3 = R3 + 1 F R5 = R6 + R5 F cmp R6, 0 bf L1 Ret R0 4 5 6 7 8 9 10 14 15 C E C E C E F C E C F E C E F C E C Annotated with Dependence Edges Time 1 2 3 R5 = 0 F E C R3 = 0 F E C R1 = #array + R3 F E R6 = ld[R1] F 11 12 13 R5 = R5 + 100 F E C R0 = R5 F R3 = R3 + 1 F R5 = R6 + R5 F cmp R6, 0 bf L1 Ret R0 4 5 6 7 8 9 10 14 15 C E C E C E F C E C F E C E F C E C Annotated with Dependence Edges Time 1 2 3 R5 = 0 F E C R3 = 0 F E C R1 = #array + R3 F R6 = ld[R1] F E R3 = R3 + 1 F R5 = R6 + R5 F cmp R6, 0 bf L1 4 5 6 7 8 9 10 11 13 14 15 Fetch BW Data Dep C E C E C E F ROB C E C F E Branch Misp. C R5 = R5 + 100 F R0 = R5 F Ret R0 12 E C E F C E C Edge Weights Added Time 1 2 3 R5 = 0 F E C R3 = 0 F 11 12 E C R5 = R5 + 100 F E 1 C R0 = R5 F 1 R1 = #array + R3 1 1 F E 1 R6 = ld[R1] 4 1 F F R5 = R6 + R5 F1 bf L1 6 2 C 0 C 1 E 7 8 9 10 13 14 15 C E R3 = R3 + 1 cmp R6, 0 5 E F C E F 1 C E C 3 Ret R0 E 1 C F E C Convert to Graph R5 = 0 F 1 E 1 C R3 = 0 0 F 1 E 1 C 1 E 1 C 1 E 2 1 C 1 F R1 = #array + R3 R6 = ld[R1] 1 0 F 2 2 F 1 R3 = R3 + 1 0 F R5 = R6 + R5 0 E C E 1 1 1 C 1 C cmp R6, 0 F 2 E 1 bf L1 F E 1 C 1 C 0 F 1 E 1 E 1 C F 1 E 1 C 3 R5 = R5 + 100 R0 = R5 Ret R0 F 1 Find Critical Path R5 = 0 F 1 E 1 C R3 = 0 0 F 1 E 1 C 1 E 1 C 1 E 2 1 C 1 F R1 = #array + R3 R6 = ld[R1] 1 0 F 2 2 F 1 R3 = R3 + 1 0 F R5 = R6 + R5 0 E C E 1 1 1 C 1 C cmp R6, 0 F 2 E 1 bf L1 F E 1 C 1 C 0 F 1 E 1 E 1 C F 1 E 1 C 3 R5 = R5 + 100 R0 = R5 Ret R0 F 1 Add Non-last-arriving Edges R5 = 0 F E 1 1 0 R3 = 0 1 F E 1 0 R1 = #array + R3 1 F 1 E 1 0 R6 = ld[R1] F 1 2 0 R3 = R3 + 1 1 2 2 F 1 1 E 1 E 1 1 0 R5 = R6 + R5 cmp R6, 0 bf L1 R5 = R5 + 100 R0 = R5 Ret R0 F 1 E 1 1 0 F 1 E 2 1 0 F 1 0 1 1 3 F E 1 0 1 E 1 1 1 1 F E 1 0 1 1 F 1 E 1 C 0 C 1 0 C 1 0 C 1 1 0 C 1 1 0 C 1 1 0 C 1 1 0 C 1 1 0 C 1 0 C 1 0 C Graph Alterations R5 = 0 F R3 = 0 1 0 F R1 = #array + R3 1 0 F R6 = ld[R1] 1 0 F 1 0 F 1 0 F 1 0 F bf L1 1 0 F R5 = R5 + 100 1 0 F R0 = R5 1 0 F R3 = R3 + 1 R5 = R6 + R5 cmp R6, 0 Ret R0 0 F E 1 E 1 1 E 1 1 E 2 1 1 1 2 2 1 1 2 1 1 1 1 E E E 1 E E 1 E 1 E 1 1 1 C 0 C 1 0 C 1 0 C 1 1 0 C 1 1 0 C 1 1 0 C 1 1 0 C 1 1 0 C 1 0 C 1 0 C 1 1 1 1 Branch misprediction made correct Token-passing analyzer Step 1. Observing R2 Observation: R3 E R1 R2 + R3 Dependence resolved early If dependence into R2 is on critical path, then value of R2 arrived last. critical arrives last arrives last critical Determining last-arrive edges Observe events within the machine last_arrive[F] = F F F F F F F F E E E E E E E E C C C C C C C C EF if branch misp. CF if ROB stall FF otherwise last_arrive[C] = last_arrive[E] = F F F F F F F F F E E E E E E E E E C C C C C C C C C EE observe arrival order of operands FE if data ready on fetch EC if commit pointer is delayed CC otherwise Last-arrive edges: a CPU stethoscope CF FE EE EF EC FF CC CPU Last-arrive edges 0 F 1 0 1 2 1 1 1 0 1 3 1 E 3 C 1 2 0 4 1 0 1 0 2 0 0 0 1 4 1 1 2 2 0 1 1 1 2 1 0 1 2 0 0 1 Remove latencies Do not need explicit weights F E C Last-arrive edges The last-arrive rule CP consists only of “last-arrive” edges F E C Prune the graph Only need to put last-arrive edges in graph No other edges could be on CP F E C newest …and we’ve found the critical path! Backward propagate along last-arrive edges F E C Found CP by only observing last-arrive edges but still requires constructing entire graph newest Step 2. Efficient analysis CP is a ”long” chain of last-arrive edges. the longer a given chain of last-arrive edges, the more likely it is part of the CP Algorithm: find sufficiently long last-arrive chains 1. Plant token into a node n 2. Propagate forward, only along last-arrive edges 3. Check for token after several hundred cycles 4. If token alive, n is assumed critical Token-passing example ROB Size 1. plant token Critical 2. propagate token Found CP without constructing entire graph 3. is token alive? 4. yes, train critical Implementation: a small SRAM array Last-arrive producer node (inst id, type) … Token Queue Write Read Commited (inst id, type) Size of SRAM: 3 bits ROB size < 200 Bytes Simply replicate for additional tokens Putting it all together Prediction Path CP prediction table PC E-critical? Training Path Token-Passing Analyzer OOO Core Last-arrive edges (producer retired instr) Scheduling and Steering Case Study #1: Clustered architectures issue window steering scheduling 1. Current state of art (Base) 2. Base + CP Scheduling 3. Base + CP Scheduling + CP Steering Current State of the Art Constant issue width, clock frequency Normalized IPC 1.10 1.00 0.90 0.80 unclustered 2 cluster 0.70 4 cluster 0.60 crafty eon gcc gzip perl vortex galgel Avg. clustering penalty for 4 clusters: 19% mesa CP Optimizations Base + CP Scheduling Normalized IPC 1.10 1.00 0.90 0.80 unclustered 2 cluster 0.70 4 cluster 0.60 crafty eon gcc gzip perl vortex galgel mesa CP Optimizations Base + CP Scheduling + CP Steering Normalized IPC 1.10 1.00 0.90 0.80 unclustered 2 cluster 0.70 4 cluster 0.60 crafty eon gcc gzip perl vortex galgel mesa Avg. clustering penalty reduced from 19% to 6% Token-passing Vs. Heuristics Local Vs. Global Analysis Previous CP predictors: local resource-sensitive predictions (HPCA 01, ISCA 01) 25.0% 20.0% Speedup token-passing 15.0% oldest-uncommited 10.0% oldest-unissued 5.0% 0.0% -5.0% crafty eon gcc gzip perl vortex galgel mesa CP exploitation seems to require global analysis Icost case study Icost Case Study: Deep pipelines Deep pipelines cause long latency loops: • level-one (DL1) cache access, issue-wakeup, branch misprediction, … But can often mitigate them indirectly Assume 4-cycle DL1 access; how to mitigate? Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses? Really, looking for serial interactions! Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Case Study: Deep pipelines DL1 access F 5 1 F 5 4 E 6 i1 0 F 1 C i2 5 E 9 14 18 1 C i3 0 F 5 E 4 C 12 5 2 E C i4 window edge 6 1 C i5 F 5 E E 7 0 12 F 1 7 0 C i6 Icost Breakdown (6 wide, 64-entry window) gcc DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss ... Total gzip vortex Icost Breakdown (6 wide, 64-entry window) gcc DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss ... Total gzip 30.5 % vortex Icost Breakdown (6 wide, 64-entry window) gcc gzip DL1 30.5 % DL1+window -15.3 DL1+bw 6.0 DL1+bmisp -3.4 DL1+dmiss -0.4 DL1+alu -8.2 DL1+imiss 0.0 ... ... Total 100.0 vortex Icost Breakdown (6 wide, 64-entry window) gcc gzip vortex DL1 18.3 % 30.5 % 25.8 % DL1+window -4.2 -15.3 -24.5 DL1+bw 10.0 6.0 15.5 DL1+bmisp -7.0 -3.4 -0.3 DL1+dmiss -1.4 -0.4 -1.4 DL1+alu -1.6 -8.2 -4.7 DL1+imiss 0.1 0.0 0.4 ... ... ... ... Total 100.0 100.0 100.0 Vortex Breakdowns, enlarging the window 64 DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss ... Total 128 256 Vortex Breakdowns, enlarging the window 64 128 256 DL1 25.8 8.9 3.9 DL1+window -24.5 -7.7 -2.6 DL1+bw 15.5 16.7 13.2 DL1+bmisp -0.3 -0.6 -0.8 DL1+dmiss -1.4 -2.1 -2.8 DL1+alu -4.7 -2.5 -0.4 DL1+imiss 0.4 0.5 0.3 ... ... ... ... Total 100.0 80.8 75.0 Shotgun Profiling Profiling goal Goal: • Construct graph many dynamic instructions Constraint: • Can only sample sparsely Genome sequencing Profiling goal Goal: • DNA strand Construct graph DNA Constraint: • Can only sample sparsely “Shotgun” genome sequencing DNA “Shotgun” genome sequencing DNA “Shotgun” genome sequencing DNA ... ... “Shotgun” genome sequencing DNA ... ... Find overlaps among samples ... ... Mapping “shotgun” to our situation many dynamic instructions Icache miss Dcache miss Branch misp. No event Profiler hardware requirements ... ... Profiler hardware requirements ... Match! ... Offline Profiler Algorithm long sample detailed samples Design issues • Choosing signature bits Identify microexecution context if = then = • Determining PCs (for better detailed sample matching) long sample Start PC 12 16 20 24 56 60 . . . branch encode taken/not-taken bit in signature Sources of error Error Source Gcc Parser Twolf Sources of error Error Source Building graph fragments Gcc Parser Twolf Sources of error Error Source Building graph fragments Sampling only a few graph fragments Gcc Parser Twolf Sources of error Error Source Building graph fragments Sampling only a few graph fragments Modeling execution as a graph Gcc Parser Twolf Sources of error Error Source Gcc Parser Twolf Building graph fragments 5.3 % 1.5 % 1.6 % Sampling only a few graph fragments Modeling execution as a graph Sources of error Error Source Gcc Parser Twolf Building graph fragments 5.3 % 1.5 % 1.6 % Sampling only a few graph fragments 4.8 % 6.5 % 7.2 % Modeling execution as a graph Sources of error Error Source Gcc Parser Twolf Building graph fragments 5.3 % 1.5 % 1.6 % Sampling only a few graph fragments 4.8 % 6.5 % 7.2 % Modeling execution as a graph 2.1 % 6.0% 0.1 % Sources of error Error Source Gcc Parser Twolf Building graph fragments 5.3 % 1.5 % 1.6 % Sampling only a few graph fragments 4.8 % 6.5 % 7.2 % Modeling execution as a graph 2.1 % 6.0% 0.1 % Total 12.2 % 14.0 % 8.9 % Icost vs. Sensitivity Study Compare Icost and Sensitivity Study Corollary to DL1 and ROB serial interaction: As load latency increases, the benefit from enlarging the ROB increases. F 1 DL1 access 0 F 1 1 E 3 2 C i1 0 C i2 1 E 1 2 2 1 C i3 1 F 1 4 E F 0 1 2 E C i4 2 1 C i5 F 1 E E 3 0 1 F 1 3 0 C i6 Compare Icost and Sensitivity Study 25 DL1 Latency Speedup 20 10 5 4 3 2 1 15 10 5 0 64 128 192 ROB size 256 Compare Icost and Sensitivity Study Sensitivity Study Advantages • More information • e.g., concave or convex curves Interaction Cost Advantages • Easy (automatic) interpretation • • Sign and magnitude have well defined meanings Concise communication • DL1 and ROB interact serially Outline • Definition (ISCA ’01) • what does it mean for an event to be critical? • Detection (ISCA ’01) • how can we determine what events are critical? • Interpretation (MICRO ’04, TACO ’04) • what does it mean for two events to interact? • Application (ISCA ’01-’02, TACO ’04) • how can we exploit criticality in hardware? Our solution: measure interactions Two parallel cache misses (Each 100 cycles) miss #1 (100) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 0 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 100 0+0 icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100 Interaction cost (icost) icost = aggregate cost – sum of individual costs 1. Positive icost parallel interaction 2. Zero icost ? miss #1 miss #2 Interaction cost (icost) icost = aggregate cost – sum of individual costs 1. Positive icost parallel interaction 2. Zero icost independent 3. Negative icost ? miss #1 miss #1 miss #2 ... miss #2 Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = ? Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost serial interaction Interaction cost (icost) icost = aggregate cost – sum of individual costs 1. Positive icost miss #1 parallel interaction 2. Zero icost independent 3. Negative icost serial interaction Fetch BranchBW mispredict Load-Replay LSQ stall Trap miss #2 miss #1 .. . miss #1 miss #2 miss #2 ALU latency Why care about serial interactions? miss #1 (100) miss #2 (100) ALU latency (110 cycles) Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1 Outline • Definition (ISCA ’01) • what does it mean for an event to be critical? • Detection (ISCA ’01) • how can we determine what events are critical? • Interpretation (MICRO ’04, TACO ’04) • what does it mean for two events to interact? • Application (ISCA ’01-’02, TACO ’04) • how can we exploit criticality in hardware? Criticality Analyzer (ISCA ‘01) Goal • Detect criticality of dynamic instructions Procedure 1. Observe last-arriving edges • uses simple rules 2. Propagate a token forward along last-arriving edges • at worst, a read-modify-write sequence to a small array 3. If token dies, non-critical; otherwise, critical Slack Analyzer (ISCA ‘02) Goal • Detect likely slack of static instructions Procedure 1. Delay the instruction by n cycles 2. Check if critical (via critical-path analyzer) • • No, instruction has n cycles of slack Yes, instruction does not have n cycles of slack Shotgun Profiling (TACO ‘04) Goal • Create representative graph fragments Procedure • Enhance ProfileMe counters with context • Use context to piece together counter samples