ELEC 669 Low Power Design Techniques Lecture 2 Amirali Baniasadi amirali@ece.uvic.ca How to write a review? Think Critically. What if? Next Step? Any other applications? 2 Branches Instructions which can alter the flow of instruction execution in a program Direct Conditional Unconditional if - then- else for loops (bez, bnez, etc) procedure calls (jal) goto (j) 3 Motivation A branch is fetched But takes N cycles to execute F D A M W F D A M W F D A M W Pipelined execution A new intruction enters the pipeline every cycle... …but still takes several cycles to execute Control flow changes Two possible paths after a branch is fetched Introduces pipeline "bubbles" Branch delay slots Prediction offers a chance to avoid this bubbles F D A M W Pipeline bubble 4 Techniques for handling branches Stalling Branch delay slots Relies on programmer/compiler to fill Depends on being able to find suitable instructions Ties resolution delay to a particular pipeline 5 Why aren’t these techniques acceptable? Branches are frequent - 15-25% Today’s pipelines are deeper and wider Higher performance penalty for stalling Misprediction Penalty = issue width * resolution delay cycles A lot of cycles can be wasted!!! 6 Branch Prediction Predicting the outcome of a branch Direction: Taken / Not Taken Direction predictors Target Address PC+offset (Taken)/ PC+4 (Not Taken) Target address predictors • Branch Target Buffer (BTB) 7 Why do we need branch prediction? Branch prediction Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) Allows useful work to be completed while waiting for the branch to resolve 8 Branch Prediction Strategies Static Decided before runtime Examples: Always-Not Taken Always-Taken Backwards Taken, Forward Not Taken (BTFNT) Profile-driven prediction Dynamic Prediction decisions may change during the execution of the program 9 What happens when a branch is predicted? On misprediction: No speculative state may commit Squash instructions in the pipeline Must not allow stores in the pipeline to occur • Cannot allow stores which would not have happened to commit • Even for good branch predictors more than half of the fetched instructions are squashed 10 Instruction traffic due to misprediction 100% 80% 60% 40% 20% 0% amm bzp cmp equ gcc mcf mes prs avg fetch decode issue complete Half of fetched instructions wasted. More Waste in Front-End. 11 Energy Loss due to Miss-Predictions 45% 40% better 35% 30% 25% 20% 15% 10% 5% 0% amm bzp cmp equ gcc mcf mes prs AVG 21% average energy loss. More energy waste in integer benchmarks. 12 Simple Static Predictors Simple heuristics Always taken Always not taken Backwards taken / Forward not taken Relies on the compiler to arrange the code following this assertion Certain opcodes taken Programmer provided hints Profiling 13 Simple Static Predictors 80 70 60 taken not taken BTFNT 50 40 30 postgres vortex perl ijpeg li gcc m88ksim 20 14 Dynamic Hardware Predictors • Dynamic Branch Prediction is the ability of the hardware to make an educated guess about which way a branch will go - will the branch be taken or not. • The hardware can look for clues based on the instructions, or it can use past history - we will discuss both of these directions. 15 A Generic Branch Predictor Predicted Stream PC, T or NT Execution Order Fetch f(PC, x) Resolve Actual Stream Actual Stream f(PC, x) = T or NT Predicted Stream - What’s f (PC, x)? - x can be any relevant info thus far x was empty 16 Bimodal Branch Predictors Dynamically store information about the branch behaviour Branches tend to behave in a fixed way Branches tend to behave in the same way across program execution Index a Pattern History Table using the branch address 1 bit: branch behaves as it did last time Saturating 2 bit counter: branch behaves as it usually does 17 Saturating-Counter Predictors Consider strongly biased branch with infrequent outcome TTTTTTTTNTTTTTTTTNTTTT Last-outcome will misspredict twice per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT Idea: Remember most frequent case Saturating-Counter: Hysteresis often called bi-modal predictor 18 Bimodal Prediction Table of 2-bit saturating counters Taken Predict the most common direction 11 T PC Not Taken 10 T PHT Ta k e n Ta k e n 00 Ta k e n 01 N ot Ta k e n Tak en Tak en 00 11 N ot Ta k e n Tak en 01 N ot Tak en Not Taken Ta k e n 10 N ot Ta k e n N ot Tak en T/NT 11 N ot N ot Tak en Tak en Taken 01 N ot Ta k e n Tak en 10 Taken Not Taken ... NT Taken 00 NT Not Taken Advantages: simple, cheap, “good” accuracy Bimodal will misspredict once per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT 19 Bimodal Branch Predictors 100 95 90 bimodal BTFNT 85 80 75 postgres vortex perl ijpeg li gcc m88ksim 70 20 Correlating Predictors From program perspective: Different Branches may be correlated if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) then … Can be viewed as a pattern detector Instead of keeping aggregate history information I.e., most frequent outcome Keep exact history information Pattern of n most recent outcomes Example: BHR: n most recent branch outcomes Use PC and BHR (xor?) to access prediction table 21 Pattern-based Prediction Nested loops: for i = 0 to N for j = 0 to 3 … Branch Outcome Stream for j-for branch • 11101110111011101110 Patterns: • • • • 111 110 101 011 -> -> -> -> 0 1 1 1 100% accuracy Learning time 4 instances Table Index (PC, 3-bit history) 22 Two-level Branch Predictors A branch outcome depends on the outcomes of previous branches First level: Branch History Registers (BHR) Global history / Branch correlation: past executions of all branches Self history / Private history: past executions of the same branch Second level: Pattern History Table (PHT) Use first level information to index a table Possibly XOR with the branch address PHT: Usually saturating 2 bit counters Also private, shared or global 23 Gshare Predictor (McFarling) Branch History Table Global BHR PC f Prediction PC and BHR can be concatenated completely overlapped partially overlapped xored, etc. How deep BHR should be? Really depends on program But, deeper increases learning time May increase quality of information 24 Two-level Branch Predictors (II) 99 97 95 Gshare bimodal 93 91 89 87 postgres vortex perl ijpeg li gcc m88ksim 85 25 Hybrid Prediction Combining branch predictors Use two different branch predictors Access both in parallel A third table determines which prediction to use Two or more predictor components combined PC GSHARE Bimodal ... T/NT Different branches benefit from different types of history T/NT Selector 26 T/NT Hybrid Branch Predictors (II) 100 98 96 Gshare+Bimod (12K) Gshare (16K) Gshare (4K) 94 92 postgres vortex perl ijpeg li gcc m88ksim 90 27 Issues Affecting Accurate Branch Prediction Aliasing More than one branch may use the same BHT/PHT entry Constructive • Prediction that would have been incorrect, predicted correctly Destructive • Prediction that would have been correct, predicted incorrectly Neutral • No change in the accuracy 28 More Issues Training time Need to see enough branches to uncover pattern Need enough time to reach steady state “Wrong” history Incorrect type of history for the branch Stale state Predictor is updated after information is needed Operating system context switches More aliasing caused by branches in different programs 29 Performance Metrics Misprediction rate Mispredicted branches per executed branch Unfortunately the most usually found Instructions per mispredicted branch Gives a better idea of the program behaviour Branches are not evenly spaced 30 Impact of Realistic Branch Prediction Limiting the type of branch 60 prediction. 61 60 58 FP: 15 - 45 48 50 46 45 46 45 45 IPC Instruction issues per cycle 41 40 35 Integer: 6 - 12 30 29 19 20 16 15 12 10 13 14 10 9 6 7 6 6 6 7 4 2 2 2 0 gcc espresso li fpppp doducd tomcatv Program Perfect Selective predictor Standard 2-bit Static None 31 BPP:Power-Aware Branch Predictor Combined Predictors Branch Instruction Behavior BPP (Branch Predictor Prediction) Results 32 Combined Predictors Different Behaviors, Different Sub-Predictors Selector Picks Sub-Predictor. Improved Performance over processors using only one sub-predictor Consequence: Extra Power (~%50) Bimodal Selector Gshare 33 Branch Predictors & Power Direct Effect Up to 10%. In-direct Effect: Wrong Path Instructions: Smaller/Less Complex Predictors, More Wasted Energy. Power-Aware Predictors MUST be Highly Accurate. 34 Branch Instruction Behavior Branches use the same sub-predictor: 100% 80% 60% 40% 20% 0% amm bzp cmp equ gcc mcf mes prs vor vpr wlf AVG 35 Branch Predictor Prediction BPP BUFFER HINTS Branch PC HINT Hints on next two branches. HOW? 11: Miss-Predicted Branch 00:Branch used Bimod last time 01:Branch used Gshare last time 36 BPP : example Code Sequence :First Appearance A HINTS B BPP BUFFER Branch PC C A B C D D E HINT 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 F BMD NON-BRANCH GSH MISS-PREDICTED 37 BPP : example Code Sequence :second appearance A B BPP BUFFER C Branch PC D A B C D E HINT 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 NEXT CYCLE : Gate Selector and Bimod DO NOTHING NEXT CYCLE: Gate Selector and Gshare F BRANCH NON-BRANCH 38 Results Power (Total & Branch Predictor’s) and Performance. Compared to three base cases: A) Non-Gated Combined (CMB) B) Bimodal (BMD) C) Gshare (GSH) Reported for 32k entry Banked Predictors. 39 Performance 130% 120% 110% 100% 90% 80% 70% CMB BMD G AV lf w vp r vo r pr s m es m cf gc c cm p eq u bz p am m 60% GSH Within 0.4% of CMB, better than BMD(7%) and GSH(3%) 40 Branch Predictor’s Energy 190% 170% 150% 130% 110% 90% 70% CMB BMD G A V lf w vp r vo r pr s m es m cf gc c eq u cm p bz p am m 50% GSH 13% less than CMB, more than BMD(35%) and GSH(22%) 41 Total Energy 105% 100% 95% 90% 85% CMB BMD G A V lf w vp r vo r pr s m es m cf gc c cm p eq u bz p am m 80% GSH 0.3%, 4.5% and 1.8% less than CMB, BMD and GSH 42 ILP, benefits and costs? How can we extract more ILP? What are the costs? 43 Upper Limit to ILP: Ideal Machine Amount of parallelism when there are no branch mispredictions and we’re limited only by data dependencies. 160 150.1 FP: 75 - 150 Instruction Issues per cycle IPC 140 120 118.7 Integer: 18 - 60 100 75.2 80 62.6 60 54.8 40 17.9 20 0 gcc Instructions that could theoretically be issued per cycle. espresso li fpppp doducd tomcatv Programs 44 Complexity-Effective Designs History: “Brainiacs” and “Speed demons” Brainiacs – maximizing the # of instructions issued per clock cycle Speed demons – simpler implementation with a very fast clock Complexity-Effective Complexity-Effective architecture means that the architecture takes both of the benefits of complex issue schemes and the benefits of simpler implementation with a fast clock cycle Complexity measurement : delay of the critical path Proposed Architecture High performance(high IPC) with a very high clock frequency 45 Extracting More Parallelism 8 4 128 Today 256 Future? Higher IPC Clock, Power? Want: High IPC+ Fast Clock+ Low Power 46 Generic pipeline description Baseline superscalar model Criteria for sources of complexity(delay) o structures whose delay is a function of issue window size and issue width o structures which tends to rely on broadcast operations over long wires 47 Sources of complexity Register renaming logic o translates logical register designators to physical register designator Wakeup logic o Responsible for waking up instructions waiting for their source operands to become available Selection logic o Responsible for selection instructions for execution from the pool of ready instructions Bypass logic o Bypassing the operand values from instructions that have completed execution Other structures not to be considered here Access time of the register file varies with the # of registers and the # of ports. Access time of a cache is a function of the size of the cache and the associativity of the cache 48 Register rename logic complexity 49 Delay analysis for rename logic Delay analysis for RAM scheme RAM scheme operates like a standard RAM Issue width affect delay through its impact wire lengths - Increasing issue width increases the # of bit/word lines - Delay of rename logic depends on the linear function of the issue width. Spice result o Total delay & each component delay increase linearly with IW o Bit line & word line delay worsens as the feature size is reduced. (Logic delay is reduced linearly as the feature size is reduced. But wire delay fall at a slow rate.) 50 Wakeup logic Wakeup logic Responsible for updating source dependences for instructions in the issue window waiting for their source operands to become available. Basic Structure o 2 OR gates and 2*IW comparators per one entry of issue window Delay analysis Almost linear func. (>=0.35um) Quadratic func. Under 0.35um Almost linear function. 51 Delay analysis for wakeup logic SPICE result (figure 5 : under 0.18um) Issue width has a greater impact on the delay than window size. WINSIZE Tdrive IW Tdrive, Ttagmatch, TmatchOR (figure 6 : under 8-way,64-entry window) The tag drive and tag match delays are less scalable than the match OR delay. Tdrive + Ttagmatch 52% under 0.8um 62% under 0.18um 52 Selection Logic Selection Logic Responsible for choosing instructions for execution from the pool of ready instructions in the issue window Basic structure REQ(input) & GRANT(output) signals Operation : 2 phases • REQ propagates up to the root. • GRANT with high priority on the arbiter cell propagates down to the leaf arbiter. Selection policy(oldest first) : < implementation > • left-most entries have the highest priority. • IW compacts the IW to the left every time instr.s are issued and inserts new instr.s at the right end. 53 Delay analysis for selection logic Delay analysis The optimal number of arbiter inputs to be four here. SPICE result Assuming a single functional unit Various components of the total delay scale well as the feature size is reduced. All the delays are logic delay. (don’t consider the wire) It is possible to minimize the effect of the effect of wire delays if the ready signals are stored in a smaller,more compact array. 54 Data bypass logic Bypass logic Responsible for forwarding result values from completing instructions to dependent instructions to dependent instructions,bypassing RF Basic structure In fully bypass design, Bypass paths=2*(IW)2*S where S= # of pipeline stages after first output-producing stage Current trend : deeper-pipelining wider issue produce critical importance 55 Delay analysis for data bypass logic Delay analysis The length of the wires is a function of the result wires Increasing IW increases the length of the result wires SPICE result Based on the basic structure(layout) The delays are the same for the three technologies(feature sizes) 56 Summary of Delays and Pipeline Issues Pipeline delay results For the 4-way machine, the window logic(WL) critical path delay For the 8-way machine, the bypass logic(BL) critical path delay Future machine(ILP) WL & BL will pose the largest problems. Both make us difficult to divide these into more pipeline segments.(atomic operation) In WL(wake-up/select) In BL(bypass logic) • In order for dependent operations to execute in consecutive cycles, the bypass value must be made available to the dependent instruction within a cycle. • Solution : stall(trade-off between the cycle time and bottle-neck from bypass in wider issues) 57 A complexity-Effective Micro-Arch. Dependence-based microarchitecture Replaces the issue window with a simpler structure that facilitates a faster clock while exploiting similar levels of parallelism. Naturally lends itself to clustering and helps the bypass problem to a large extent. Simple description “Dependent instructions can’t execute in parallel but consecutively.” The issue window is replaced by a small # of FIFO buffers The FIFO buffers are constrained to issue in-order, and dependent instr.s are steered to the same FIFO. The register availability only needs to be fanned out to the heads of the FIFO buffers. (In typical issue window, result tags have to be broadcast to all the entries.) The instruction at the FIFO heads monitor reservation bits to check for operand availability. (one per physical register) SRC_FIFO table for steering instructions to appropriate buffers • Indexed using logical register designators. • SRC_FIFO(Ra) = the identity of the FIFO buffer 58 Instruction Steering Heuristics Applied heuristics Case 1: All operands of I are available I into new(free) FIFO Case 2: A single outstanding operands of I & Isource in FIFO fa if no instructions behind Isource in FIFO fa I into FIFO fa else I into new FIFO Case 3: 2 outstanding operands of I apply one of 2 operands to case 2 59 Performance results Performance results Proposed arch. : 8 FIFOs, 8 entries in 1 FIFO, baseline arch. : 64-entry issue window The dependence-based microarchitecture is nearly as effective(extracts similar parallelism) as the typical window-based microarchitecture. Max. 8% 60 Complexity analysis Reservation table If the instruction Ia at the head of FIFO Fa is dependent on an instruction Ib waiting in FIFO, Ia cannot issue until Ib completes. The delay of the wakeup logic is determined by the delay of accessing the reservation table. The selection logic is simple because only the instructions at the FIFO heads need to be considered for selecton. Effect The suggested arch. can improve clock period(faster clock) as much as 39% in 0.18 um technology 61 Clustering Clustering the dependence-based microarchitecture Advantage Wakeup and selection logic are simplified. Because of assigning dependent instructions to FIFOs,local bypasses are more frequently than inter-cluster bypasses.(overall delay is reduced.) Multiple copies of register file make the # of ports reduced(faster RF access) 62 Performance of Clustering Performance comparison Comparison between 2*4-way dependence-based and conventional 8-way 64-entry window-based architecture Assuming 1-cycle Local bypass delay and 2-cycle inter-cluster bypass delay Overall performance considering clock speed average 16% improvement Max 12% 63 Conclusion Some important results The logic associated with the issue window and the data bypass logic are going to become increasingly critical as future designs employ wider issue widths,bigger windows, and smaller feature size. Wire delays will increasingly dominate total delay in future technology. (window logic and bypass logic are atomic operations.) Complexity-effective architecture Architecture that facilitate a fast clock while exploiting similar levels of ILP Dependence-based architecture as a complexity-effective architecture simplifies window logic naturally lends itself to clustering by grouping dependent instructions 64 The Motivation for Caches Memory System Processor Cache DRAM Motivation: Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access time small by: Servicing most accesses from a small, fast memory. Reduce the bandwidth required of the large memory Levels of the Memory Hierarchy Upper Level Capacity Access Time Cost CPU Registers 100s Bytes <10s ns Cache K Bytes 10-100 ns $.01-.001/bit Main Memory M Bytes 100ns-1us $.01-.001 Disk G Bytes ms -3 -4 10 - 10 cents Tape infinite sec-min 10 -6cents Staging Xfer Unit faster Registers Instr. Operands prog./compiler 1-8 bytes Cache Blocks cache cntl 8-128 bytes Memory Pages OS 512-4K bytes Disk Files Tape user/operator Mbytes Larger Lower Level The Principle of Locality Probability of reference 0 Address Space 2 The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Example: 90% of time in 10% of the code Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. Memory Hierarchy: Principles of Operation At any given time, data is copied between only 2 adjacent levels: Upper Level (Cache) : the one closer to the processor Smaller, faster, and uses more expensive technology Lower Level (Memory): the one further away from the processor Bigger, slower, and uses less expensive technology Block: The minimum unit of information that can either be present or not present in the two level hierarchy To Processor Upper Level (Cache) Lower Level (Memory) Blk X From Processor Blk Y