CS116-Computer Architecture

ELEC 669 Low Power Design Techniques Lecture 2 Amirali Baniasadi amirali@ece.uvic.ca How to write a review?  Think Critically.  What if?  Next Step?  Any other applications? 2 Branches Instructions which can alter the flow of instruction execution in a program Direct Conditional Unconditional if - then- else for loops (bez, bnez, etc) procedure calls (jal) goto (j) 3 Motivation A branch is fetched But takes N cycles to execute F D A M W F D A M W F D A M W  Pipelined execution  A new intruction enters the pipeline every cycle...  …but still takes several cycles to execute  Control flow changes  Two possible paths after a branch is fetched  Introduces pipeline "bubbles" Branch delay slots  Prediction offers a chance to avoid this bubbles F D A M W Pipeline bubble 4 Techniques for handling branches Stalling Branch delay slots Relies on programmer/compiler to fill Depends on being able to find suitable instructions Ties resolution delay to a particular pipeline 5 Why aren’t these techniques acceptable?  Branches are frequent - 15-25%  Today’s pipelines are deeper and wider  Higher performance penalty for stalling  Misprediction Penalty = issue width * resolution delay cycles  A lot of cycles can be wasted!!! 6 Branch Prediction  Predicting the outcome of a branch  Direction: Taken / Not Taken Direction predictors  Target Address PC+offset (Taken)/ PC+4 (Not Taken) Target address predictors • Branch Target Buffer (BTB) 7 Why do we need branch prediction? Branch prediction Increases the number of instructions available for the scheduler to issue. Increases instruction level parallelism (ILP) Allows useful work to be completed while waiting for the branch to resolve 8 Branch Prediction Strategies  Static  Decided before runtime  Examples: Always-Not Taken Always-Taken Backwards Taken, Forward Not Taken (BTFNT) Profile-driven prediction  Dynamic  Prediction decisions may change during the execution of the program 9 What happens when a branch is predicted? On misprediction: No speculative state may commit Squash instructions in the pipeline Must not allow stores in the pipeline to occur • Cannot allow stores which would not have happened to commit • Even for good branch predictors more than half of the fetched instructions are squashed 10 Instruction traffic due to misprediction 100% 80% 60% 40% 20% 0% amm bzp cmp equ gcc mcf mes prs avg fetch decode issue complete Half of fetched instructions wasted. More Waste in Front-End. 11 Energy Loss due to Miss-Predictions 45% 40% better 35% 30% 25% 20% 15% 10% 5% 0% amm bzp cmp equ gcc mcf mes prs AVG 21% average energy loss. More energy waste in integer benchmarks. 12 Simple Static Predictors  Simple heuristics  Always taken  Always not taken  Backwards taken / Forward not taken Relies on the compiler to arrange the code following this assertion  Certain opcodes taken  Programmer provided hints  Profiling 13 Simple Static Predictors 80 70 60 taken not taken BTFNT 50 40 30 postgres vortex perl ijpeg li gcc m88ksim 20 14 Dynamic Hardware Predictors • Dynamic Branch Prediction is the ability of the hardware to make an educated guess about which way a branch will go - will the branch be taken or not. • The hardware can look for clues based on the instructions, or it can use past history - we will discuss both of these directions. 15 A Generic Branch Predictor Predicted Stream PC, T or NT Execution Order Fetch f(PC, x) Resolve Actual Stream Actual Stream f(PC, x) = T or NT Predicted Stream - What’s f (PC, x)? - x can be any relevant info thus far x was empty 16 Bimodal Branch Predictors Dynamically store information about the branch behaviour Branches tend to behave in a fixed way Branches tend to behave in the same way across program execution Index a Pattern History Table using the branch address 1 bit: branch behaves as it did last time Saturating 2 bit counter: branch behaves as it usually does 17 Saturating-Counter Predictors  Consider strongly biased branch with infrequent outcome  TTTTTTTTNTTTTTTTTNTTTT  Last-outcome will misspredict twice per infrequent outcome encounter:  TTTTTTTTNTTTTTTTTNTTTT  Idea: Remember most frequent case  Saturating-Counter: Hysteresis  often called bi-modal predictor 18 Bimodal Prediction  Table of 2-bit saturating counters Taken  Predict the most common direction 11 T PC Not Taken 10 T PHT Ta k e n Ta k e n 00 Ta k e n 01 N ot Ta k e n Tak en Tak en 00 11 N ot Ta k e n Tak en 01 N ot Tak en Not Taken Ta k e n 10 N ot Ta k e n N ot Tak en T/NT 11 N ot N ot Tak en Tak en Taken 01 N ot Ta k e n Tak en 10 Taken Not Taken ... NT Taken 00 NT Not Taken  Advantages: simple, cheap, “good” accuracy  Bimodal will misspredict once per infrequent outcome encounter: TTTTTTTTNTTTTTTTTNTTTT 19 Bimodal Branch Predictors 100 95 90 bimodal BTFNT 85 80 75 postgres vortex perl ijpeg li gcc m88ksim 70 20 Correlating Predictors  From program perspective:  Different Branches may be correlated  if (aa == 2) aa = 0;  if (bb == 2) bb = 0;  if (aa != bb) then …  Can be viewed as a pattern detector  Instead of keeping aggregate history information I.e., most frequent outcome  Keep exact history information Pattern of n most recent outcomes  Example:  BHR: n most recent branch outcomes  Use PC and BHR (xor?) to access prediction table 21 Pattern-based Prediction  Nested loops: for i = 0 to N for j = 0 to 3 …  Branch Outcome Stream for j-for branch • 11101110111011101110  Patterns: • • • • 111 110 101 011 -> -> -> -> 0 1 1 1  100% accuracy  Learning time 4 instances  Table Index (PC, 3-bit history) 22 Two-level Branch Predictors  A branch outcome depends on the outcomes of previous branches  First level: Branch History Registers (BHR)  Global history / Branch correlation: past executions of all branches  Self history / Private history: past executions of the same branch  Second level: Pattern History Table (PHT)  Use first level information to index a table Possibly XOR with the branch address  PHT: Usually saturating 2 bit counters  Also private, shared or global 23 Gshare Predictor (McFarling) Branch History Table Global BHR PC f Prediction  PC and BHR can be  concatenated  completely overlapped  partially overlapped  xored, etc.  How deep BHR should be?  Really depends on program  But, deeper increases learning time  May increase quality of information 24 Two-level Branch Predictors (II) 99 97 95 Gshare bimodal 93 91 89 87 postgres vortex perl ijpeg li gcc m88ksim 85 25 Hybrid Prediction  Combining branch predictors  Use two different branch predictors Access both in parallel  A third table determines which prediction to use Two or more predictor components combined PC GSHARE Bimodal ... T/NT  Different branches benefit from different types of history T/NT Selector 26 T/NT Hybrid Branch Predictors (II) 100 98 96 Gshare+Bimod (12K) Gshare (16K) Gshare (4K) 94 92 postgres vortex perl ijpeg li gcc m88ksim 90 27 Issues Affecting Accurate Branch Prediction  Aliasing  More than one branch may use the same BHT/PHT entry Constructive • Prediction that would have been incorrect, predicted correctly Destructive • Prediction that would have been correct, predicted incorrectly Neutral • No change in the accuracy 28 More Issues  Training time  Need to see enough branches to uncover pattern  Need enough time to reach steady state  “Wrong” history  Incorrect type of history for the branch  Stale state  Predictor is updated after information is needed  Operating system context switches  More aliasing caused by branches in different programs 29 Performance Metrics  Misprediction rate  Mispredicted branches per executed branch Unfortunately the most usually found  Instructions per mispredicted branch  Gives a better idea of the program behaviour Branches are not evenly spaced 30 Impact of Realistic Branch Prediction Limiting the type of branch 60 prediction. 61 60 58 FP: 15 - 45 48 50 46 45 46 45 45 IPC Instruction issues per cycle 41 40 35 Integer: 6 - 12 30 29 19 20 16 15 12 10 13 14 10 9 6 7 6 6 6 7 4 2 2 2 0 gcc espresso li fpppp doducd tomcatv Program Perfect Selective predictor Standard 2-bit Static None 31 BPP:Power-Aware Branch Predictor Combined Predictors Branch Instruction Behavior BPP (Branch Predictor Prediction) Results 32 Combined Predictors     Different Behaviors, Different Sub-Predictors Selector Picks Sub-Predictor. Improved Performance over processors using only one sub-predictor Consequence: Extra Power (~%50) Bimodal Selector Gshare 33 Branch Predictors & Power  Direct Effect Up to 10%.  In-direct Effect: Wrong Path Instructions:  Smaller/Less Complex Predictors, More Wasted Energy.  Power-Aware Predictors MUST be Highly Accurate. 34 Branch Instruction Behavior  Branches use the same sub-predictor: 100% 80% 60% 40% 20% 0% amm bzp cmp equ gcc mcf mes prs vor vpr wlf AVG 35 Branch Predictor Prediction BPP BUFFER HINTS Branch PC HINT Hints on next two branches. HOW? 11: Miss-Predicted Branch 00:Branch used Bimod last time 01:Branch used Gshare last time 36 BPP : example Code Sequence :First Appearance A HINTS B BPP BUFFER Branch PC C A B C D D E HINT 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 F BMD NON-BRANCH GSH MISS-PREDICTED 37 BPP : example Code Sequence :second appearance A B BPP BUFFER C Branch PC D A B C D E HINT 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 NEXT CYCLE : Gate Selector and Bimod DO NOTHING NEXT CYCLE: Gate Selector and Gshare F BRANCH NON-BRANCH 38 Results  Power (Total & Branch Predictor’s) and Performance.  Compared to three base cases:  A) Non-Gated Combined (CMB)  B) Bimodal (BMD)  C) Gshare (GSH)  Reported for 32k entry Banked Predictors. 39 Performance 130% 120% 110% 100% 90% 80% 70% CMB BMD G AV lf w vp r vo r pr s m es m cf gc c cm p eq u bz p am m 60% GSH Within 0.4% of CMB, better than BMD(7%) and GSH(3%) 40 Branch Predictor’s Energy 190% 170% 150% 130% 110% 90% 70% CMB BMD G A V lf w vp r vo r pr s m es m cf gc c eq u cm p bz p am m 50% GSH 13% less than CMB, more than BMD(35%) and GSH(22%) 41 Total Energy 105% 100% 95% 90% 85% CMB BMD G A V lf w vp r vo r pr s m es m cf gc c cm p eq u bz p am m 80% GSH 0.3%, 4.5% and 1.8% less than CMB, BMD and GSH 42 ILP, benefits and costs?  How can we extract more ILP?  What are the costs? 43 Upper Limit to ILP: Ideal Machine Amount of parallelism when there are no branch mispredictions and we’re limited only by data dependencies. 160 150.1 FP: 75 - 150 Instruction Issues per cycle IPC 140 120 118.7 Integer: 18 - 60 100 75.2 80 62.6 60 54.8 40 17.9 20 0 gcc Instructions that could theoretically be issued per cycle. espresso li fpppp doducd tomcatv Programs 44 Complexity-Effective Designs  History: “Brainiacs” and “Speed demons”  Brainiacs – maximizing the # of instructions issued per clock cycle  Speed demons – simpler implementation with a very fast clock  Complexity-Effective  Complexity-Effective architecture means that the architecture takes both of the benefits of complex issue schemes and the benefits of simpler implementation with a fast clock cycle  Complexity measurement : delay of the critical path  Proposed Architecture  High performance(high IPC) with a very high clock frequency 45 Extracting More Parallelism 8 4 128 Today 256 Future? Higher IPC Clock, Power? Want: High IPC+ Fast Clock+ Low Power 46 Generic pipeline description  Baseline superscalar model  Criteria for sources of complexity(delay) o structures whose delay is a function of issue window size and issue width o structures which tends to rely on broadcast operations over long wires 47 Sources of complexity  Register renaming logic o translates logical register designators to physical register designator  Wakeup logic o Responsible for waking up instructions waiting for their source operands to become available  Selection logic o Responsible for selection instructions for execution from the pool of ready instructions  Bypass logic o Bypassing the operand values from instructions that have completed execution  Other structures not to be considered here  Access time of the register file varies with the # of registers and the # of ports.  Access time of a cache is a function of the size of the cache and the associativity of the cache 48 Register rename logic complexity 49 Delay analysis for rename logic  Delay analysis for RAM scheme  RAM scheme operates like a standard RAM  Issue width affect delay through its impact wire lengths - Increasing issue width increases the # of bit/word lines - Delay of rename logic depends on the linear function of the issue width.  Spice result o Total delay & each component delay increase linearly with IW o Bit line & word line delay worsens as the feature size is reduced. (Logic delay is reduced linearly as the feature size is reduced. But wire delay fall at a slow rate.) 50 Wakeup logic  Wakeup logic  Responsible for updating source dependences for instructions in the issue window waiting for their source operands to become available.  Basic Structure o 2 OR gates and 2*IW comparators per one entry of issue window  Delay analysis Almost linear func. (>=0.35um) Quadratic func. Under 0.35um Almost linear function. 51 Delay analysis for wakeup logic  SPICE result  (figure 5 : under 0.18um) Issue width has a greater impact on the delay than window size. WINSIZE  Tdrive IW  Tdrive, Ttagmatch, TmatchOR  (figure 6 : under 8-way,64-entry window) The tag drive and tag match delays are less scalable than the match OR delay. Tdrive + Ttagmatch 52% under 0.8um 62% under 0.18um 52 Selection Logic  Selection Logic  Responsible for choosing instructions for execution from the pool of ready instructions in the issue window  Basic structure  REQ(input) & GRANT(output) signals  Operation : 2 phases • REQ propagates up to the root. • GRANT with high priority on the arbiter cell propagates down to the leaf arbiter.  Selection policy(oldest first) : < implementation > • left-most entries have the highest priority. • IW compacts the IW to the left every time instr.s are issued and inserts new instr.s at the right end. 53 Delay analysis for selection logic  Delay analysis  The optimal number of arbiter inputs to be four here.  SPICE result  Assuming a single functional unit  Various components of the total delay scale well as the feature size is reduced.  All the delays are logic delay. (don’t consider the wire)  It is possible to minimize the effect of the effect of wire delays if the ready signals are stored in a smaller,more compact array. 54 Data bypass logic  Bypass logic  Responsible for forwarding result values from completing instructions to dependent instructions to dependent instructions,bypassing RF  Basic structure  In fully bypass design, Bypass paths=2*(IW)2*S where S= # of pipeline stages after first output-producing stage  Current trend : deeper-pipelining wider issue  produce critical importance 55 Delay analysis for data bypass logic  Delay analysis  The length of the wires is a function of the result wires  Increasing IW increases the length of the result wires  SPICE result  Based on the basic structure(layout)  The delays are the same for the three technologies(feature sizes) 56 Summary of Delays and Pipeline Issues  Pipeline delay results  For the 4-way machine, the window logic(WL)  critical path delay  For the 8-way machine, the bypass logic(BL)  critical path delay  Future machine(ILP)  WL & BL will pose the largest problems.  Both make us difficult to divide these into more pipeline segments.(atomic operation)  In WL(wake-up/select)  In BL(bypass logic) • In order for dependent operations to execute in consecutive cycles, the bypass value must be made available to the dependent instruction within a cycle. • Solution : stall(trade-off between the cycle time and bottle-neck from bypass in wider issues) 57 A complexity-Effective Micro-Arch.  Dependence-based microarchitecture  Replaces the issue window with a simpler structure that facilitates a faster clock while exploiting similar levels of parallelism.  Naturally lends itself to clustering and helps the bypass problem to a large extent.  Simple description  “Dependent instructions can’t execute in parallel but consecutively.”  The issue window is replaced by a small # of FIFO buffers  The FIFO buffers are constrained to issue in-order, and dependent instr.s are steered to the same FIFO.  The register availability only needs to be fanned out to the heads of the FIFO buffers. (In typical issue window, result tags have to be broadcast to all the entries.)  The instruction at the FIFO heads monitor reservation bits to check for operand availability. (one per physical register)  SRC_FIFO table for steering instructions to appropriate buffers • Indexed using logical register designators. • SRC_FIFO(Ra) = the identity of the FIFO buffer 58 Instruction Steering Heuristics  Applied heuristics  Case 1: All operands of I are available  I into new(free) FIFO  Case 2: A single outstanding operands of I & Isource in FIFO fa if no instructions behind Isource in FIFO fa  I into FIFO fa else  I into new FIFO  Case 3: 2 outstanding operands of I  apply one of 2 operands to case 2 59 Performance results  Performance results  Proposed arch. : 8 FIFOs, 8 entries in 1 FIFO, baseline arch. : 64-entry issue window  The dependence-based microarchitecture is nearly as effective(extracts similar parallelism) as the typical window-based microarchitecture. Max. 8% 60 Complexity analysis  Reservation table  If the instruction Ia at the head of FIFO Fa is dependent on an instruction Ib waiting in FIFO, Ia cannot issue until Ib completes.  The delay of the wakeup logic is determined by the delay of accessing the reservation table.  The selection logic is simple because only the instructions at the FIFO heads need to be considered for selecton.  Effect  The suggested arch. can improve clock period(faster clock)  as much as 39% in 0.18 um technology 61 Clustering  Clustering the dependence-based microarchitecture  Advantage  Wakeup and selection logic are simplified.  Because of assigning dependent instructions to FIFOs,local bypasses are more frequently than inter-cluster bypasses.(overall delay is reduced.)  Multiple copies of register file make the # of ports reduced(faster RF access) 62 Performance of Clustering  Performance comparison  Comparison between 2*4-way dependence-based and conventional 8-way 64-entry window-based architecture  Assuming 1-cycle Local bypass delay and 2-cycle inter-cluster bypass delay  Overall performance considering clock speed  average 16% improvement Max 12% 63 Conclusion  Some important results  The logic associated with the issue window and the data bypass logic are going to become increasingly critical as future designs employ wider issue widths,bigger windows, and smaller feature size.  Wire delays will increasingly dominate total delay in future technology. (window logic and bypass logic are atomic operations.)  Complexity-effective architecture  Architecture that facilitate a fast clock while exploiting similar levels of ILP  Dependence-based architecture as a complexity-effective architecture  simplifies window logic  naturally lends itself to clustering by grouping dependent instructions 64 The Motivation for Caches Memory System Processor Cache DRAM Motivation: Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access time small by: Servicing most accesses from a small, fast memory. Reduce the bandwidth required of the large memory Levels of the Memory Hierarchy Upper Level Capacity Access Time Cost CPU Registers 100s Bytes <10s ns Cache K Bytes 10-100 ns $.01-.001/bit Main Memory M Bytes 100ns-1us $.01-.001 Disk G Bytes ms -3 -4 10 - 10 cents Tape infinite sec-min 10 -6cents Staging Xfer Unit faster Registers Instr. Operands prog./compiler 1-8 bytes Cache Blocks cache cntl 8-128 bytes Memory Pages OS 512-4K bytes Disk Files Tape user/operator Mbytes Larger Lower Level The Principle of Locality Probability of reference 0 Address Space 2 The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Example: 90% of time in 10% of the code Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. Memory Hierarchy: Principles of Operation At any given time, data is copied between only 2 adjacent levels: Upper Level (Cache) : the one closer to the processor  Smaller, faster, and uses more expensive technology Lower Level (Memory): the one further away from the processor  Bigger, slower, and uses less expensive technology Block: The minimum unit of information that can either be present or not present in the two level hierarchy To Processor Upper Level (Cache) Lower Level (Memory) Blk X From Processor Blk Y

CS116-Computer Architecture

Related documents

Products

Support

CS116-Computer Architecture

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib