CS 152 Computer Architecture and Engineering Lecture 16 -- Midterm I Review Session 2014-3-13 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Today - Midterm I Review Session Study tips, test ground rules All questions answered (almost ...) Short break HW 1, problem by problem ... Recall: HW 1 was Fall 05 Mid-term I. CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB On Tuesday Mid-term I ... Ground rules ... When is it? Where is it? Ground rules. 9:30 AM sharp, Tuesday March 18th, 306 Soda. Every-other-seat seating, except for the front row, where every-seat is permitted. No blue-books needed. We will be handing out a paper test. Pencil is preferred. Pencils down @ 10:55 AM, so we can collect papers before next class comes in. CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB When is it? Where is it? Ground rules. No use of calculators, smartphones, laptops, etc ... during the exam. Closed-book, closed-notes. Just pencils, erasers. No consulting with students. Restroom breaks are OK, but you’ll still need to hand in your exam @ 10:55. Questions are reserved for serious concerns about a bug in the question. CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB What does it cover? First 8 lectures. Not to be taken legalistically. For example, WAW hazards were covered in the pipelining lecture, and so it’s fair for me to show a pipeline and say “does this have a WAW hazard”, even if that example was also seen in a later lecture. CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Mid-term: How to do well ... Problem intro often features a lecture slide. If you have to teach yourself that slide during the test, you’re starting out behind. Getting the problem correct requires thinking on your feet to do a new design or analyze one given to you. There will not be “you can only get it if do the reading” problems ... but the reading helps you understand how to think through the problem. CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Mid-term: There may be math ... No memorization: If we ask about Amdahl’s Law, we will show its definition lecture slide. Understanding is needed: A problem may require you to apply equation to a design, etc. Cannot use You may need to do: electronic devices simple algebra and calculus, ... more add a few numbers by hand, administrative etc. info after we do some content. CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Example starting slides ... These are meant to be examples, not a complete list! A problem may start with a slide that is not in this part of the presentation! CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB CS 152 Computer Architecture and Engineering Lecture 2 – Single Cycle Wrap-up 2014-1-23 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Merging data paths ... Add muxes R-format N N N How many ?2 I-format (ignore ALU control) Where ? CS 152: Single-Cycle Design UC Regents Spring 2014 © UCB Adding branch testing to the data path 5 5 5 RegFile rs1 rd1 rs2 32 ws 32 wd RegDest 32 rd2 ALUctr WE Ext RegWr ExtOp MemToReg ALUsrc MemWr Equal (wire into control) Syntax: BEQ $1, $2, 12 Action: If ($1 != $2), PC = PC + 4 Action: If ($1 == $2), PC = PC + 4 + 48 CS 152: Single-Cycle Design UC Regents Spring 2014 © UCB Josh Fisher: idea grew out of his Ph.D (1979) in compilers VLIW Very Long Instruction Words CS 152: L2 Single-Cycle Wrap-up Led to a startup (MultiFlow) whose computers worked, but which went out of business ... the ideas remain influential. UC Regents Spring 2014 © UCB 32-bit & 64-bit semantics different? Yes! Assume: $7 = 7, $8 = 8, $9 = 9, $10 = 10 (decimal) 32-bit MIPS: ADD $8 $9 $10; Result: $8 = 19 ADD $7 $8 $9; Result: $7 = 28 VLIW: Instr: ADD $8 $9 $10 ADD $7 $8 $9 CS 152: Single-Cycle Design ; result $8 = 19 ; result $7 = 17 (not 28) UC Regents Spring 2014 © UCB Branch policy: All instr operators execute BNE $8 $9 Label opcode rs rt ADD $7 $8 $9 rd shamt funct opcode rs rt rd shamt funct ADD executes if branch is taken or not taken. Problem: Large N machines find it hard to fill all operators with useful work. Solution: New “predication” operator. Syntax: SELECT $7 $8 $9 $10 Semantics: If $8 == 0, $7 = $10, else $7 = $9 Permits simple branches to be converted to inline code. CS 152: Single-Cycle Design UC Regents Spring 2014 © UCB Branch nesting in a single instruction ... BEQ $8 $9 LabelOne opcode rs rt rd shamt funct opcode rs rt rd shamt funct BEQ $11 $12 LabelTwo Conundrum: How to define the semantics of multiple branches in one instruction? Solution: Nested branch semantics If $8 == $9, branch to LabelOne Else $11 == $12, branch to LabelTwo CS 152: Single-Cycle Design UC Regents Spring 2014 © UCB CS 152 Computer Architecture and Engineering Lecture 3 – Metrics 2014-1-28 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB CPU time: Proportional to Instruction Count Q. Once ISA is set, who can influence instruction count? A. Compiler writer, application developer. CPU time Program ∝ Q. Static count? (lines of program printout) Or dynamic count? (trace of execution) A. Dynamic. Machine Instructions Program Rationale: Every additional instruction you execute takes time. CS 152 L6: Performance Q. How does a architect influence the number of machine instructions needed to run an algorithm? A. Create new instructions: instruction set architect. UC Regents Fall 2006 © UCB Recall Lecture 2: Multi-flow VLIW CPU Q. Which right-hand-side term decreases with “N” ? Seconds Program = Instructions Program Cycles Seconds Cycle Instruction A. This one gets smaller. Syntax: ADD $8 $9 $10 Semantics:$8 A. We hope this one doesn’t =grow. $9 + $10 opcode rs rt rd shamt funct opcode rs rt rd shamt funct Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9 N x 32-bit VLIW yields factor of N speedup! Multiflow: N = 7, 14, or 28 (3 CPUs in product family) CS 152 L3: Metrics + Microcode UC Regents Spring 2014 © UCB A closer look at fan-out ... Driving more gates adds delay. Linear model works for reasonable fan-out FO4: Fanout of four delay. CS 250 L3: Timing Delay time of an inverter UC Regents Fall 2013 © UCB Clock skew also eats into “time budget” CLKd CLKd As T →0, which circuit fails first? CLKd CS 250 L3: Timing UC Regents Fall 2013 © UCB CS 152 Computer Architecture and Engineering Lecture 4 – Pipelining 2014-1-30 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Hazards: An instruction is not a car ... Stage #3 Stage #1 Stage #2 Instr Fetch Decode & Reg Fetch IR ADD R4,R3,R2 OR R5,R4,R2 IR ... wrong value of R4 fetched from RegFile, contract with programmer A broken! Oops! M B CS 152: L4 Pipelining IR R4 not written yet ... New sample program ADD R4,R3,R2 OR R5,R4,R2 An example of a “hazard” -- we must (1) detect and (2) resolve all hazards to make a CPU that matches ISA UC Regents Spring 2014 © UCB Performance Equation and Pipelining Seconds Program = Instr Fetch Instructions Program IR Cycles Instruction Decode & Reg Fetch Stage #3 IR CPI == 1 Once pipe is fill, one instruction completes per A cycle M B CS 152: L4 Pipelining Seconds Cycle IR Clock period is shorter Less work to do in each cycle To get shortest clock period, balance the work to do in each pipeline stage. UC Regents Spring 2014 © UCB Data Hazards: 3 Types (RAW, WAR, WAW) Write After Read (WAR) hazards. Instruction I2 expects to write over a data value after an earlier instruction I1 reads it. But instead, I2 writes too early, and I1 sees the new value. Write After Write (WAW) hazards. Instruction I2 writes over data an earlier instruction I1 also writes. But instead, I1 writes after I2, and the final data value is incorrect. WAR and WAW not possible in our 5-stage pipeline. But are possible in other pipeline designs. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Resolving a RAW hazard by forwarding 1 “IF” Stage 2 3 “ID/RF” Stage Decode & Reg Fetch “EX” Stage Execution OR R5,R4,R2 ADD R4,R3,R2 Instr Fetch Sample program ADD R4,R3,R2 IR OR R5,R4,R2 IR ALU computes R4 in the EX stage, so ... Just forward it back! A Y M M B CS 152: L4 Pipelining IR Unlike stalling, does not change CPI. May hurt cycle time. UC Regents Spring 2014 © UCB CS 152 Computer Architecture and Engineering Lecture 5 – ISA Design + Microcode + Cost 2014-2-4 Born on this day in 2004. Facebook John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ CS 152: L5: ISA Design + Microcode UC Regents Spring 2014 © UCB Register-register 1990s technology was ready for RISC Machine code for c = a + b; Transistors were available for on-chip instruction cache. So, larger code size would not monopolize bandwidth to off-chip DRAM memory. Fixed-length instructions made fast pipelining practical. Not really quantitative ... -----------> For the right target ISA, compiled code quality could match hand-coded assembly. Instruction modes: “C is a high-level assembler” origin w += i w += 3 w += a[100 + i] w += a[i] w += a[i + j] w += a[1001] w += a[*p] a[i++] a[i--] w += a[100 + i + d*j] How compiler technology can inform an ISA decision Example: f=b+c+d-1 becomes: Temporaries: a, e During code generation, a compiler allocates registers to temps when available, because registers are faster than memory. In the general case, register allocation task is NP-complete ... There are good heuristic solutions, but they require 16 free registers (preferably more) to work well. This line of reasoning quantifies one advantage of 32 general purpose registers An example of a complex instruction 8 byte instruction fetch amortized by 28 byte data move MOVEM.L D0/D4-D7/A4/A5,40(A6) Move the 32-bit data stored in 7 registers (D0, D4, D5, D6, D7, A4, A5) to the region of memory pointed to by A^, displaced by 28H bytes. Takes 58 clock cycles to execute. Requires non-architected state to keep track of memory and register indices. But ... what exactly is microcode? Each clock cycle, we can think of this data path as executing an 11-bit microcode instruction word: One microcode instruction - binary format DN 0 Assembler format OE1, WEP; WE1 WE2 WE3 WE4 WEP OE1 OE2 OE3 OE4 OEP 0 0 0 0 1 1 0 R1 D 0 0 a list of “1” columns. P R4 Q 32 32 D ... OE WE 0 D Q 32 32 WE OE Q 32 32 WE OE DN WE1 OE1 WE4 OE4 WEP OEP 32 CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB 353 Haswell CPU dies on this wafer ==> $4.80 per die! 11 11 But if die were twice as big ... $9.60 per die! This is one reason 21 21 why die size 27 27 matters. 31 31 Die counts 33 35 33 37 35 per column This analysis is optimistic ... CS 152 Computer Architecture and Engineering Lecture 6 – Superpipelining + Branch Prediction 2014-2-6 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 © UCB Pipelining a 256 byte instruction memory. Fully combinational (and slow). Only read behavior shown. Can we add two pipeline stages? A7-A0: 8-bit read address { 3 { A7 A6 A5 A4 A3 A2 3 OE --> Tri-state Q outputs! OE 1 D E M U . X . . OE Byte 0-31 256 Q 256 Byte 32-63 Q ... OE . . . Byte 224-255 Q 256 M U X Data 3 output is 32 bits D0-D31 32 i.e. 4 bytes 256 Each register holds 32 bytes (256 bits) CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 © UCB Spatial Predictors C code snippet: b1 b2 b3 After compilation: We want to predict this branch. Idea: Devote hardware to four 2-bit predictors for BEQZ branch. P1: Use if b1 and b2 not taken. P2: Use if b1 taken, b2 not taken. P3: Use if b1 not taken, b2 taken. P4: Use if b1 and b2 taken. Track the current taken/not-taken status of b1 and b2, and use it to choose from P1 ... P4 for BEQZ ... How? b1 b1 b2 b2 b3 Can b1 and b2 help us predict it? CS 152 Computer Architecture and Engineering Lecture 7 -- Power and Energy 2014-2-11 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB The Watt: Unit of power. A rate of energy (J/s). A gas pump hose delivers 6 MW. 120 KW: The power delivered by a Tesla Supercharger. Tesla Model S has a 306 MJ battery 1J=1W (good for 265 miles). CS 152: L7: Power and Energy The Joule: Unit of energy. A 1 Gallon gas container holds 130 MJ of energy. 1 W = 1 J/s. UC Regents Spring 2014 © UCB And so, we can transform this: Gate delay roughly linear with Vdd 2 P ~ F ⨯ Vdd 2 P~1⨯1 Block processes stereo audio. 1/2 of clocks for “left”, 1/2 for “right”. Into this: Top block processes “left”, bottom “right”. 2 Vdd P ~ #blks ⨯ F ⨯ P ~ 2 ⨯ 1/2 ⨯ 1/4 = 1/4 CV2 power only This magic trick brought to you by Cory Hall ... CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB CS 152 Computer Architecture and Engineering Lecture 8 -- CPU Verification 2014-2-13 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ CS 152: L7: Power and Energy UC Regents Spring 2014 © UCB “CPU program” diagnosis is tricky ... Observation: On a buggy CPU model, the correctness of every executed instruction is suspect. Consequence: One needs to verify the correctness of instructions that surround the suspected buggy instruction. Depends on: (1) number of “instructions in flight” in the machine, and (2) lifetime of non-architected state (may be “indefinite”). CS 250 L11: Design Verification UC Regents Fall 2012 © UCB Combinational Unit Testing: 32-bit Adder Number of input bits ? 65 Cin A Total number of possible input values? 32 + B 32 Sum 2 65 = 3.689e+19 32 Just test them all? Cout Exhaustive testing does not “scale”. “Combinatorial explosion!” CS 250 L11: Design Verification UC Regents Fall 2012 © UCB On Tuesday Mid-term I ... Ground rules ... When is it? Where is it? Ground rules. 9:30 AM sharp, Tuesday March 18th, 306 Soda. Every-other-seat seating, except for the front row, where every-seat is permitted. No blue-books needed. We will be handing out a paper test. Pencil is preferred. Pencils down @ 10:55 AM, so we can collect papers before next class comes in. CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB When is it? Where is it? Ground rules. No use of calculators, smartphones, laptops, etc ... during the exam. Closed-book, closed-notes. Just pencils, erasers. No consulting with students. Restroom breaks are OK, but you’ll still need to hand in your exam @ 10:55. Questions are reserved for serious concerns about a bug in the question. CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Today - Midterm I Review Session All questions answered (almost ...) CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Break Play: CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB # Points 1 10 Name: 2 15 SSID: 3 10 “All the work is my own. I have no prior knowledge of the exam contents, aside from guidance from class staff. I will not share the contents with others in CS152 who have not taken it yet.” 4 10 5 15 Signature: 6 15 Please write clearly, and put your name on each page. Please abide by word limits. Good luck! 7 10 8 15 CS152 Midterm I October 4th 2005 Now at Splunk (log files in the cloud) Now at Redux (10second David Marquardt Udam Saini John Lazzaro (still @ berkeley) Tot 100 Q1. Register File Design (10 points) clk ws 5 WE R0 - The constant 0 D D E M U . X . . D Q En R1 Q En R2 Q ... D 32 wd En R31 Q Q1: The actual question ... CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB ws1 clk 5 R0 - Constant 0 D Draw your answer here ... WE1 WE2 D Q En R1 Q En R2 Q ... D 5 ws2 32 wd1 32 wd2 En R31 Q ws1 clk 5 WE1 de mu x R0 - Constant 0 a0 a1 a2 or . . . D En Q a1 b1 R1 Q 1 a31 0 32 or D En a2 b2 R2 Q 1 WE2 de mu x 0 b0 b1 b2 ws2 ... or . . . D 1 b31 5 32 32 32 wd1 wd2 0 32 En a31 b31 R31 Q Q2: Single Cycle Design (part A) LWA: Load Word and Auto-update Index CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB R-format op rs rt I-format op rs rt rs rt 1 rt 5 5 5 rs 0 RegDest RegFile rs1 rd1 rs2 32 ws 32 wd 32 rd2 WE RegWr rd shamt funct imm ALUctr 1 1 imm Ext ExtOp Equal 0 0 MemToReg ALUsrc MemWr Mux control: 0 is lower mux input, 1 upper mux input RegWr, MemWr: 1 = write, 0 = no write. ExtOp: 1 =sign-extend, 0 = zero-extend. Q2: Single Cycle Design (part A) CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Q2: Single Cycle Design (part B) SWA: Store Word and Auto-update Index CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB R-format op rs rt I-format op rs rt rs rt 1 rt 5 5 5 rs 0 RegDest RegFile rs1 rd1 rs2 32 ws 32 wd 32 rd2 WE RegWr rd shamt funct imm ALUctr 1 1 imm Ext ExtOp Equal 0 0 MemToReg ALUsrc MemWr Mux control: 0 is lower mux input, 1 upper mux input RegWr, MemWr: 1 = write, 0 = no write. ExtOp: 1 =sign-extend, 0 = zero-extend. Q2: Single Cycle Design (part B) CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Q2: Single Cycle Design (part C) BEQR: Branch if EQual to address in Register CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB R-format op rs rt I-format op rs rt rs rt 1 rt 5 5 5 rs 0 RegDest RegFile rs1 rd1 rs2 32 ws 32 wd 32 rd2 WE RegWr rd shamt funct imm ALUctr 1 1 imm Ext ExtOp Equal 0 0 MemToReg ALUsrc MemWr Mux control: 0 is lower mux input, 1 upper mux input RegWr, MemWr: 1 = write, 0 = no write. ExtOp: 1 =sign-extend, 0 = zero-extend. 32 PC 32 Instr Mem 32 32 1 D + 32 Q 0 0x4 Addr Data 32 32 PCSrc Clk imm + Ext 1 rd1 0 BRSrc 32 Note: imm is immediate field from Iformat bitfield. Ext unit sign extends, does word->byte shift. Note: rd1 is output from register file. Mux control: 0 is lower mux input, 1 upper mux input Q2: Single Cycle Design (part C) CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Q3: Single-Cycle Branch Delay Slot CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Single-Cycle I-Fetch (no delay slot) 32 PC Instr Mem 32 32 1 32 D + 32 Q 0 0x4 Addr Data 32 32 PCSrc Ex te nd Clk + 32 Mux control: 0 is lower mux input, 1 upper mux inpu Design Instruction Fetch WITH delay slot 32 PC 1 32 D Q + 32 0 32 PCSrc Ex te nd + Clk Instr Mem 32 32 Addr Data Design Instruction Fetch WITH delay slot 32 PC 32 NEWREG 1 32 0x4 0x0 D + 1 32 Q Q 32 0 0 32 PCSrc PCSrc Ex te nd D + 32 Clk Clk On reset: PC = 32’d0 NEWREG = 32’d4 Instr Mem 32 Addr Data Q4: Interpreting Schmoo Plots ... CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB A “Schmoo” plot for a Cell SPU ... The energy equations: E 2 1 C 2 2 1 C 2 E1V V = = dd current 1 Joule of dissipated >1 energy is dd >0 by a 1 Amp 0- Operating point Q flowing through a 1 Ohm resistor for 1 second. Also, 1 Joule of energy is 1 Watt (1 amp into 1 ohm) dissipating for 1 second. Operating point R Operating point P Each square shows chip temperature (C) and power Q4: Part A ... Operating point P: 1.3 V, 4.8 GHz, 10 W. Operating point Q: 1.3 V, 2.4 GHz, 5 W. Q4: Part A answer Q4: Part B ... Operating point P: 1.3 V, 4.8 GHz, 10 W. Operating point Q: 1.3 V, 2.4 GHz, 5 W. Operating point R: 0.9 V, 2.4 GHz, 1 W. Q4: Part B answer Q5: Visualizing Stalls and Kills Note: no forwarding muxes, no “==” ID ALU 1 “IF” Stage Instr Fetch 2 3 5 4 “ID” Stage “EX” Stage “MEM” Stage WB Memory Write Decode & Reg Fetch Execution Back IR IR IR IR WE, MemToReg Mux,Logic A Y R To branch logic M B M Notes: Program I1: OR R5,R1,R2 In BEQ, the I7 denotes the branch target instruction (if the branch is taken). Look at the I2: OR R6,R1,R2 I3: BEQ R6,R5,I7 code to figure out if branch is taken or not. I4: LW R3 0(R5) Use N to denote a stage with a muxed-in I5: OR R7,R5,R6 NOP instruction. I6: OR R0,R3,R7 I7: OR R6,R0,R3 Fill out the table until all slots of t13 are I8: OR R5,R0,R1 filled in. Do not add and fill in t14, t15, etc. I9: OR R11,R9,R9 I10: OR R12,R9,R9 We filled in I1 to get you started. t1 IF: I1 ID: EX: MEM: WB: t2 t3 t4 t5 I1 I1 I1 I1 t6 t7 t8 t9 t10 t11 t12 t13 Notes: Program I1: OR R5,R1,R2 In BEQ, the I7 denotes the branch target instruction (if the branch is taken). Look at the I2: OR R6,R1,R2 I3: BEQ R6,R5,I7 code to figure out if branch is taken or not. I4: LW R3 0(R5) Use N to denote a stage with a muxed-in I5: OR R7,R5,R6 NOP instruction. I6: OR R0,R3,R7 I7: OR R6,R0,R3 Fill out the table until all slots of t13 are I8: OR R5,R0,R1 filled in. Do not add and fill in t14, t15, etc. I9: OR R11,R9,R9 I10: OR R12,R9,R9 We filled in I1 to get you started. t1 IF: I1 ID: EX: MEM: WB: t8 t9 t10 t11 t12 t13 t2 t3 t4 t5 t6 t7 I2 I1 I3 I2 I1 I4 I3 I2 I1 I4 I3 N I2 I1 I4 I3 N N I2 I4 I5 I7 I3 I4 N N I3 I4 N N I3 N N N I8 I7 N I4 I3 I8 I8 I9 I7 I7 I8 N N I7 N N N N I4 N Q6: Unified Memory and Pipelines 1 “IF” Stage Instr Fetch NOP mux into IR not shown 2 3 5 4 “ID/RF” Stage “EX” Stage “MEM” Stage WB Memory Write Decode & Reg Fetch Execution Back IR IR IR IR WE, MemToReg Mux,Logic PC PC update logic not shown A Y R To branch logic M M MemToReg B Policy: Data reads and writes take precedence over instruction fetches. Use N to denote a stage holding a NOP. Fill out the table until all slots of t13 are filled in. Do not add and fill in t14, t15, etc. We filled in I1 to get you started. t1 IF: I1 ID: EX: MEM: WB: t2 t3 t4 t5 I1 I1 I1 I1 t6 t7 Program I1: LW R1, 0(R0) I2: LW R2, 0(R1) I3: LW R3, 0(R1) I4: LW R4, 0(R3) I5: LW R5, 0(R3) I6: LW R6, 0(R4) I7: OR R5,R6,R5 t8 t9 t10 t11 t12 t13 Program I1: LW R1, 0(R0) I2: LW R2, 0(R1) I3: LW R3, 0(R1) I4: LW R4, 0(R3) I5: LW R5, 0(R3) I6: LW R6, 0(R4) I7: OR R5,R6,R5 t1 t2 t3 t4 t5 IF: I1 ID: EX: MEM: WB: I2 I1 I3 I2 I1 N I3 I2 I1 N I3 N I2 I1 t6 I4 I3 N N I2 t7 t8 I5 N I4 I5 I3 I4 N I3 N N t9 t10 t11 t12 t13 N I6 I7 N N I5 I5 I6 I7 I7 N I5 I6 N N I4 N N I5 I6 I3 I4 N N I5 Q7: Forwarding Networks Forwarding muxes, with numbers inputs ID (Decode) EX IR IR WB MEM IR IR To branch logic == Mux,Logic From WB From mux outputs 4 5 2 1 1 2 3 4 A 1 2 3 5 M 3 Y R M H CS 152 L16: Midterm I Review UC Regents Spring 2014 © UCB Program I1: OR R5,R1,R4 OR ws,rs1,rs2 LW ws,imm(rs1) I2: OR R4,R1,R2 BEQ rs1, rs2, branch target label I3: OR R3,R5,R4 (1) Fill in IF/ID/EX/MEM/WB rows with instruction I4: OR R3,R1,R2 number (I1, I2, etc) or N for a stage that holds a I5: BEQ R3,R4,I8 NOP. (2) Fill in A# with the selected input of the mux I6: LW R3 0(R3) driving the A register needed to fulfill the programmers contract (1,2,3, 4, or X for don’t care). I7: OR R6,R9,R3 (3) Fill in M# with the selected input of the mux driving I8: OR R3,R3,R9 the M register needed to fulfill the programmers contractI9: OR R9,R6,R3 I10: OR R3,R6,R6 (1,2,3, 5, or X for don’t care). Opcodes to datapath mapping: t1 IF: I1 ID: EX: MEM: WB: A#: X M#: X t2 t3 t4 t5 I1 I1 I1 I1 t6 t7 t8 t9 t10 Program I1: OR R5,R1,R4 I2: OR R4,R1,R2 I3: OR R3,R5,R4 I4: OR R3,R1,R2 I5: BEQ R3,R4,I8 I6: LW R3 0(R3) I7: OR R6,R9,R3 I8: OR R3,R3,R9 I9: OR R9,R6,R3 I10: OR R3,R6,R6 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 IF: I1 ID: EX: MEM: WB: A#: X M#: X I2 I1 I3 I2 I1 I4 I3 I2 I1 4 5 4 5 2 1 I5 I4 I3 I2 I1 4 5 I6 I5 I4 I3 I2 1 3 I8 I6 I5 I4 I3 2 X I9 I8 I6 I5 I4 X X I9 I10 I8 I9 N I8 I6 N I5 I6 2 4 5 1 Q8: Forwarding Through Registers A novel forward scheme (regfile mux) ID (Decode) EX IR IR WB MEM IR IR Mux,Logic 3 4 1 2 3 3 4 A Y R 5 3 3 5 M B 1 M 2 3 2 CS 152 L16: Midterm I Review 1 UC Regents Spring 2014 © UCB Opcodes to datapath mapping: OR ws,rs1,rs2 Fill in A# with the selected input of the mux driving the A register needed to fufill the programmers LW ws,imm(rs1) contract (3, 4, or X for don’t care). Fill in M# with the selected input of the mux driving the M register needed to fufill the programmers contract (3, 5, or X for don’t care). Fill in wd with the selected input of the mux driving the wd register file input (1, 2, 3, or X for “don’t care because there is no write this cycle”) IF: ID: EX: MEM: WB: A#: M#: wd: Program I1: OR R5,R1,R2 I2: OR R8,R3,R5 I3: OR R7,R8,R5 I4: LW R4 0(R8) I5: OR R9,R8,R7 I6: OR R3,R9,R7 I7: OR R2,R4,R3 I8: OR R7,R3,R4 I9: OR R5,R7,R9 t8 t9 t10 t11 t12 t13 t1 t2 t3 t4 t5 t6 t7 I1 I2 I1 I3 I2 I1 I4 I3 I2 I1 I5 I4 I3 I2 I1 I6 I5 I4 I3 I2 I7 I8 I9 I6 I7 I8 I5 I6 I7 I4 I5 I6 I3 I4 I5 X X X I9 I8 I9 I7 I8 I9 I6 I7 I8 I9 IF: ID: EX: MEM: WB: A#: M#: wd: Program I1: OR R5,R1,R2 I2: OR R8,R3,R5 I3: OR R7,R8,R5 I4: LW R4 0(R8) I5: OR R9,R8,R7 I6: OR R3,R9,R7 I7: OR R2,R4,R3 I8: OR R7,R3,R4 I9: OR R5,R7,R9 t8 t9 t10 t11 t12 t13 t1 t2 t3 t4 t5 t6 t7 I1 I2 I1 I3 I2 I1 I4 I3 I2 I1 4 5 X 4 3 3 3 5 3 I5 I4 I3 I2 I1 4 X 3 I6 I5 I4 I3 I2 4 5 X I7 I8 I9 I6 I7 I8 I5 I6 I7 I4 I5 I6 I3 I4 I5 4 3 4 3 5 5 2 3 1 X X X I9 I8 I9 I7 I8 I9 I6 I7 I8 I9 3 5 Q8: Forwarding Through Registers Q9: Simple branch predictor Address of BNEZ instruction 0b0110[...]01001000 28 bits 2 bits Branch Target Buffer (BTB) line index 28-bit address tag target address 0b00 0b01 0b10 0b0110[...]0100 0b11 = Hit CS 152 L16: Midterm I Review PC + 4 + Loop BNEZ R1 Loop Branch History Table (BHT) N L Update BHT once taken/ not taken status is known On a miss, replace BTB for the line with the new branch tag & target. Next slide defines initial BHT N and UC Regents Spring 2014 © UCB Simple (”2-bit”) Branch History State “N bit” Prediction for Next branch (1 = take, 0 = not take) D “L bit” Was Last prediction correct? (1 = yes, 0 = no) D Q L N old N 0 0 0 0 1 1 1 Q old L branch new N new L 0 0 1 1 0 0 1 not taken taken not taken taken not taken taken not taken 0 1 0 0 0 1 1 1 1 1 0 1 1 0 When replacing the tag value for a line, initialize branch history state to (N = 1, L = 1) (for taken branches) or to (N = 0, L = 1) (for “not taken” branches). Branch predictor state before first inst. in trace executes 1 28-bit address tag target address 0x 0000 000 PC + 4 + Lab1 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 007 PC + 4 + Lab8 0x 0000 0000 N L 0 0 1 0 0 1 1 BEQ R1 R2 Lab1 0x 0000 000 PC + 4 + Lab1 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 007 PC + 4 + Lab8 1 line index 0b00 0b01 0b10 0b11 Taken 1 1 0 1 1 0 1 0b00 1 0b11 0b01 0b10 2 0x 0000 000 PC + 4 + Lab1 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 007 PC + 4 + Lab8 0x 0000 0034 1 1 0 1 BEQ R7 R8 Lab4 0x 0000 000 PC + 4 + Lab1 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 007 PC + 4 + Lab8 1 0 1 0b00 1 0b11 0b01 0b10 Not Taken 1 0 0 1 1 1 1 0b00 1 0b11 0b01 0b10 3 0x 0000 000 PC + 4 + Lab1 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 007 PC + 4 + Lab8 0x 0000 006C 1 0 0 1 BEQ R13 R14 Lab7 0x 0000 000 PC + 4 + Lab1 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 006 PC + 4 + Lab7 1 1 1 0b00 1 0b11 0b01 0b10 Not Taken 1 0 0 0 1 1 1 0b00 1 0b11 0b01 0b10 4 0x 0000 000 PC + 4 + Lab1 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 006 PC + 4 + Lab7 0x 0000 0058 1 0 0 0 BEQ R11 R12 Lab6 0x 0000 000 PC + 4 + Lab1 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 006 PC + 4 + Lab7 1 1 1 0b00 1 0b11 0b01 0b10 Taken 1 0 0 0 1 1 0 0b00 1 0b11 0b01 0b10 5 0x 0000 000 PC + 4 + Lab1 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 006 PC + 4 + Lab7 0x 0000 0020 1 0 0 0 BNE R5 R6 Lab3 0x 0000 002 PC + 4 + Lab3 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 006 PC + 4 + Lab7 1 1 0 0b00 1 0b11 0b01 0b10 Taken 1 0 0 0 1 1 0 0b00 1 0b11 0b01 0b10 6 0x 0000 002 PC + 4 + Lab3 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 006 PC + 4 + Lab7 0x 0000 0034 1 0 0 0 BEQ R7 R8 Lab4 0x 0000 002 PC + 4 + Lab3 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 006 PC + 4 + Lab7 1 1 0 0b00 1 0b11 0b01 0b10 Taken 1 0 0 0 1 0 0 0b00 1 0b11 0b01 0b10 7 Q4 Answer: Branch predictor state after 7 branches complete 0x 0000 002 PC + 4 + Lab3 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 006 PC + 4 + Lab7 0x 0000 006C 1 0 0 0 BEQ R13 R14 Lab7 0x 0000 002 PC + 4 + Lab3 0x 0000 003 PC + 4 + Lab4 0x 0000 005 PC + 4 + Lab6 0x 0000 006 PC + 4 + Lab7 1 0 0 0b00 1 0b11 0b01 0b10 Not Taken 1 0 0 0 1 0 0 0b00 1 0b11 0b01 0b10 On Tuesday Mid-term I ... Good luck !