CS 152 Computer Architecture and Engineering Lecture 15 -- Advanced CPUs 2014-3-11 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152 L15: Superscalars and Scoreboards UC Regents Spring 2014 © UCB DEC Alpha 21164 Top performing microprocessor in its day (1995). 300 MFLOPS in 0.5µ CMOS, @ 300 MHz. DEC Alpha 21164 Uses techniques we cover in Part I of lecture. Lockup-free cache integration. Use of many functional units. Many instructions issued per cycle (superscalar) DEC Alpha 21164 Most of chip is cache (in blue). This 4-issue chip was the high watermark for inorder designs. In 2014, in-order superscalar lives in the costsensitive sector ... Marvell Embedded CPU: In-order dual-core superscalar $35 retail implies Bill of Materials (BOM) in the $20 range ... ARM CPU Wi-Fi (Marvell) 2 GB 512 MB Flash DRAM Chromecast: Web browser in a flash-drive form factor. Plugs into the HDMI port on a TV. Includes a Wi-Fi chip so you can control the browser from your cell phone. Key Issue: Overcoming data hazards Read After Write (RAW) hazards. Instruction I2 expects to read a data value written by an earlier instruction, but I2 executes “too early” and reads the wrong copy of the data. Write After Read (WAR) hazards. Instruction I2 expects to write over a data value after an earlier instruction I1 reads it. But instead, I2 writes too early, and I1 sees the new value. Write After Write (WAW) hazards. Instruction I2 writes over data an earlier instruction I1 also writes. But instead, I1 writes after I2, and the final data value is incorrect. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Key issue: Structural Hazards ... Floating Point Pipeline of Alpha 21164: Insufficient register write ports to service all sources every clock cycle. Not every arithmetic unit is fully pipelined. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Topic #1: CPU side of our hit-over-miss cache ... From CPU CPU requests a read by placing MTYPE, TAG, MADDR Queue 1 in Queue 1. To CPU Queue 2 “We” == L1 D-Cache controller We do a normal cache access. If there is a hit, we put place load result in Queue 2 ... In the case of a miss, we use the Inverted Miss Status Holding Register. Integrating queues into the pipeline ... A memory pipe splits off from the main pipeline, after ALU calculates index. CPU uses 5 bits of TAG to encode the target/source register for LW/SW. CS 194-6 L9: Advanced Processors I Queue 1 Queue 2 UC Regents Fall 2008 © UCB LockBits: a scoreboard data structure In decode stage, we stall any instruction that reads or writes a locked register. 5 LockBits rs 5 ws 1 wd rd WE 1 Each register has a lock bit, initialized to 0. An example of a scoreboard data structure. In decode stage, we lock target register of any LW we issue. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB How lock bits are cleared ... 5 5 1 LockBits rs ws rd 1 wd When data is returned to CPU via Queue 2, CPU writes data into register file, and clears the associated lock bit. WE Dedicated write ports are needed to avoid structural hazards. From CPU Queue 1 CS 194-6 L9: Advanced Processors I To CPU Queue 2 UC Regents Fall 2008 © UCB Memory semantics and lock-free caches The CPU expects that loads and stores to the same memory location are applied in queued order. The simple (low-performance) approach for the data cache is to “snoop” Queue 1, and delay accepting writes to addresses that are being read. Finally, note the lack of sequential consistency. From CPU Queue 1 CS 194-6 L9: Advanced Processors I To CPU Queue 2 UC Regents Fall 2008 © UCB Topic #2: Pipelines and latency ... This pipeline splits after the RF stage, feeding functional units with different latencies. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Split pipelines: a write-after-write hazard. Solution: SUB detects R1 clash in decode stage and stalls, via a pipe-write scoreboard. WAW Hazard DIV R1, R2, R3 SUB R1, R2, R3 If long latency DIV and short latency SUB are sent to parallel pipes, SUB may finish first. CS 194-6 L9: Advanced Processors I The pipeline splits after the RF stage, feeding functional units with different latencies. UC Regents Fall 2008 © UCB Register write port: a structural hazard Solution: A scoreboard structure to reserve future slots of the write port. Stall SUB in decode until slot opens. Structural Hazard DIV R1, R2, R3 [...] SUB R5, R2, R3 DIV and SUB may need to write register file at the same time. CS 194-6 L9: Advanced Processors I Other solutions possible ... above, solution of separate UC Regents Fall 2008 © UCB Functional unit input: a structural hazard Solution: A scoreboard structure to detect busy functional units. Stall DIV R5, ... in decode until divider is ready. Structural Hazard DIV R1, R2, R3 DIV R5, R2, R3 Divide is usually not fully pipelined, and cannot accept new operands every cycle. CS 194-6 L9: Advanced Processors I The pipeline splits after the RF stage, feeding functional units with different latencies. UC Regents Fall 2008 © UCB Imprecise exceptions: A difficult issue Solutions: Too complicated for a slide. See page C-58 in CA-AQA Exceptions DIV R1, R2, R3 SUB R4, R2, R3 If DIV throws an exception after SUB writes back, the contract with the programmer breaks. CS 194-6 L9: Advanced Processors I The pipeline splits after the RF stage, feeding functional units with different latencies. UC Regents Fall 2008 © UCB Superscalar: Multiple issues per cycle Goal: Improve CPI by issuing several instructions per cycle. Example: CPU with floating point ALUs: Issue 1 FP + 1 Integer instruction per cycle. Difficulties: Load and branch delays affect more instructions. Ultimate Limiter: Programs may be a poor match to issue rules. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Recall VLIW: Super-sized Instructions Example: All instructions are 64-bit. Each instruction consists of two 32-bit MIPS instructions, that execute in parallel. Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10 opcode rs rt rd shamt funct opcode rs rt rd shamt funct Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9 A 64-bit VLIW instruction But what if we can’t change ISA execution semantics ? CS 194-6 L3: Single-Cycle CPU UC Regents Fall 2008 © UCB IF (Fetch) Superscalar R machine ID (Decode) IR IR RegFile rd1 rs2 ws1 64 WB IR IR Y R rd2 Y R IR IR B wd1 Data Instr Mem rs3 Addr ws2 rd3 A rs4 rd4 B wd2 32 PC and Sequencer MEM A rs1 Instruction Issue Logic EX (ALU) WE1 WE2 IR IF (Fetch) CS 194-6 L9: Advanced Processors I IR ID (Decode) EX (ALU) MEM WB UC Regents Fall 2008 © UCB IF (Fetch) Sustaining Dual Instr Issues (no forwarding) ID (Decode) IR ADD ADD ADD ADD R8,R0,R0 R11,R0,R0 R27,R26,R25 R30,R29,R28 ADD ADD ADD ADD ADD ADD R21,R20,R19 R24,R23,R22 R15,R14,R13 R18,R17,R16 R9,R8,R7 R12,R11,R10 It’s rarely this good ... ADD R9,R8,R7 RegFile IR ADD R15, R14,R13 rd1 rs2 ws1 MEM WB ADD R27 ADD R21,R20,R19 IR IR Y R Y R A rs1 IR EX (ALU) rd2 B wd1 rs3 rd3 A rs4 ws2 rd4 B wd2 WE1 WE2 ADD R12,R11,R10 CS 194-6 L9: Advanced Processors I ID (Decode) IR ADD R18, R17,R16 EX (ALU) ADD R24,R23,R22 IR IR MEM ADD R30 WB UC Regents Fall 2008 © UCB IF (Fetch) ID (Decode) EX (ALU) We add 12 ADD R11,R10,R0 IR IR forwarding buses (not shown). (6 to each ID from RegFile A stages of both pipes). ADD R10, R9,R0 rs1 Worst-Case Instruction Issue ADD ADD ADD ADD rd1 rs2 R8,R0,R0 R9,R8,R0 R10,R9,R0 R11,R10,R0 ws1 rd2 rs3 rd3 CS 194-6 L9: Advanced Processors I ADD R9,R8,R0 IR IR Y R Y R ADD R8, B A rs4 ws2 rd4 B WE1 Dependencies force “serialization” WB wd1 wd2 IR MEM WE2 NOP ID (Decode) IR NOP EX (ALU) IR NOP MEM IR NOP WB UC Regents Fall 2008 © UCB Superscalar: A simple example ... Example: Superscalar MIPS. Fetches 2 instructions at a time. If first integer and second floating point, issue in same cycle Integer instruction FP instruction LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) ADDD F4,F0,F2 LD F14,-24(R1) ADDD F8,F6,F2 LD F18,-32(R1) ADDD F12,F10,F2 SD 0(R1),F4 ADDD F16,F14,F2 SD -8(R1),F8 ADDD F20,F18,F2 SD -16(R1),F12 SD -24(R1),F16 CS 194-6 L9: Advanced Processors I Two issues per cycle One issue per cycle UC Regents Fall 2008 © UCB Superscalar: Visualizing the pipeline Type Int. instruction FP instruction Int. instruction FP instruction Int. instruction FP instruction Pipe Stages IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Three instructions potentially affected by a single cycle of load delay, as FP register loads done in the “integer” pipeline). CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Limitations of “lockstep” superscalar Gets 0.5 CPI only for a 50/50 float/int mix with no hazards. For games/media, may be OK. Extending scheme to speed up general apps (Microsoft Office, ...) is complicated. If one accepts building a complicated machine, there are better ways to do it. Dynamic Scheduling : After spring CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB DEC Alpha 21164 This 4-issue chip was the high watermark for in-order superscalar designs. Final paragraph DEC was sold off to Compaq a few years later ... who sold of Digital Semiconductor to Intel ... who still makes Alpha chips in small batches for HP (who bought Compaq). Break Play: CS 152 L15: Superscalars and Scoreboards UC Regents Spring 2014 © UCB The CDC 6600 was the world’s fastest computer for 5 years (1964-1969). The design team was located in a small town in Wisconsin, the home town of its leader, Seymour The lab was placed far from CDC Cray. headquarters in Minneapolis, to limit interference from upper management. Operator Console Top-down view: Transistor-based design, running at 100 ns clock speed. 64K of 60-bit words, implemented with magnetic core memory. Entire main frame was liquid cooled with Freon. Bus wires: twisted wire pairs that were trimmed by hand to meet cycle time. Architecture Out-of-order execution. The first RISC machine Peripheral processor invented multithreading “Scoreboard” 10 functional units Register File Includes eight 60-bit floating point registers Long, variable latency Instruction Fetch and the Scoreboard The scoreboard controls the execution flow of all instructions. It’s goal is to maintain a CPI of 1. The instruction fetch unit is decoupled. It’s goal is to pass one decoded instruction to the scoreboard every cycle. The scoreboard holds decoded copies of all in-flight instructions, and tracks the status of all elements cycle-by-cycle. Lifecycle of an Pending instruction in Issue the scoreboard (part 1) Awaiting operands Newly arrived instructions placed in this state, until (1) a functional unit becomes free, and (2) no other issued instructions want to write the register it wants to write. Prevents WAW hazards. If an instruction is in pending issue, the scoreboard stalls the instruction fetch unit. Execution in progress Execution has completed Result is written Lifecycle of an Pending instruction in Issue the scoreboard (part 2) Awaiting operands Instructions remain in this state, until both of its operand registers are not waiting to be written by a functional unit. Execution in progress Execution has completed Prevents RAW hazards. Result is written Lifecycle of an Pending instruction in Issue the scoreboard (part 3) Awaiting operands This state can last many cycles, as functional units have long latency. Execution in progress Execution has completed Result is written Lifecycle of an Pending instruction in Issue the scoreboard (part 4) Awaiting operands Instructions may pass though this state, unless there is an instruction is Pending or Awaiting mode that (1) preceded it in the instruction stream, (2) Pending/Awaiting instruction needs to read the register this instruction plans to write. Prevents WAR hazards. Execution in progress Execution has completed Result is written What the scoreboard keeps score of. The full status of each functional unit. (1) Is it running an instruction? Which one? (2) What are its source/destination registers? (3) For each source: waiting/ready-to-read/read. (4) For each source: who will be writing it? For each register, which functional unit is planning to write it? Current state of all in-flight instructions. Limitations of scoreboard control ... If one accepts building a complicated machine, there are better ways to do it. Dynamic Scheduling : After spring break. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB On Thursday Midterm Review Lecture