RISC Pipeline Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6 A Processor memory +4 inst register file +4 =? PC control offset new pc alu cmp addr din dout memory target imm extend 2 A Processor memory inst register file alu +4 addr p PC din control new pc Instruction Fetch imm extend Instruction Decode dout memory compute jump/branch targets Execute Memory WriteBack 3 Basic Pipeline Five stage “RISC” load-store architecture 1. Instruction fetch (IF) – get instruction from memory, increment PC 2. Instruction Decode (ID) – translate opcode into control signals and read registers 3. Execute (EX) – perform ALU operation, compute jump/branch targets 4. Memory (MEM) – access memory if needed 5. Writeback (WB) – update register file Slides thanks to Sally McKee & Kavita Bala 4 Pipelined Implementation Break instructions across multiple clock cycles (five, in this case) Design a separate stage for the execution performed during each clock cycle Add pipeline registers to isolate signals between different stages 5 register file B alu D memory D A Pipelined Processor +4 IF/ID M B ID/EX Execute EX/MEM Memory ctrl Instruction Decode Instruction Fetch dout compute jump/branch targets ctrl extend din memory imm new pc control ctrl inst PC addr WriteBack MEM/WB 6 IF Stage 1: Instruction Fetch Fetch a new instruction every cycle • Current PC is index to instruction memory • Increment the PC at end of cycle (assume no branches for now) Write values of interest to pipeline register (IF/ID) • Instruction bits (for later decoding) • PC+4 (for later computing branch targets) 7 IF instruction memory mc 00 = read word 1 PC+4 +4 inst addr WE PC pcreg new pc pcsel Rest of pipeline 1 pcrel pcabs IF/ID 8 ID Stage 2: Instruction Decode On every cycle: • Read IF/ID pipeline register to get instruction bits • Decode instruction, generate control signals • Read from register file Write values of interest to pipeline register (ID/EX) • Control information, Rd index, immediates, offsets, … • Contents of Ra, Rb • PC+4 (for computing branch targets later) 9 ctrl PC+4 imm inst PC+4 Stage 1: Instruction Fetch WE Rd register D file A A IF/ID ID/EX decode extend Rest of pipeline B Ra Rb B result ID dest 10 EX Stage 3: Execute On every cycle: • • • • Read ID/EX pipeline register to get values and control bits Perform ALU operation Compute targets (PC+4+offset, etc.) in case this is a branch Decide if jump/branch should be taken Write values of interest to pipeline register (EX/MEM) • Control information, Rd index, … • Result of ALU operation • Value in case this is a memory store instruction 11 ctrl pcabs ctrl pcrel B imm B D alu + Rest of pipeline PC+4 Stage 2: Instruction Decode A pcsel EX branch? pcreg || ID/EX EX/MEM 12 MEM Stage 4: Memory On every cycle: • Read EX/MEM pipeline register to get values and control bits • Perform memory load/store if needed – address is ALU result Write values of interest to pipeline register (MEM/WB) • Control information, Rd index, … • Result of memory operation • Pass result of ALU operation 13 ctrl ctrl B Stage 3: Execute din dout addr memory Rest of pipeline M D D MEM mc EX/MEM MEM/WB 14 WB Stage 5: Write-back On every cycle: • Read MEM/WB pipeline register to get values and control bits • Select value and write to register file 15 ctrl M Stage 4: Memory D WB result dest MEM/WB 16 IF/ID D M B D A B ID/EX addr din dout OP Rd OP EX/MEM Rd mem PC+4 Rd OP PC+4 +4 PC B Ra Rb imm inst inst mem A Rd D MEM/WB 17 Example add nand lw add sw r3, r6, r4, r5, r7, r1, r2; r4, r5; 20(r2); r2, r5; 12(r3); 18 IF/ID nand lw r4, add sw r7, r3, r5, r6,20(r2) 12(r3) r1, r2, r4, r5 r2 ID/EX D addr din dout M B B D A nand lw add sw r4, r7, r3, r5, r6, 20(r2) 12(r3) r1, r2, r4,r5 r2r5 OP Rd OP EX/MEM Rd mem PC+4 imm 0 36A 9 12 18 7B 41 Rb 77 22 OP PC+4 +4 PC r0 r1 r2 Rd r3 Dr4 r5 r6 Ra r7 nand lw add sw r4, r7, r3, r5, r6, 20(r2) 12(r3) r1, r2, r4,r5 r2r5 Rd 0:add 1:nand inst 2:lw 3:add mem 4:sw nand lw add sw r4, r7, r3, r5, r6, 20(r2) 12(r3) r1, r2, r4,r5 r2r5 inst nand lw add sw r4, r7, r3, r5, r6, 20(r2) 12(r3) r1, r2, r4,r5 r2r5 MEM/WB 19 Time Graphs Clock cycle add nand 1 2 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID lw add sw Latency: Throughput: Concurrency: 3 4 5 6 7 8 9 EX MEM WB CPI = 20 Pipelining Recap Powerful technique for masking latencies • Logically, instructions execute one at a time • Physically, instructions execute in parallel – Instruction level parallelism Abstraction promotes decoupling • Interface (ISA) vs. implementation (Pipeline) 21