CS 61C: Great Ideas in Computer Architecture Control and Pipelining, Part I Instructors: Krste Asanovic, Randy H. Katz http://inst.eecs.Berkeley.edu/~cs61c/fa12 6/27/2016 Fall 2012 -- Lecture #27 1 You Are Here! Software • Parallel Requests Assigned to computer e.g., Search “Katz” Hardware Smart Phone Warehouse Scale Computer Harness • Parallel Threads Parallelism & Assigned to core e.g., Lookup, Ads Achieve High Performance Computer • Parallel Instructions >1 instruction @ one time e.g., 5 pipelined instructions Memory • Hardware descriptions All gates @ one time Today’s Lecture Instruction Unit(s) Core Functional Unit(s) A0+B0 A1+B1 A2+B2 A3+B3 Main Memory Logic Gates • Programming Languages 6/27/2016 Core (Cache) Input/Output • Parallel Data >1 data item @ one time e.g., Add of 4 pairs of words … Core Fall 2012 -- Lecture #27 2 Levels of Representation/Interpretation High Level Language Program (e.g., C) Compiler Assembly Language Program (e.g., MIPS) Assembler Machine Language Program (MIPS) temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; lw lw sw sw 0000 1010 1100 0101 $t0, 0($2) $t1, 4($2) $t1, 0($2) $t0, 4($2) 1001 1111 0110 1000 1100 0101 1010 0000 Anything can be represented as a number, i.e., data or instructions 0110 1000 1111 1001 1010 0000 0101 1100 1111 1001 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111 Machine Interpretation Hardware Architecture Description (e.g., block diagrams) Architecture Implementation Logic Circuit Description (Circuit Schematic Diagrams) Fall 2012 -- Lecture #27 6/27/2016 3 Agenda • Pipelined Execution • Administrivia • Pipelined Datapath 6/27/2016 Fall 2012 -- Lecture #27 4 Agenda • Pipelined Execution • Administrivia • Pipelined Datapath 6/27/2016 Fall 2012 -- Lecture #27 5 Review: Single-Cycle Processor • Five steps to design a processor: Processor 1. Analyze instruction set Input datapath requirements Control Memory 2. Select set of datapath components & establish Datapath Output clock methodology 3. Assemble datapath meeting the requirements: re-examine for pipelining 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic • Formulate Logic Equations • Design Circuits 6/27/2016 Fall 2012 -- Lecture #27 6 Pipeline Analogy: Doing Laundry • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away A B C D – Washer takes 30 minutes – Dryer takes 30 minutes – “Folder” takes 30 minutes – “Stasher” takes 30 minutes to put clothes into drawers 6/27/2016 Fall 2012 -- Lecture #27 7 Sequential Laundry 6 PM 7 T a s k O r d e r A 8 9 10 11 12 1 2 AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time B C D 6/27/2016 • Sequential laundry takes 8 hours for 4 loads Fall 2012 -- Lecture #27 8 Pipelined Laundry 6 PM 7 T a s k 8 9 11 10 3030 30 30 30 30 30 12 1 2 AM Time A B C O D r d e r • 6/27/2016 Pipelined laundry takes 3.5 hours for 4 loads! Fall 2012 -- Lecture #27 9 Pipelining Lessons (1/2) 6 PM T a s k 8 9 Time 30 30 30 30 30 30 30 A B O r d e r 7 C D 6/27/2016 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages (4 in this case) • Time to fill pipeline and time to drain it reduces speedup: 8 hours/3.5 hours or 2.3X v. potential 4X in this example Fall 2012 -- Lecture #27 10 Pipelining Lessons (2/2) 6 PM T a s k 8 9 Time 30 30 30 30 30 30 30 A B O r d e r 7 C D 6/27/2016 • Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline? • Pipeline rate limited by slowest pipeline stage • Unbalanced lengths of pipe stages reduces speedup Fall 2012 -- Lecture #27 11 Agenda • Pipelined Execution • Administrivia • Pipelined Datapath 6/27/2016 Fall 2012 -- Lecture #27 12 Administrivia • Project #4, Labs #10 and #11, (Last) HW #6 posted – Project due 11/11 (Sunday after next) – Project is not difficult, but is long: don’t wait for the weekend when it is due! Start early! – Look at Logisim labs before you start on Project #4 • Lots of useful tip and tricks for using Logisim in those labs • TA Sung Roa will hold extended office hours on the project this coming weekend 6/27/2016 Fall 2012 -- Lecture #27 13 Administrivia • Reweighting of Projects – Still 40% of grade overall • • • • • Project 1: 5% Project 2: 12.5% Project 3: 10% Project 4: 12.5% Optional Extra Credit Project 5: up to 5% extra – Gold, Silver, and Bronze medals to the three fastest projects – Code size and aesthetics also a criteria for recognition – Winning projects as selected by the TAs will be highlighted in class 6/27/2016 Fall 2012 -- Lecture #27 14 Agenda • Pipelined Execution • Administrivia • Pipelined Datapath 6/27/2016 Fall 2012 -- Lecture #27 15 Review: RISC Design Principles • “A simpler core is a faster core” • Reduction in the number and complexity of instructions in the ISA simplifies pipelined implementation • Common RISC strategies: – Fixed instruction length, generally a single word (MIPS = 32b); Simplifies process of fetching instructions from memory – Simplified addressing modes; (MIPS just register + offset) Simplifies process of fetching operands from memory – Fewer and simpler instructions in the instruction set; Simplifies process of executing instructions – Simplified memory access: only load and store instructions access memory; – Let the compiler do it. Use a good compiler to break complex high-level language statements into a number of simple assembly language statements 6/27/2016 Fall 2012 -- Lecture #27 16 Review: Single Cycle Datapath 31 26 21 op 16 rs 0 rt immediate • Data Memory {R[rs] + SignExt[imm16]} = R[rt] Rs Rt 5 5 Rw busW 5 Ra Rb busA busB 32 imm16 16 ExtOp= Extender clk 32 = ALU RegFile 32 6/27/2016 Rs Rt Rd zero ALUctr= 0 <0:15> RegWr= <11:15> 1 clk Instruction<31:0> <16:20> Rd Rt instr fetch unit <21:25> nPC_sel= RegDst= Imm16 MemtoReg= MemWr= 32 0 0 32 1 Data In 32 ALUSrc= Fall 2012 -- Lecture #27 clk WrEn Adr Data Memory 1 17 Steps in Executing MIPS 1) IF: Instruction Fetch, Increment PC 2) ID: Instruction Decode, Read Registers 3) EX: Execution Mem-ref: Calculate Address Arith-log: Perform Operation 4) Mem: Load: Read Data from Memory Store: Write Data to Memory 5) WB: Write Data Back to Register 6/27/2016 Fall 2012 -- Lecture #27 18 +4 1. Instruction Fetch 6/27/2016 rd rs rt ALU Data memory registers PC instruction memory Redrawn Single-Cycle Datapath imm 2. Decode/ 3. Execute 4. Memory Register Read Fall 2012 -- Lecture #27 5. Write Back 19 +4 1. Instruction Fetch rd rs rt ALU Data memory registers PC instruction memory Pipelined Datapath imm 2. Decode/ 3. Execute 4. Memory Register Read 5. Write Back • Add registers between stages – Hold information produced in previous cycle • 5 stage pipeline; clock rate potential 5X faster 6/27/2016 Fall 2012 -- Lecture #27 20 More Detailed Pipeline Registers named for adjacent stages, e.g., IF/ID 6/27/2016 Fall 2012 -- Lecture #27 21 IF for Load, Store, … Highlight combinational logic components used + right half of state logic on read, left half on write 6/27/2016 Fall 2012 -- Lecture #27 22 ID for Load, Store, … 6/27/2016 Fall 2012 -- Lecture #27 23 EX for Load 6/27/2016 Fall 2012 -- Lecture #27 24 MEM for Load 6/27/2016 Fall 2012 -- Lecture #27 25 WB for Load Has Bug that was in 1st edition of textbook! Wrong register number 6/27/2016 Fall 2012 -- Lecture #27 26 Corrected Datapath for Load Correct register number 6/27/2016 Fall 2012 -- Lecture #27 27 Pipelined Execution Representation Time IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB • Every instruction must take same number of steps, also called pipeline stages, so some will go idle sometimes 6/27/2016 Fall 2012 -- Lecture #27 28 rd rs rt ALU Data memory registers PC instruction memory Graphical Pipeline Diagrams imm +4 1. Instruction Fetch 2. Decode/ 3. Execute 4. Memory Register Read 5. Write Back • Use datapath figure below to represent pipeline IF 6/27/2016 Reg EX ALU I$ ID Mem WB D$ Reg Fall 2012 -- Lecture #27 29 Graphical Pipeline Representation (In Reg, right half highlight read, left half write) Time (clock cycles) Reg Reg D$ Reg I$ Reg D$ Reg I$ Reg ALU D$ Reg I$ Reg ALU I$ D$ ALU Reg ALU I$ ALU I n Load s t Add r. Store O Sub r d Or e r6/27/2016 D$ Fall 2012 -- Lecture #27 Reg 30 Pipeline Performance • Assume time for stages is – 100ps for register read or write – 200ps for other stages • What is pipelined clock rate? – Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps beq 200ps 100 ps 200ps 6/27/2016 700ps 100 ps 600ps 500ps Fall 2012 -- Lecture #27 31 Student Roulette? Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) 6/27/2016 Fall 2012 -- Lecture #27 32 Pipeline Speedup • If all stages are balanced – i.e., all take the same time – Time between instructionspipelined = Time between instructionsnonpipelined Number of stages • If not balanced, speedup is less • Speedup due to increased throughput – Latency (time for each instruction) does not decrease 6/27/2016 Fall 2012 -- Lecture #27 33 Instruction Level Parallelism (ILP) • Another parallelism form to go with Request Level Parallelism and Data Level Parallelism – RLP – e.g., Warehouse Scale Computing – DLP – e.g., SIMD, Map-Reduce • ILP – e.g., Pipelined Instruction Execution – 5 stage pipeline => 5 instructions executing simultaneously, one at each pipeline stage 6/27/2016 Fall 2012 -- Lecture #27 34 And in Conclusion, … The BIG Picture • Pipelining improves performance by increasing instruction throughput: exploits ILP – Executes multiple instructions in parallel – Each instruction has the same latency • Key enabler is placing registers between pipeline stages 6/27/2016 Fall 2012 -- Lecture #27 35