30LecF11Pipeline1.pptx

61C in the News0 Is organic computing finally here? SPOTLIGHT | NOVEMBER 2, 2011 Japanese scientists have made organic molecules perform parallel computations like neurons in the human brain. They created this promising new approach with a ringlike molecule called 2,3-dichloro-5,6Today, computer chips can process data at 10 dicyano-p-benzoquinone, or DDQ. trillion (1013) bits per second. But, even though neurons in the human brain fire at a rate of 100 times per second, the brain still outperforms the best computers at various tasks. The main reason being that calculations done by computer chips happen in isolated pipelines one at a time. 6/27/2016 Fall 2011 -- Lecture #30 1 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: Pipeline Parallelism 1 Instructors: Mike Franklin Dan Garcia http://inst.eecs.Berkeley.edu/~cs61c/fa11 6/27/2016 Fall 2011 -- Lecture #30 2 Datapath Control Signals • ExtOp: • ALUsrc: • ALUctr: • • • • • “zero”, “sign” 0  regB; 1  immed “ADD”, “SUB”, “OR” MemWr: MemtoReg: nPC_sel: RegDst: RegWr: ALUctr MemtoReg MemWr RegDst Rd Rt 1 Inst Address 4 RegWr 0 Rs Rt 5 5 Rw busW 5 Ra Rb busA RegFile busB PC Mux 32 clk imm16 16 Extender PC Ext Adder 1 imm16 0 32 WrEn Adr 1 Data In ALUSrc clk 32 ExtOp 6/27/2016 32 0 32 clk 32 ALU Adder 0 00 nPC_sel & Equal 1  write memory 0  ALU; 1  Mem 0  “+4”; 1  “br” 0  “rt”; 1  “rd” 1  write register Fall 2011 -- Lecture #30 1 Data Memory 3 Where Do Control Signals Come From? Instruction<31:0> Rd <0:15> Rs <11:15> Rt <16:20> Op Fun <21:25> <0:5> <26:31> Inst Memory Adr Imm16 Control nPC_sel RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg DATA PATH 6/27/2016 Fall 2011 -- Lecture #30 4 P&H Figure 4.17 6/27/2016 Fall 2011 -- Lecture #30 5 Summary of the Control Signals (1/2) inst Register Transfer add R[rd]  R[rs] + R[rt]; PC  PC + 4 ALUsrc=RegB, ALUctr=“ADD”, RegDst=rd, RegWr, nPC_sel=“+4” sub R[rd]  R[rs] – R[rt]; PC  PC + 4 ALUsrc=RegB, ALUctr=“SUB”, RegDst=rd, RegWr, nPC_sel=“+4” ori R[rt]  R[rs] + zero_ext(Imm16); PC  PC + 4 ALUsrc=Im, Extop=“Z”, ALUctr=“OR”, RegDst=rt,RegWr, nPC_sel=“+4” R[rt]  MEM[ R[rs] + sign_ext(Imm16)]; PC  PC + 4 lw ALUsrc=Im, Extop=“sn”, ALUctr=“ADD”, MemtoReg, RegDst=rt, RegWr, nPC_sel = “+4” MEM[ R[rs] + sign_ext(Imm16)]  R[rs]; PC  PC + 4 sw ALUsrc=Im, Extop=“sn”, ALUctr = “ADD”, MemWr, nPC_sel = “+4” beq if (R[rs] == R[rt]) then PC  PC + sign_ext(Imm16)] || 00 else PC  PC + 4 nPC_sel = “br”, 6/27/2016 ALUctr = “SUB” Fall 2011 -- Lecture #30 6 Summary of the Control Signals (2/2) See Appendix A func 10 0000 10 0010 op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010 add sub ori lw sw beq jump RegDst 1 1 0 0 x x x ALUSrc 0 0 1 1 1 0 x MemtoReg 0 0 0 1 x x x RegWrite 1 1 1 1 0 0 0 MemWrite 0 0 0 0 1 0 0 nPCsel 0 0 0 0 0 1 ? Jump 0 0 0 0 0 0 1 ExtOp x x 0 1 1 x Add Subtract Or Add Add x Subtract ALUctr<2:0> 31 26 21 16 R-type op rs rt I-type op rs rt J-type op 6/27/2016 We Don’t Care :-) 11 rd 6 shamt immediate target address Fall 2011 -- Lecture #30 x 0 funct add, sub ori, lw, sw, beq jump 7 Boolean Exprs for Controller Instruction<31:0> = = ~op5 ~op5 op5 op5 ~op5 ~op5 rtype rtype 6/27/2016       ~op4 ~op4 ~op4 ~op4 ~op4 ~op4 Rs       Rd ~op3 op3 ~op3 op3 ~op3 ~op3 <0:15> add sub = = = = = = Rt <11:15> rtype ori lw sw beq jump Op 0-5 are really Instruction bits 26-31 Func 0-5 are really Instruction bits 0-5 <16:20> Op Fun <21:25> <0:5> <26:31> Inst Memory Adr Imm16       ~op2 op2 ~op2 ~op2 op2 ~op2  ~op1  ~op0,  ~op1  op0  op1  op0  op1  op0  ~op1  ~op0  op1  ~op0  func5  ~func4  ~func3  ~func2  ~func1  ~func0  func5  ~func4  ~func3  ~func2  func1  ~func0 2011 -- Lecture #30 this in gates? How do we Fallimplement 8 Controller Implementation opcode func “AND” logic 6/27/2016 add sub ori lw sw beq jump “OR” logic Fall 2011 -- Lecture #30 RegDst ALUSrc MemtoReg RegWrite MemWrite nPCsel Jump ExtOp ALUctr[0] ALUctr[1] 9 Boolean Exprs for Controller RegDst ALUSrc MemtoReg RegWrite MemWrite nPCsel Jump ExtOp ALUctr[0] ALUctr[1] = = = = = = = = = = add + sub ori + lw + sw lw add + sub + ori + lw sw beq jump lw + sw sub + beq ori (assume ALUctr is 00 ADD, 01 SUB, 10 OR) How do we implement this in gates? 6/27/2016 Fall 2011 -- Lecture #30 10 Controller Implementation opcode func “AND” logic 6/27/2016 add sub ori lw sw beq jump “OR” logic Fall 2011 -- Lecture #30 RegDst ALUSrc MemtoReg RegWrite MemWrite nPCsel Jump ExtOp ALUctr[0] ALUctr[1] 11 Call home, we’ve made HW/SW contact! High Level Language Program (e.g., C) Compiler Assembly Language Program (e.g.,MIPS) Assembler Machine Language Program (MIPS) Machine Interpretation Hardware Architecture Description (e.g., block diagrams) Architecture Implementation Logic Circuit Description (Circuit Schematic Diagrams) temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; lw lw sw sw 0000 1010 1100 0101 $t0, 0($2) $t1, 4($2) $t1, 0($2) $t0, 4($2) 1001 1111 0110 1000 1100 0101 1010 0000 0110 1000 1111 1001 1010 0000 0101 1100 1111 1001 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111 Administrivia • Due to time constraints – we can only allow the use of a maximum of 2 slip days on Project 4. • Thus, while we always encourage you to get your work done on time, if you still have 3 slip days left, you may want to consider using one prior to project 4 6/27/2016 Fall 2011 -- Lecture #30 13 Review: Single-cycle Processor • Five steps to design a processor: Processor 1. Analyze instruction set  Input datapath requirements Control Memory 2. Select set of datapath components & establish Datapath Output clock methodology 3. Assemble datapath meeting the requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic • Formulate Logic Equations • Design Circuits 6/27/2016 Fall 2011 -- Lecture #30 14 Single Cycle Performance • Assume time for actions are – 100ps for register read or write; 200ps for other events • Clock rate is? Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps beq 200ps 100 ps 200ps 700ps 100 ps 600ps 500ps • What can we do to improve clock rate? • Will this improve performance as well? Want increased clock rate to mean faster programs Fall 2011 -- Lecture #30 6/27/2016 15 Single Cycle Performance • Assume time for actions are – 100ps for register read or write; 200ps for other events • Clock rate is? Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps beq 200ps 100 ps 200ps 700ps 100 ps 600ps 500ps • What can we do to improve clock rate? • Will this improve performance as well? Want increased clock rate to mean faster programs Fall 2011 -- Lecture #30 6/27/2016 16 Gotta Do Laundry • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away – Washer takes 30 minutes – Dryer takes 30 minutes – “Folder” takes 30 minutes – “Stasher” takes 30 minutes to put clothes into drawers A B C D Sequential Laundry 6 PM 7 T a s k O r d e r A 8 9 10 11 12 1 2 AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time B C D • Sequential laundry takes 8 hours for 4 loads Pipelined Laundry 6 PM 7 T a s k 8 9 3030 30 30 30 30 30 11 10 Time A B C O D r d e r • Pipelined laundry takes 3.5 hours for 4 loads! 12 1 2 AM Pipelining Lessons (1/2) 6 PM T a s k 8 9 Time 30 30 30 30 30 30 30 A B O r d e r 7 C D • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Time to “fill” pipeline and time to “drain” it reduces speedup: 2.3X v. 4X in this example Pipelining Lessons (2/2) 6 PM T a s k 8 9 Time 30 30 30 30 30 30 30 A B O r d e r 7 C D • Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline? • Pipeline rate limited by slowest pipeline stage • Unbalanced lengths of pipe stages reduces speedup Steps in Executing MIPS 1) IFtch: Instruction Fetch, Increment PC 2) Dcd: Instruction Decode, Read Registers 3) Exec: Mem-ref: Calculate Address Arith-log: Perform Operation 4) Mem: Load: Read Data from Memory Store: Write Data to Memory 5) WB: Write Data Back to Register +4 1. Instruction Fetch rd rs rt ALU Data memory registers PC instruction memory Single Cycle Datapath imm 2. Decode/ 3. Execute 4. Memory 5. Write Back Register Read +4 1. Instruction Fetch rd rs rt ALU Data memory registers PC instruction memory Pipeline registers imm 2. Decode/ 3. Execute 4. Memory 5. Write Back Register Read • Need registers between stages – To hold information produced in previous cycle More Detailed Pipeline Chapter 4 — The Processor — 25 IF for Load, Store, … Chapter 4 — The Processor — 26 ID for Load, Store, … Chapter 4 — The Processor — 27 EX for Load Chapter 4 — The Processor — 28 MEM for Load Chapter 4 — The Processor — 29 WB for Load – Oops! Wrong register number Chapter 4 — The Processor — 30 Corrected Datapath for Load Chapter 4 — The Processor — 31 So, in conclusion • You now know how to implement the control logic for the single-cycle CPU. – (actually, you already knew it!) • Pipelining improves performance by increasing instruction throughput: exploits ILP – Executes multiple instructions in parallel – Each instruction has the same latency • Next: hazards in pipelining: – Structure, data, control Fall 2011 -- Lecture #30 6/27/2016 32

30LecF11Pipeline1.pptx

Related documents

Products

Support

30LecF11Pipeline1.pptx

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib