Components of a Computer Datorarkitektur och operativsystem Lecture 5 1 Processor Datapath Component of the p processor that performs arithmetic operations p Datapath With Control Control Component of the p processor that commands the datapath, p memory, y I/O devices according to the instructions of the memory Goal: We can correctly interpret a datapath and control from a schematic diagram Chapter 4 — The Processor — 4 ANSWER : Regwrite asserted to write Mux (before ALUs) signaled to read from registers and not immediate Mux M (before re registers) isters) si signaled naled to use se ALU and not data memory ALU is signaled l d to perform f AND Branch not asserted Memwrite not asserted Memread not asserted Problem: Consider the instruction: AND Rd, Rt, Rs. What are the control signals generated by the control unit in this figure ? Chapter 4 — The Processor — 5 Problem: If the only thing we need to do in a processor is fetch consecutive instructions (see figure), what would the cycle time be ? Consider that the logic blocks have the following latencies. Other units have negligible latencies units have negligible latencies. I ‐Mem ADD Mux ALU Regs D‐Mem Sign‐ Extend Shift‐left‐ 2 200 ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps For some instructions, some control signals do not matter. E.g, Mux i i l i l d (before registers) in a sw instruction. 6 ANSWER: 200ps because I-Mem has larger latency than the Add unit I-Mem and Add are in parallel paths 8 A Recap: State Elements A Recap: Combinational Elements AND-gate AND gate Y =A & B A B Multiplexer A + Y = A + B B Y Adder M u x Registers Data Memory Instruction Memory Arithmetic/Logic Unit / Y = F(A, B) ( , ) Y = S ? I1 : I0 I0 I1 Y A ALU Y Y B S Clocks are needed to decide when an element that contains state should be updated F Chapter 4 — The Processor — 9 Clocking Methodology We study 10 Clocking Methodology Longest delay determines clock period Ed Edge triggered i d methodology h d l • Because it is simple Edge triggered methodology: All state changes occur on a clock edge Chapter 4 — The Processor — 11 Chapter 4 — The Processor — 12 Single Cycle: Performance 200 ps latency Assume time for stages is 100ps for register read or write 200ps p for other stages g Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps beq 200ps 100 ps 200ps 700ps 100 ps 100 ps latency 600ps 500ps Single Cycle: Performance Chapter 4 — The Processor — 14 Performance Issues LLongest delay d l determines d i clock l k period i d Critical path: load instruction Instruction memory register file ALU data memoryy register g file Pipelining = Overlap the stages Performance Issues Violates design principle Making the common case fast • (read text in p.329-330) Improve performance by pipelining! The 5 Stages g Recap: the stages of the datapath We can look W l k at the h datapath d h as five fi stages, one step per stage 1. IF: Instruction fetch from memory g read 2. ID: Instruction decode & register 3. EX: Execute operation or calculate address 4 MEM: Access memory operand 4. 5. WB: Write result back to register 200 ps latency Pipeline 100 ps latency Pipelining = Overlap the stages Chapter 4 — The Processor — 19 Chapter 4 — The Processor — 20 Clock Cycle Single-cycle (Tc= 800ps) Even if some stages take only 100ps instead of 200ps, the pipelined execution clock cycle must have the worst case clock cycle time of 200ps Pipelined (Tc= 200ps) Chapter 4 — The Processor — 21 Speedup from Pipelining If all stages are balanced i.e., all take the same time Time between instructionspipelined = Time between instructionsnonpipelined i li d Number of stages 22 Speedup from Pipelining If the stages are not balanced (not equal), speedup is less . Look at our example: Time between instructions (non-pipelined) = 800ps Number of stages = 5 Time between instructions (pipelined) = 800/5 =160 ps But, what did we get ? Also read text in p.334 p Chapter 4 — The Processor — 23 Chapter 4 — The Processor — 24 Speedup from Pipelining Recall from Chapter 1: Throughput versus latency p p in pipelining pp g is due to increased throughput g p Speedup Latency (time for each instruction) does not decrease Hazards Situations that prevent starting the next instruction in the next cycle are called hazards There are 3 kinds of hazards Structure hazards Data hazards Control hazards Chapter 4 — The Processor — 25 Structure Hazards When an instruction cannot execute in proper cycle due to a conflict for use of a resource Chapter 4 — The Processor — 27 Chapter 4 — The Processor — 26 Example of structural hazard Imagine a MIPS ppipeline p with a single g data and instruction memoryy a fourth additional instruction in the pipeline; when this instruction is trying to fetch from instruction memory, the first instruction is fetching of data memory – leading to a hazard Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches Data Hazards Data Hazards An instruction depends on completion of data access b a previous by i i instruction i add $s0, $t0, $t1 sub b $t2, $ 2 $s0, $ 0 $t3 $ 3 For the above code, when can we start the second instruction ? Chapter 4 — The Processor — 29 Data Hazards add sub $s0, , $t0, , $t1 $t2, $s0, $t3 Bubble or the ppipeline p stall is used to resolve a hazard BUT it leads to wastage of cycles = performance deterioration Solve the performance problem by ‘forwarding’ the required data Chapter 4 — The Processor — 31 32 Forwarding (aka Bypassing) Use result when it is computed : don don’tt wait for it to be stored in a register Requires extra connections in the datapath Forwarding (aka Bypassing) Forwarding (aka Bypassing) add $s0, $t0, $t1 sub $t2, $s0, $t3 For the above code, draw the datapath p with forwarding Load Use Data Hazard Load-Use Can’t always avoid stalls by forwarding • If value not computed when needed Chapter 4 — The Processor — 35 Chapter 4 — The Processor — 36 Code Scheduling g to Avoid Stalls Reorder code to avoid use of load result in the next i instruction i C code for A = B + E; ; C = B + F; ; lw lw add sw lw add sw $t1, $t1 $t2, $t3, $t3, $t4, $ $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $ $t4 $ 16($t0) Code Scheduling g to Avoid Stalls Reorder code to avoid use of load result in the next i instruction i C code for A = B + E; ; C = B + F; ; stall stall 13 cycles l Reorder code to avoid use of load result in the next i instruction i C code for A = B + E; ; C = B + F; ; stall stall lw lw add sw lw add sw $t1, $t1 $t2, $t3, $t3, $t4, $ $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $ $ $t4 16($t0) 13 cycles l 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $ $ $t4 16($t0) Chapter 4 — The Processor — 38 Code Scheduling g to Avoid Stalls $t1, $t1 $t2, $t3, $t3, $t4, $ $t5, $t5, 13 cycles l Chapter 4 — The Processor — 37 lw lw add sw lw add sw lw lw lw add sw add sw $t1, $t1 $t2, $t4, $t3, $t3, $ $t5, $t5, Control Hazards When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is need 0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $ $ $t4 16($t0) 11 cycles l Chapter 4 — The Processor — 39 Chapter 4 — The Processor — 40 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Stall on Branch Wait until branch outcome determined before fetching next instruction but with extra hardware to minimize the stalled cycles y Naïve/Simple p Solution: Stall if we fetch a branch instruction Drawback: Several cycles y lost Improve by adding hardware so that already in second stage g we can know the address of the next instruction Chapter 4 — The Processor — 41 Chapter 4 — The Processor — 42 Branch Prediction MIPS with Predict Not Taken Longer g pipelines pp can’t readilyy determine branch outcome early (in second stage) Stall penalty becomes unacceptable Prediction correct Solution: predict outcome of branch Only stall if prediction is wrong We will not study y the details Prediction incorrect Chapter 4 — The Processor — 43 Chapter 4 — The Processor — 44 Recap: Hazards Pipeline Summary Situations that prevent starting the next instruction in the next cycle Structure S hazards h d A required resource is busy Data hazard Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends on previous i t ti instruction Pi Pipelining li i improves i performance f bby iincreasing i instruction throughput Executes multiple instructions in parallel Each instruction has the same latency Subject to hazards Structure, data, control Instruction set design affects complexity of pipeline implementation (p.335) Chapter 4 — The Processor — 46 Chapter 4 — The Processor — 45 Administrative Detail: A Rule for Homework Administrative Detail Homework 2 is coming up very soon The deadline may NOT be extended. No exceptions p for sickness,, forgetfulness, g ,… Why? 47 48 Administrative Detail: A Rule for Homework From the Textbook The deadline may NOT be extended. No exceptions p for sickness,, forgetfulness, g ,… Why ? Last week: 4.1, 4.2, 4.3 Today: 4.5 45 To T ensure fairness f i to your classmates l who h worked hard and could submit only half of the questions. i If someone submits b i llater and d completes l all questions, he/she should NOT be rewarded! It is just like an exam; so if you miss it, it is over. Only that, for this special exam, you do it at home! 49 50 50