Datorarkitektur och operativsystem Lecture 5 1 Components of a Computer Processor Datapath Component of the p processor that performs arithmetic operations p Control Component of the p processor that commands the datapath, p memory, y I/O devices according to the instructions of the memory Goal: We can correctly interpret a datapath and control from a schematic diagram Datapath With Control Chapter 4 — The Processor — 4 Problem: Consider the instruction: AND Rd, Rt, Rs. What are the control signals generated by the control unit in this figure ? Chapter 4 — The Processor — 5 ANSWER : Regwrite asserted to write Mux (before ALUs) signaled to read from registers and not immediate Mux M (before re registers) isters) si signaled naled to use se ALU and not data memory ALU is signaled l d to perform f AND Branch not asserted Memwrite not asserted Memread not asserted For some instructions, some control signals do not matter. E.g, Mux i i l i l d (before registers) in a sw instruction. 6 Problem: If the only thing we need to do in a processor is fetch consecutive instructions (see figure), what would the cycle time be ? Consider that the logic blocks have the following latencies. Other units have negligible latencies units have negligible latencies. I ‐Mem ADD Mux ALU Regs D‐Mem Sign‐ Extend Shift‐left‐ 2 200 ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps ANSWER: 200ps because I-Mem has larger latency than the Add unit I-Mem and Add are in parallel paths 8 A Recap: Combinational Elements AND-gate AND gate Y =A & B A B Multiplexer A + Y = A + B B Y Adder Y = S ? I1 : I0 I0 I1 M u x S Chapter 4 — The Processor — 9 Arithmetic/Logic Unit / Y = F(A, B) ( , ) A ALU Y B F Y Y A Recap: State Elements Registers Data Memory Instruction Memory Clocks are needed to decide when an element that contains state should be updated 10 Clocking Methodology We study Ed Edge triggered i d methodology h d l • Because it is simple Edge triggered methodology: All state changes occur on a clock edge Chapter 4 — The Processor — 11 Clocking Methodology Longest delay determines clock period Chapter 4 — The Processor — 12 Single Cycle: Performance Assume time for stages is 100ps for register read or write 200ps p for other stages g Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps beq 200ps 100 ps 200ps 700ps 100 ps 600ps 500ps 200 ps latency 100 ps latency Single Cycle: Performance Chapter 4 — The Processor — 14 Pipelining = Overlap the stages Performance Issues LLongest delay d l d determines i clock l k period i d Critical path: load instruction Instruction memory register file ALU data memoryy register g file Performance Issues Violates design principle Making the common case fast • (read text in p.329-330) Improve performance by pipelining! Recap: the stages of the datapath We can look W l k at the h datapath d h as five fi stages, one step per stage 1. IF: Instruction fetch from memory g read 2. ID: Instruction decode & register 3. EX: Execute operation or calculate address 4 MEM: Access memory operand 4. 5. WB: Write result back to register The 5 Stages g 200 ps latency 100 ps latency Pipelining = Overlap the stages Chapter 4 — The Processor — 19 Pipeline Chapter 4 — The Processor — 20 Clock Cycle Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 — The Processor — 21 Even if some stages take only 100ps instead of 200ps, the pipelined execution clock cycle must have the worst case clock cycle time of 200ps 22 Speedup from Pipelining If all stages are balanced i.e., all take the same time Time between instructionspipelined = Time between instructionsnonpipelined i li d Number of stages Chapter 4 — The Processor — 23 Speedup from Pipelining If the stages are not balanced (not equal), speedup is less . Look at our example: Time between instructions (non-pipelined) = 800ps Number of stages = 5 Time between instructions (pipelined) = 800/5 =160 ps But, what did we get ? Also read text in p.334 p Chapter 4 — The Processor — 24 Speedup from Pipelining Recall from Chapter 1: Throughput versus latency p p in ppipelining p g is due to increased throughput g p Speedup Latency (time for each instruction) does not decrease Chapter 4 — The Processor — 25 Hazards Situations that prevent starting the next instruction in the next cycle are called hazards There are 3 kinds of hazards Structure hazards Data hazards Control hazards Chapter 4 — The Processor — 26 Structure Hazards When an instruction cannot execute in proper cycle due to a conflict for use of a resource Chapter 4 — The Processor — 27 Example of structural hazard Imagine a MIPS p pipeline p with a single g data and instruction memoryy a fourth additional instruction in the pipeline; when this instruction is trying to fetch from instruction memory, the first instruction is fetching of data memory – leading to a hazard Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches Data Hazards An instruction depends on completion of data access b a previous by i i instruction i Chapter 4 — The Processor — 29 Data Hazards add $s0, $t0, $t1 sub b $t2, $ 2 $s0, $ 0 $t3 $ 3 For the above code, when can we start the second instruction ? Data Hazards add sub Chapter 4 — The Processor — 31 $s0, , $t0, , $t1 $t2, $s0, $t3 Bubble or the ppipeline p stall is used to resolve a hazard BUT it leads to wastage of cycles = performance deterioration Solve the performance problem by ‘forwarding’ the required data 32 Forwarding (aka Bypassing) Use result when it is computed : don don’tt wait for it to be stored in a register Requires extra connections in the datapath Forwarding (aka Bypassing) add $s0, $t0, $t1 sub $t2, $s0, $t3 For the above code, draw the datapath p with forwarding Forwarding (aka Bypassing) Chapter 4 — The Processor — 35 Load Use Data Hazard Load-Use Can’t always avoid stalls by forwarding • If value not computed when needed Chapter 4 — The Processor — 36 Code Scheduling g to Avoid Stalls Reorder code to avoid use of load result in the next i instruction i C code for A = B + E; ; C = B + F; ; lw lw add sw lw add sw $t1, $t1 $t2, $t3, $t3, $t4, $ $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $ $t4 $ 16($t0) 13 cycles l Chapter 4 — The Processor — 37 Code Scheduling g to Avoid Stalls Reorder code to avoid use of load result in the next i instruction i C code for A = B + E; ; C = B + F; ; stall stall lw lw add sw lw add sw $t1, $t1 $t2, $t3, $t3, $t4, $ $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $ $ $t4 16($t0) 13 cycles l Chapter 4 — The Processor — 38 Code Scheduling g to Avoid Stalls Reorder code to avoid use of load result in the next i instruction i C code for A = B + E; ; C = B + F; ; stall stall lw lw add sw lw add sw $t1, $t1 $t2, $t3, $t3, $t4, $ $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $ $ $t4 16($t0) 13 cycles l lw lw lw add sw add sw $t1, $t1 $t2, $t4, $t3, $t3, $ $t5, $t5, 0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $ $ $t4 16($t0) 11 cycles l Chapter 4 — The Processor — 39 Control Hazards When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is need Chapter 4 — The Processor — 40 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Naïve/Simple p Solution: Stall if we fetch a branch instruction Drawback: Several cycles y lost Improve by adding hardware so that already in second stage g we can know the address of the next instruction Chapter 4 — The Processor — 41 Stall on Branch Wait until branch outcome determined before fetching next instruction but with extra hardware to minimize the stalled cycles y Chapter 4 — The Processor — 42 Branch Prediction Longer g pipelines pp can’t readilyy determine branch outcome early (in second stage) Stall penalty becomes unacceptable Solution: predict outcome of branch Only stall if prediction is wrong We will not study y the details Chapter 4 — The Processor — 43 MIPS with Predict Not Taken Prediction correct Prediction incorrect Chapter 4 — The Processor — 44 Recap: Hazards Situations that prevent starting the next instruction in the next cycle Structure S hhazards d A required resource is busy Data hazard Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends on previous i t ti instruction Chapter 4 — The Processor — 45 Pipeline Summary Pi Pipelining li i improves i performance f bby iincreasing i instruction throughput Executes multiple instructions in parallel Each instruction has the same latency Subject to hazards Structure, data, control Instruction set design affects complexity of pipeline implementation (p.335) Chapter 4 — The Processor — 46 Administrative Detail Homework 2 is coming up very soon 47 Administrative Detail: A Rule for Homework The deadline may NOT be extended. No exceptions p for sickness,, forgetfulness, g ,… Why? 48 Administrative Detail: A Rule for Homework The deadline may NOT be extended. No exceptions p for sickness,, forgetfulness, g ,… Why ? To T ensure fairness f i to your classmates l who h worked hard and could submit only half of the questions. i If someone submits b i llater and d completes l all questions, he/she should NOT be rewarded! It is just like an exam; so if you miss it, it is over. Only that, for this special exam, you do it at home! 49 From the Textbook Last week: 4.1, 4.2, 4.3 Today: 44.55 50 50