Datorarkitektur och operativsystem Lecture 5 1

Datorarkitektur och operativsystem Lecture 5 1 Components of a Computer Processor  Datapath  Component of the p processor that performs arithmetic operations p  Control  Component of the p processor that commands the datapath, p memory, y I/O devices according to the instructions of the memory Goal: We can correctly interpret a datapath and control from a schematic diagram Datapath With Control Chapter 4 — The Processor — 4 Problem: Consider the instruction: AND Rd, Rt, Rs. What are the control signals generated by the control unit in this figure ? Chapter 4 — The Processor — 5  ANSWER :  Regwrite asserted to write  Mux (before ALUs) signaled to read from registers and not immediate  Mux M (before re registers) isters) si signaled naled to use se ALU and not data memory  ALU is signaled l d to perform f AND  Branch not asserted  Memwrite not asserted  Memread not asserted For some instructions, some control signals do not matter. E.g, Mux i i l i l d (before registers) in a sw instruction. 6 Problem: If the only thing we need to do in a processor is fetch consecutive instructions (see figure), what would the cycle time be ? Consider that the logic blocks have the following latencies. Other units have negligible latencies units have negligible latencies. I ‐Mem ADD Mux ALU Regs D‐Mem Sign‐ Extend Shift‐left‐ 2 200 ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps  ANSWER:  200ps because  I-Mem has larger latency than the Add unit  I-Mem and Add are in parallel paths 8 A Recap: Combinational Elements  AND-gate AND gate   Y =A & B A B  Multiplexer  A + Y = A + B B Y   Adder Y = S ? I1 : I0 I0 I1 M u x S Chapter 4 — The Processor — 9 Arithmetic/Logic Unit /  Y = F(A, B) ( , ) A ALU Y B F Y Y A Recap: State Elements  Registers  Data Memory  Instruction Memory  Clocks are needed to decide when an element that contains state should be updated 10 Clocking Methodology  We study  Ed Edge triggered i d methodology h d l • Because it is simple  Edge triggered methodology:  All state changes occur on a clock edge Chapter 4 — The Processor — 11 Clocking Methodology  Longest delay determines clock period Chapter 4 — The Processor — 12 Single Cycle: Performance  Assume time for stages is  100ps for register read or write  200ps p for other stages g Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps beq 200ps 100 ps 200ps 700ps 100 ps 600ps 500ps 200 ps latency 100 ps latency Single Cycle: Performance Chapter 4 — The Processor — 14  Pipelining = Overlap the stages Performance Issues  LLongest delay d l d determines i clock l k period i d  Critical path: load instruction  Instruction memory  register file  ALU  data memoryy  register g file Performance Issues  Violates design principle  Making the common case fast • (read text in p.329-330)  Improve performance by pipelining! Recap: the stages of the datapath  We can look W l k at the h datapath d h as five fi stages, one step per stage 1. IF: Instruction fetch from memory g read 2. ID: Instruction decode & register 3. EX: Execute operation or calculate address 4 MEM: Access memory operand 4. 5. WB: Write result back to register The 5 Stages g 200 ps latency 100 ps latency  Pipelining = Overlap the stages Chapter 4 — The Processor — 19 Pipeline Chapter 4 — The Processor — 20 Clock Cycle Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 — The Processor — 21  Even if some stages take only 100ps instead of 200ps, the pipelined execution clock cycle must have the worst case clock cycle time of 200ps 22 Speedup from Pipelining  If all stages are balanced  i.e., all take the same time  Time between instructionspipelined = Time between instructionsnonpipelined i li d Number of stages Chapter 4 — The Processor — 23 Speedup from Pipelining  If the stages are not balanced (not equal), speedup is less . Look at our example:  Time between instructions (non-pipelined) = 800ps  Number of stages = 5  Time between instructions (pipelined) = 800/5 =160 ps  But, what did we get ?  Also read text in p.334 p Chapter 4 — The Processor — 24 Speedup from Pipelining  Recall from Chapter 1: Throughput versus latency p p in ppipelining p g is due to increased throughput g p  Speedup  Latency (time for each instruction) does not decrease Chapter 4 — The Processor — 25 Hazards  Situations that prevent starting the next instruction in the next cycle are called hazards  There are 3 kinds of hazards  Structure hazards  Data hazards  Control hazards Chapter 4 — The Processor — 26 Structure Hazards  When an instruction cannot execute in proper cycle due to a conflict for use of a resource Chapter 4 — The Processor — 27 Example of structural hazard  Imagine  a MIPS p pipeline p with a single g data and instruction memoryy  a fourth additional instruction in the pipeline; when this instruction is trying to fetch from instruction memory, the first instruction is fetching of data memory – leading to a hazard  Hence, pipelined datapaths require separate instruction/data memories  Or separate instruction/data caches Data Hazards  An instruction depends on completion of data access b a previous by i i instruction i Chapter 4 — The Processor — 29 Data Hazards  add $s0, $t0, $t1 sub b $t2, $ 2 $s0, $ 0 $t3 $ 3  For the above code, when can we start the second instruction ? Data Hazards  add sub Chapter 4 — The Processor — 31 $s0, , $t0, , $t1 $t2, $s0, $t3  Bubble or the ppipeline p stall is used to resolve a hazard  BUT it leads to wastage of cycles = performance deterioration  Solve the performance problem by ‘forwarding’ the required data 32 Forwarding (aka Bypassing)  Use result when it is computed : don don’tt wait for it to be stored in a register  Requires extra connections in the datapath Forwarding (aka Bypassing)  add $s0, $t0, $t1 sub $t2, $s0, $t3  For the above code, draw the datapath p with forwarding Forwarding (aka Bypassing) Chapter 4 — The Processor — 35 Load Use Data Hazard Load-Use  Can’t always avoid stalls by forwarding • If value not computed when needed Chapter 4 — The Processor — 36 Code Scheduling g to Avoid Stalls   Reorder code to avoid use of load result in the next i instruction i C code for A = B + E; ; C = B + F; ; lw lw add sw lw add sw $t1, $t1 $t2, $t3, $t3, $t4, $ $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $ $t4 $ 16($t0) 13 cycles l Chapter 4 — The Processor — 37 Code Scheduling g to Avoid Stalls   Reorder code to avoid use of load result in the next i instruction i C code for A = B + E; ; C = B + F; ; stall stall lw lw add sw lw add sw $t1, $t1 $t2, $t3, $t3, $t4, $ $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $ $ $t4 16($t0) 13 cycles l Chapter 4 — The Processor — 38 Code Scheduling g to Avoid Stalls   Reorder code to avoid use of load result in the next i instruction i C code for A = B + E; ; C = B + F; ; stall stall lw lw add sw lw add sw $t1, $t1 $t2, $t3, $t3, $t4, $ $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $ $ $t4 16($t0) 13 cycles l lw lw lw add sw add sw $t1, $t1 $t2, $t4, $t3, $t3, $ $t5, $t5, 0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $ $ $t4 16($t0) 11 cycles l Chapter 4 — The Processor — 39 Control Hazards  When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is need Chapter 4 — The Processor — 40 Control Hazards  Branch determines flow of control  Fetching next instruction depends on branch outcome  Naïve/Simple p Solution: Stall if we fetch a branch instruction  Drawback: Several cycles y lost  Improve by adding hardware so that already in second stage g we can know the address of the next instruction Chapter 4 — The Processor — 41 Stall on Branch  Wait until branch outcome determined before fetching next instruction  but with extra hardware to minimize the stalled cycles y Chapter 4 — The Processor — 42 Branch Prediction  Longer g pipelines pp can’t readilyy determine branch outcome early (in second stage)  Stall penalty becomes unacceptable  Solution: predict outcome of branch  Only stall if prediction is wrong  We will not study y the details Chapter 4 — The Processor — 43 MIPS with Predict Not Taken Prediction correct Prediction incorrect Chapter 4 — The Processor — 44 Recap: Hazards  Situations that prevent starting the next instruction in the next cycle  Structure S hhazards d  A required resource is busy  Data hazard  Need to wait for previous instruction to complete its data read/write  Control hazard  Deciding on control action depends on previous i t ti instruction Chapter 4 — The Processor — 45 Pipeline Summary  Pi Pipelining li i improves i performance f bby iincreasing i instruction throughput  Executes multiple instructions in parallel  Each instruction has the same latency  Subject to hazards  Structure, data, control  Instruction set design affects complexity of pipeline implementation (p.335) Chapter 4 — The Processor — 46 Administrative Detail  Homework 2 is coming up very soon 47 Administrative Detail: A Rule for Homework  The deadline may NOT be extended. No exceptions p for sickness,, forgetfulness, g ,…   Why? 48 Administrative Detail: A Rule for Homework  The deadline may NOT be extended. No exceptions p for sickness,, forgetfulness, g ,…   Why ?  To T ensure fairness f i to your classmates l who h worked hard and could submit only half of the questions. i If someone submits b i llater and d completes l all questions, he/she should NOT be rewarded!  It is just like an exam; so if you miss it, it is over. Only that, for this special exam, you do it at home! 49 From the Textbook  Last week: 4.1, 4.2, 4.3  Today: 44.55 50 50

Datorarkitektur och operativsystem Lecture 5 1

Related documents

Products

Support

Datorarkitektur och operativsystem Lecture 5 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib