Chapter 4 The Processor CPU 效能因素(performance factors) 指令數(Instruction count) 由CPU硬體決定(Determined by CPU hardware) 兩種MIPS實作( two MIPS implementations) 由ISA及編譯器決定(Determined by ISA and compiler) 指令週期數 & 週期時間(CPI and Cycle time) §4.1 Introduction 簡介Introduction 簡易版(A simplified version) 實際管線版(A more realistic pipelined version) 檢單子集Simple subset, shows most aspects 記憶題存取(Memory reference): lw, sw 算術邏輯(Arithmetic/logical): add, sub, and, or, slt 控制轉移(Control transfer): beq, j Chapter 4 — The Processor — 2 執行指令Instruction Execution 擷取指令(PC instruction memory, fetch instruction) 暫存器(Register numbers register file, read registers) 與指令類別有關(Depending on instruction class) 計算(Use ALU to calculate) 記憶體存取(Memory address for load/store) Arithmetic result Access data memory for load/store 下一指令(Branch target address) PC target address or PC + 4 Chapter 4 — The Processor — 3 CPU Overview Chapter 4 — The Processor — 4 多工器Multiplexers 不能將線直接連接:多工器 Can’t just join wires together 程式計數器 指令記憶體 暫存器檔案 Use multiplexers 資料記憶體 Chapter 4 — The Processor — 5 控制線路Control Chapter 4 — The Processor — 6 二元資訊Information encoded in binary 組合邏輯Combinational element Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data encoded on multi-wire buses §4.2 Logic Design Conventions 邏輯設計Logic Design Basics 資料運算Operate on data 輸出入函數關係Output is a function of input 序向邏輯State (sequential) elements Store information Chapter 4 — The Processor — 7 Combinational Elements AND-gate Y=A&B A B Multiplexer A + Y=A+B Y B Y Adder Arithmetic/Logic Unit Y = F(A, B) Y = S ? I1 : I0 A I0 I1 M u x S ALU Y Y B F Chapter 4 — The Processor — 8 Sequential Elements 暫存器(Register): stores data in a circuit 利用clock信號決定何時更新資料(Uses a clock signal to determine when to update the stored value) 邊緣觸發Clk改變(上升緣)觸發(Edge-triggered: update when Clk changes from 0 to 1) Clk D Q D Clk Q Chapter 4 — The Processor — 9 Sequential Elements 暫存器具寫入控制(Register with write control) 當Clk上升緣且Write為High時才寫入(Only updates on clock edge when write control input is 1) Used when stored value is required later Clk D Write Clk Q Write D Q Chapter 4 — The Processor — 10 Clocking Methodology 組合邏輯在時脈週期轉換資料(Combinational logic transforms data during clock cycles) 位準觸發(Between clock edges) 輸出入皆為序向元件(Input from state elements, output to state element) 最常延遲決定時脈寬度(Longest delay determines clock period) Chapter 4 — The Processor — 11 資料路徑(Datapath) CPU中處理資料及位址之元件(Elements that process data and addresses in the CPU) §4.3 Building a Datapath 建立資料路徑Building a Datapath 暫存器、ALU、多工器(Registers, ALUs, mux’s,memories, …) 建構MIPS資料路徑(Build a MIPS datapath incrementally) Refining the overview design Chapter 4 — The Processor — 12 Instruction Fetch 32-bit register 32位元暫存器 Increment by 4 for next instruction 下一指令 PC+4 Chapter 4 — The Processor — 13 R格式指令(R-Format Instructions) 讀取兩個傳存器運算元(Read two register operands) 執行算術邏輯運算(Perform arithmetic/logical operation) 寫回傳暫存器結果(Write register result) 5條輸入線表示有25=32個暫存器 Chapter 4 — The Processor — 14 Load/Store Instructions 讀取暫存器運算元(Read register operands) 使用16bit偏移計算位址(Calculate address using 16bit offset) Use ALU, but sign-extend offset Load: Read memory and update PC register Store: Write register value to memory 0110 0000 0110 1010 1111 1010 Chapter 4 — The Processor — 15 分支指令Branch Instructions 讀取運算元(Read register operands) 比較運算元(Compare operands) 使用ALU、相減並檢查零輸出(Use ALU, subtract and check Zero output) 計算目標位址(Calculate target address) 符號擴展置換(Sign-extend displacement) 位移2個位置(字組位移) Shift left 2 places (word displacement) PC值加4(Add to PC + 4) 指令擷取時已計算Already calculated by instruction fetch Chapter 4 — The Processor — 16 Branch Instructions Just re-routes wires Sign-bit wire replicated Chapter 4 — The Processor — 17 Composing the Elements 將指令的資料路徑切在一個時脈週期內 (First-cut data path does an instruction in one clock cycle) 任一資歷路徑元件只能值一次行一個功能 (Each datapath element can only do one function at a time) 將指令與資料記憶體分開 (separate instruction and data memories) 使用多工器選擇不同指令的資料來源 (Use multiplexers where alternate data sources are used for different instructions) Chapter 4 — The Processor — 18 R-Type/Load/Store Datapath R型/載入/儲存指令 資料路徑 add addi lw sw $t1, $s3, $t0, $t0, $t1, $s6 $s3, 1 32($s3) 48($s3) Chapter 4 — The Processor — 19 Full Datapath(完整資料路徑) add addi lw sw $t1, $s3, $t0, $t0, $t1, $s6 $s3, 1 32($s3) 48($s3) Chapter 4 — The Processor — 20 ALU之用途(ALU used for) Load/Store: F = add Branch: F = subtract R-type: F depends on funct field(功能欄) ALU control(ALU控制信號) Function(功能) 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR §4.4 A Simple Implementation Scheme ALU控制ALU Control Chapter 4 — The Processor — 21 ALU控制ALU Control 假設2位元ALUOp 從opcode中取得(Assume 2-bit ALUOp derived from opcode) Combinational logic derives ALU control opcode ALUOp Operation funct ALU function lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001 set-on-less-than 101010 set-on-less-than 0111 Functl功能欄只有在R型指令時才有作用 ALU control 主要控制單元(The Main Control Unit) Control signals derived from instruction R-type Load/ Store Branch 0 rs rt rd shamt funct 31:26 25:21 20:16 15:11 10:6 5:0 35 or 43 rs rt address 31:26 25:21 20:16 15:0 4 rs rt address 31:26 25:21 20:16 15:0 opcode always read read, except for load write for R-type and load sign-extend and add 運算碼 一定要讀 一定要讀 除了load 寫入Reg R、Load 符號擴展、加 Chapter 4 — The Processor — 23 資料路徑+控制(Datapath With Control) Chapter 4 — The Processor — 24 R型指令(R-Type Instruction) Chapter 4 — The Processor — 25 R型指令(R-Type Instruction) 執行指令 add $t1, $t2, $t3 執行步驟 1.擷取指令,PC值+4 2.從暫存器檔案中讀出$t2,$t3; 同時,CU解碼指令(計算所需之控制訊號線之值) 3.ALU根據功能碼(funct)對$t2,$t3做運算 4.ALU結果寫入目的暫存器 Chapter 4 — The Processor — 26 載入指令(Load Instruction) lw $t0, 32($s3) ; 35 19 8 32 35 or 43 rs rt address 31:26 25:21 20:16 15:0 Chapter 4 — The Processor — 27 載入指令(Load Instruction) 執行指令 lw $t0, 32($s3) 執行步驟: 1.擷取指令,PC值+4 2.從暫存器檔案中讀出$s3; 3.ALU計算$s3與經過符號延伸之32之和 4. ALU計算結果為資料記憶體的輸入位址 5.資料記憶體傳回之值寫入暫存器檔案($t0) Chapter 4 — The Processor — 28 (條件分支指令)Branch-on-Equal Instruction beq $s1, $s2, 100 ; 4 17 18 25 4 rs rt address 31:26 25:21 20:16 15:0 Chapter 4 — The Processor — 29 (條件分支指令)Branch-on-Equal Instruction 執行指令 beq $s0, $s1, 100 執行步驟 1.擷取指令,PC值+4 2.從暫存器檔案中讀出$s0, $s1 3.ALU執行減法, (PC+4) 值與經過符號延伸之 並左移2位之值相加之和(i.e.分支目的位址) 4. ALU之Zero輸出決定哪一個加法器結果寫回 PC Chapter 4 — The Processor — 30 絕對跳躍指令(Implementing Jumps) Jump 2 address 31:26 25:0 Jump uses word address Update PC with concatenation of Top 4 bits of old PC 26-bit jump address 002 需要從運算碼(opcode)多一個額外控制訊號 (Need an extra control signal decoded from opcode) Chapter 4 — The Processor — 31 跳躍指令增加之資料路徑 Datapath With Jumps Added Chapter 4 — The Processor — 32 效能議題Performance Issues Longest delay determines clock period 關鍵路徑:Load指令(Critical path: load instruction) 指令記憶體暫存器檔案 ALU 資料記憶體傳存器檔案 (Instruction memory register file ALU data memory register file) 對不同指令沒有彈性可以改變週期 (Not feasible to vary period for different instructions) 違反設計原則Violates design principle 讓一般情況加快(Making the common case fast) 利用管線處理來增進效能 (We will improve performance by pipelining) Chapter 4 — The Processor — 33 管線式洗衣(Pipelined laundry: overlapping execution) 平行處理增進效能(Parallelism improves performance) Four loads: §4.5 An Overview of Pipelining 管線之比喻Pipelining Analogy Speedup = 8/3.5 = 2.3 Non-stop: Speedup = 2n/0.5n + 1.5 ≈ 4 = number of stages Chapter 4 — The Processor — 34 MIPS管線處理(MIPS Pipeline) 管線處理(pipelining)之定義 將指令區分成數個步驟,分別由不同的功能單元同時加以執行, 以增進整體程式之效能 管線處理五個步驟(Five stages, one step per stage) 1. IF: Instruction Fetch(擷取指令) 從記憶體擷取指令(Instruction fetch from memory) 2. ID: Instruction Decode(指令解碼) 解碼指令並讀取暫存器(Instruction decode & register read) 3. EX: Execution(執行指令) 執行運算或計算位址(Execute operation or calculate address) 4. MEM: Memory Access(記憶體存取) 存取記憶體運算元(Access memory operand) 5. WB: Write Back(寫回) 將結果寫回至暫存器(Write result back to register) Chapter 4 — The Processor — 35 管線處理效能Pipeline Performance 假設每一階段的時間為(Assume time for stages is) 暫存器讀寫:100ps for register read or write 其他階段:200ps for other stages 比較管線處理與單一週期的資料路徑 (Compare pipelined datapath with single-cycle datapath) Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps beq 200ps 100 ps 200ps 700ps 100 ps 600ps 500ps Chapter 4 — The Processor — 36 Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 — The Processor — 37 管線處理加速Pipeline Speedup 所有階段都一致If all stages are balanced 每一階段時間都相同(i.e., all take the same time) Time between instructionspipelined = Time between instructionsnonpipelined Number of stages 若階段不一致,加速值較少If not balanced, speedup is less 管線處理的「加速』源自增加處理量(產量) Speedup due to increased throughput 延遲時間(latency每一指令的時間)沒有減少 Latency (time for each instruction) does not decrease Chapter 4 — The Processor — 38 Pipelining and ISA Design MIPS ISA專為管線化處理所設計 (MIPS ISA designed for pipelining) 所有指令皆為32位元(All instructions are 32-bits) 較容易在一個週期內擷取並解碼 (Easier to fetch and decode in one cycle) c.f. x86: 1- to 17-byte instructions 少量且規則的指令格式(Few and regular instruction formats) 能在一個步驟內解碼並讀取暫存器 (Can decode and read registers in one step) 載入與儲存定址(Load/store addressing) 能在第3階段計算位址,在第4階段存犬記憶體 (Can calculate address in 3rd stage, access memory in 4th stage) 記憶體運算元對齊(Alignment of memory operands) 記憶體存取只需一個週期 (Memory access takes only one cycle) Chapter 4 — The Processor — 39 危障Hazards 下一週期的起始位址不是下一指令 Situations that prevent starting the next instruction in the next cycle 危障種類Hazard types 結構危障(Structure hazards) 所需資源忙碌中(A required resource is busy) 資料危障(Data hazard) 等待前一指令完成資料讀寫 Need to wait for previous instruction to complete its data read/write 控制危障(Control hazard) 依前一指令結果決定控制動作 Deciding on control action depends on previous instruction Chapter 4 — The Processor — 40 結構危障Structure Hazards 使用資源衝突Conflict for use of a resource MIPS中只有一個記憶體In MIPS pipeline with a single memory Load/store 需要做資料存取Load/store requires data access 該週期的指令擷取必須延遲(stall),需管線泡泡 Instruction fetch would have to stall for that cycle Would cause a pipeline “bubble” 管線式資料路徑需要獨立的指令/資料記憶體 或獨立的指令/資料快取(記憶體) Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches Chapter 4 — The Processor — 41 資料危障Data Hazards 與前依指令資料存取完成結果有關 An instruction depends on completion of data access by a previous instruction add sub $s0, $t0, $t1 $t2, $s0, $t3 Chapter 4 — The Processor — 42 前饋Forwarding (aka Bypassing) 使用已經計算完成的結果Use result when it is computed 不需等到存至暫存器中Don’t wait for it to be stored in a register 資料路徑需要額外的連接線 Requires extra connections in the datapath Chapter 4 — The Processor — 43 Load指令-資料危障(Load-Use Data Hazard) 使用前饋仍無法避免要用延遲(stall) Can’t always avoid stalls by forwarding 當所需要的值尚未計算完成 If value not computed when needed 無法前饋至之前的時間 Can’t forward backward in time! Chapter 4 — The Processor — 44 Code Scheduling to Avoid Stalls 利用指令重排避免下一 指令為Load指令 Reorder code to avoid use of load result in the next instruction A = B + E; C = B + F; C code for stall stall lw lw add sw lw add sw $t1, $t2, $t3, $t3, $t4, $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $t4 16($t0) 13 cycles lw lw lw add sw add sw $t1, $t2, $t4, $t3, $t3, $t5, $t5, 0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $t4 16($t0) 11 cycles Chapter 4 — The Processor — 45 控制危障Control Hazards 分支決定控制流程Branch determines flow of control 擷取下一指令取決於分支結果 Fetching next instruction depends on branch outcome 管線處理不能永遠擷取正確的下一個指令 Pipeline can’t always fetch correct instruction 仍在分支指令的ID階段 Still working on ID stage of branch MIPS的管線處理中In MIPS pipeline 需要在管線中比較暫存器與提早計算目標位址 Need to compare registers and compute target early in the pipeline 在ID階段增加硬體來處理 Add hardware to do it in ID stage Chapter 4 — The Processor — 46 分支中的延遲Stall on Branch 等到分支結果來決定擷取下一指令 Wait until branch outcome determined before fetching next instruction Chapter 4 — The Processor — 47 分支預測Branch Prediction 較長的管線無法完全提早決定分之結果 Longer pipelines can’t readily determine branch outcome early 延遲時間變得無法接受 Stall penalty becomes unacceptable 分支預測結果Predict outcome of branch 預測錯誤只有造成延遲Only stall if prediction is wrong MIPS管線處理中In MIPS pipeline 可以預測分支未發生Can predict branches not taken 分支後的擷取指令沒有延遲 Fetch instruction after branch, with no delay Chapter 4 — The Processor — 48 MIPS with Predict Not Taken Prediction correct Prediction incorrect Chapter 4 — The Processor — 49 More-Realistic Branch Prediction 靜態分支預測Static branch prediction 基於典型分支行為Based on typical branch behavior 範例:迴圈與if指令Example: loop and if-statement branches 預測反向分支會發生Predict backward branches taken 預測前向分支不會發生Predict forward branches not taken 動態分支預測Dynamic branch prediction 硬體測量實際分支行為Hardware measures actual branch behavior 例如:記錄每一分支最近結果的歷史 e.g., record recent history of each branch 假設未來行為會持續趨勢 Assume future behavior will continue the trend 當猜錯時,使用重新擷取時延遲stall、並更新歷史紀錄 When wrong, stall while re-fetching, and update history Chapter 4 — The Processor — 50 管線處理總結Pipeline Summary The BIG Picture 管線處理增加效能是利用增加指令處理量 Pipelining improves performance by increasing instruction throughput 危障Subject to hazards 同時執行多個指令Executes multiple instructions in parallel 每一指令有相同延遲時間Each instruction has the same latency 結構、資料、控制Structure, data, control 指令集設計影響實現管線處理的複雜度 Instruction set design affects complexity of pipeline implementation Chapter 4 — The Processor — 51 §4.6 Pipelined Datapath and Control MIPS Pipelined Datapath MEM Right-to-left flow leads to hazards WB Chapter 4 — The Processor — 52 管線暫存器Pipeline registers 管線各階段間需要暫存器Need registers between stages 保存上一階段產生的資訊To hold information produced in previous cycle Chapter 4 — The Processor — 53 Pipeline Operation 管線化資料路徑「逐一週期」的流程 Cycle-by-cycle flow of instructions through the pipelined datapath “Single-clock-cycle” pipeline diagram c.f. “multi-clock-cycle” diagram Shows pipeline usage in a single cycle Highlight resources used Graph of operation over time We’ll look at “single-clock-cycle” diagrams for load & store Chapter 4 — The Processor — 54 IF for Load, Store, … Chapter 4 — The Processor — 55 ID for Load, Store, … Chapter 4 — The Processor — 56 EX for Load Chapter 4 — The Processor — 57 MEM for Load Chapter 4 — The Processor — 58 WB for Load Wrong register number Chapter 4 — The Processor — 59 Load指令之修正資料路徑 (Corrected Datapath for Load) Chapter 4 — The Processor — 60 EX for Store Chapter 4 — The Processor — 61 MEM for Store Chapter 4 — The Processor — 62 WB for Store Chapter 4 — The Processor — 63 Multi-Cycle Pipeline Diagram 顯示資源利用(Form showing resource) usage Chapter 4 — The Processor — 64 Multi-Cycle Pipeline Diagram 傳統形式(Traditional form) Chapter 4 — The Processor — 65 Single-Cycle Pipeline Diagram State of pipeline in a given cycle Chapter 4 — The Processor — 66 Pipelined Control (Simplified) Chapter 4 — The Processor — 67 Pipelined Control Control signals derived from instruction As in single-cycle implementation Chapter 4 — The Processor — 68 Pipelined Control Chapter 4 — The Processor — 69 Consider this sequence: sub and or add sw $2, $1,$3 $12,$2,$5 $13,$6,$2 $14,$2,$2 $15,100($2) We can resolve hazards with forwarding §4.7 Data Hazards: Forwarding vs. Stalling Data Hazards in ALU Instructions How do we detect when to forward? Chapter 4 — The Processor — 70 Dependencies & Forwarding Chapter 4 — The Processor — 71 Detecting the Need to Forward Pass register numbers along pipeline ALU operand register numbers in EX stage are given by e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg Chapter 4 — The Processor — 72 Detecting the Need to Forward But only if forwarding instruction will write to a register! EX/MEM.RegWrite, MEM/WB.RegWrite And only if Rd for that instruction is not $zero EX/MEM.RegisterRd ≠ 0, MEM/WB.RegisterRd ≠ 0 Chapter 4 — The Processor — 73 Forwarding Paths Chapter 4 — The Processor — 74 Forwarding Conditions EX hazard if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 Chapter 4 — The Processor — 75 Double Data Hazard Consider the sequence: add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 Both hazards occur Want to use the most recent Revise MEM hazard condition Only fwd if EX hazard condition isn’t true Chapter 4 — The Processor — 76 Revised Forwarding Condition MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 Chapter 4 — The Processor — 77 Datapath with Forwarding Chapter 4 — The Processor — 78 Load-Use Data Hazard Need to stall for one cycle Chapter 4 — The Processor — 79 Load-Use Hazard Detection Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage are given by Load-use hazard when IF/ID.RegisterRs, IF/ID.RegisterRt ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) If detected, stall and insert bubble Chapter 4 — The Processor — 80 How to Stall the Pipeline Force control values in ID/EX register to 0 EX, MEM and WB do nop (no-operation) Prevent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows MEM to read data for lw Can subsequently forward to EX stage Chapter 4 — The Processor — 81 Stall/Bubble in the Pipeline Stall inserted here Chapter 4 — The Processor — 82 Stall/Bubble in the Pipeline Or, more accurately… Chapter 4 — The Processor — 83 Datapath with Hazard Detection Chapter 4 — The Processor — 84 Stalls and Performance The BIG Picture Stalls reduce performance But are required to get correct results Compiler can arrange code to avoid hazards and stalls Requires knowledge of the pipeline structure Chapter 4 — The Processor — 85 If branch outcome determined in MEM §4.8 Control Hazards Branch Hazards Flush these instructions (Set control values to 0) PC Chapter 4 — The Processor — 86 Reducing Branch Delay Move hardware to determine outcome to ID stage Target address adder Register comparator Example: branch taken 36: 40: 44: 48: 52: 56: 72: sub beq and or add slt ... lw $10, $1, $12, $13, $14, $15, $4, $3, $2, $2, $4, $6, $8 7 $5 $6 $2 $7 $4, 50($7) Chapter 4 — The Processor — 87 Example: Branch Taken Chapter 4 — The Processor — 88 Example: Branch Taken Chapter 4 — The Processor — 89 Data Hazards for Branches If a comparison register is a destination of 2nd or 3rd preceding ALU instruction add $1, $2, $3 IF add $4, $5, $6 … beq $1, $4, target ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Can resolve using forwarding Chapter 4 — The Processor — 90 Data Hazards for Branches If a comparison register is a destination of preceding ALU instruction or 2nd preceding load instruction lw Need 1 stall cycle $1, addr IF add $4, $5, $6 beq stalled beq $1, $4, target ID EX MEM WB IF ID EX MEM WB IF ID ID EX MEM WB Chapter 4 — The Processor — 91 Data Hazards for Branches If a comparison register is a destination of immediately preceding load instruction lw Need 2 stall cycles $1, addr IF beq stalled beq stalled beq $1, $0, target ID EX IF ID MEM WB ID ID EX MEM WB Chapter 4 — The Processor — 92 Dynamic Branch Prediction In deeper and superscalar pipelines, branch penalty is more significant Use dynamic prediction Branch prediction buffer (aka branch history table) Indexed by recent branch instruction addresses Stores outcome (taken/not taken) To execute a branch Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction Chapter 4 — The Processor — 93 1-Bit Predictor: Shortcoming Inner loop branches mispredicted twice! outer: … … inner: … … beq …, …, inner … beq …, …, outer Mispredict as taken on last iteration of inner loop Then mispredict as not taken on first iteration of inner loop next time around Chapter 4 — The Processor — 94 2-Bit Predictor Only change prediction on two successive mispredictions Chapter 4 — The Processor — 95 Calculating the Branch Target Even with predictor, still need to calculate the target address 1-cycle penalty for a taken branch Branch target buffer Cache of target addresses Indexed by PC when instruction fetched If hit and instruction is branch predicted taken, can fetch target immediately Chapter 4 — The Processor — 96 “Unexpected” events requiring change in flow of control Different ISAs use the terms differently Exception Arises within the CPU e.g., undefined opcode, overflow, syscall, … Interrupt §4.9 Exceptions Exceptions and Interrupts From an external I/O controller Dealing with them without sacrificing performance is hard Chapter 4 — The Processor — 97 Handling Exceptions In MIPS, exceptions managed by a System Control Coprocessor (CP0) Save PC of offending (or interrupted) instruction In MIPS: Exception Program Counter (EPC) Save indication of the problem In MIPS: Cause register We’ll assume 1-bit 0 for undefined opcode, 1 for overflow Jump to handler at 8000 00180 Chapter 4 — The Processor — 98 An Alternate Mechanism Vectored Interrupts Example: Handler address determined by the cause Undefined opcode: Overflow: …: C000 0000 C000 0020 C000 0040 Instructions either Deal with the interrupt, or Jump to real handler Chapter 4 — The Processor — 99 Handler Actions Read cause, and transfer to relevant handler Determine action required If restartable Take corrective action use EPC to return to program Otherwise Terminate program Report error using EPC, cause, … Chapter 4 — The Processor — 100 Exceptions in a Pipeline Another form of control hazard Consider overflow on add in EX stage add $1, $2, $1 Prevent $1 from being clobbered Complete previous instructions Flush add and subsequent instructions Set Cause and EPC register values Transfer control to handler Similar to mispredicted branch Use much of the same hardware Chapter 4 — The Processor — 101 Pipeline with Exceptions Chapter 4 — The Processor — 102 Exception Properties Restartable exceptions Pipeline can flush the instruction Handler executes, then returns to the instruction Refetched and executed from scratch PC saved in EPC register Identifies causing instruction Actually PC + 4 is saved Handler must adjust Chapter 4 — The Processor — 103 Exception Example Exception on add in 40 44 48 4C 50 54 … sub and or add slt lw $11, $12, $13, $1, $15, $16, $2, $4 $2, $5 $2, $6 $2, $1 $6, $7 50($7) sw sw $25, 1000($0) $26, 1004($0) Handler 80000180 80000184 … Chapter 4 — The Processor — 104 Exception Example Chapter 4 — The Processor — 105 Exception Example Chapter 4 — The Processor — 106 Multiple Exceptions Pipelining overlaps multiple instructions Simple approach: deal with exception from earliest instruction Could have multiple exceptions at once Flush subsequent instructions “Precise” exceptions In complex pipelines Multiple instructions issued per cycle Out-of-order completion Maintaining precise exceptions is difficult! Chapter 4 — The Processor — 107 Imprecise Exceptions Just stop pipeline and save state Including exception cause(s) Let the handler work out Which instruction(s) had exceptions Which to complete or flush May require “manual” completion Simplifies hardware, but more complex handler software Not feasible for complex multiple-issue out-of-order pipelines Chapter 4 — The Processor — 108 Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle Multiple issue Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice §4.10 Parallelism and Advanced Instruction Level Parallelism Instruction-Level Parallelism (ILP) Chapter 4 — The Processor — 109 Multiple Issue Static multiple issue Compiler groups instructions to be issued together Packages them into “issue slots” Compiler detects and avoids hazards Dynamic multiple issue CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Chapter 4 — The Processor — 110 Speculation “Guess” what to do with an instruction Start operation as soon as possible Check whether guess was right If so, complete the operation If not, roll-back and do the right thing Common to static and dynamic multiple issue Examples Speculate on branch outcome Roll back if path taken is different Speculate on load Roll back if location is updated Chapter 4 — The Processor — 111 Compiler/Hardware Speculation Compiler can reorder instructions e.g., move load before branch Can include “fix-up” instructions to recover from incorrect guess Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed Flush buffers on incorrect speculation Chapter 4 — The Processor — 112 Speculation and Exceptions What if exception occurs on a speculatively executed instruction? Static speculation e.g., speculative load before null-pointer check Can add ISA support for deferring exceptions Dynamic speculation Can buffer exceptions until instruction completion (which may not occur) Chapter 4 — The Processor — 113 Static Multiple Issue Compiler groups instructions into “issue packets” Group of instructions that can be issued on a single cycle Determined by pipeline resources required Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW) Chapter 4 — The Processor — 114 Scheduling Static Multiple Issue Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies with a packet Possibly some dependencies between packets Varies between ISAs; compiler must know! Pad with nop if necessary Chapter 4 — The Processor — 115 MIPS with Static Dual Issue Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n+4 Load/store IF ID EX MEM WB n+8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB Chapter 4 — The Processor — 116 MIPS with Static Dual Issue Chapter 4 — The Processor — 117 Hazards in the Dual-Issue MIPS More instructions executing in parallel EX data hazard Forwarding avoided stalls with single-issue Now can’t use ALU result in load/store in same packet Load-use hazard add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effectively a stall Still one cycle use latency, but now two instructions More aggressive scheduling required Chapter 4 — The Processor — 118 Scheduling Example Schedule this for dual-issue MIPS Loop: lw addu sw addi bne Loop: $t0, $t0, $t0, $s1, $s1, 0($s1) $t0, $s2 0($s1) $s1,–4 $zero, Loop # # # # # $t0=array element add scalar in $s2 store result decrement pointer branch $s1!=0 ALU/branch Load/store cycle nop lw 1 addi $s1, $s1,–4 nop 2 addu $t0, $t0, $s2 nop 3 bne sw $s1, $zero, Loop $t0, 0($s1) $t0, 4($s1) 4 IPC = 5/4 = 1.25 (c.f. peak IPC = 2) Chapter 4 — The Processor — 119 Loop Unrolling Replicate loop body to expose more parallelism Reduces loop-control overhead Use different registers per replication Called “register renaming” Avoid loop-carried “anti-dependencies” Store followed by a load of the same register Aka “name dependence” Reuse of a register name Chapter 4 — The Processor — 120 Loop Unrolling Example Loop: ALU/branch Load/store cycle addi $s1, $s1,–16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t4, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 sw $t3, 4($s1) 8 bne $s1, $zero, Loop IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size Chapter 4 — The Processor — 121 Dynamic Multiple Issue “Superscalar” processors CPU decides whether to issue 0, 1, 2, … each cycle Avoiding structural and data hazards Avoids the need for compiler scheduling Though it may still help Code semantics ensured by the CPU Chapter 4 — The Processor — 122 Dynamic Pipeline Scheduling Allow the CPU to execute instructions out of order to avoid stalls But commit result to registers in order Example lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 Can start sub while addu is waiting for lw Chapter 4 — The Processor — 123 Dynamically Scheduled CPU Preserves dependencies Hold pending operands Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions Chapter 4 — The Processor — 124 Register Renaming Reservation stations and reorder buffer effectively provide register renaming On instruction issue to reservation station If operand is available in register file or reorder buffer Copied to reservation station No longer required in the register; can be overwritten If operand is not yet available It will be provided to the reservation station by a function unit Register update may not be required Chapter 4 — The Processor — 125 Speculation Predict branch and continue issuing Don’t commit until branch outcome determined Load speculation Avoid load and cache miss delay Predict the effective address Predict loaded value Load before completing outstanding stores Bypass stored values to load unit Don’t commit load until speculation cleared Chapter 4 — The Processor — 126 Why Do Dynamic Scheduling? Why not just let the compiler schedule code? Not all stalls are predicable Can’t always schedule around branches e.g., cache misses Branch outcome is dynamically determined Different implementations of an ISA have different latencies and hazards Chapter 4 — The Processor — 127 Does Multiple Issue Work? The BIG Picture Yes, but not as much as we’d like Programs have real dependencies that limit ILP Some dependencies are hard to eliminate Some parallelism is hard to expose Limited window size during instruction issue Memory delays and limited bandwidth e.g., pointer aliasing Hard to keep pipelines full Speculation can help if done well Chapter 4 — The Processor — 128 Power Efficiency Complexity of dynamic scheduling and speculations requires power Multiple simpler cores may be better Microprocessor Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores Power i486 1989 25MHz 5 1 No 1 5W Pentium 1993 66MHz 5 2 No 1 10W Pentium Pro 1997 200MHz 10 3 Yes 1 29W P4 Willamette 2001 2000MHz 22 3 Yes 1 75W P4 Prescott 2004 3600MHz 31 3 Yes 1 103W Core 2006 2930MHz 14 4 Yes 2 75W UltraSparc III 2003 1950MHz 14 4 No 1 90W UltraSparc T1 2005 1200MHz 6 1 No 8 70W Chapter 4 — The Processor — 129 72 physical registers §4.11 Real Stuff: The AMD Opteron X4 (Barcelona) Pipeline The Opteron X4 Microarchitecture Chapter 4 — The Processor — 130 The Opteron X4 Pipeline Flow For integer operations FP is 5 stages longer Up to 106 RISC-ops in progress Bottlenecks Complex instructions with long dependencies Branch mispredictions Memory access delays Chapter 4 — The Processor — 131 §4.13 Fallacies and Pitfalls Fallacies Pipelining is easy (!) The basic idea is easy The devil is in the details e.g., detecting data hazards Pipelining is independent of technology So why haven’t we always done pipelining? More transistors make more advanced techniques feasible Pipeline-related ISA design needs to take account of technology trends e.g., predicated instructions Chapter 4 — The Processor — 132 Pitfalls Poor ISA design can make pipelining harder e.g., complex instruction sets (VAX, IA-32) e.g., complex addressing modes Significant overhead to make pipelining work IA-32 micro-op approach Register update side effects, memory indirection e.g., delayed branches Advanced pipelines have long delay slots Chapter 4 — The Processor — 133 ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput using parallelism §4.14 Concluding Remarks Concluding Remarks More instructions completed per second Latency for each instruction not reduced Hazards: structural, data, control Multiple issue and dynamic scheduling (ILP) Dependencies limit achievable parallelism Complexity leads to the power wall Chapter 4 — The Processor — 134