Computer Organization & Design 5th. Chapter 4 The Processor: Datapath and Control 處理器:資料路徑與控制 ROBERT CHEN SHU-TE UNIVERSITY CSIE DEPT. 4-1 Computer Organization & Design 5th. Outlines • • • • • • Introduction Logic Design Conventions Building a Datapath A Simple Implementation Scheme A Multicycle Implementation Exception SHU-TE UNIVERSITY CSIE DEPT. 4-2 Computer Organization & Design 5th. Introduction • 計算機的效能受到下面三個因素影響: – 指令的數目(instruction count) – 每個指令的時脈週期數目 (CPI) • 整數指令, 算數邏輯指令, 記憶體相關指令及分支 – 時脈週期的長短(clock cycle time) • 編譯器(compiler)和指令集架構(ISA)決定了一個程式所 需的指令數目的多寡。 • 時脈週期的長度和每個指令的時脈週期數目(CPI)卻是由 處理器本身的製作方式來決定。 • 在本章中,我們分別對於兩種不同的MIPS指令製作方式, 建構出其資料路徑和控制單元。 – 單一時脈製作方法 – 多重時脈製作方法 SHU-TE UNIVERSITY CSIE DEPT. 4-3 Computer Organization & Design 5th. Introduction • 製作MIPS時,其功能單元包含兩個不同的邏輯元件: – 能運算資料的元件 • 例:ALU • 組合式(元件的輸出值僅取決於現有的輸入值) – 含狀態的元件 • 例:記憶體和暫存器檔案 • 循序式(輸出值決定在輸入值及其內部的狀態) – 循序邏輯 SHU-TE UNIVERSITY CSIE DEPT. 4-4 Computer Organization & Design 5th. Introduction • 執行指令的階段 – – – – – • 指令擷取(Instruction Fetch) 解碼 (Decode) 運算元擷取 (Operand Fetch) 執行(Execute) 寫回(Write back) 圖4.1以高階的概觀圖來說明MIPS的製作方式 SHU-TE UNIVERSITY CSIE DEPT. 4-5 Computer Organization & Design 5th. Introduction • We're ready to look at an implementation of the MIPS • Simplified to contain only: – memory-reference instructions: • lw, sw – arithmetic-logical instructions: • add, sub, and, or, slt – control flow instructions: • beq, j SHU-TE UNIVERSITY CSIE DEPT. 4-6 Computer Organization & Design 5th. Introduction • State Elements – Unclocked vs. Clocked – Clocks used in synchronous logic • when should an element that contains state be updated? falling edge cycle time rising edge SHU-TE UNIVERSITY CSIE DEPT. 4-7 Computer Organization & Design 5th. Introduction • An unclocked state element – The set-reset latch • output depends on present inputs and also on past inputs • Latches and Flip-flops – Latches and flip-flops are the simplest memory elements. – Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) – Change of state (value) is based on the clock – Latches: whenever the inputs change, and the clock is asserted – Flip-flop: state changes only on a clock edge (edge-triggered methodology) • • A clocking methodology defines when signals can be read and written Wouldn't want to read a signal at the same time it was being written SHU-TE UNIVERSITY CSIE DEPT. 4-8 Computer Organization & Design 5th. Introduction • D-latch – Two inputs: • the data value to be stored (D) • the clock signal (C) indicating when to read & store D – Two outputs: • the value of the internal state (Q) and it's complement – When the latch is open (C asserted), the value of Q changes as D changes transparent latch. C Q D C _ Q D SHU-TE UNIVERSITY CSIE DEPT. Q 4-9 Computer Organization & Design 5th. Introduction • D flip-flop(D型正反器) – Flip-flops are not transparent – Output changes only on the clock edge – The first latch, called the master, is open and follows the input D when C is asserted. When the clock input falls, the first latch is closed, but the 2nd latch, called the slave, is open and gets its input from the output of the master latch. D D C D latch Q D Q D latch _ C Q Q D _ Q C C Q SHU-TE UNIVERSITY CSIE DEPT. 4 - 10 Computer Organization & Design 5th. Introduction • Set-up time and Hold time – Set-up time: the minimum time that the input must remain valid before the clock edge – Hold time: the minimum time that the input must be valid after the clock edge (usually very small) D Set-up time Hold time C SHU-TE UNIVERSITY CSIE DEPT. 4 - 11 Computer Organization & Design 5th. Introduction • An edge triggered methodology(邊緣觸發) – Decide signals when to be read, when to be written • Typical execution: – read contents of some state elements, – send values through some combinational logic – write results to one or more state elements State element 1 Combinational logic State element 2 Clock cycle SHU-TE UNIVERSITY CSIE DEPT. 4 - 12 Computer Organization & Design 5th. Introduction • Register File(暫存器檔案) – A register file consists of a set of registers that can be read and written by supplying a register number to be accessed. – Built using D flip-flops and decoders (specify register number) – Read part (left) : supply a register number as input, and the output is the information stored in that register. – A register file with 2 read ports and 1 write ports. (right) Read register number 1 Register 0 Register 1 Register n 1 M u x Read register number 1 Read data 1 Register n Read register number 2 Register file Write register Read register number 2 Write data M u x SHU-TE UNIVERSITY CSIE DEPT. Read data 1 Read data 2 W rite Read data 2 4 - 13 Computer Organization & Design 5th. Introduction • Register File – Write part: need 3 inputs: a register number, the data to write, and a clock that controls the writing into the register. – Note: we still use the real clock to determine when to write Write 0 Register number C Register 0 1 D n-to-1 decoder C n -1 Register 1 D n C Register n -1 D C Register n Register data SHU-TE UNIVERSITY CSIE DEPT. D 4 - 14 Computer Organization & Design 5th. Introduction • Simple Implementation – Basic components: • two state elements instruction memory (指令記憶體)and program counter (PC) are needed to store and access instructions. • An adder is needed to compute the next instruction address. – Since the instruction memory is read-only(唯讀), we can treat it as combinational logic. Instruction address PC Instruction Add Sum Instruction memory a. Instruction memory SHU-TE UNIVERSITY CSIE DEPT. b. Program counter c. Adder 4 - 15 Computer Organization & Design 5th. Introduction • Fetching instruction and incrementing PC (擷取指令並遞增PC) – A portion of the datapath used for fetching instructions and incrementing Program Counter Add PC送出位址讀取指令之後, 立刻PC+4,指到下一個指令 4 PC Read address Instruction Instruction memory SHU-TE UNIVERSITY CSIE DEPT. 4 - 16 Computer Organization & Design 5th. Introduction • R-Format ALU operations – R-format instruction has 3 register operands, 2 read and 1 write – Rg. add $t0, $t1, $t2 – Register numbers are 5 bits to indicate 32 registers, data bus are 32 bits and ALU control has 4 bits 5 4 Re ad register 1 R egister 5 R e ad data 1 Re ad Ze ro register 2 numbers Registers 5 D ata W rite ALU ALU result register D ata ALU control Re ad data 2 W rite data R egW rite a. Registers SHU-TE UNIVERSITY CSIE DEPT. b. ALU 4 - 17 Computer Organization & Design 5th. Introduction • Datapath for R-type Instruction – Eg. add $t0, $t1, $t2 4 Read register 1 Instruction Read register 2 Registers Write register Write data ALU operation Read data 1 Zero ALU ALU result Read data 2 RegWrite SHU-TE UNIVERSITY CSIE DEPT. 4 - 18 Computer Organization & Design 5th. Introduction • Load and Store Instructions – Load and store instructions compute a memory address by adding the base register, to a 16-bit signed offset field contained in the instruction – “Sign extension unit” extends the 16-bit data to 32-bit data by replicating the high-order sign bit to the extra higher 16-bit data – Eg. lw $t0, 40($t1) sw $t0, 32($t1) MemWrite Address Write data Read data Data memory 16 Sign extend 32 MemRead a. Data memory unit SHU-TE UNIVERSITY CSIE DEPT. b. Sign-extension unit 4 - 19 Computer Organization & Design 5th. Introduction • Datapath for load and store instructions – 資料路徑的載入和儲存動作 • 暫存器的存取發生在記憶體位址計算之後。 • 對記憶體的讀取。 • 如果是載入指令,會有一個寫入動作到暫存器檔案中。 lw $t0, 40($t1) sw $t0, 32($t1) t1 Read register 1 Instruction t0 Read register 2 Registers Write register Write data Read data 1 Zero ALU ALU result Address Read data 2 Write data 40 16 SHU-TE UNIVERSITY CSIE DEPT. Sign extend Read data Data memory 32 4 - 20 Computer Organization & Design 5th. Introduction • J-type Instruction – Branch datapath • Needs to compute the branch target address (計算分支目標位址) – PC+4 is the address of the next instruction – Offset field is left-shifted two bits to make a word offset. (PC0-27 Offset 25-0 +00 ) • Needs to compare register contents(比較暫存器內容) PC + 4 from instruction datapath Add Sum Branch target Shift left 2 beq $t1, $t2, offset Instruction Read register 1 4 ALU operation Read data 1 Read register 2 Registers Write register Read data 2 Write data ALU Zero To branch control logic RegWrite 16 SHU-TE UNIVERSITY CSIE DEPT. Sign extend 32 4 - 21 Computer Organization & Design 5th. Introduction • 聖戰士組合 – 利用多工器(MUX)或資料選擇器(data selector)將R形態指令和記憶體指 令的資料路徑組合起來, 而不用重複增加相同的功能單元 4 SHU-TE UNIVERSITY CSIE DEPT. 4 - 22 Computer Organization & Design 5th. Introduction • 聖戰士組合 – 加入指令擷取部份的資料路徑 SHU-TE UNIVERSITY CSIE DEPT. 4 - 23 Computer Organization & Design 5th. Introduction • 聖戰士組合 – 加入分支部份的資料路徑 – 跳躍指令目標位址=指令之偏移量+跳躍指令之位址 SHU-TE UNIVERSITY CSIE DEPT. 4 - 24 Computer Organization & Design 5th. Introduction • 大功告成? – 最難的是Control Unit 之設計 SHU-TE UNIVERSITY CSIE DEPT. 4 - 25 Computer Organization & Design 5th. A Simple Implementation Scheme • 這個簡易的製作方式包含 – 載入字組 (lw) 及儲存字組 (sw) – 相等分支 (beq) – ALU 指令: add, sub, and , or, 及 set on less than • 根據不同的指令形態,ALU需要可以做下列運算 – 加法 計算 lw 及 sw 的記憶體位址 – 減法 為了相等分支 – AND, OR, subtraction, add, 或 slt 為了 R-形態指令需要 (由6位元的功能 欄決定) • ALU 控制輸入 – – – – – – 0000 : AND 0001 : OR a 0010 : 加法 0110 : 減法 0111 : 小於時設定 set on less than b 1100 :NOR (for other MIPS instructions) ALU-operation 4 ALU Zero Result Overflow CarryOut SHU-TE UNIVERSITY CSIE DEPT. 4 - 26 Computer Organization & Design 5th. A Simple Implementation Scheme • Purpose – Selecting the operations to perform (ALU, read/write, etc.) – Controlling the flow of data (multiplexor inputs) • How you get these control signals: – Information comes from the 32 bits of the instruction Example: add $8, $17, $18 Instruction Format: 000000 10001 10010 01000 00000 100000 op rs rt rd shamt funct • ALU's operation based on instruction type and function code SHU-TE UNIVERSITY CSIE DEPT. 4 - 27 Computer Organization & Design 5th. What Control Signals Do We Need? SHU-TE UNIVERSITY CSIE DEPT. 4 - 28 Computer Organization & Design 5th. Design Method for Control • Multi-level control (decoding) • Instruction opcode: main control unit (first level) – ALU control • Sub-control for arithmetic – MUX control • • • • Which source registers and destination registers ALU input source Input source of destination register Input source of PC – Result for first level • Seven 1-bit control lines • 2-bit ALUOP control signals • The above control signals can be set based solely on the opcode field of the instruction – Exception: PCSrc (depends on the beq result) SHU-TE UNIVERSITY CSIE DEPT. 4 - 29 Computer Organization & Design 5th. A Simple Implementation Scheme • ALU控制位元的控制是由 ALUOp 控制位元所決定 • ALUOp是來用決定不同的指令型態 需要的ALU運算 ALU的控制輸入 XXXXXX 加法 0010 儲存字組 XXXXXX 加法 0010 01 相等分支 XXXXXX 減法 0110 10 加法 100000 加法 0010 R-type 10 減法 100010 減法 0110 R-type 10 AND 100100 and 0000 R-type 10 OR 100101 or 0001 R-type 10 小於時設定 101010 小於時設定slt 0111 指令運算碼 ALUOp LW 00 載入字組 SW 00 Branch equal R-type 指令的運算 功能欄位 SHU-TE UNIVERSITY CSIE DEPT. 4 - 30 Computer Organization & Design 5th. ALU Control • ALU Control ALUOp – Instructions using ALU • Load/store • address calculation – add lw $t1, offset(t2) – Branch eq 2 6 function field • Subtract for comparison • ‘taken’ or ‘not taken’ • add/subtract for address calculation beq $t1, $t2, offset ALU control ALU operation 4 ALU – R-type • and/or • set-on-less-than SHU-TE UNIVERSITY CSIE DEPT. 4- 31 Computer Organization & Design 5th. ALU Control • Multi-level control (decoding) – Instruction opcode: main control unit – first level 00 = lw, sw 01 = beq, 10 = arithmetic • 2nd level: function code for arithmetic : sub control – Main CU generates the ALUOP bits as inputs of the ALU control unit – Reduce the size of main control but may increase the delay SHU-TE UNIVERSITY CSIE DEPT. 4 - 32 Computer Organization & Design 5th. ALU Control • Truth table – X : don’t care term – All zeros or don’t care terms are eliminated Output Input ALUOp ALUOp1 ALUOp0 0 0 X 1 1 X 1 X 1 X 1 X 1 X F5 X X X X X X X SHU-TE UNIVERSITY CSIE DEPT. Funct field F4 F3 F2 F1 X X X X X X X X X 0 0 0 X 0 0 1 X 0 1 0 X 0 1 0 X 1 0 1 Operation F0 X X 0 0 0 1 0 0010 0110 0010 0110 0000 0001 0111 注意事項: 1.ALUOP 目前無 ’11’項 所以原來的’10’改成’1X 2.Funct field中F5F4皆為 ’10’故改成’XX’ 4 - 33 Computer Organization & Design 5th. Design Main Control Unit(設計主要的控制單元) • 指令的格式 – Op 欄位:Op[5 : 0] – R 型指令、相等則分支(beq)指令及儲存指令中, 暫存器:指令的25 : 21 位元及20 : 16 位元的rs 欄位及 rt 欄位 – 載入及儲存指令中的基底暫存器:指令的25 : 21 位元(rs) – 相等則分支(beq)指令﹑載入指令及儲存指令的16 位元偏移量(offset): 指令的15 : 0 位元 SHU-TE UNIVERSITY CSIE DEPT. 4 - 34 Computer Organization & Design 5th. A Simple Implementation Scheme • Seven single-bit control lines, one 2-bit ALUOp control signal • Except for PCSrc, the control signal can be set solely based on the opcode field of the instruction. • To generate PCSrc, we need to AND together a signal from the control unit, which we call Branch, with the Zero signal out of the ALU. Signal RegDst Regwrite ALUSrc PCSrc MemRead MemWrite MemtoReg Deasserted 未設定 dest register from rt (20-16) none 2nd operand from reg output 2 PC<-- PC+4 none none put ALU result to reg SHU-TE UNIVERSITY CSIE DEPT. 設定 Asserted from rd (15-11) write to dest register from 16 bit sign extension PC<--branch dest read from memory write to memory put memory read data to reg 4 - 35 Computer Organization & Design 5th. The Simple Datapath with the Control Unit 0 M u x Add 4 Instruction[31 26] Read address Instruction memory Instruction[15 11] Zero ALU ALU result Address Read register 1 Instruction[20 16] Instruction [31 0] 1 Shift left 2 RegDst Branch MemRead MemtoReg Control ALUOp MemWrite ALUSrc RegWrite Instruction[25 21] PC ALU Add result 0 M u x 1 Read data1 Read register 2 Registers Read Write data2 register 0 M u x 1 Write data Write data Instruction[15 0] 16 Sign extend Read data Data memory 1 M u x 0 32 ALU control Instruction[5 0] SHU-TE UNIVERSITY CSIE DEPT. 4 - 36 Computer Organization & Design 5th. The Simple Datapath with the Control Unit • 定義所有控制訊號線應該如何對每一種運作碼來設定 • 第一列對應到R 格式的指令(add、sub、and、or 及slt), – 來源暫存器都是rs 和rt,目的暫存器都是rd;定義了ALUSrc 和RegDst 控制訊 號線如何設定 – R-型指令將運算結果寫入暫存器(RegWrite=1),但不會存取(讀寫)數據記憶體。 – 當Branch 控制訊號等於0 時,PC 的值無條件地由PC+4 取代;否則,如果 ALU 的Zero 輸出值也為1 時,PC 的值由分支目標位址取代 – R-型指令的ALUOp 欄位被設定為10,表示ALU 的控制是由功能欄位(funct field)來產生 SHU-TE UNIVERSITY CSIE DEPT. 4 - 37 Computer Organization & Design 5th. The Simple Datapath with the Control Unit • 第二列及第三列說明lw 指令及sw 指令的控制訊號設定。 – ALUSrc 和ALUOp 欄位設定成執行位址的計算。 – Mem-Read 及MemWrite 設定成執行記憶體的存取。 – 載入指令,RegDst 及RegWrite 設定成將結果儲存到rt 暫存器中。 SHU-TE UNIVERSITY CSIE DEPT. 4 - 38 Computer Organization & Design 5th. The Simple Datapath with the Control Unit • 分支指令的ALUOp 欄位設定成執行減法(ALU 控制=01),用 來測試兩個值是否相等。 – 如果RegWrite 控制訊號為0 時MemtoReg 欄位是無關緊要的:因為暫 存器不會被寫入,所以暫存器的Write data 輸入值不會被用到。 – 表格最後兩列的MemtoReg 項目用X 來表示,代表「don’t care」。 – 當RegWrite 為0 時,RegDst 也可以使用X 來表示。 SHU-TE UNIVERSITY CSIE DEPT. 4 - 39 Computer Organization & Design 5th. Datapath operation(數據通道運作) • 數據通道對R 型指令:add $t1, $t2, $t3 – 一個時脈週期內可以把它想像成執行了四個步驟: 指令被擷取,並且遞增PC 的值 兩個暫存器從暫存器檔案中讀出;同時,主控制單元計算 控制訊號線的值 ALU 根據功能碼(指令中的5 :0 位元功能欄位)來產生ALU 功能的控制,以對從暫存器檔案中讀出的值運算 ALU 的運算結果使用指令的15:11 位元來選擇目的暫存器 ($t1),以寫入暫存器檔案 SHU-TE UNIVERSITY CSIE DEPT. 4 - 40 Computer Organization & Design 5th. lw $t0, 32($s3) ; 35 19 8 32 35 or 43 rs rt address 31:26 25:21 20:16 15:0 SHU-TE UNIVERSITY CSIE DEPT. 4 - 41 Computer Organization & Design 5th. Datapath operation(數據通道運作) • 載入指令中有動作的功能單元和被設定的控制訊號 lw $t0, 32($s3) – 想像成執行了五個步驟: 指令從指令記憶體中被擷取,並且遞增PC 的值 暫存器的值從暫存器檔案中讀出 ALU 計算由暫存器檔案中讀出的值和符號延伸過的指令中 較低的16 位元偏移量的和 ALU 所得的和作為數據記憶體的位址 記憶體傳回的數據寫入暫存器檔案;暫存器的目的地可由 指令中的位元20:16 得知 SHU-TE UNIVERSITY CSIE DEPT. 4 - 42 Computer Organization & Design 5th. beq $s1, $s2, 100 ; 4 17 18 25 4 rs rt address 31:26 25:21 20:16 15:0 SHU-TE UNIVERSITY CSIE DEPT. 4 - 43 Computer Organization & Design 5th. Datapath operation(數據通道運作) • beq指令: beq $s0, $s1, 100 – 想像成執行時的四個步驟: 指令從指令記憶體中被擷取,並且遞增PC 的值 兩個暫存器$t1 和$t2 從暫存器檔案中被讀出 ALU 對暫存器檔案中讀出的值執行減法。PC+4 的值與符 號延伸並左移兩位的指令中較低的16 位元(偏移量)相加; 其結果即為分支目的位址 ALU 的Zero 輸出被用來決定哪一個加法器的結果要寫回 PC SHU-TE UNIVERSITY CSIE DEPT. 4 - 44 Computer Organization & Design 5th. beq $s1, $s2, 100 ; 4 17 18 25 4 rs rt address 31:26 25:21 20:16 15:0 SHU-TE UNIVERSITY CSIE DEPT. 4 - 45 Computer Organization & Design 5th. Complete Control Unit(完成控制單元) • 加入跳躍(jump)指令,以便說明如何在基本的數據通道 及控制中,再作延伸以處理其他的指令 SHU-TE UNIVERSITY CSIE DEPT. 4 - 46 Computer Organization & Design 5th. Jump 2 address 31:26 25:0 加入跳躍(jump)指令 SHU-TE UNIVERSITY CSIE DEPT. 4 - 47 Computer Organization & Design 5th. 效能議題Performance Issues • Longest delay determines clock period – 關鍵路徑:Load指令(Critical path: load instruction) – 指令記憶體暫存器檔案 ALU 資料記憶體傳存器檔案 (Instruction memory register file ALU data memory register file) • 對不同指令沒有彈性可以改變週期 (Not feasible to vary period for different instructions) • 違反設計原則 Violates design principle – 讓一般情況加快(Making the common case fast) • 利用管線處理來增進效能 (We will improve performance by pipelining) SHU-TE UNIVERSITY CSIE DEPT. 4 - 48 Computer Organization & Design 5th. A Simple Implementation Scheme • 為什麼單一時脈週期的製作方式不被採用? – 每個指令的時脈週期都必須有相同長度(因此,CPI = 1) – 計算機的運算處理指令中最長的路徑將決定時脈週期的長度 – 整體效能似乎不是很好 • 範例:單一時脈計算機的效能,假設功能單元的運算時間如下: – – – – 記憶體單元: 2 ns ALU 及加法器: 2 ns 暫存器檔案 (讀取或寫入): 1 ns 下列的製作方式那一種會比較快? 1. 每個指令在一個固定長度的時脈週期內運作完成 2. 每個指令在一個時脈週期內運作完成,但時脈週期長度是可變動 SHU-TE UNIVERSITY CSIE DEPT. 4 - 49 Computer Organization & Design 5th. A Simple Implementation Scheme • 範例 (續) 為了計算效能,假設我們使用下 列指令的混合比例: 24% 載入, 12% 儲存, 44% R 形態指令, 18% 分支及 2%跳躍 指令 • 解答 1. CPU 時脈週期為 8 ns. 2. CPU 時脈週期 = 8*24% + 7*12% + 6*44% + 5*18% + 2*2% = 6.3 ns 效能改進的比例為 8/6.3 = 1.27. SHU-TE UNIVERSITY CSIE DEPT. 指令種類所用到的功能單元 R格式 指令擷取 暫存器存取 ALU 暫存器存取 載入字組 指令擷取 暫存器存取 ALU 記憶體存取 儲存字組 指令擷取 暫存器存取 ALU 記憶體存取 分支 指令擷取 暫存器存取 ALU 跳躍 指令擷取 暫存器存取 指令 種類 指令記 憶體 暫存 器讀 取 ALU 運算 資料記 憶體 暫存器 寫入 總和 R格式 2 1 2 0 1 6ns 載入字組 2 1 2 2 1 8ns 儲存字組 2 1 2 2 分支 2 1 2 跳躍 2 7ns 5ns 2ns 4 - 50 Computer Organization & Design 5th. A Simple Implementation Scheme • 範例 – – – – 假設我們有浮點指令單元: 執行浮點加法需要8ns 執行浮點乘法需要16ns 所有功能單元所需的時間如同上例。下列的製作方式何會比較快? • 1.每個指令在一個固定長度的時脈週期內運作完成 • 2.每個指令在一個時脈週期內運作完成,但時脈週期長度是可變動 – 為了計算效能,假設我們使用下列指令的混合比例: – 31%載入, 21%儲存, 27% R形態指令, 5%分支,2% 跳躍指令, 7%浮點加法及7% FP浮點乘法 • 解答 – 1. 最長的指令為浮點乘法,其時脈週期為 2 + 1 + 16 + 1 = 20 ns – 2. 浮點指令的加法須時 2 + 1 + 8 + 1 = 12 ns. – CPU 時脈週期 = 8*31% + 7*21% + 6*27% + 5*5% + 2*2% +20*7% + 12*7%= 7.0 ns – 效能改進的比例為20/7 = 2.9. SHU-TE UNIVERSITY CSIE DEPT. 4 - 51 Computer Organization & Design 5th. 管線化 Pipelining • 管線式洗衣(Pipelined laundry: overlapping execution) – 平行處理增進效能(Parallelism improves performance) Four loads: Non-stop: SHU-TE UNIVERSITY CSIE DEPT. Speedup = 8/3.5 = 2.3 Speedup = 2n/0.5n + 1.5 ≈ 4 = number of stages 4 - 52 Computer Organization & Design 5th. MIPS管線處理(MIPS Pipeline) • 管線處理(pipelining)之定義 – • 將指令區分成數個步驟,分別由不同的功能單元同時加以執行,以 增進整體程式之效能 管線處理五個步驟(Five stages, one step per stage) 1. IF: Instruction Fetch(擷取指令) 從記憶體擷取指令(Instruction fetch from memory) 2. ID: Instruction Decode(指令解碼) 解碼指令並讀取暫存器(Instruction decode & register read) 3. EX: Execution(執行指令) 執行運算或計算位址(Execute operation or calculate address) 4. MEM: Memory Access(記憶體存取) 存取記憶體運算元(Access memory operand) 5. WB: Write Back(寫回) 將結果寫回至暫存器(Write result back to register) SHU-TE UNIVERSITY CSIE DEPT. 4 - 53 Computer Organization & Design 5th. 管線處理效能Pipeline Performance • 假設每一階段的時間為(Assume time for stages is) – 暫存器讀寫:100ps for register read or write – 其他階段:200ps for other stages • 比較管線處理與單一週期的資料路徑 (Compare pipelined datapath with single-cycle datapath) Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps R-format 200ps 100 ps 200ps beq 200ps 100 ps 200ps SHU-TE UNIVERSITY CSIE DEPT. 700ps 100 ps 600ps 500ps 4 - 54 Computer Organization & Design 5th. Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) SHU-TE UNIVERSITY CSIE DEPT. 4 - 55 Computer Organization & Design 5th. 管線處理加速Pipeline Speedup • 所有階段都一致If all stages are balanced – 每一階段時間都相同(i.e., all take the same time) – Time between instructionspipelined = Time between instructionsnonpipelined Number of stages • 若階段不一致,加速值較少If not balanced, speedup is less 管線處理的「加速』源自增加處理量(產量) Speedup due to increased throughput – 延遲時間(latency每一指令的時間)沒有減少 Latency (time for each instruction) does not decrease – SHU-TE UNIVERSITY CSIE DEPT. 4 - 56 Computer Organization & Design 5th. Pipelining and ISA Design • MIPS ISA專為管線化處理所設計 (MIPS ISA designed for pipelining) – 所有指令皆為32位元(All instructions are 32-bits) • 較容易在一個週期內擷取並解碼 (Easier to fetch and decode in one cycle) • c.f. x86: 1- to 17-byte instructions – 少量且規則的指令格式(Few and regular instruction formats) • 能在一個步驟內解碼並讀取暫存器 (Can decode and read registers in one step) – 載入與儲存定址(Load/store addressing) • 能在第3階段計算位址,在第4階段存取記憶體 (Can calculate address in 3rd stage, access memory in 4th stage) – 記憶體運算元對齊(Alignment of memory operands) • 記憶體存取只需一個週期 (Memory access takes only one cycle) SHU-TE UNIVERSITY CSIE DEPT. 4 - 57 Computer Organization & Design 5th. 危障Hazards • 下一週期的起始位址不是下一指令 Situations that prevent starting the next instruction in the next cycle • 危障種類Hazard types – 結構危障(Structure hazards) • 所需資源忙碌中(A required resource is busy) – 資料(數據)危障(Data hazard) • 等待前一指令完成資料讀寫 Need to wait for previous instruction to complete its data read/write – 控制危障(Control hazard) • 依前一指令結果決定控制動作 Deciding on control action depends on previous instruction SHU-TE UNIVERSITY CSIE DEPT. 4 - 58 Computer Organization & Design 5th. 結構危障Structure Hazards • 定義:當安排好的指令由於硬體無法支援當時應執行的指令 在適當的時脈週期內執行 • 使用資源衝突Conflict for use of a resource – MIPS中只有一個記憶體 In MIPS pipeline with a single memory – Load/store 需要做資料存取 Load/store requires data access – 該週期的指令擷取必須延遲(stall),需管線泡泡 Instruction fetch would have to stall for that cycle Would cause a pipeline “bubble” • 管線式資料路徑需要獨立的指令/資料記憶體 – 或獨立的指令/資料快取(記憶體) Hence, pipelined datapaths require separate instruction/data memories – Or separate instruction/data caches SHU-TE UNIVERSITY CSIE DEPT. 4 - 59 Computer Organization & Design 5th. 結構危障Structure Hazards [問題]若有第四個指令進入管線中,則….. 指令1在時間6~8需要存取記憶體,此時指令4也需要從記憶 體讀取指令 SHU-TE UNIVERSITY CSIE DEPT. 沖!沖!沖! 4 - 60 Computer Organization & Design 5th. 資料(數據)危障Data Hazards • 定義:當安排好的指令執行所需的資料未取得而無法在適當 的時脈週期內執行 • 與前依指令資料存取完成結果有關 An instruction depends on completion of data access by a previous instruction – add sub $s0, $t0, $t1 $t2, $s0, $t3 SHU-TE UNIVERSITY CSIE DEPT. 4 - 61 Computer Organization & Design 5th. 前饋/繞送 (Forwarding/Bypassing) • 使用已經計算完成的結果Use result when it is computed – 不需等到存至暫存器中Don’t wait for it to be stored in a register – 資料路徑需要額外的連接線 Requires extra connections in the datapath SHU-TE UNIVERSITY CSIE DEPT. 4 - 62 Computer Organization & Design 5th. Load指令-資料危障(Load-Use Data Hazard) • 使用前饋仍無法避免要用延遲/停滯(stall) Can’t always avoid stalls by forwarding – 當所需要的值尚未計算完成 If value not computed when needed – 無法前饋至之前的時間 Can’t forward backward in time! SHU-TE UNIVERSITY CSIE DEPT. 4 - 63 Computer Organization & Design 5th. Code Scheduling to Avoid Stalls • 利用指令重排避免下一 指令為Load指令 Reorder code to avoid use of load result in the next instruction • C code for stall stall lw lw add sw lw add sw A = B + E; C = B + F; $t1, $t2, $t3, $t3, $t4, $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $t4 16($t0) 13 cycles SHU-TE UNIVERSITY CSIE DEPT. lw lw lw add sw add sw $t1, $t2, $t4, $t3, $t3, $t5, $t5, 0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $t4 16($t0) 11 cycles 4 - 64 Computer Organization & Design 5th. 控制危障Control Hazards • 定義:當所擷取的指令並非所需的指令而造成適當的指 令無法在適當的管線時脈週期中執行;亦即,指令位址 產生的順序非管線所期待者 • 分支決定控制流程Branch determines flow of control – 擷取下一指令取決於分支結果 Fetching next instruction depends on branch outcome – 管線處理不可能永遠擷取正確的下一個指令 Pipeline can’t always fetch correct instruction • 仍在分支指令的ID階段 Still working on ID stage of branch • MIPS的管線處理中In MIPS pipeline – 需要在管線中比較暫存器與提早計算目標位址Need to compare registers and compute target early in the pipeline – 在ID階段增加硬體來處理 Add hardware to do it in ID stage SHU-TE UNIVERSITY CSIE DEPT. 4 - 65 Computer Organization & Design 5th. 分支中的延遲Stall on Branch • 等到分支結果來決定擷取下一指令 Wait until branch outcome determined before fetching next instruction SHU-TE UNIVERSITY CSIE DEPT. 4 - 66 Computer Organization & Design 5th. 分支預測Branch Prediction • 較長的管線無法完全提早決定分之結果 Longer pipelines can’t readily determine branch outcome early – 延遲時間變得無法接受 Stall penalty becomes unacceptable • 分支預測結果Predict outcome of branch – 預測錯誤只有造成延遲Only stall if prediction is wrong • MIPS管線處理中In MIPS pipeline – 可以預測分支未發生Can predict branches not taken – 分支後的擷取指令沒有延遲 Fetch instruction after branch, with no delay SHU-TE UNIVERSITY CSIE DEPT. 4 - 67 Computer Organization & Design 5th. MIPS with Predict Not Taken Prediction correct Prediction incorrect SHU-TE UNIVERSITY CSIE DEPT. 4 - 68 Computer Organization & Design 5th. More-Realistic Branch Prediction • 靜態分支預測Static branch prediction – 基於典型分支行為Based on typical branch behavior – 範例:迴圈與if指令Example: loop and if-statement branches • 預測反向分支會發生Predict backward branches taken • 預測前向分支不會發生Predict forward branches not taken • 動態分支預測Dynamic branch prediction – 硬體預測器根據每道分支的行為來作預測,並且在程式運作過程中可 以改變對一道分支的預測 – 硬體測量實際分支行為Hardware measures actual branch behavior • 例如:記錄每一分支最近結果的歷史 e.g., record recent history of each branch – 假設未來行為會持續趨勢 Assume future behavior will continue the trend • 當猜錯時,使用重新擷取時延遲stall、並更新歷史紀錄 When wrong, stall while re-fetching, and update history SHU-TE UNIVERSITY CSIE DEPT. 4 - 69 Computer Organization & Design 5th. 管線處理總結Pipeline Summary • 管線處理增加效能是利用增加指令處理量 Pipelining improves performance by increasing instruction throughput – 同時執行多個指令Executes multiple instructions in parallel – 每一指令有相同延遲時間Each instruction has the same latency • 危障Subject to hazards – 結構、資料、控制Structure, data, control • 指令集設計影響實現管線處理的複雜度 Instruction set design affects complexity of pipeline implementation SHU-TE UNIVERSITY CSIE DEPT. 4 - 70 Computer Organization & Design 5th. 了解程式效能 • 有效率的管道運作通常是除了記憶體系統之外,決定處理器的 CPI 也就是其效能最重要的因素 • 結構危障通常發生在可能無法完全管道化的浮點單元中 • 控制危障通常在有較多分支而且分支較難預測的整數程式中較 難處理 • 數據危障 – 通常在浮點程式中由於其較少的分支以及較規律的記憶體存取樣式,方 便讓編譯器試著安排指令而較易避免數據危障 • 對於偏向使用指標(pointer)而導致較不規律記憶體存取的整數 程式則較難以這個方法來改善 • 管道化改善了指令的處理量然而反而會增加單一指令的執行時 間或延遲 SHU-TE UNIVERSITY CSIE DEPT. 4 - 71 Computer Organization & Design 5th. 管道化數據通道及控制 • 單一週期數據通道 – 5 級的管道 – 代表在任一時脈週期內最多有5 道指令正在執行 1. IF:指令擷取 2. ID:指令解碼與暫存器檔案讀取 3. EX:執行或位址計算 4. MEM:數據記憶體存取 5. WB:寫回 SHU-TE UNIVERSITY CSIE DEPT. 4 - 72 Computer Organization & Design 5th. MIPS Pipelined Datapath • 2個例外 – Write Back(WB) – Next PC MEM Right-to-left flow leads to hazards WB SHU-TE UNIVERSITY CSIE DEPT. 4 - 73 Computer Organization & Design 5th. 管線暫存器Pipeline registers • 管線各階段間需要暫存器Need registers between stages – 保存上一階段產生的資訊 To hold information produced in previous cycle 64bits SHU-TE UNIVERSITY CSIE DEPT. 128 97 64bits4 - 74 Computer Organization & Design 5th. Pipeline Operation • 管線化資料路徑「逐一週期」的流程 Cycle-by-cycle flow of instructions through the pipelined datapath – 單一週期管線圖 “Single-clock-cycle” pipeline diagram • 顯示單一週期管線的使用Shows pipeline usage in a single cycle • 標示使用到的資源Highlight resources used – 比較:多重時脈週期管線圖 c.f. “multi-clock-cycle” diagram • Graph of operation over time • 審視load跟store指令單一週期管線圖 We’ll look at “single-clock-cycle” diagrams for load & store SHU-TE UNIVERSITY CSIE DEPT. 4 - 75 Computer Organization & Design 5th. 單一時脈週期數據路徑假設以管道化的方式來執行 SHU-TE UNIVERSITY CSIE DEPT. 4 - 76 Computer Organization & Design 5th. 管道化數據通道及控制 • 顯示載入指令通過管道中五個階段時數據通道中強 調出有動作的部分的情形 – 任何在後方管道階段會使用到的資訊都必須透過 管道暫存器來傳遞給各該階段 – 數據通道中的每個邏輯元件如指令記憶體、暫存 器讀取埠、ALU、數據記憶體和暫存器寫入埠只 能用於唯一的管道階段期間,否則會發生結構危 障 • 現在我們來發現在載入指令設計中的一個錯誤。你 看出來了嗎? – 寫入暫存器的編號! SHU-TE UNIVERSITY CSIE DEPT. 4 - 77 Computer Organization & Design 5th. IF for Load, Store, … SHU-TE UNIVERSITY CSIE DEPT. 4 - 78 Computer Organization & Design 5th. ID for Load, Store, … SHU-TE UNIVERSITY CSIE DEPT. 4 - 79 Computer Organization & Design 5th. EX for Load SHU-TE UNIVERSITY CSIE DEPT. 4 - 80 Computer Organization & Design 5th. MEM for Load SHU-TE UNIVERSITY CSIE DEPT. 4 - 81 Computer Organization & Design 5th. WB for Load Wrong register number SHU-TE UNIVERSITY CSIE DEPT. 4 - 82 Computer Organization & Design 5th. Load指令之修正資料路徑 (Corrected Datapath for Load) SHU-TE UNIVERSITY CSIE DEPT. 4 - 83 Computer Organization & Design 5th. EX for Store SHU-TE UNIVERSITY CSIE DEPT. 4 - 84 Computer Organization & Design 5th. MEM for Store SHU-TE UNIVERSITY CSIE DEPT. 4 - 85 Computer Organization & Design 5th. WB for Store SHU-TE UNIVERSITY CSIE DEPT. 4 - 86 Computer Organization & Design 5th. Multi-Cycle Pipeline Diagram • 顯示資源利用(Form showing resource) usage SHU-TE UNIVERSITY CSIE DEPT. 4 - 87 Computer Organization & Design 5th. SHU-TE UNIVERSITY CSIE DEPT. 4 - 88