Chapter04

advertisement
Computer Organization & Design 5th.
Chapter 4
The Processor: Datapath and Control
處理器:資料路徑與控制
ROBERT CHEN
SHU-TE UNIVERSITY CSIE DEPT.
4-1
Computer Organization & Design 5th.
Outlines
•
•
•
•
•
•
Introduction
Logic Design Conventions
Building a Datapath
A Simple Implementation Scheme
A Multicycle Implementation
Exception
SHU-TE UNIVERSITY CSIE DEPT.
4-2
Computer Organization & Design 5th.
Introduction
• 計算機的效能受到下面三個因素影響:
– 指令的數目(instruction count)
– 每個指令的時脈週期數目 (CPI)
• 整數指令, 算數邏輯指令, 記憶體相關指令及分支
– 時脈週期的長短(clock cycle time)
• 編譯器(compiler)和指令集架構(ISA)決定了一個程式所
需的指令數目的多寡。
• 時脈週期的長度和每個指令的時脈週期數目(CPI)卻是由
處理器本身的製作方式來決定。
• 在本章中,我們分別對於兩種不同的MIPS指令製作方式,
建構出其資料路徑和控制單元。
– 單一時脈製作方法
– 多重時脈製作方法
SHU-TE UNIVERSITY CSIE DEPT.
4-3
Computer Organization & Design 5th.
Introduction
• 製作MIPS時,其功能單元包含兩個不同的邏輯元件:
– 能運算資料的元件
• 例:ALU
• 組合式(元件的輸出值僅取決於現有的輸入值)
– 含狀態的元件
• 例:記憶體和暫存器檔案
• 循序式(輸出值決定在輸入值及其內部的狀態)
– 循序邏輯
SHU-TE UNIVERSITY CSIE DEPT.
4-4
Computer Organization & Design 5th.
Introduction
• 執行指令的階段
–
–
–
–
–
•
指令擷取(Instruction Fetch)
解碼 (Decode)
運算元擷取 (Operand Fetch)
執行(Execute)
寫回(Write back)
圖4.1以高階的概觀圖來說明MIPS的製作方式
SHU-TE UNIVERSITY CSIE DEPT.
4-5
Computer Organization & Design 5th.
Introduction
• We're ready to look at an implementation of the MIPS
• Simplified to contain only:
– memory-reference instructions:
• lw, sw
– arithmetic-logical instructions:
• add, sub, and, or, slt
– control flow instructions:
• beq, j
SHU-TE UNIVERSITY CSIE DEPT.
4-6
Computer Organization & Design 5th.
Introduction
• State Elements
– Unclocked vs. Clocked
– Clocks used in synchronous logic
• when should an element that contains state be updated?
falling edge
cycle time
rising edge
SHU-TE UNIVERSITY CSIE DEPT.
4-7
Computer Organization & Design 5th.
Introduction
• An unclocked state element
– The set-reset latch
• output depends on present inputs and also on past inputs
• Latches and Flip-flops
– Latches and flip-flops are the simplest memory elements.
– Output is equal to the stored value inside the element
(don't need to ask for permission to look at the value)
– Change of state (value) is based on the clock
– Latches: whenever the inputs change, and the clock is asserted
– Flip-flop: state changes only on a clock edge
(edge-triggered methodology)
•
•
A clocking methodology defines when signals can be read and written
Wouldn't want to read a signal at the same time it was being written
SHU-TE UNIVERSITY CSIE DEPT.
4-8
Computer Organization & Design 5th.
Introduction
• D-latch
– Two inputs:
• the data value to be stored (D)
• the clock signal (C) indicating when to read & store D
– Two outputs:
• the value of the internal state (Q) and it's complement
– When the latch is open (C asserted), the value of Q changes as D changes
transparent latch.
C
Q
D
C
_
Q
D
SHU-TE UNIVERSITY CSIE DEPT.
Q
4-9
Computer Organization & Design 5th.
Introduction
• D flip-flop(D型正反器)
– Flip-flops are not transparent
– Output changes only on the clock edge
– The first latch, called the master, is open and follows the input D when C is
asserted. When the clock input falls, the first latch is closed, but the 2nd
latch, called the slave, is open and gets its input from the output of the
master latch.
D
D
C
D
latch
Q
D
Q
D
latch _
C
Q
Q
D
_
Q
C
C
Q
SHU-TE UNIVERSITY CSIE DEPT.
4 - 10
Computer Organization & Design 5th.
Introduction
• Set-up time and Hold time
– Set-up time: the minimum time that the input must remain valid
before the clock edge
– Hold time: the minimum time that the input must be valid after the
clock edge (usually very small)
D
Set-up time
Hold time
C
SHU-TE UNIVERSITY CSIE DEPT.
4 - 11
Computer Organization & Design 5th.
Introduction
• An edge triggered methodology(邊緣觸發)
– Decide signals when to be read, when to be written
• Typical execution:
– read contents of some state elements,
– send values through some combinational logic
– write results to one or more state elements
State
element
1
Combinational logic
State
element
2
Clock cycle
SHU-TE UNIVERSITY CSIE DEPT.
4 - 12
Computer Organization & Design 5th.
Introduction
• Register File(暫存器檔案)
– A register file consists of a set of registers that can be read and written by
supplying a register number to be accessed.
– Built using D flip-flops and decoders (specify register number)
– Read part (left) : supply a register number as input, and the output is the
information stored in that register.
– A register file with 2 read ports and 1 write ports. (right)
Read register
number 1
Register 0
Register 1
Register n
1
M
u
x
Read register
number 1
Read data 1
Register n
Read register
number 2
Register file
Write
register
Read register
number 2
Write
data
M
u
x
SHU-TE UNIVERSITY CSIE DEPT.
Read
data 1
Read
data 2
W rite
Read data 2
4 - 13
Computer Organization & Design 5th.
Introduction
• Register File
– Write part: need 3 inputs: a register number, the data to write, and a clock that
controls the writing into the register.
– Note: we still use the real clock to determine when to write
Write
0
Register number
C
Register 0
1
D
n-to-1
decoder
C
n -1
Register 1
D
n
C
Register n -1
D
C
Register n
Register data
SHU-TE UNIVERSITY CSIE DEPT.
D
4 - 14
Computer Organization & Design 5th.
Introduction
• Simple Implementation
– Basic components:
• two state elements instruction memory (指令記憶體)and program counter (PC)
are needed to store and access instructions.
• An adder is needed to compute the next instruction address.
– Since the instruction memory is read-only(唯讀), we can treat it as combinational
logic.
Instruction
address
PC
Instruction
Add Sum
Instruction
memory
a. Instruction memory
SHU-TE UNIVERSITY CSIE DEPT.
b. Program counter
c. Adder
4 - 15
Computer Organization & Design 5th.
Introduction
• Fetching instruction and incrementing PC
(擷取指令並遞增PC)
– A portion of the datapath used for fetching instructions and
incrementing Program Counter
Add
PC送出位址讀取指令之後,
立刻PC+4,指到下一個指令
4
PC
Read
address
Instruction
Instruction
memory
SHU-TE UNIVERSITY CSIE DEPT.
4 - 16
Computer Organization & Design 5th.
Introduction
• R-Format ALU operations
– R-format instruction has 3 register operands, 2 read and 1 write
– Rg. add $t0, $t1, $t2
– Register numbers are 5 bits to indicate 32 registers, data bus are 32 bits and
ALU control has 4 bits
5
4
Re ad
register 1
R egister
5
R e ad
data 1
Re ad
Ze ro
register 2
numbers
Registers
5
D ata
W rite
ALU
ALU
result
register
D ata
ALU control
Re ad
data 2
W rite
data
R egW rite
a. Registers
SHU-TE UNIVERSITY CSIE DEPT.
b. ALU
4 - 17
Computer Organization & Design 5th.
Introduction
• Datapath for R-type Instruction
– Eg. add $t0, $t1, $t2
4
Read
register 1
Instruction
Read
register 2
Registers
Write
register
Write
data
ALU operation
Read
data 1
Zero
ALU ALU
result
Read
data 2
RegWrite
SHU-TE UNIVERSITY CSIE DEPT.
4 - 18
Computer Organization & Design 5th.
Introduction
• Load and Store Instructions
– Load and store instructions compute a memory address by adding the base
register, to a 16-bit signed offset field contained in the instruction
– “Sign extension unit” extends the 16-bit data to 32-bit data by replicating
the high-order sign bit to the extra higher 16-bit data
– Eg. lw $t0, 40($t1)
sw $t0, 32($t1)
MemWrite
Address
Write
data
Read
data
Data
memory
16
Sign
extend
32
MemRead
a. Data memory unit
SHU-TE UNIVERSITY CSIE DEPT.
b. Sign-extension unit
4 - 19
Computer Organization & Design 5th.
Introduction
• Datapath for load and store instructions
– 資料路徑的載入和儲存動作
• 暫存器的存取發生在記憶體位址計算之後。
• 對記憶體的讀取。
• 如果是載入指令,會有一個寫入動作到暫存器檔案中。
lw $t0, 40($t1)
sw $t0, 32($t1)
t1
Read
register 1
Instruction
t0
Read
register 2
Registers
Write
register
Write
data
Read
data 1
Zero
ALU
ALU
result
Address
Read
data 2
Write
data
40
16
SHU-TE UNIVERSITY CSIE DEPT.
Sign
extend
Read
data
Data
memory
32
4 - 20
Computer Organization & Design 5th.
Introduction
• J-type Instruction
– Branch datapath
• Needs to compute the branch target address (計算分支目標位址)
– PC+4 is the address of the next instruction
– Offset field is left-shifted two bits to make a word offset.
(PC0-27  Offset 25-0 +00 )
• Needs to compare register contents(比較暫存器內容)
PC + 4 from instruction datapath
Add Sum
Branch target
Shift
left 2
beq $t1, $t2, offset
Instruction
Read
register 1
4
ALU operation
Read
data 1
Read
register 2
Registers
Write
register
Read
data
2
Write
data
ALU Zero
To branch
control logic
RegWrite
16
SHU-TE UNIVERSITY CSIE DEPT.
Sign
extend
32
4 - 21
Computer Organization & Design 5th.
Introduction
• 聖戰士組合
– 利用多工器(MUX)或資料選擇器(data selector)將R形態指令和記憶體指
令的資料路徑組合起來, 而不用重複增加相同的功能單元
4
SHU-TE UNIVERSITY CSIE DEPT.
4 - 22
Computer Organization & Design 5th.
Introduction
• 聖戰士組合
– 加入指令擷取部份的資料路徑
SHU-TE UNIVERSITY CSIE DEPT.
4 - 23
Computer Organization & Design 5th.
Introduction
• 聖戰士組合
– 加入分支部份的資料路徑
– 跳躍指令目標位址=指令之偏移量+跳躍指令之位址
SHU-TE UNIVERSITY CSIE DEPT.
4 - 24
Computer Organization & Design 5th.
Introduction
• 大功告成?
– 最難的是Control Unit 之設計
SHU-TE UNIVERSITY CSIE DEPT.
4 - 25
Computer Organization & Design 5th.
A Simple Implementation Scheme
• 這個簡易的製作方式包含
– 載入字組 (lw) 及儲存字組 (sw)
– 相等分支 (beq)
– ALU 指令: add, sub, and , or, 及 set on less than
• 根據不同的指令形態,ALU需要可以做下列運算
– 加法 計算 lw 及 sw 的記憶體位址
– 減法 為了相等分支
– AND, OR, subtraction, add, 或 slt 為了 R-形態指令需要 (由6位元的功能
欄決定)
• ALU 控制輸入
–
–
–
–
–
–
0000 : AND
0001 : OR
a
0010 : 加法
0110 : 減法
0111 : 小於時設定 set on less than
b
1100 :NOR (for other MIPS instructions)
ALU-operation
4
ALU
Zero
Result
Overflow
CarryOut
SHU-TE UNIVERSITY CSIE DEPT.
4 - 26
Computer Organization & Design 5th.
A Simple Implementation Scheme
• Purpose
– Selecting the operations to perform (ALU, read/write, etc.)
– Controlling the flow of data (multiplexor inputs)
• How you get these control signals:
– Information comes from the 32 bits of the instruction
Example: add $8, $17, $18
Instruction Format:
000000
10001
10010
01000
00000
100000
op
rs
rt
rd
shamt
funct
• ALU's operation based on instruction type and function code
SHU-TE UNIVERSITY CSIE DEPT.
4 - 27
Computer Organization & Design 5th.
What Control Signals Do We Need?
SHU-TE UNIVERSITY CSIE DEPT.
4 - 28
Computer Organization & Design 5th.
Design Method for Control
• Multi-level control (decoding)
• Instruction opcode: main control unit (first level)
– ALU control
• Sub-control for arithmetic
– MUX control
•
•
•
•
Which source registers and destination registers
ALU input source
Input source of destination register
Input source of PC
– Result for first level
• Seven 1-bit control lines
• 2-bit ALUOP control signals
• The above control signals can be set based solely on the opcode field of
the instruction
– Exception: PCSrc (depends on the beq result)
SHU-TE UNIVERSITY CSIE DEPT.
4 - 29
Computer Organization & Design 5th.
A Simple Implementation Scheme
• ALU控制位元的控制是由 ALUOp 控制位元所決定
• ALUOp是來用決定不同的指令型態
需要的ALU運算
ALU的控制輸入
XXXXXX
加法
0010
儲存字組
XXXXXX
加法
0010
01
相等分支
XXXXXX
減法
0110
10
加法
100000
加法
0010
R-type
10
減法
100010
減法
0110
R-type
10
AND
100100
and
0000
R-type
10
OR
100101
or
0001
R-type
10
小於時設定
101010
小於時設定slt
0111
指令運算碼
ALUOp
LW
00
載入字組
SW
00
Branch
equal
R-type
指令的運算 功能欄位
SHU-TE UNIVERSITY CSIE DEPT.
4 - 30
Computer Organization & Design 5th.
ALU Control
• ALU Control
ALUOp
– Instructions using ALU
• Load/store
• address calculation – add
lw $t1, offset(t2)
– Branch eq
2
6
function
field
• Subtract for comparison
• ‘taken’ or ‘not taken’
• add/subtract for address calculation
beq $t1, $t2, offset
ALU
control
ALU
operation
4
ALU
– R-type
• and/or
• set-on-less-than
SHU-TE UNIVERSITY CSIE DEPT.
4- 31
Computer Organization & Design 5th.
ALU Control
• Multi-level control (decoding)
– Instruction opcode: main control unit – first level
00 = lw, sw
01 = beq,
10 = arithmetic
• 2nd level: function code for arithmetic : sub control
– Main CU generates the ALUOP bits as inputs of the ALU control unit
– Reduce the size of main control but may increase the delay
SHU-TE UNIVERSITY CSIE DEPT.
4 - 32
Computer Organization & Design 5th.
ALU Control
• Truth table
– X : don’t care term
– All zeros or don’t care terms are eliminated
Output
Input
ALUOp
ALUOp1 ALUOp0
0
0
X
1
1
X
1
X
1
X
1
X
1
X
F5
X
X
X
X
X
X
X
SHU-TE UNIVERSITY CSIE DEPT.
Funct field
F4 F3 F2 F1
X X X X
X X X X
X 0 0 0
X 0 0 1
X 0 1 0
X 0 1 0
X 1 0 1
Operation
F0
X
X
0
0
0
1
0
0010
0110
0010
0110
0000
0001
0111
注意事項:
1.ALUOP 目前無 ’11’項
所以原來的’10’改成’1X
2.Funct field中F5F4皆為
’10’故改成’XX’
4 - 33
Computer Organization & Design 5th.
Design Main Control Unit(設計主要的控制單元)
• 指令的格式
– Op 欄位:Op[5 : 0]
– R 型指令、相等則分支(beq)指令及儲存指令中,
暫存器:指令的25 : 21 位元及20 : 16 位元的rs 欄位及 rt 欄位
– 載入及儲存指令中的基底暫存器:指令的25 : 21 位元(rs)
– 相等則分支(beq)指令﹑載入指令及儲存指令的16 位元偏移量(offset):
指令的15 : 0 位元
SHU-TE UNIVERSITY CSIE DEPT.
4 - 34
Computer Organization & Design 5th.
A Simple Implementation Scheme
• Seven single-bit control lines, one 2-bit ALUOp control signal
• Except for PCSrc, the control signal can be set solely based on the
opcode field of the instruction.
• To generate PCSrc, we need to AND together a signal from the control
unit, which we call Branch, with the Zero signal out of the ALU.
Signal
RegDst
Regwrite
ALUSrc
PCSrc
MemRead
MemWrite
MemtoReg
Deasserted 未設定
dest register from rt (20-16)
none
2nd operand from reg output 2
PC<-- PC+4
none
none
put ALU result to reg
SHU-TE UNIVERSITY CSIE DEPT.
設定
Asserted
from rd (15-11)
write to dest register
from 16 bit sign extension
PC<--branch dest
read from memory
write to memory
put memory read data to reg
4 - 35
Computer Organization & Design 5th.
The Simple Datapath with the Control Unit
0
M
u
x
Add
4
Instruction[31 26]
Read
address
Instruction
memory
Instruction[15 11]
Zero
ALU ALU
result
Address
Read
register 1
Instruction[20 16]
Instruction
[31 0]
1
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
Control ALUOp
MemWrite
ALUSrc
RegWrite
Instruction[25 21]
PC
ALU
Add result
0
M
u
x
1
Read
data1
Read
register 2
Registers Read
Write
data2
register
0
M
u
x
1
Write
data
Write
data
Instruction[15 0]
16
Sign
extend
Read
data
Data
memory
1
M
u
x
0
32
ALU
control
Instruction[5 0]
SHU-TE UNIVERSITY CSIE DEPT.
4 - 36
Computer Organization & Design 5th.
The Simple Datapath with the Control Unit
• 定義所有控制訊號線應該如何對每一種運作碼來設定
• 第一列對應到R 格式的指令(add、sub、and、or 及slt),
– 來源暫存器都是rs 和rt,目的暫存器都是rd;定義了ALUSrc 和RegDst 控制訊
號線如何設定
– R-型指令將運算結果寫入暫存器(RegWrite=1),但不會存取(讀寫)數據記憶體。
– 當Branch 控制訊號等於0 時,PC 的值無條件地由PC+4 取代;否則,如果
ALU 的Zero 輸出值也為1 時,PC 的值由分支目標位址取代
– R-型指令的ALUOp 欄位被設定為10,表示ALU 的控制是由功能欄位(funct
field)來產生
SHU-TE UNIVERSITY CSIE DEPT.
4 - 37
Computer Organization & Design 5th.
The Simple Datapath with the Control Unit
• 第二列及第三列說明lw 指令及sw 指令的控制訊號設定。
– ALUSrc 和ALUOp 欄位設定成執行位址的計算。
– Mem-Read 及MemWrite 設定成執行記憶體的存取。
– 載入指令,RegDst 及RegWrite 設定成將結果儲存到rt 暫存器中。
SHU-TE UNIVERSITY CSIE DEPT.
4 - 38
Computer Organization & Design 5th.
The Simple Datapath with the Control Unit
• 分支指令的ALUOp 欄位設定成執行減法(ALU 控制=01),用
來測試兩個值是否相等。
– 如果RegWrite 控制訊號為0 時MemtoReg 欄位是無關緊要的:因為暫
存器不會被寫入,所以暫存器的Write data 輸入值不會被用到。
– 表格最後兩列的MemtoReg 項目用X 來表示,代表「don’t care」。
– 當RegWrite 為0 時,RegDst 也可以使用X 來表示。
SHU-TE UNIVERSITY CSIE DEPT.
4 - 39
Computer Organization & Design 5th.
Datapath operation(數據通道運作)
• 數據通道對R 型指令:add $t1, $t2, $t3
– 一個時脈週期內可以把它想像成執行了四個步驟:
指令被擷取,並且遞增PC 的值
兩個暫存器從暫存器檔案中讀出;同時,主控制單元計算
控制訊號線的值
ALU 根據功能碼(指令中的5 :0 位元功能欄位)來產生ALU
功能的控制,以對從暫存器檔案中讀出的值運算
ALU 的運算結果使用指令的15:11 位元來選擇目的暫存器
($t1),以寫入暫存器檔案
SHU-TE UNIVERSITY CSIE DEPT.
4 - 40
Computer Organization & Design 5th.
lw
$t0, 32($s3) ; 35 19 8 32
35 or 43
rs
rt
address
31:26
25:21
20:16
15:0
SHU-TE UNIVERSITY CSIE DEPT.
4 - 41
Computer Organization & Design 5th.
Datapath operation(數據通道運作)
• 載入指令中有動作的功能單元和被設定的控制訊號
lw $t0, 32($s3)
– 想像成執行了五個步驟:
指令從指令記憶體中被擷取,並且遞增PC 的值
暫存器的值從暫存器檔案中讀出
ALU 計算由暫存器檔案中讀出的值和符號延伸過的指令中
較低的16 位元偏移量的和
ALU 所得的和作為數據記憶體的位址
記憶體傳回的數據寫入暫存器檔案;暫存器的目的地可由
指令中的位元20:16 得知
SHU-TE UNIVERSITY CSIE DEPT.
4 - 42
Computer Organization & Design 5th.
beq
$s1, $s2, 100 ; 4 17 18 25
4
rs
rt
address
31:26
25:21
20:16
15:0
SHU-TE UNIVERSITY CSIE DEPT.
4 - 43
Computer Organization & Design 5th.
Datapath operation(數據通道運作)
• beq指令: beq $s0, $s1, 100
– 想像成執行時的四個步驟:
指令從指令記憶體中被擷取,並且遞增PC 的值
兩個暫存器$t1 和$t2 從暫存器檔案中被讀出
ALU 對暫存器檔案中讀出的值執行減法。PC+4 的值與符
號延伸並左移兩位的指令中較低的16 位元(偏移量)相加;
其結果即為分支目的位址
ALU 的Zero 輸出被用來決定哪一個加法器的結果要寫回
PC
SHU-TE UNIVERSITY CSIE DEPT.
4 - 44
Computer Organization & Design 5th.
beq
$s1, $s2, 100 ; 4 17 18 25
4
rs
rt
address
31:26
25:21
20:16
15:0
SHU-TE UNIVERSITY CSIE DEPT.
4 - 45
Computer Organization & Design 5th.
Complete Control Unit(完成控制單元)
• 加入跳躍(jump)指令,以便說明如何在基本的數據通道
及控制中,再作延伸以處理其他的指令
SHU-TE UNIVERSITY CSIE DEPT.
4 - 46
Computer Organization & Design 5th.
Jump
2
address
31:26
25:0
加入跳躍(jump)指令
SHU-TE UNIVERSITY CSIE DEPT.
4 - 47
Computer Organization & Design 5th.
效能議題Performance Issues
• Longest delay determines clock period
– 關鍵路徑:Load指令(Critical path: load instruction)
– 指令記憶體暫存器檔案 ALU 資料記憶體傳存器檔案
(Instruction memory  register file  ALU  data memory  register file)
• 對不同指令沒有彈性可以改變週期
(Not feasible to vary period for different instructions)
• 違反設計原則 Violates design principle
– 讓一般情況加快(Making the common case fast)
• 利用管線處理來增進效能
(We will improve performance by pipelining)
SHU-TE UNIVERSITY CSIE DEPT.
4 - 48
Computer Organization & Design 5th.
A Simple Implementation Scheme
• 為什麼單一時脈週期的製作方式不被採用?
– 每個指令的時脈週期都必須有相同長度(因此,CPI = 1)
– 計算機的運算處理指令中最長的路徑將決定時脈週期的長度
– 整體效能似乎不是很好
• 範例:單一時脈計算機的效能,假設功能單元的運算時間如下:
–
–
–
–
記憶體單元: 2 ns
ALU 及加法器: 2 ns
暫存器檔案 (讀取或寫入): 1 ns
下列的製作方式那一種會比較快?
1. 每個指令在一個固定長度的時脈週期內運作完成
2. 每個指令在一個時脈週期內運作完成,但時脈週期長度是可變動
SHU-TE UNIVERSITY CSIE DEPT.
4 - 49
Computer Organization & Design 5th.
A Simple Implementation Scheme
•
範例 (續)
為了計算效能,假設我們使用下
列指令的混合比例:
24% 載入, 12% 儲存, 44% R
形態指令, 18% 分支及 2%跳躍
指令
• 解答
1. CPU 時脈週期為 8 ns.
2. CPU 時脈週期
= 8*24% + 7*12% + 6*44% +
5*18% + 2*2%
= 6.3 ns
效能改進的比例為 8/6.3 = 1.27.
SHU-TE UNIVERSITY CSIE DEPT.
指令種類所用到的功能單元
R格式
指令擷取
暫存器存取
ALU
暫存器存取
載入字組
指令擷取
暫存器存取
ALU
記憶體存取
儲存字組
指令擷取
暫存器存取
ALU
記憶體存取
分支
指令擷取
暫存器存取
ALU
跳躍
指令擷取
暫存器存取
指令
種類
指令記
憶體
暫存
器讀
取
ALU
運算
資料記
憶體
暫存器
寫入
總和
R格式
2
1
2
0
1
6ns
載入字組
2
1
2
2
1
8ns
儲存字組
2
1
2
2
分支
2
1
2
跳躍
2
7ns
5ns
2ns
4 - 50
Computer Organization & Design 5th.
A Simple Implementation Scheme
• 範例
–
–
–
–
假設我們有浮點指令單元:
執行浮點加法需要8ns
執行浮點乘法需要16ns
所有功能單元所需的時間如同上例。下列的製作方式何會比較快?
• 1.每個指令在一個固定長度的時脈週期內運作完成
• 2.每個指令在一個時脈週期內運作完成,但時脈週期長度是可變動
– 為了計算效能,假設我們使用下列指令的混合比例:
– 31%載入, 21%儲存, 27% R形態指令, 5%分支,2% 跳躍指令,
7%浮點加法及7% FP浮點乘法
• 解答
– 1. 最長的指令為浮點乘法,其時脈週期為
2 + 1 + 16 + 1 = 20 ns
– 2. 浮點指令的加法須時 2 + 1 + 8 + 1 = 12 ns.
– CPU 時脈週期
= 8*31% + 7*21% + 6*27% + 5*5% + 2*2% +20*7% + 12*7%= 7.0 ns
– 效能改進的比例為20/7 = 2.9.
SHU-TE UNIVERSITY CSIE DEPT.
4 - 51
Computer Organization & Design 5th.
管線化 Pipelining
• 管線式洗衣(Pipelined laundry: overlapping execution)
– 平行處理增進效能(Parallelism improves performance)

Four loads:


Non-stop:

SHU-TE UNIVERSITY CSIE DEPT.
Speedup
= 8/3.5 = 2.3
Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages
4 - 52
Computer Organization & Design 5th.
MIPS管線處理(MIPS Pipeline)
•
管線處理(pipelining)之定義
–
•
將指令區分成數個步驟,分別由不同的功能單元同時加以執行,以
增進整體程式之效能
管線處理五個步驟(Five stages, one step per stage)
1. IF: Instruction Fetch(擷取指令)
從記憶體擷取指令(Instruction fetch from memory)
2.
ID: Instruction Decode(指令解碼)
解碼指令並讀取暫存器(Instruction decode & register read)
3.
EX: Execution(執行指令)
執行運算或計算位址(Execute operation or calculate address)
4.
MEM: Memory Access(記憶體存取)
存取記憶體運算元(Access memory operand)
5.
WB: Write Back(寫回)
將結果寫回至暫存器(Write result back to register)
SHU-TE UNIVERSITY CSIE DEPT.
4 - 53
Computer Organization & Design 5th.
管線處理效能Pipeline Performance
• 假設每一階段的時間為(Assume time for stages is)
– 暫存器讀寫:100ps for register read or write
– 其他階段:200ps for other stages
• 比較管線處理與單一週期的資料路徑
(Compare pipelined datapath with single-cycle datapath)
Instr
Instr fetch Register
read
ALU op
Memory
access
Register
write
Total time
lw
200ps
100 ps
200ps
200ps
100 ps
800ps
sw
200ps
100 ps
200ps
200ps
R-format
200ps
100 ps
200ps
beq
200ps
100 ps
200ps
SHU-TE UNIVERSITY CSIE DEPT.
700ps
100 ps
600ps
500ps
4 - 54
Computer Organization & Design 5th.
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
SHU-TE UNIVERSITY CSIE DEPT.
4 - 55
Computer Organization & Design 5th.
管線處理加速Pipeline Speedup
• 所有階段都一致If all stages are balanced
– 每一階段時間都相同(i.e., all take the same time)
– Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
• 若階段不一致,加速值較少If not balanced, speedup is less
管線處理的「加速』源自增加處理量(產量)
Speedup due to increased throughput
– 延遲時間(latency每一指令的時間)沒有減少
Latency (time for each instruction) does not decrease
–
SHU-TE UNIVERSITY CSIE DEPT.
4 - 56
Computer Organization & Design 5th.
Pipelining and ISA Design
• MIPS ISA專為管線化處理所設計
(MIPS ISA designed for pipelining)
– 所有指令皆為32位元(All instructions are 32-bits)
• 較容易在一個週期內擷取並解碼
(Easier to fetch and decode in one cycle)
• c.f. x86: 1- to 17-byte instructions
– 少量且規則的指令格式(Few and regular instruction formats)
• 能在一個步驟內解碼並讀取暫存器
(Can decode and read registers in one step)
– 載入與儲存定址(Load/store addressing)
• 能在第3階段計算位址,在第4階段存取記憶體
(Can calculate address in 3rd stage, access memory in 4th stage)
– 記憶體運算元對齊(Alignment of memory operands)
• 記憶體存取只需一個週期
(Memory access takes only one cycle)
SHU-TE UNIVERSITY CSIE DEPT.
4 - 57
Computer Organization & Design 5th.
危障Hazards
• 下一週期的起始位址不是下一指令
Situations that prevent starting the next instruction in the next cycle
• 危障種類Hazard types
– 結構危障(Structure hazards)
• 所需資源忙碌中(A required resource is busy)
– 資料(數據)危障(Data hazard)
• 等待前一指令完成資料讀寫
Need to wait for previous instruction to complete its data read/write
– 控制危障(Control hazard)
• 依前一指令結果決定控制動作
Deciding on control action depends on previous instruction
SHU-TE UNIVERSITY CSIE DEPT.
4 - 58
Computer Organization & Design 5th.
結構危障Structure Hazards
• 定義:當安排好的指令由於硬體無法支援當時應執行的指令
在適當的時脈週期內執行
• 使用資源衝突Conflict for use of a resource
– MIPS中只有一個記憶體 In MIPS pipeline with a single memory
– Load/store 需要做資料存取 Load/store requires data access
– 該週期的指令擷取必須延遲(stall),需管線泡泡
Instruction fetch would have to stall for that cycle
Would cause a pipeline “bubble”
• 管線式資料路徑需要獨立的指令/資料記憶體
– 或獨立的指令/資料快取(記憶體)
Hence, pipelined datapaths require separate instruction/data
memories
– Or separate instruction/data caches
SHU-TE UNIVERSITY CSIE DEPT.
4 - 59
Computer Organization & Design 5th.
結構危障Structure Hazards
[問題]若有第四個指令進入管線中,則…..
指令1在時間6~8需要存取記憶體,此時指令4也需要從記憶
體讀取指令
SHU-TE UNIVERSITY CSIE DEPT.
沖!沖!沖!
4 - 60
Computer Organization & Design 5th.
資料(數據)危障Data Hazards
• 定義:當安排好的指令執行所需的資料未取得而無法在適當
的時脈週期內執行
• 與前依指令資料存取完成結果有關
An instruction depends on completion of data access by a previous instruction
– add
sub
$s0, $t0, $t1
$t2, $s0, $t3
SHU-TE UNIVERSITY CSIE DEPT.
4 - 61
Computer Organization & Design 5th.
前饋/繞送 (Forwarding/Bypassing)
• 使用已經計算完成的結果Use result when it is computed
– 不需等到存至暫存器中Don’t wait for it to be stored in a register
– 資料路徑需要額外的連接線
Requires extra connections in the datapath
SHU-TE UNIVERSITY CSIE DEPT.
4 - 62
Computer Organization & Design 5th.
Load指令-資料危障(Load-Use Data Hazard)
• 使用前饋仍無法避免要用延遲/停滯(stall)
Can’t always avoid stalls by forwarding
– 當所需要的值尚未計算完成
If value not computed when needed
– 無法前饋至之前的時間
Can’t forward backward in time!
SHU-TE UNIVERSITY CSIE DEPT.
4 - 63
Computer Organization & Design 5th.
Code Scheduling to Avoid Stalls
• 利用指令重排避免下一 指令為Load指令
Reorder code to avoid use of load result in the next instruction
• C code for
stall
stall
lw
lw
add
sw
lw
add
sw
A = B + E;
C = B + F;
$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
13 cycles
SHU-TE UNIVERSITY CSIE DEPT.
lw
lw
lw
add
sw
add
sw
$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
11 cycles
4 - 64
Computer Organization & Design 5th.
控制危障Control Hazards
• 定義:當所擷取的指令並非所需的指令而造成適當的指
令無法在適當的管線時脈週期中執行;亦即,指令位址
產生的順序非管線所期待者
• 分支決定控制流程Branch determines flow of control
– 擷取下一指令取決於分支結果
Fetching next instruction depends on branch outcome
– 管線處理不可能永遠擷取正確的下一個指令
Pipeline can’t always fetch correct instruction
• 仍在分支指令的ID階段
Still working on ID stage of branch
• MIPS的管線處理中In MIPS pipeline
– 需要在管線中比較暫存器與提早計算目標位址Need to compare
registers and compute target early in the pipeline
– 在ID階段增加硬體來處理
Add hardware to do it in ID stage
SHU-TE UNIVERSITY CSIE DEPT.
4 - 65
Computer Organization & Design 5th.
分支中的延遲Stall on Branch
• 等到分支結果來決定擷取下一指令
Wait until branch outcome determined before
fetching next instruction
SHU-TE UNIVERSITY CSIE DEPT.
4 - 66
Computer Organization & Design 5th.
分支預測Branch Prediction
• 較長的管線無法完全提早決定分之結果
Longer pipelines can’t readily determine branch outcome early
– 延遲時間變得無法接受
Stall penalty becomes unacceptable
• 分支預測結果Predict outcome of branch
– 預測錯誤只有造成延遲Only stall if prediction is wrong
• MIPS管線處理中In MIPS pipeline
– 可以預測分支未發生Can predict branches not taken
– 分支後的擷取指令沒有延遲
Fetch instruction after branch, with no delay
SHU-TE UNIVERSITY CSIE DEPT.
4 - 67
Computer Organization & Design 5th.
MIPS with Predict Not Taken
Prediction
correct
Prediction
incorrect
SHU-TE UNIVERSITY CSIE DEPT.
4 - 68
Computer Organization & Design 5th.
More-Realistic Branch Prediction
• 靜態分支預測Static branch prediction
– 基於典型分支行為Based on typical branch behavior
– 範例:迴圈與if指令Example: loop and if-statement branches
• 預測反向分支會發生Predict backward branches taken
• 預測前向分支不會發生Predict forward branches not taken
• 動態分支預測Dynamic branch prediction
– 硬體預測器根據每道分支的行為來作預測,並且在程式運作過程中可
以改變對一道分支的預測
– 硬體測量實際分支行為Hardware measures actual branch behavior
• 例如:記錄每一分支最近結果的歷史
e.g., record recent history of each branch
– 假設未來行為會持續趨勢
Assume future behavior will continue the trend
• 當猜錯時,使用重新擷取時延遲stall、並更新歷史紀錄
When wrong, stall while re-fetching, and update history
SHU-TE UNIVERSITY CSIE DEPT.
4 - 69
Computer Organization & Design 5th.
管線處理總結Pipeline Summary
• 管線處理增加效能是利用增加指令處理量
Pipelining improves performance by increasing instruction throughput
– 同時執行多個指令Executes multiple instructions in parallel
– 每一指令有相同延遲時間Each instruction has the same latency
• 危障Subject to hazards
– 結構、資料、控制Structure, data, control
• 指令集設計影響實現管線處理的複雜度
Instruction set design affects complexity of pipeline implementation
SHU-TE UNIVERSITY CSIE DEPT.
4 - 70
Computer Organization & Design 5th.
了解程式效能
• 有效率的管道運作通常是除了記憶體系統之外,決定處理器的
CPI 也就是其效能最重要的因素
• 結構危障通常發生在可能無法完全管道化的浮點單元中
• 控制危障通常在有較多分支而且分支較難預測的整數程式中較
難處理
• 數據危障
– 通常在浮點程式中由於其較少的分支以及較規律的記憶體存取樣式,方
便讓編譯器試著安排指令而較易避免數據危障
• 對於偏向使用指標(pointer)而導致較不規律記憶體存取的整數
程式則較難以這個方法來改善
• 管道化改善了指令的處理量然而反而會增加單一指令的執行時
間或延遲
SHU-TE UNIVERSITY CSIE DEPT.
4 - 71
Computer Organization & Design 5th.
管道化數據通道及控制
• 單一週期數據通道
– 5 級的管道
– 代表在任一時脈週期內最多有5 道指令正在執行
1. IF:指令擷取
2. ID:指令解碼與暫存器檔案讀取
3. EX:執行或位址計算
4. MEM:數據記憶體存取
5. WB:寫回
SHU-TE UNIVERSITY CSIE DEPT.
4 - 72
Computer Organization & Design 5th.
MIPS Pipelined Datapath
• 2個例外
– Write Back(WB)
– Next PC
MEM
Right-to-left flow
leads to hazards
WB
SHU-TE UNIVERSITY CSIE DEPT.
4 - 73
Computer Organization & Design 5th.
管線暫存器Pipeline registers
• 管線各階段間需要暫存器Need registers between stages
– 保存上一階段產生的資訊
To hold information produced in previous cycle
64bits
SHU-TE UNIVERSITY CSIE DEPT.
128
97
64bits4 - 74
Computer Organization & Design 5th.
Pipeline Operation
• 管線化資料路徑「逐一週期」的流程
Cycle-by-cycle flow of instructions through the pipelined datapath
– 單一週期管線圖
“Single-clock-cycle” pipeline diagram
• 顯示單一週期管線的使用Shows pipeline usage in a single cycle
• 標示使用到的資源Highlight resources used
– 比較:多重時脈週期管線圖
c.f. “multi-clock-cycle” diagram
• Graph of operation over time
• 審視load跟store指令單一週期管線圖
We’ll look at “single-clock-cycle” diagrams for load & store
SHU-TE UNIVERSITY CSIE DEPT.
4 - 75
Computer Organization & Design 5th.
單一時脈週期數據路徑假設以管道化的方式來執行
SHU-TE UNIVERSITY CSIE DEPT.
4 - 76
Computer Organization & Design 5th.
管道化數據通道及控制
• 顯示載入指令通過管道中五個階段時數據通道中強
調出有動作的部分的情形
– 任何在後方管道階段會使用到的資訊都必須透過
管道暫存器來傳遞給各該階段
– 數據通道中的每個邏輯元件如指令記憶體、暫存
器讀取埠、ALU、數據記憶體和暫存器寫入埠只
能用於唯一的管道階段期間,否則會發生結構危
障
• 現在我們來發現在載入指令設計中的一個錯誤。你
看出來了嗎?
– 寫入暫存器的編號!
SHU-TE UNIVERSITY CSIE DEPT.
4 - 77
Computer Organization & Design 5th.
IF for Load, Store, …
SHU-TE UNIVERSITY CSIE DEPT.
4 - 78
Computer Organization & Design 5th.
ID for Load, Store, …
SHU-TE UNIVERSITY CSIE DEPT.
4 - 79
Computer Organization & Design 5th.
EX for Load
SHU-TE UNIVERSITY CSIE DEPT.
4 - 80
Computer Organization & Design 5th.
MEM for Load
SHU-TE UNIVERSITY CSIE DEPT.
4 - 81
Computer Organization & Design 5th.
WB for Load
Wrong
register
number
SHU-TE UNIVERSITY CSIE DEPT.
4 - 82
Computer Organization & Design 5th.
Load指令之修正資料路徑
(Corrected Datapath for Load)
SHU-TE UNIVERSITY CSIE DEPT.
4 - 83
Computer Organization & Design 5th.
EX for Store
SHU-TE UNIVERSITY CSIE DEPT.
4 - 84
Computer Organization & Design 5th.
MEM for Store
SHU-TE UNIVERSITY CSIE DEPT.
4 - 85
Computer Organization & Design 5th.
WB for Store
SHU-TE UNIVERSITY CSIE DEPT.
4 - 86
Computer Organization & Design 5th.
Multi-Cycle Pipeline Diagram
• 顯示資源利用(Form showing resource) usage
SHU-TE UNIVERSITY CSIE DEPT.
4 - 87
Computer Organization & Design 5th.
SHU-TE UNIVERSITY CSIE DEPT.
4 - 88
Download
Related flashcards
Create Flashcards