Chapter Six Enhancing performance with Pipelining 授課教師: 張傳育 博士 (Chuan-Yu Chang Ph.D.) E-mail: chuanyu@yuntech.edu.tw Tel: (05)5342601 ext. 4337 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 1 An Overview of Pipelining • 管路 (pipeline)是一種製作技巧,它可以重疊指令的執行 • 舉例來說 – 4個階段的管路 (假設所有的管路階段都花費同樣的時間) • 對於非管路而言,花費16個單位時間 • 對於管路而言,花費7個單位時間 • 使用管路並不是縮短單一指令的執行時間,而是增加指 令的生產量 • 如果每個階段所需的時間一樣,且有足夠多的工作要做 ,則pipeline加速的效率約等於管線的階段數。 • 舉例來說 – 大量衣服需要洗滌,管路化非常有效率 – 只有一件衣服需要洗滌,用不用管路所需的時間都是一樣的 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 2 An Overview of Pipelining 洗衣機 烘乾機 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 3 An Overview of Pipelining • 一般來說,MIPS的指令分成下面五個管路階段: – 從指令記憶體中擷取指令 – 當對指令進行解碼時,讀取暫存器的值 – 執行運算 (R-type),或計算一個位址 (存取記憶體) – 存取在資料記憶體中的運算元 – 將結果寫回暫存器中 • MIPS指令以管路方式執行須五個管路階段 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 4 An Overview of Pipelining • 範例. 單一時脈週期指令及管路效率的比較 (page 438) (1) 單一時脈,非管路化的執行結果: – 單一指令執行時間 = 8 ns – 三個指令的執行時間 = 3*8 = 24 ns (2) 管路執行結果: – 所有的管路階段花費相同的時間 (單一時脈週期) – 時脈週期必須夠長以配合最慢的指令 – 階段長度 = 2 ns – 執行時間 = 完成第一個指令的時間+(n-1)階段長度 = 10+(3-1)*2 = 14ns Instruction Class Instruction fecth Register read Load word (lw) 2 ns Store word (sw) R-format (add、sub、and、 or、slt) Branch (beq) Data access Register write Total time 1 ns ALU operatio n 2 ns 2 ns 1 ns 8 ns 2 ns 1 ns 2 ns 2 ns 2 ns 1 ns 2 ns 2 ns 1 ns 2 ns 7 ns 1 ns 6 ns 5 ns 只討論此八個指令 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 5 An Overview of Pipelining Program execution Time order (in instructions) lw $1, 100($0) 2 Instruction Reg fetch lw $2, 200($0) 4 6 8 ALU Data access 10 12 14 ALU Data access 16 18 Reg Instruction Reg fetch 8 ns lw $3, 300($0) Reg Instruction fetch 8 ns ... 8 ns Program execution Time order (in instructions) 2 lw $1, 100($0) Instruction fetch lw $2, 200($0) 2 ns lw $3, 300($0) 4 Reg Instruction fetch 2 ns 6 ALU Reg Instruction fetch 2 ns 8 Data access ALU Reg 2 ns 10 14 12 Reg Data access Reg ALU Data access 2 ns 2 ns Reg 2 ns Fig 6.3 Single-cycle, non-pipelined execution in top vs. pipelined execution in bottom • 若有分支指令呢? 在管路終將會有〝洞〞產生. 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 6 An Overview of Pipelining (cont.) • Under ideal conditions – The speedup from pipelining equals the number of pipe stages. • In fact – Pipelining involves some overhead – The time per instruction in the pipelined machine will exceed the minimum possible, and the speed up will be less than the number of pipeline stages. • Pipelining improves performance by increasing instruction throughput, as opposed to decreasing the execution time of an individual instruction. 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 7 An Overview of Pipelining • Designing Instruction Sets for Pipelining – all instructions are the same length – just a few instruction formats, with the source register fields being located in the same place in each instruction. – memory operands appear only in loads and stores – Operands must be aligned in memory 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 8 An Overview of Pipelining • 管路危障 (Pipeline Hazards) 在管路化中,不能順利的在下一個時脈週期執行下一個指令,這種 情形稱之為〝危障〞。. • 三種不同形態的危障 – 結構危障(structural hazards): –suppose we had only one memory – 控制危障(control hazards): –need to worry about branch instructions – 資料危障(data hazards): –an instruction depends on a previous instruction 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 9 An Overview of Pipelining • 1.結構危障(structural hazards) 硬體資源不夠多,而導致在同一時間內要執行的多個指令卻無法執 行。 • 範例 假設我們只有單一記憶體而不是擁有兩個獨立的記憶體,如果在圖 6.3管路中有第四個指令,在某一時脈週期,第一個指令正在存取記 憶體的同時,第四個指令也在同一記憶體中擷取指令,也就是兩個 記憶體對同一個記憶體同時進行存取動作,此一狀況稱之為〝結構 危障〞。 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 10 An Overview of Pipelining 2. 控制危障(control hazards) 發生在其他指令正在執行時,需要依據另一指令的結果來做出一些 決定的時候就會發生〝控制危障〞。 • 範例: 圖6.4 的lw指令︰ Program execution Time order (in instructions) add $4, $5, $6 beq $1, $2, 40 2 Instruction fetch 2ns 4 Reg Instruction fetch lw $3, 300($0) 4 ns 6 ALU Reg 8 Data access ALU Instruction fetch 10 14 12 16 Reg Data access Reg Reg ALU Data access Reg 2ns • 解決方式 1: 暫停(stall) – 假設我們有足夠的硬體,所以可以在第二個管路階段中測試 暫存器、計算分支位址、並更新PC值。 – 指令 lw 被額外暫停了 2-ns 的時脈週期 稱之為〝管路暫停〞 (pipeline stall) ,也稱之為”氣泡”(bubble)。 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 11 An Overview of Pipelining • 解決方式2: 預測(predict) – 預測分支條件永遠不會成立 – 當你的預測正確時,管路可以全速的運作 (圖 6.5 (a)) – 只有當分支發生時我們才需要管路暫停 (圖6.5 (b)) Program execution Time order (in instructions) add $4, $5, $6 2 6 Instruction Reg fetch 2 ns lw $3, 300($0) 2 4 Instruction Reg fetch beq $1, $2, 40 2 ns ALU Instruction Reg fetch bubble or $7, $8, $9 Data access ALU 6 4 ns 10 14 Reg Data access ALU 8 Data access 12 Reg Instruction Reg fetch 2 ns Program execution Time order (in instructions) 8 Data access ALU Instruction Reg fetch beq $1, $2, 40 add $4, $5 ,$6 4 10 Reg 14 12 Reg ALU Data access Reg bubble bubble bubble Instruction Reg fetch ALU bubble Data access Reg Fig 6.5 Predicting that branches are not taken as a solution to control hazards 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 12 An Overview of Pipelining • 解決方式3: 延遲決定(delayed decision) – 有一些指令不管分支發不發生都要執行(safe instruction),而且這些指令 不影響管線運作的正確性,因此我們可將這些指令放到原本需暫停的時 脈週期中。 – MIPS會把safe instruction放到分支指令之後的位置。 Program execution order Time (in instructions) beq $1, $2, 40 2 Instruction fetch add $4, $5, $6 (Delayed branch slot) 2 ns lw $3, 300($0) 4 Reg Instruction fetch 2 ns 6 ALU Reg Instruction fetch 8 Data access ALU Reg 10 12 14 Reg Data access ALU Reg Data access Reg 2 ns 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 13 An Overview of Pipelining 3. 資料危障(data hazards) 管路中某一指令執行時,需要用到還在管路中前面的指令所產生的 結果。 • 範例 add $s0, $t0, $t1 sub $t2, $s0, $t3 加法指令會到第五個管路階段才將資料寫回暫存器,意思就是說 我們將會產生三個氣泡。 • 解決Data dazard的方法是不等整個指令完成,即將ALU運算結果送給 下一指令,這種從內部資源提早拿取資料的方法稱為前饋(forwarding) 或旁路(bypassing) (圖 6.8) 左邊的陰影表示write 右邊的陰影表示read Program execution order Time (in instructions) add $s0, $t0, $t1 sub $t2, $s0, $t3 2 IF 4 6 8 ID EX MEM IF ID EX 10 WB MEM 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. WB 14 An Overview of Pipelining • 範例 lw $s0, 20($t1) sub $t2, $s0, $t3 即使有前饋的技術我們仍需要暫停管路 2 Time 4 因為memory reference指 令要到第四個階段才完成 記憶體data的存取 6 8 10 12 MEM WB bubble bubble bubble 14 Program execution order (in instructions) lw $s0, 20($t1) sub $t2, $s0, $t3 IF ID EX bubble bubble IF ID EX MEM WB 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 15 A Pipeline Datapath • MIPS指令執行時的5個階段 – IF: 指令擷取 – ID: 指令解碼與暫存器讀取 – EX: 執行或有效記憶體位址計算 – MEM: 資料記憶體存取 – WB: 寫回 • 圖 6.9 展示單一時脈, 5個階段的管路化資料路徑(同Fig. 5.17) IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back 0 M u x 1 Add 4 Add Add result Shift left 2 PC Read register 1 Address Instruction Instruction memory Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 16 A Pipeline Datapath 左邊的陰影表示write 右邊的陰影表示read • 這5個指令將在任一時脈週期內執行 • 範例 觀看圖 6.10 Time (in clock cycles) Program execution order (in instructions) lw $1, 100($0) CC 1 IM lw $2, 200($0) lw $3, 300($0) CC 2 Reg IM CC 3 ALU Reg IM CC 4 DM ALU Reg CC 5 Reg DM ALU CC 6 CC 7 框框內的英文縮寫表示 資料處理的對象: IM: instruction memory DM: data memory Reg DM Reg 此指令流程有兩個例外: 1. 在寫回階段,會將結果回存在資料路徑中間的暫存器檔案,造成資料危障。 2. PC的值可能是PC+1或是從MEM階段所得到的分支位址,造成控制危障。 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 17 Pipelined Datapath • 管路暫存器(Pipeline Register) – 我們將從指令記憶體讀出的資料存放在管路暫存器中,以便保有 指令的相關資訊,讓剩餘的4個階段使用。 – 圖 6.11 展示加上著色的管路暫存器之管路化資料路徑 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 18 Pipelined Datapath • 當記憶體或暫存器做讀取,我們將其右半塗上不同的顏色,而寫入時我 們將其左半塗上顏色。 • 指令 lw 的5個階段如下所示:(參考圖 6.12-6.14) – 指令擷取: 我們以程式計數器 (PC) 中儲存的位址到記憶體中讀取指令並將 其放到IF/ID管路暫存器 (這是由於電腦一開始並不曉得哪種形態 的指令會被擷取) – 指令解碼與暫存器讀取: 暫存器的號碼, 暫存器的內容, 16位元的立即欄位, ID/EX 暫存器 置入遞增後的程式計數器 (PC)的值 – 執行或有效記憶體計算: 載入指令讀取從ID/EX管路暫存器讀取符號擴充後的位址與暫存器1 的內容。使用ALU將這兩個值相加後放到EX/MEM管路暫存器中。 – 記憶體存取 載入指令使用EX/MEM管路暫存器內的位址到資料記憶體讀取資 料 – 寫回 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 19 lw Instruction fetch 0 M u x 1 Fig. 6.12 IF/ID ID/EX EX/MEM MEM/WB Add load word 4 Add Add result Instruction fetch PC Instruction Shift left 2 Address Instruction memory Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Zero ALU ALU result 0 M u x 1 Write data Address Read data 1 M u x 0 Data memory Write data 16 32 Sign extend lw 0 M u x 1 Instruction decode IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Add result Instruction decode & Register file read PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend 32 Read data 1 M u x 0 Pipelined Datapath Fig. 6.13 load word Execute or address calculation 從ID/EX管路暫存器讀取符號擴充後的位址暫存器的內容 使用ALU將這兩個值相加後,放到EX/MEM管路暫存器中。 lw 0 M u x 1 Execution IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Add result PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 21 Pipelined Datapath lw Fig. 6.14 load word 0 M u x 1 Memory IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 Shift left 2 Address PC Read register 1 Instruction Memory access 從EX/MEM管路暫存器的 位址到資料記憶體讀取資 料,並將結果寫入 MEM/WB管路暫存器中。 Instruction memory Read data 1 Read register 2 Registers Read Write data 2 register 0 M u x 1 Write data Zero ALU ALU result Address Data memory Read data 1 M u x 0 Write data 16 Write back 從EX/MEM管路暫存器中 讀取資料,並將其寫入暫 存器檔案。 Sign extend 32 lw Write back 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Add result PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 22 Pipelined Datapath Fig. 6.15 store word指令的第三階段 (計算位址) 執行有效記憶體位址計算,將有效位址放到EX/MEM sw 0 M u x 1 Execution IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Read data Write data 16 Sign extend 1 M u x 0 32 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 23 Pipelined Datapath sw Fig. 6.16 0 M u x 1 store word指令的 Memory IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 Shift left 2 Memory write Read register 1 Instruction Address PC Instruction memory Read data 1 Read register 2 Registers Read Write data 2 register 0 M u x 1 Write data Zero ALU ALU result Read data Address 1 M u x 0 Data memory Write data 16 Sign extend 32 sw 0 M u x 1 Write back IF/ID ID/EX EX/MEM 因為sw指令在 第四階段已將 資料寫入記憶 體,因此在此 階段無動作。 MEM/WB Add Write back 4 Add Add result PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 24 Pipelined Datapath Fig. 6.17 修正後的pipelined datapath使得可處理lw指令 需將IF/ID register中的目的暫存器保留至寫 回階段,供記憶體將data寫入register file 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Add result PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 25 Pipelined Datapath Fig. 6.18 lw指令所會用到的所有datapath 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Add result PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Read data Data memory Write data 16 Sign extend 1 M u x 0 32 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 26 Graphically Representing Pipelines 多時脈週期pipeline表示法(1) Time (in clock cycles) Program execution order (in instructions) lw $10, 20($1) sub $11, $2, $3 CC 1 CC 2 CC 3 IM Reg ALU IM Reg CC 4 CC 5 DM Reg ALU DM CC 6 Reg • Can help with answering questions like: – how many cycles does it take to execute this code? – what is the ALU doing during cycle 4? – use this representation to help understand datapaths 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 27 Graphically Representing Pipelines 多時脈週期pipeline表示法(2) Program execution order (in instructions) lw $10, $20($1) sub $11, $2, $3 Time ( in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 Instruction fetch Instruction decode Execution Data access Write back Instruction fetch Instruction decode Execution Data access CC 6 Write back 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 28 單時脈週期pipeline表示法 lw $10, 20($1) Instruction fetch 0 M u x 1 第一個指令的第一階段 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Instruction Shift left 2 Address Instruction memory Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Zero ALU ALU result 0 M u x 1 Write data Address Read data 1 M u x 0 Data memory Write data 16 Sign extend 32 Clock 1 sub $11, $2, $3 lw $10, 20($1) Instruction fetch Instruction decode 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result Shift left 2 PC Address Instruction memory Instruction 第一個指令的第二階段 第二個指令的第一階段 4 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 Clock 2 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 29 單時脈週期pipeline表示法 sub $11, $2, $3 lw $10, 20($1) Instruction decode Execution 0 M u x 1 第一個指令的第三階段 第二個指令的第二階段 IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Add result PC Instruction Shift left 2 Address Instruction memory Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Zero ALU ALU result 0 M u x 1 Write data Address Read data 1 M u x 0 Data memory Write data 16 Sign extend 32 Clock 3 0 M u x 1 IF/ID sub $11, $2, $3 lw $10, 20($1) Execution Memory ID/EX EX/MEM MEM/WB Add 4 Add Add result PC 第一個指令的第四階段 第二個指令的第三階段 Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 Clock 4 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 30 0 M u x 1 單時脈週期pipeline表示法 IF/ID ID/EX sub $11, $2, $3 lw $10, 20($1) Memory Write back EX/MEM MEM/WB Add 4 Add Add result Shift left 2 PC Instruction 第一個指令的第五階段 第二個指令的第四階段 Address Instruction memory Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Zero ALU ALU result 0 M u x 1 Write data Address Read data 1 M u x 0 Data memory Write data 16 Sign extend 32 Clock 5 sub $11, $2, $3 0 M u x 1 Write back IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Add result PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 第二個指令的第五階段 0 M u x 1 Zero ALU ALU result Address Read data Data memory Write data 16 Sign extend 1 M u x 0 32 Clock 6 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 31 Pipeline Control PCSrc 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add result Add 4 Branch Shift left 2 PC Address Instruction memory Instruction RegWrite Read register 1 MemWrite Read data 1 Read register 2 Registers Read Write data 2 register Write data ALUSrc Zero Zero ALU ALU result 0 M u x 1 MemtoReg Address Data memory Write Read data 1 M u x 0 data Instruction 16 [15– 0] Sign extend 32 6 ALU control MemRead Instruction [20– 16] Instruction [15– 11] 0 M u x 1 ALUOp RegDst 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 32 Pipeline control • We have 5 stages. What needs to be controlled in each stage? – – – – – Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back • How would control be handled in an automobile plant? – a fancy control center telling everyone what to do? – should we use a finite state machine? 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 33 Pipeline Control • Pass control signals along just like the data Instruction R-format lw sw beq Execution/Address Calculation Memory access stage stage control lines control lines Reg ALU ALU ALU Mem Mem Dst Op1 Op0 Src Branch Read Write 1 1 0 0 0 0 0 0 0 0 1 0 1 0 X 0 0 1 0 0 1 X 0 1 0 1 0 0 Write-back stage control lines Reg Mem to write Reg 1 0 1 1 0 X 0 X WB Instruction IF/ID Control M WB EX M WB ID/EX EX/MEM MEM/WB 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 34 Datapath with Control PCSrc ID/EX 0 M u x 1 WB Control IF/ID EX/MEM M WB EX M MEM/WB WB Add Add Add result Instruction memory ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Address Branch Shift left 2 MemWrite PC Instruction RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 35 Dependencies • Problem with starting next instruction before first is finished – dependencies that “go backward in time” are data hazards Time (in clock cycles) CC 1 Value of register $2: 10 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10 10 10/– 20 – 20 – 20 – 20 – 20 DM Reg Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) IM Reg IM DM Reg IM DM Reg IM Reg DM Reg IM Reg Reg Reg DM Reg 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 36 Software Solution • Have compiler guarantee no hazards • Where do we insert the “nops” ? sub and or add sw $2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2) • Problem: this really slows us down! 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 37 Forwarding • Use temporary results, don’t wait for them to be written – register file forwarding to handle read/write to same register – ALU forwarding Time (in clock cycles) CC 1 Value of register $2 : 10 Value of EX/MEM : X Value of MEM/WB : X CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 X X 10 – 20 X 10/– 20 X – 20 – 20 X X – 20 X X – 20 X X – 20 X X DM Reg Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) IM Reg IM Reg IM DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg what if this $2 was $13? 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 38 Forwarding ID/EX WB Control PC Instruction memory Instruction IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ALU Data memory M u x IF/ID.RegisterRs Rs IF/ID.RegisterRt Rt IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd M u x M u x EX/MEM.RegisterRd Forwarding unit MEM/WB.RegisterRd 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 39 Can't always forward • Load word can still cause a hazard: – an instruction tries to read a register following a load instruction that writes to the same register. Time (in clock cycles) Program CC 1 execution order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 IM CC 2 CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM DM Reg CC 6 CC 7 CC 8 CC 9 Reg DM Reg – add $9, $4, $2 slt $1, $6, $7 IM Reg IM DM Reg Reg DM Reg Thus, we need a hazard detection unit to “stall” the load instruction 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 40 Stalling • We can stall the pipeline by keeping an instruction in the same stage Program Time (in clock cycles) execution CC 1 CC 2 order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 IM CC 3 Reg IM Reg IM CC 4 CC 5 DM Reg Reg IM CC 6 CC 7 DM Reg Reg DM CC 8 CC 9 CC 10 Reg bubble add $9, $4, $2 slt $1, $6, $7 IM DM Reg IM Reg Reg DM Reg 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 41 Hazard Detection Unit • Stall by letting an instruction that won’t write anything go forward ID/EX.MemRead Hazard detection unit ID/EX IF/IDWrite WB Control 0 M u x PC Instruction memory Instruction PCWrite IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ALU Data memory M u x M u x IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd ID/EX.RegisterRt Rs Rt M u x EX/MEM.RegisterRd Forwarding unit MEM/WB.RegisterRd 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 42 Branch Hazards • When we decide to branch, other instructions are in the pipeline! Time (in clock cycles) Program execution CC 1 CC 2 order (in instructions) 40 beq $1, $3, 7 44 and $12, $2, $5 48 or $13, $6, $2 52 add $14, $2, $2 72 lw $4, 50($7) IM CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM DM Reg IM CC 6 CC 8 CC 9 Reg DM Reg IM CC 7 Reg DM Reg Reg DM Reg 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 43 Flushing Instructions IF.Flush Hazard detection unit ID/EX M u x WB Control 0 M u x IF/ID 4 M WB EX M MEM/WB WB Shift left 2 Registers PC EX/MEM = M u x Instruction memory ALU M u x Data memory M u x Sign extend M u x Forwarding unit 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 44 Improving Performance • Try and avoid stalls! E.g., reorder these instructions: lw lw sw sw $t0, $t2, $t2, $t0, 0($t1) 4($t1) 0($t1) 4($t1) • Add a “branch delay slot” – the next instruction after a branch is always executed – rely on compiler to “fill” the slot with something useful • Superscalar: start more than one instruction in the same cycle 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 45 Dynamic Scheduling • The hardware performs the “scheduling” – hardware tries to find instructions to execute – out of order execution is possible – speculative execution and dynamic branch prediction • All modern processors are very complicated – DEC Alpha 21264: 9 stage pipeline, 6 instruction issue – PowerPC and Pentium: branch history table – Compiler technology important • This class has given you the background you need to learn more • Video: An Overview of Intel’s Pentium Processor (available from University Video Communications) 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 46 Superscalar and Dynamic Pipelining • n-路 超純量 – 複製電腦的內部單元 使其在每一個管路階段都能夠處理n個指令 – 理想的 CPI 是 1/n • 超純量的MIPS指令 – 假設每個時脈週期會啟動兩個指令 – 一個指令可以是整數的ALU運算,另一個可以是載入或儲存的 指令 指令形態 ALU 或分支指令 載入或儲存指令 ALU 或分支指令 載入或儲存指令 ALU或分支指令 載入或儲存指令 IF ID IF ID IF IF 管路階段 EX MEM WB EX MEM WB ID EX MEM ID EX MEM IF ID EX IF ID EX WB WB MEM WB MEM WB 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 47 Superscalar and Dynamic Pipelining • 範例 Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 bne $s1, zero, Loop sw $t0, 4($s1) ALU或分支指令 Loop: 資料傳送指令 lw $t0, 0($s1) 時脈週期 1 addi $s1, $s1, -4 2 addu $t0, $t0, $s2 3 bne $s1, zero, Loop sw $t0, 4($s1) 4 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 48 Superscalar and Dynamic Pipelining • 額外的硬體需求 (圖 6.58) – 另外32個指令記憶體的位元 – 暫存器檔案額外的存取埠 – 一個另外的ALU 負責資料傳送的位址計算 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 49 Superscalar and Dynamic Pipelining • 動態管路排程 – 當解決等待暫停時,動態管路排程跳過暫停去尋找之後的指令來執行 – 動態管路排程能夠正常的與額外的硬體資源相結合,所以之後的指令 可以並行的處理 – 代價是更為複雜許多的管路控制, 及更複雜的指令執行模式 • 範例 lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 alti $t5, $s4, 20 即使 sub and slti 指令已經準備要執行, 首先它們必須等待lw 和 addu 指令完成。 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 50 Superscalar and Dynamic Pipelining • 管路可分為三個主要單元 – 指令擷取與啟動單元: 擷取指令並將其解碼。 將每個指令送對執行階段對應的單元。 有順序的交付。 – 執行單元: 每個功能單元都有緩衝器, 稱為保留站,可儲存運算元與 運算子。 當緩衝器包含它所有的運算子 且功能單元可以開始執行時 , 便可計算出結果。 不照順序的執行。 – 交付單元: 決定何時可以安全地將結果送到暫存器檔案或記憶體。 有順序的交付。 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 51 Superscalar and Dynamic Pipelining • 範例 lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 alti $t5, $s4, 20 sub $s4, $s5, $t6 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 52 Superscalar and Dynamic Pipelining • 範例: DEC Alpha 21264 – 每個時脈週期擷取四個指令, 但是最多可以6個指令。 – 使用不照順序執行,與有順序完成。 – 這個管路花了9個階段來做簡單的整數與浮點運算。 – 在1997年的時脈速度為600MHz。 • 動態管路比傳統的靜態管路還要複雜 – 結合分支預測: 交付單元必須能捨棄在執行單元的結果,而這種結果是在 錯誤分支之後交付給指令執行的。 – 結合超純量執行。 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 53 Branches • If the branch is taken, we have a penalty of one cycle • For our simple design, this is reasonable • With deeper pipelines, penalty increases and static branch prediction drastically hurts performance • Solution: dynamic branch prediction Taken Not taken Predict taken Predict taken Taken Not taken Taken Not taken Predict not taken Predict not taken Taken Not taken A 2-bit prediction scheme 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 54 Branch Prediction • Sophisticated Techniques: – A “branch target buffer” to help us look up the destination – Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches) – Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. – A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) • Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! 醫學影像處理實驗室(Medical Image Processing Lab.) Chuan-Yu Chang Ph.D. 55