ECE369 The MIPS Processor ECE369 1 Datapath & control design • • • • We will design a simplified MIPS processor Some of the instructions supported are – Memory-reference instructions: lw, sw – Arithmetic-logical instructions: add, sub, and, or, slt – Control flow instructions: beq, j Generic implementation – Use the program counter (PC) to supply instruction address – Get the instruction from memory – Read registers – Use the instruction to decide exactly what to do All R-type and I-Type instructions use the ALU after reading the registers ECE369 2 Summary of Instruction Types R-Type: op=0 31:26 op 25:21 rs 20:16 rt 15:11 10:6 rd shamt 5:0 funct Load/Store: op=35 or 43 31:26 op 25:21 rs 20:16 rt 15:0 address 20:16 rt 15:0 address Branch: op=4 31:26 op 25:21 rs ECE369 3 Building blocks Instruction address PC Instruction Instruction memory Address a. Instruction memory 5 Register numbers 5 5 Data MemWrite Add Sum b. Program counter 3 Read register 1 Read register 2 Registers Write register Write data c. Adder Write data Read data Data memory 16 Sign extend 32 ALU control MemRead Read data 1 Data Zero ALU ALU result a. Data memory unit b. Sign-extension unit Read data 2 RegWrite a. Registers b. ALU ECE369 4 Fetching instructions ECE369 5 Reading registers op rs rt rd ECE369 shamt funct 6 Load/Store memory access 31:26 op 25:21 rs 20:16 rt ECE369 15:0 address 7 Branch target ECE369 8 Combining datapath for memory and R-type instructions ECE369 9 Appending instruction fetch ECE369 10 Now Insert Branch ECE369 11 The simple datapath ECE369 12 Adding control to datapath ECE369 13 ALU Control • given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic ECE369 14 Control (Reading Assignment: Appendix C.2) • Simple combinational logic (truth tables) Inputs Op5 Op4 Op3 ALUOp Op2 Op1 ALU control block Op0 ALUOp0 ALUOp1 Outputs R-format F3 F2 F (5– 0) Operation2 Iw sw beq RegDst ALUSrc Operation1 Operation MemtoReg RegWrite F1 MemRead Operation0 MemWrite F0 Branch ALUOp1 ALUOpO ECE369 15 Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format lw sw ECE369 beq 16 Datapath in Operation for R-Type Instruction Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw sw ECE369 beq 17 Datapath in Operation for Load Instruction Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0ECE369 0 1 0 0 0 beq 18 Datapath in Operation for Branch Equal Instruction Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0ECE369 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 19 Datapath with control for Jump instruction • • J-type instructions use 6 bits for the opcode, and 26 bits for the immediate value (called the target). newPC <- PC[31-28] IR[25-0] 00 ECE369 20 Timing: Single cycle implementation • Calculate cycle time assuming negligible delays except – Memory (2ns), ALU and adders (2ns), Register file access (1ns) ECE369 21 Why is Single Cycle not GOOD??? • • • Memory - 2ns; ALU - 2ns; Adder - 2ns; Reg - 1ns Instruction class Instruction memory Register read ALU Data memory Register write Total (in ns) ALU type 2 1 2 0 1 6 Load word 2 1 2 2 1 8 Store word 2 1 2 2 Branch 2 1 2 Jump 2 ECE369 – what if we had floating point instructions to handle? 7 5 2 22 • • • Memory - 2ns; ALU - 2ns; Adder 2ns; Reg - 1ns • • • • • Loads 24% Stores 12% R-type 44% Branch 18% Jumps 2% Instruction class Instruction memory Register read ALU Data memory Register write Total (in ns) ALU type 2 1 2 0 1 6 Load word 2 1 2 2 1 8 Store word 2 1 2 2 Branch 2 1 2 Jump 2 7 5 2 ECE369 23 • • • Memory - 2ns; ALU - 2ns; Adder 2ns; Reg - 1ns • • • • • Loads 24% Stores 12% R-type 44% Branch 18% Jumps 2% Avg CPU = 8*24% + 7*12% + 6*44% + 5*18% + 2*2% Avg CPU = 6.3ns Instruction class Instruction memory Register read ALU Data memory Register write Total (in ns) ALU type 2 1 2 0 1 6 Load word 2 1 2 2 1 8 Store word 2 1 2 2 Branch 2 1 2 Jump 2 7 5 2 ECE369 24 Single Cycle Problems – Wasteful of area • Each unit used once per clock cycle – Clock cycle equal to worst case scenario • Will reducing the delay of common case help? ECE369 25 Pipelining ECE369 Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup = 2n/0.5n + 1.5 ≈ 4 = number of stages 26 Pipelining • Five stages, one step per stage – IF: Instruction fetch from memory – ID: Instruction decode & register read – EX: Execute operation or calculate address – MEM: Access memory operand – WB: Write result back to register ECE369 27 Pipelining • Improve performance by increasing instruction throughput Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Ideal speedup is number of stages in the pipeline. Do we achieve this? 28 ECE369 Pipelining MIPS ISA designed for pipelining All instructions are 32-bits Few and regular instruction formats Easier to fetch and decode in one cycle Can decode and read registers in one step Load/store addressing Can calculate address in 3rd stage, access memory in 4th stage ECE369 29 Pipelining: What makes it hard? Situations that prevent starting the next instruction in the next cycle Structure hazards Data hazard A required resource is busy Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends on previous instruction ECE369 30 Pipelining: Structure Hazards Conflict for use of a resource In MIPS pipeline with a single memory Load/store requires data access Instruction fetch would have to stall for that cycle Would cause a pipeline “bubble” Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches ECE369 31 Pipelining: Data Hazards An instruction depends on completion of data access by a previous instruction add sub $s0, $t0, $t1 $t2, $s0, $t3 ECE369 32 Pipelining: Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can’t always fetch correct instruction Still working on ID stage of branch Wait until branch outcome determined before fetching next instruction ECE369 33 Pipelining: Summary Pipelining improves performance by increasing instruction throughput Subject to hazards Executes multiple instructions in parallel Each instruction has the same latency Structure, data, control Instruction set design affects complexity of pipeline implementation ECE369 34 Representation What do we need to add to actually split the datapath into stages? ECE369 35 Pipelined datapath ECE369 36 IF for Load, Store, … Memory and registers Left half: write Right half: read ECE369 37 ID for Load, Store, … ECE369 38 EX for Load ECE369 39 MEM for Load ECE369 40 WB for Load What is wrong with this datapath? ECE369 41 WB for Load ECE369 42 EX for Store ECE369 43 MEM for Store ECE369 44 WB for Store ECE369 45 Graphically representing pipelines ECE369 46 Graphically representing pipelines ECE369 47 Pipeline operation • • • • • One operation begins in every cycle One operation completes in each cycle Each instruction takes 5 clock cycles When a stage is not used, no control needs to be applied How to generate control signals for them is an issue ECE369 48 Pipeline control ECE369 49 Pipeline operation Control signals derived from instruction As in single-cycle implementation ECE369 50 Pipeline control Instruction R-format lw sw beq Execution/Address Calculation stage control lines Reg ALU ALU ALU Dst Op1 Op0 Src 1 1 0 0 0 0 0 1 X 0 0 1 X 0 1 0 ECE369 Memory access stage control lines Branc Mem Mem h Read Write 0 0 0 0 1 0 0 0 1 1 0 0 Write-back stage control lines Reg Mem write to Reg 1 0 1 1 0 X 0 X 51 Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) – Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). ECE369 52 Data Hazards in ALU Instructions Consider this sequence: sub and or add sw $2, $1,$3 $12,$2,$5 $13,$6,$2 $14,$2,$2 $15,100($2) ECE369 53 Dependencies • Problem with starting next instruction before first is finished – Dependencies that “go backward in time” are data hazards ECE369 54 Three Generic Data Hazards Inst I before inst j in in the program • Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. ECE369 55 Three Generic Data Hazards • Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 ECE369 56 Three Generic Data Hazards • Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • • Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 ECE369 57 Hazards ECE369 58 Forwarding • Use temporary results, don’t wait for them to be written – register file forwarding to handle read/write to same register – ALU forwarding ECE369 59 Detecting the Need to Forward Pass register numbers along pipeline ALU operand register numbers in EX stage are given by e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg • ECE369 60 Detecting the Need to Forward Pass register numbers along pipeline ALU operand register numbers in EX stage are given by e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg • ECE369 61 Detecting the Need to Forward But only if forwarding instruction will write to a register! EX/MEM.RegWrite, MEM/WB.RegWrite And only if Rd for that instruction is not $zero EX/MEM.RegisterRd ≠ 0, MEM/WB.RegisterRd ≠ 0 ECE369 62 Forwarding ECE369 63 Forwarding Conditions EX hazard if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 ECE369 64 Double Data Hazard Consider the sequence: add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 Both hazards occur Want to use the most recent Revise MEM hazard condition Only fwd if EX hazard condition isn’t true ECE369 65 Revised Forwarding Condition MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 ECE369 66 Datapath with Forwarding ECE369 67 Can't always forward Need to stall for one cycle ECE369 68 Stalling • • Hardware detection and no-op insertion is called stalling Stall pipeline by keeping instruction in the same stage Program Time (in clock cycles) execution CC 1 CC 2 order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 IM CC 3 Reg IM CC 4 CC 5 DM Reg Reg Reg IM IM CC 6 CC 7 DM Reg Reg DM CC 8 CC 9 CC 10 Reg bubble add $9, $4, $2 IM slt $1, $6, $7 IM ECE369 DM Reg Reg Reg DM Reg 69 ECE369 70 Load-Use Hazard Detection Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage are given by Load-use hazard when IF/ID.RegisterRs, IF/ID.RegisterRt ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) If detected, stall and insert bubble ECE369 71 How to Stall the Pipeline Force control values in ID/EX register to 0 EX, MEM and WB do nop (no-operation) Prevent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows MEM to read data for lw Can subsequently forward to EX stage ECE369 72 Stall logic • Stall logic – If (ID/EX.MemRead) // Load word instruction AND – If ((ID/EX.Rt == IF/ID.Rs) or (ID/EX.Rt == IF/ID.Rt)) • Insert no-op (no-operation) – Deasserting all control signals • Stall following instruction – Not writing program counter – Not writing IF/ID registers PCWrite IF/ID.Rs IF/ID.Rt ECE369 ID/EX.Rt 73 Pipeline with hazard detection ECE369 74 Assume that register file is written in the first half and read in the second half of the clock cycle. load r2 <- mem(r1+0) r3 <- r3 + r2 load r4 <- mem(r2+r3) r4 <- r5 - r3 Cycles 1 2 3 4 5 6 7 8 9 ID EX ME WB IF ID S S EX ME WB IF S S ID EX ME WB S S IF ID S EX ; LOAD1 ; ADD ; LOAD2 ; SUB 10 11 12 13 load r2 <- mem(r1+0) IF r3 <- r3 + r2 load r4 <- mem(r2+r3) r4 <- r5 - r3 ECE369 ME WB 75 Summary ECE369 76 Multi-cycle ECE369 77 Multi-cycle ECE369 78 Multi-cycle Pipeline ECE369 79 Branch Hazards ECE369 80 Branch hazards • • When we decide to branch, other instructions are in the pipeline! We are predicting “branch not taken” – need to add hardware for flushing instructions if we are wrong Flush these instructions (Set control values to 0) PC ECE369 81 Solution to control hazards • Branch prediction – By executing next instruction we are predicting “branch not taken” – Need to add hardware for flushing instructions if we are wrong • Reduce branch penalty – By advancing the branch decision to ID stage – Compare the data read from two registers read in ID stage – Comparison for equality is a simpler design! (Why?) – Still need to flush instruction in IF stage • Make the hazard into a feature! – Always execute instruction following branch ECE369 82 Branch detection in ID stage Flush IF/ID if we miss-predict ECE369 83 Data Hazards for Branches If a comparison register is a destination of 2nd or 3rd preceding ALU instruction Can resolve using forwarding add $1, $2, $3 IF add $4, $5, $6 … ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM beq $1, $4, target ECE369 WB 84 Data Hazards for Branches If a comparison register is a destination of preceding ALU instruction or 2nd preceding load instruction lw Need 1 stall cycle $1, addr IF add $4, $5, $6 beq stalled ID EX MEM WB IF ID EX MEM WB IF ID ID EX beq $1, $4, target ECE369 MEM WB 85 Data Hazards for Branches If a comparison register is a destination of immediately preceding load instruction lw Need 2 stall cycles $1, addr IF beq stalled ID EX IF ID beq stalled MEM WB ID beq $1, $0, target ID ECE369 EX MEM WB 86 Static Branch Prediction •Scheduling (reordering) code around delayed branch • need to predict branch statically at compile time • use profile information collected from earlier runs •Behavior of branch is often bimodally distributed! • Biased toward taken or not taken •Effectiveness depend on • frequency of branches and accuracy of the scheme 22% 18% 20% 15% 15% 12% 11% 12% 10% 9% 10% 4% 5% 6% Integer r 2c o p dl jd FP su d m hy dr o2 ea r li c do du c gc eq nt ot es t pr es so m pr e ss 0% co Integer benchmarks have higher branch frequency Misprediction Rate 25% 87 Four Branch Hazard Alternatives #1: Stall until branch direction is clear: branch penalty is fixed and can not be reduced by software (this is the example of MIPS) #2: Predict Branch Not Taken (treat every branch as not taken) – Execute successor instructions in sequence – “flush” instructions in pipeline if branch actually taken – 47% MIPS branches not taken on average – PC+4 already calculated, so use it to get next instruction ECE369 88 Four Branch Hazard Alternatives: #3: Predict Branch Taken (treat every branch as taken) As soon as the branch is decoded and the target address is computed, we assume the branch is taken and begin fetching and executing at the target address. – 53% MIPS branches taken on average – Because in our MIPS pipeline we don’t know the target address any earlier than we know the branch outcome, there is no advantage in this approach for MIPS. – MIPS still incurs 1 cycle branch penalty • Other machines: branch target known before outcome ECE369 89 Four Branch Hazard Alternatives #4: Delayed Branch – In a delayed branch, the execution cycle with a branch delay of length n is: branch instruction sequential successor1 sequential successor2 Branch delay of length n ........ sequential successorn branch target if taken These sequential successor instructions are in a branch-delay slots. The sequential successors are executed whether or not the branch is taken. The job of the compiler is to make the successor instructions valid and useful. ECE369 90 Scheduling Branch Delay Slots (Fig A.14) A. From before branch add $1,$2,$3 if $2=0 then delay slot becomes if $2=0 then add $1,$2,$3 B. From branch target C. From fall through sub $4,$5,$6 add $1,$2,$3 if $1=0 then add $1,$2,$3 if $1=0 then delay slot becomes Sub $4, $5, $6 add $1,$2,$3 if $1=0 then sub $4,$5,$6 ECE369 delay slot Or $7, $8, $ 9 sub $4,$5,$6 becomes add $1,$2,$3 if $1=0 then Or $7, $8, $9 Sub $4,$5,$6 91 Delayed Branch • • • Where to get instructions to fill branch delay slot? – Before branch instruction: this is the best choice if feasible. – From the target address: only valuable when branch taken – From fall through: only valuable when branch not taken Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot – Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches – Growth in available transistors has made dynamic approaches relatively cheaper ECE369 92 Reducing branch penalty (loop unrolling) Source: For ( i=1000; i>0; i=i-1 ) x[i] = x[i] + s; Direct translation: – Loop: LD ADDD SD DADDUI BNE F0, 0 (R1); F4, F0, F2; F4, 0(R1) R1, R1, #-8 R1, R2, loop; R1 points x[1000] F2 = scalar value R2 last element Producer Consumer Latency FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Store double Store double 0 Assume 1 cycle latency from unsigned integer arithmetic to dependent instruction 93 Reducing stalls 1 2 3 4 5 6 7 8 9 • Pipeline Implementation: – Loop: Loop: LD F0, 0 (R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) DADDUI R1, R1, #-8 stall BNE R1, R2, loop stall LD DADDUI ADDD stall stall SD BNE Producer Consumer Latency FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Store double Store double 0 F0, 0 (R1) R1, R1, #-8 F4, F0, F2 F4, 8(R1) R1, R2, Loop 94 Loop Unrolling Loop LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD DADDUI BNE F0, 0(R1) F4, F0, F2 F4, 0(R1) F6, -8 (R1) F8, F6, F2 F8, -8 (R1) F10, -16 (R1) F12, F10, F2 F12, -16 (R1) F14, -24 (R1) F16, F14, F2 F16, -24 (R1) R1, R1, #-32 R1, R2, Loop Producer Consumer Latency FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Store double Store double 0 ; drop SUBI & BNEZ ; drop SUBI & BNEZ ; drop SUBI & BNEZ 27 cycles: 14 instructions, 1 for each LD, 2 for each ADDD, 1 for DADDUI 95 Loop LD LD LD LD ADDD ADDD ADDD ADDD SD SD DADDUI SD SD BNE F0, 0(R1) F6, -8 (R1) F10, -16(R1) F14, -24(R1) F4, F0, F2 F8, F6, F2 F12, F10, F2 F16, F14, F2 F4, 0(R1) F8, -8 (R1) R1, R1, #-32 F12, -16 (R1) F16, 8(R1) R1, R2, Loop Design Issues: • Code size! • Instruction cache • Register space • Iteration dependence • Loop termination • Memory addressing 14 instructions (3.5 cycles per element vs. 9 cycles!) 96 Dynamic Branch Prediction In deeper and superscalar pipelines, branch penalty is more significant Use dynamic prediction Branch prediction buffer (aka branch history table) Indexed by recent branch instruction addresses Stores outcome (taken/not taken) To execute a branch Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction 1-Bit Branch Prediction •Branch History Table: Lower bits of PC address index table of 1-bit values • Says whether or not branch taken last time NT 0x40010100 0x40010104 for (i=0; i<100; i++) { …. } addi r10, r0, 100 addi r1, r1, r0 L1: …… …… … 0x40010A04 addi r1, r1, 1 0x40010A08 bne r1, r10, L1 …… 0x40010108 T 1-bit Branch History Table T NT T : : T NT Prediction 98 1-Bit Bimodal Prediction (SimpleScalar Term) • For each branch, keep track of what happened last time and use that outcome as the prediction • Change mind fast 99 1-Bit Branch Prediction •What is the drawback of using lower bits of the PC? • Different branches may have the same lower bit value •What is the performance shortcome of 1-bit BHT? • in a loop, 1-bit BHT will cause two mispredictions • End of loop case, when it exits instead of looping as before • First time through loop on next time through code, when it predicts exit instead of looping 100 2-bit Saturating Up/Down Counter Predictor •Solution: 2-bit scheme where change prediction only if get misprediction twice 2-bit branch prediction State diagram 101 2-Bit Bimodal Prediction (SimpleScalar Term) • For each branch, maintain a 2-bit saturating counter: if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1) • If (counter >= 2), predict taken, else predict not taken • Advantage: a few atypical branches will not influence the prediction (a better measure of “the common case”) • Can be easily extended to N-bits (in most processors, N=2) 102 Branch History Table •Misprediction reasons: • Wrong guess for that branch • Got branch history of wrong branch when indexing the table 103 Branch History Table (4096-entry, 2-bits) Branch intensive benchmarks have higher miss rate. How can we solve this problem? Increase the buffer size or Increase the accuracy 104 Branch History Table (Increase the size?) Need to focus on increasing the accuracy of the scheme! 105 Correlated Branch Prediction • Standard 2-bit predictor uses local information • Fails to look at the global picture •Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch 106 Correlated Branch Prediction • A shift register captures the local path through the program • For each unique path a predictor is maintained • Prediction is based on the behavior history of each local path • Shift register length determines program region size 107 Correlated Branch Prediction •Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table • In general, (m,n) predictor means record last m branches to select between 2^m history tables each with n-bit counters • Old 2-bit BHT is then a (0,2) predictor If (aa == 2) aa=0; If (bb == 2) bb = 0; If (aa != bb) do something; 108 Correlated Branch Prediction Global Branch History: m-bit shift register keeping T/NT status of last m branches. 109 Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% 16% 14% 12% 11% 10% 8% 6% 6% 5% 6% 6% 5% 4% 4% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry li eqntott expresso gcc fpppp matrix300 0% spice 1% 0% doducd 1% tomcatv 2% nasa7 Frequency of Mispredictions 20% 1,024 entries (2,2) 110 Tournament Predictors • A local predictor might work well for some branches or programs, while a global predictor might work well for others • Provide one of each and maintain another predictor to identify which predictor is best for each branch Local Predictor Global Predictor Branch PC M U X Tournament Predictor 111 Tournament Predictors •Multilevel branch predictor • Selector for the Global and Local predictors of correlating branch prediction •Use n-bit saturating counter to choose between predictors •Usual choice between global and local predictors 112 Tournament Predictors •Advantage of tournament predictor is ability to select the right predictor for a particular branch •A typical tournament predictor selects global predictor 40% of the time for SPEC integer benchmarks • AMD Opteron and Phenom use tournament style 113 Tournament Predictors (Intel Core i7) • Based on predictors used in Core Due chip • Combines three different predictors • Two-bit • Global history • Loop exit predictor • Uses a counter to predict the exact number of taken branches (number of loop iterations) for a branch that is detected as a loop branch • Tournament: Tracks accuracy of each predictor • Main problem of speculation: • A mispredicted branch may lead to another branch being mispredicted ! 114 Branch Prediction •Sophisticated Techniques: • A “branch target buffer” to help us look up the destination • Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches) • Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. • A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) •Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! •Modern processors predict correctly 95% of the time! 115 Branch Target Buffers (BTB) •Branch target calculation is costly and stalls the instruction fetch. •BTB stores PCs the same way as caches •The PC of a branch is sent to the BTB •When a match is found the corresponding Predicted PC is returned •If the branch was predicted taken, instruction fetch continues at the returned predicted PC 116 Branch Target Buffers (BTB) 117 Pipeline without Branch Predictor IF (br) PC Reg Read Compare Br-target PC + 4 In the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branch 118 Pipeline with Branch Predictor IF (br) PC Branch Predictor Reg Read Compare Br-target 119 Other issues in pipelines • • • • • • Exceptions – Errors in ALU for arithmetic instructions – Memory non-availability Exceptions lead to a jump in a program However, the current PC value must be saved so that the program can return to it back for recoverable errors Multiple exception can occur in a pipeline Preciseness of exception location is important in some cases I/O exceptions are handled in the same manner ECE369 120 Exceptions ECE369 121 Improving Performance • Try and avoid stalls! E.g., reorder these instructions: lw lw sw sw $t0, $t2, $t2, $t0, 0($t1) 4($t1) 0($t1) 4($t1) • Dynamic Pipeline Scheduling – Hardware chooses which instructions to execute next – Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!) – Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) • Trying to exploit instruction-level parallelism ECE369 122 Dynamically scheduled pipeline ECE369 123 Dynamic Scheduling using Tomasulo’s Method 1 FIFO 2 3 124 Where is the store queue? 125 Advanced Pipelining • • • • Increase the depth of the pipeline Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more ILP (better scheduling) “Superscalar” processors – DEC Alpha 21264: 9 stage pipeline, 6 instruction issue • All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”) • VLIW: very long instruction word, static multiple issue (relies more on compiler technology) • This class has given you the background you need to learn more! ECE369 126 Superscalar architecture -Two instructions executed in parallel ECE369 127 VLIW: Very Large Instruction Word • Each “instruction” has explicit coding for multiple operations – In IA-64, grouping called a “packet” – In Transmeta, grouping called a “molecule” (with “atoms” as ops) – Moderate LIW also used in Cray/Tera MTA-2 • Tradeoff instruction space for simple decoding – The long instruction word has room for many operations – By definition, all the operations the compiler puts in one long instruction word are independent => can execute in parallel – E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch » 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide – Need compiling techniques to schedule across several branches (called “trace scheduling”) 128 Thrice Unrolled Loop that Eliminates Stalls for Scalar Pipeline Computers 1 Loop: 2 3 4 5 6 7 8 9 10 11 L.D L.D L.D ADD.D ADD.D ADD.D S.D S.D DSUBUI BNEZ S.D F0,0(R1) F6,-8(R1) F10,-16(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 0(R1),F4 -8(R1),F8 R1,R1,#24 R1,LOOP 8(R1),F12 Minimum times between pairs of instructions: L.D to ADD.D: 1 Cycle ADD.D to S.D: 2 Cycles A single branch delay slot follows the BNEZ. ; 8-24 = -16 11 clock cycles, or 3.67 per iteration 129 Loop Unrolling in VLIW L.D to ADD.D: +1 Cycle ADD.D to S.D: +2 Cycles Memory reference 1 Memory reference 2 FP operation 1 1 Loop: 2 3 4 5 6 7 8 9 10 11 FP op. 2 L.D L.D L.D ADD.D ADD.D ADD.D S.D S.D DSUBUI BNEZ S.D Int. op/ branch F0,0(R1) F6,-8(R1) F10,-16(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 0(R1),F4 -8(R1),F8 R1,R1,#24 R1,LOOP 8(R1),F12 Clock 130 Loop Unrolling in VLIW L.D to ADD.D: +1 Cycle ADD.D to S.D: +2 Cycles Memory reference 1 Memory reference 2 FP operation 1 1 Loop: 2 3 4 5 6 7 8 9 10 11 FP op. 2 L.D L.D L.D ADD.D ADD.D ADD.D S.D S.D DSUBUI BNEZ S.D Int. op/ branch F0,0(R1) F6,-8(R1) F10,-16(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 0(R1),F4 -8(R1),F8 R1,R1,#24 R1,LOOP 8(R1),F12 Clock L.D F0,0(R1) 1 S.D 0(R1),F4 2 3 4 5 6 ADD.D F4,F0,F2 131 Loop Unrolling in VLIW L.D to ADD.D: +1 Cycle ADD.D to S.D: +2 Cycles Memory reference 1 Memory reference 2 L.D F0,0(R1) L.D F6,-8(R1) FP operation 1 L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D L.D F26,-48(R1) ADD.D ADD.D S.D 0(R1),F4 S.D -8(R1),F8 ADD.D S.D -16(R1),F12 S.D -24(R1),F16 S.D -32(R1),F20 S.D -40(R1),F24 S.D 8(R1),F28 1 Loop: 2 3 4 5 6 7 8 9 10 11 FP op. 2 L.D L.D L.D ADD.D ADD.D ADD.D S.D S.D DSUBUI BNEZ S.D Int. op/ branch F0,0(R1) F6,-8(R1) F10,-16(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 0(R1),F4 -8(R1),F8 R1,R1,#24 R1,LOOP 8(R1),F12 Clock 1 2 F4,F0,F2 ADD.D F8,F6,F2 3 F12,F10,F2 ADD.D F16,F14,F2 4 F20,F18,F2 ADD.D F24,F22,F2 5 F28,F26,F2 6 7 DSUBUI R1,R1,#56 8 BNEZ R1,LOOP 9 Unrolled 7 times to avoid stall delays from ADD.D to S.D 7 results in 9 clocks, or 1.3 clocks per iteration (2.8X: 1.3 vs 3.67) Average: 2.5 ops per clock (23 ops in 45 slots), 51% efficiency Note: 8, not -48, after DSUBUI R1,R1,#56 - which may be out of place. See next slide. Note: We needed more registers in VLIW (used 15 pairs vs. 6 in SuperScalar) 132 Problems with 1st Generation VLIW • Increase in code size – generating enough operations in a straight-line code fragment requires ambitiously unrolling loops – whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding • Operated in lock-step; no hazard detection HW – a stall in any functional unit pipeline caused entire processor to stall, since all functional units must be kept synchronized – Compiler might predict function unit stalls, but cache stalls are hard to predict • Binary code incompatibility – Pure VLIW => different numbers of functional units and unit latencies require different versions of the code 133 Multiple Issue Processors • Exploiting ILP – Unrolling simple loops – More importantly, able to exploit parallelism in a less structured code size • Modern Processors: – Multiple Issue – Dynamic Scheduling – Speculation 134 Multiple Issue, Dynamic, Speculative Processors • How do you issue two instructions concurrently? – What happens at the reservation station if two instructions issued concurrently have true dependency? – Solution 1: » Issue first during first half and Issue second instruction during second half of the clock cycle – Problem: » Can we issue 4 instructions? – Solution 2: » Pipeline and widen the issue logic » Make instruction issue take multiple clock cycles! – Problem: » Can not pipeline indefinitely, new instructions issued every clock cycle » Must be able to assign reservation station » Dependent instruction that is being used should be able to refer to the correct reservation stations for its operands • Issue step is the bottleneck in dynamically scheduled superscalars! 135 ARM Cortex-A8 and Intel Core i7 • A8: • • •I7: • • • Multiple issue iPad, Motorola Droid, iPhones Multiple issue high end dynamically scheduled speculative High-end desktops, server 136 ARM Cortex-A8 • A8 Design goal: low power, reasonably high clock rate • Dual-issue • Statically scheduled superscalar • Dynamic issue detection • Issue one or two instructions per clock (in-order) • 13 stage pipeline • Fully bypassing • Dynamic branch predictor • 512-entry, 2-way set associative branch target buffer • 4K-entry global history buffer • If branch target buffer misses • Prediction through global history buffer • 8-entry return address stack • I7: aggressive 4-issue dynamically scheduled speculative 137 pipeline ARM Cortex-A8 The basic structure of the A8 pipeline is 13 stages. Three cycles are used for instruction fetch and four for instruction decode, in addition to a five-cycle integer pipeline. This yields a 13-cycle branch misprediction penalty. The instruction fetch unit tries to keep the 12-entry instruction queue filled. 138 ARM Cortex-A8 The five-stage instruction decode of the A8. In the first stage, a PC produced by the fetch unit (either from the branch target buffer or the PC incrementer) is used to retrieve an 8-byte block from the cache. Up to two instructions are decoded and placed into the decode queue; if neither instruction is a branch, the PC is incremented for the next fetch. Once in the decode queue, the scoreboard logic decides when the instructions can issue. In the issue, the register operands are read; recall that in a simple scoreboard, the operands always come from the registers. The register operands and opcode are sent to the instruction execution portion of the pipeline. 139 ARM Cortex-A8 The five-stage instruction decode of the A8. Multiply operations are always performed in ALU pipeline 0. 140 ARM Cortex-A8 Figure 3.39 The estimated composition of the CPI on the ARM A8 shows that pipeline stalls are the primary addition to the base CPI. eon deserves some special mention, as it does integer-based graphics calculations (ray tracing) and has very few cache misses. It is computationally intensive with heavy use of multiples, and the single multiply pipeline becomes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and penalties to compute the L1 and L2 generated stalls per instruction. These are subtracted from the CPI measured by a detailed simulator to 141 obtain the pipeline stalls. Pipeline stalls include all three hazards plus minor effects such as way misprediction. ARM Cortex-A8 vs A9 A9: Issue 2 instructions/clk Dynamic scheduling Speculation Figure 3.40 The performance ratio for the A9 compared to the A8, both using a 1 GHz clock and the same size caches for L1 and L2, shows that the A9 is about 1.28 times faster. Both runs use a 32 KB primary cache and a 1 MB secondary cache, which is 8-way set associative for the A8 and 16-way for the A9. The block sizes in the caches are 64 bytes for the A8 and 32 bytes for the A9. As mentioned in the caption of Figure 3.39, eon makes intensive use of integer multiply, and the combination of dynamic scheduling and a faster multiply pipeline significantly improves performance on the A9. twolf experiences a small slowdown, likely due to the fact that its cache behavior is worse with the smaller L1 block size of the A9. 142 Intel Core i7 • The total pipeline depth is 14 stages. • There are 48 load and 32 store buffers. • The six independent functional units can each begin execution of a ready micro-op in the same cycle. 143 Intel Core i7 • Instruction Fetch: • Multilevel branch target buffer • Return address stack (function return) • Fetch 16 bytes from instruction cache • 16-bytes in predecode instruction buffer • Macro-op fusion: compare followed by branch fused into one instruction • Break 16 bytes into instructions • Place into 18-entry queue 144 Intel Core i7 • Micro-op decode: translate x86 instructions into micro-ops (directly executable by the pipeline) • Generate up to 4 microops/cycle • Place into 28-entry buffer • Micro-op buffer: • loop stream detection: • Small sequence of instructions in a loop (<28 instructions) • Eliminate fetch, decode • Microfusion • Fuse load/ALU, ALU/store pairs • Issue to single reservation station 145 Intel Core i7 vs. Atom 230 (45nm technology) Intel i7 920 ARM A8 Intel Atom 230 4-cores each with FP 1 core, no FP 1 core, with FP Clock rate 2.66GHz 1GHz 1.66GHz Power 130W 2W 4W Cache 3-level, all 4-way, 128 I, 64 D, 512 L2 1-level Fully associative 32 I, 32 D 2-level All 4-way 16 I, 16 D, 64 L2 Pipeline 4ops/cycle 2ops/cycle 2 ops/cycle Speculative, OOO In-order, dynamic issue In-order Dynamic issue Two-level Two-level 512-entry BTB 4K global history 8-entry return Two-level Branch pred 146 Intel Core i7 vs. Atom 230 (45nm technology) Figure 3.45 The relative performance and energy efficiency for a set of single-threaded benchmarks shows the i7 920 is 4 to over 10 times faster than the Atom 230 but that it is about 2 times less power efficient on average! Performance is shown in the columns as i7 relative to Atom, which is execution time (i7)/execution time (Atom). Energy is shown with the line as Energy (Atom)/Energy (i7). The i7 never beats the Atom in energy efficiency, although it is essentially as good on four benchmarks, three of which are floating point. The data shown here were collected by Esmaeilzadeh et al. [2011]. The SPEC benchmarks were compiled with optimization on using the standard Intel compiler, while the Java benchmarks use the Sun (Oracle) Hotspot Java VM. Only one core is active on the i7, and the rest are in deep power saving mode. Turbo Boost is used on the i7, which increases its performance advantage but slightly decreases its relative energy efficiency. Copyright © 2011, Elsevier Inc. All rights Reserved. Improving Performance • Techniques to increase performance: pipelining improves clock speed increases number of in-flight instructions hazard/stall elimination branch prediction register renaming out-of-order execution bypassing increased pipeline bandwidth 148 Deep Pipelining • Increases the number of in-flight instructions • Decreases the gap between successive independent instructions • Increases the gap between dependent instructions • Depending on the ILP in a program, there is an optimal pipeline depth • Tough to pipeline some structures; increases the cost of bypassing 149 Increasing Width • Difficult to find more than four independent instructions • Difficult to fetch more than six instructions (else, must predict multiple branches) • Increases the number of ports per structure 150 Reducing Stalls in Fetch • Better branch prediction novel ways to index/update and avoid aliasing cascading branch predictors • Trace cache stores instructions in the common order of execution, not in sequential order in Intel processors, the trace cache stores pre-decoded instructions 151 Reducing Stalls in Rename/Regfile • Larger ROB/register file/issue queue • Virtual physical registers: assign virtual register names to instructions, but assign a physical register only when the value is made available • Runahead: while a long instruction waits, let a thread run ahead to prefetch (this thread can deallocate resources more aggressively than a processor supporting precise execution) • Two-level register files: values being kept around in the register file for precise exceptions can be moved to 2nd level 152 Performance beyond single thread ILP •There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) •Explicit Thread Level Parallelism or Data Level Parallelism •Thread: process with own instructions and data • thread may be a process part of a parallel program of multiple processes, or it may be an independent program • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute •Data Level Parallelism: Perform identical operations on data, and lots of data 153 Thread-Level Parallelism • Motivation: a single thread leaves a processor under-utilized for most of the time by doubling processor area, single thread performance barely improves • Strategies for thread-level parallelism: multiple threads share the same large processor reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT) each thread executes on its own mini processor simple design, low interference between threads Chip Multi-Processing (CMP) 154 New Approach: Mulithreaded Execution •Multithreading: multiple threads to share the functional units of 1 processor via overlapping • processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table • memory shared through the virtual memory mechanisms, which already support multiple processes • HW for fast thread switch; much faster than full process switch ~100s to 1000s of clocks •When switch? • Alternate instruction per thread (fine grain) • When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) 155 M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes Simultaneous Multi-threading ... One thread, 8 units Cycle M M FX FX FP FP BR CC Two threads, 8 units Cycle M M FX FX FP FP BR CC 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 156 Time (processor cycle) Multithreaded Categories Superscalar Fine-Grained Coarse-Grained Thread 1 Thread 2 Multiprocessing Thread 3 Thread 4 Simultaneous Multithreading Thread 5 Idle slot 157 Data-Level Parallelism in Vector, SIMD, and GPU Architectures 158 SIMD Variations •Vector architectures •SIMD extensions • MMX: Multimedia Extensions (1996) • SSE: Streaming SIMD Extensions • AVX: Advanced Vector Extension (2010) •Graphics Processor Units (GPUs) 159 Vector Architectures •Basic idea: • Read sets of data elements into “vector registers” • Operate on those registers • Disperse the results back into memory 160 VMIPS Instructions 161 Example: VMIPS •Vector registers • Each register holds a 64-element, 64 bits/element vector • Register file has 16 read ports and 8 write ports •Vector functional units • Fully pipelined • Data and control hazards are detected •Vector load-store unit • Fully pipelined • Words move between registers • One word per clock cycle after initial latency •Scalar registers • 32 general-purpose registers • 32 floating-point registers 162 VMIPS Instructions •Example: DAXPY L.D LV MULVS.D LV ADDVV SV F0,a V1,Rx V2,V1,F0 V3,Ry V4,V2,V3 Ry,V4 ;load scalar a ;load vector X ;vector-scalar mult ;load vector Y ;add ;store result • In MIPS Code • ADD waits for MUL, SD waits for ADD • In VMIPS • Stall once for the first vector element, subsequent elements will flow smoothly down the pipeline. • Pipeline stall required once per vector instruction! 163 Vector Chaining Advantage • Without chaining, must wait for last element of result to be written before starting dependent instruction Load Mul Time Add • With chaining, can start dependent instruction as soon as first result appears Load Mul Add Vector Chaining • Vector version of register bypassing • Chaining • Allows a vector operation to start as soon as the individual elements of its vector source operand become available • Results from the first functional unit are forwarded to the second unit V 1 LV v1 MULV v3,v1,v2 ADDV v5, v3, v4 V 2 Chain Load Unit Memory V 3 V 4 Chain Mult. Add V 5 VMIPS Instructions •Flexible • 64 64-bit / 128 32-bit / 256 16-bit, 512 8-bit • Matches the need of multimedia (8bit), scientific applications that require high precision. 166 Vector Instruction Execution ADDV C,A,B Four-lane execution using four pipelined functional units Execution using one pipelined functional unit A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] Multiple Lanes •Element n of vector register A is “hardwired” to element n of vector register B • Allows for multiple hardware lanes • No communication between lanes • Little increase in control overhead • No need to change machine code Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance! 168 Memory Banks •Memory system must be designed to support high bandwidth for vector loads and stores •Spread accesses across multiple banks • Control bank addresses independently • Load or store non sequential words • Support multiple vector processors sharing the same memory 169 Vector Summary • Vector is alternative model for exploiting ILP • If code is vectorizable, then simpler hardware, energy efficient, and better real-time model than out-of-order • More lanes, slower clock rate! • Scalable if elements are independent • If there is dependency • One stall per vector instruction rather than one stall per vector element • Programmer in charge of giving hints to the compiler! • Design issues: number of lanes, functional units and registers, length of vector registers, exception handling, conditional operations • Fundamental design issue is memory bandwidth 170 • Especially with virtual address translation and caching SIMD •Implementations: • Intel MMX (1996) • Repurpose 64-bit floating point registers • Eight 8-bit integer ops or four 16-bit integer ops • Streaming SIMD Extensions (SSE) (1999) • Separate 128-bit registers • Eight 16-bit ops, Four 32-bit ops or two 64-bit ops • Single precision floating point arithmetic • Double-precision floating point in • SSE2 (2001), SSE3(2004), SSE4(2007) • Advanced Vector Extensions (2010) 171 • Four 64-bit integer/fp ops SIMD •Implementations: • Advanced Vector Extensions (2010) • Doubles the width to 256 bits • Four 64-bit integer/fp ops • Extendible to 512 and 1024 bits for future generations • Operands must be consecutive and aligned memory locations 172 Example SIMD •Example DXPY: L.D MOV MOV MOV DADDIU Loop: L.4D MUL.4D L.4D ADD.4D S.4D DADDIU DADDIU DSUBU BNEZ F0,a F1, F0 F2, F0 F3, F0 R4,Rx,#512 ;load scalar a ;copy a into F1 for SIMD MUL ;copy a into F2 for SIMD MUL ;copy a into F3 for SIMD MUL ;last address to load F4,0[Rx] F4,F4,F0 F8,0[Ry] F8,F8,F4 0[Ry],F8 Rx,Rx,#32 Ry,Ry,#32 R20,R4,Rx R20,Loop ;load X[i], X[i+1], X[i+2], X[i+3] ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3] ;load Y[i], Y[i+1], Y[i+2], Y[i+3] ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3] ;store into Y[i], Y[i+1], Y[i+2], Y[i+3] ;increment index to X ;increment index to Y ;compute bound ;check if done 173 SIMD extensions •Meant for programmers to utilize •Not for compilers to generate • Recent x86 compilers • Capable for FP intensive apps • Why is it popular? • Costs little to add to the standard arithmetic unit • Easy to implement • Need smaller memory bandwidth than vector • Separate data transfers aligned in memory • Vector: single instruction , 64 memory accesses, page fault in the middle of the vector likely! • Use much smaller register space • Fewer operands • No need for sophisticated mechanisms of vector 174 architecture Graphics Processing Unit •Given the hardware invested to do graphics well, how can we supplement it to improve performance of a wider range of applications? •Basic idea: • Heterogeneous execution model • CPU is the host, GPU is the device • Develop a C-like programming language for GPU • Unify all forms of GPU parallelism as CUDA thread • Programming model: “Single Instruction Multiple Thread” 175 Graphics Processing Unit •CUDA’s design goals extend a standard sequential programming language, specifically C/C++, focus on the important issues of parallelism—how to craft efficient parallel algorithms—rather than grappling with the mechanics of an unfamiliar and complicated language. minimalist set of abstractions for expressing parallelism highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores. 176 GTX570 GPU Global Memory 1,280MB L2 Cache 640KB Texture Cache 8KB Up to 1536 Threads/SM L1 Cache 16KB Constant Cache 8KB SM 0 Shared Memory 48KB Registers 32,768 SM 14 Shared Memory 48KB Registers 32,768 32 cores 32 cores 177 Programming the GPU • CUDA Programming Model – Single Instruction Multiple Thread (SIMT) • • • • A thread is associated with each data element Threads are organized into blocks Blocks are organized into a grid GPU hardware handles thread management, not applications or OS GTX570 GPU • 32 threads within a block work collectively Memory access optimization, latency hiding 179 GTX570 GPU Kernel Grid Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 8 Block 9 Block 10 Block 11 Block 12 Block 13 Block 14 Block 15 Device with 4 Multiprocessors MP 0 MP 1 MP 2 MP 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 • Up to 1024 Threads/Block and 8 Active Blocks per SM Programming the GPU Matrix Multiplication Matrix Multiplication • For a 4096x4096 matrix multiplication - Matrix C will require calculation of 16,777,216 matrix cells. • On the GPU each cell is calculated by its own thread. • We can have 23,040 active threads (GTX570), which means we can have this many matrix cells calculated in parallel. • On a general purpose processor we can only calculate one cell at a time. • Each thread exploits the GPUs fine granularity by computing one element of Matrix C. • Sub-matrices are read into shared memory from global memory to act as a buffer and take advantage of GPU bandwidth. Solving Systems of Equations Thread Organization •If we expand to 4096 equations, we can process each row completely in parallel with 4096 threads •We will require 4096 kernel launches. One for each equation Results CPU Configuration: Intel Xeon @2.33GHz with 2GB RAM GPU Configuration: NVIDIA Tesla C1060 @1.3GHz *For single precision, speedup improves by at least a factor of 2X Execution time includes data transfer from host to device Programming the GPU • Distinguishing execution place of functions: _device_ or _global_ => GPU Device Variables declared are allocated to the GPU memory _host_ => System processor (HOST) • Function call Name<<dimGrid, dimBlock>>(..parameter list..) blockIdx: block identifier threadIdx: threads per block identifier blockDim: threads per block Programming the GPU //Invoke DAXPY daxpy(n,2.0,x,y); //DAXPY in C void daxpy(int n, double a, double* x, double* y) { for (int i=0;i<n;i++) y[i]= a*x[i]+ y[i] } Programming the GPU //Invoke DAXPY with 256 threads per Thread Block _host_ int nblocks = (n+255)/256; daxpy<<<nblocks, 256>>> (n,2.0,x,y); //DAXPY in CUDA _device_ void daxpy(int n,double a,double* x,double* y){ int i=blockIDx.x*blockDim.x+threadIdx.x; if (i<n) y[i]= a*x[i]+ y[i] } Programming the GPU • CUDA • Hardware handles thread management • Invisible to the programmer (productivity), • Performance programmers need to know the operation principles of the threads! • Productivity vs. performance • How much power to be given to the programmer, CUDA is still evolving! Efficiency Considerations • Avoid execution divergence – threads within a warp follow different execution paths. – Divergence between warps is ok • Allow loading a block of data into SM – process it there, and then write the final result back out to external memory. • Coalesce memory accesses – Access executive words instead of gather-scatter • Create enough parallel work – 5K to 10K threads Efficiency Considerations • GPU Architecture – Each SM executes multiple warps in a time-sharing fashion while one or more are waiting for memory values • Hiding the execution cost of warps that are executed concurrently. – How many memory requests can be serviced and how many warps can be executed together while one warp is waiting for memory values. Easy to Learn Takes time to master Logic Design 194 State Elements • Unclocked vs. Clocked • Clocks used in synchronous logic • Clocks are needed in sequential logic to decide when an element that contains state should be updated. State element 1 Clock cycle Combinational logic State element 2 Latches and Flip-flops C Q _ Q D Latches and Flip-flops D D C C Q D latch D C Q D latch Q Q Q Latches and Flip-flops Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) SRAM SRAM vs. DRAM Which one has a better memory density? static RAM (SRAM): value stored in a cell is kept on a pair of inverting gates dynamic RAM (DRAM), value kept in a cell is stored as a charge in a capacitor. DRAMs use only a single transistor per bit of storage, By comparison, SRAMs require four to six transistors per bit Which one is faster? In DRAMs, the charge is stored on a capacitor, so it cannot be kept indefinitely and must periodically be refreshed. (called dynamic) Synchronous RAMs ?? is the ability to transfer a burst of data from a series of sequential addresses within an array or row.