ECE232: Hardware Organization and Design Part 11: Pipelining Chapter 4/6 http://www.ecs.umass.edu/ece/ece232/ Adapted from Computer Organization and Design, Patterson & Hennessy, UCB CPI Calculation CPI stands for average number of Cycles Per Instruction Assume an instruction mix of 24% loads, 12% stores, 44% Rformat, 18% branches, and 2% jumps CPI = 0.24 * 5 + 0.12 * 4 + 0.44 * 4 + 0.18 * 3 + 0.02 * 3 = 4.04 Speedup? Question: Can we achieve a CPI of 1??? ECE232: Pipelining I 2 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Speeding up through pipelining Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold A • Washer takes 30 minutes B C D • Dryer takes 30 minutes • “Folder” takes 30 minutes • “Stasher” takes 30 minutes to put clothes into drawers Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass ECE232: Pipelining I 3 Koren Sequential Laundry 6 PM T a s k O r d e r A 7 8 9 10 11 12 1 2 AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time B C D Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take? ECE232: Pipelining I 4 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Pipelined Laundry: Start work ASAP 6 PM T a s k 8 7 10 9 30 30 30 30 30 30 30 11 12 2 AM 1 Time A B C O r d e r D Pipelined laundry takes 3.5 hours for 4 loads! ECE232: Pipelining I 5 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Pipelining Lessons 6 PM T a s k 8 9 Time 30 30 30 30 30 30 30 A B O r d e r 7 C D ECE232: Pipelining I 6 Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Pipelining Instructions Time (in cycles) Instruction F Fetch = 10 ns Decode = 6 ns Execute = 8 ns Memory = 10 ns Write back = 6 ns D EX M W F D EX M W F D EX M W F D EX M W F D EX M W F D EX M ECE232: Pipelining I 7 W Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Single Cycle, Multiple Cycle, vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Ifetch Reg Exec Mem Wr Exec Mem Wr Reg Exec Mem Store Ifetch Reg Exec Mem R-type Ifetch Pipeline Implementation: Load Ifetch Reg Store Ifetch R-type Ifetch ECE232: Pipelining I 8 Reg Exec Wr Mem Wr Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine • 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine • 10 ns/cycle x 4.04 CPI (for the given inst mix) x 100 inst = 4040 ns • Instruction mix of 24% loads, 12% stores, 44% R-format, 18% branches, and 2% jumps Ideal pipelined machine (with 5 stages) • 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns Speedup=4.33 vs. single-cycle 3.88 vs. multi-cycle (for the given inst mix) Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass ECE232: Pipelining I 9 Koren Why Pipeline? Because the resources are there! Time (clock cycles) Inst 5 ECE232: Pipelining I 10 Reg Im Reg Dm Reg Dm Im Reg Im Reg Reg Reg Dm Reg ALU Inst 4 Im Dm ALU Inst 3 Reg ALU Inst 2 Im ALU O r d e r Inst 1 ALU I n s t r. Dm Reg Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Pipelining Rules Inst 4 IMem Inst 3 Inst 2 Inst 1 ALU Inst 5 DMem Reg Reg Forward traveling signals at each stage are latched Only perform logic on signals in the same stage • signal labeling useful to prevent errors, • e.g., IRR, IRA, IRM, IRW Backward travelling signals at each stage represent hazards Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass ECE232: Pipelining I 11 Koren MIPS Pipelined Datapath State registers between pipeline stages to isolate them IF:IFetch ID:Dec Inst 5 Inst 4 EX:Execute MEM: MemAccess WB: WriteBack Inst 3 Inst 2 Inst 1 Add 16 Sign Extend ALU Data Memory Address Read Data Write Data Mem/WB File Write Addr Read Data 2 Write Data Add Exec/Mem Read Address Read Addr 1 Register Read 1 Read Addr Data 2 Dec/Exec PC Instruction Memory Shift left 2 IFetch/Dec 4 32 System Clock ECE232: Pipelining I 12 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Pipeline Hazards Data hazards: an instruction uses the result of a previous instruction (RAW) ADD R1, R2, R3 or SW R1, 4(R2) SUB R4, R1, R5 LW R3, 4(R2) Control hazards: the address of the next instruction to be executed depends on a previous instruction BEQ R1,R2,CONT SUB R6,R7,R8 … CONT: ADD R3,R4,R5 Structural hazards: two instructions need access to the same resource • e.g., single memory shared for instruction fetch and load/store Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass ECE232: Pipelining I 13 Koren Structural Hazard Time (clock cycles) Inst 4 Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Inst 3 Mem Reading data from memory Mem ALU Inst 2 Reg ALU Inst 1 Mem ALU O r d e r lw ALU I n s t r. Mem Reg Mem Reading instruction from memory Reg Mem Reg Reg Fix with separate instruction and data memories (I$ and D$) ECE232: Pipelining I 14 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Data Hazards (RAW) Time (in cycles) Instruction F D EX M W Write Data to R1 Here F D EX M W Get data from R1 Here ADD SUB ECE232: Pipelining I 15 R1, R2, R3 R4, R1, R5 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren One Way to handle a Data Hazard O r d e r add $1,… IM Reg ALU I n s t r. DM By waiting – introducing stalls – but impacts CPI Reg stall stall stall ECE232: Pipelining I 16 IM Reg ALU sub $4,$1,$5 DM Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Reg Koren Must allow Wr/Rd in REG in same cycle Split cycle into two halves Time (clock cycles) Im Reg Inst 3 Im Dm Reg Dm Im Reg Im Reg Inst 5 ECE232: Pipelining I 17 Reg Reg Dm Reg ALU Inst 4 Reg ALU Im Dm ALU Inst 2 Reg ALU O r d e r Inst 1 ALU I n s t r. Dm Reg Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Only two stall cycles IM Reg DM Reg IM Reg ALU O r d e r add $1,… ALU I n s t r. Write in 1st half, Read in 2nd half IM Reg stall stall sub $4,$1,$5 ECE232: Pipelining I 18 ALU and $6,$1,$7 DM Reg DM Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Reg Koren Another Way to “Fix” a Data Hazard Time by forwarding IM Reg IM Reg IM Reg ALU sub $4,$1,$5 Reg ALU IM ALU O r d e r add $1,… ALU I n s t r. IM Reg and $6,$1,$7 or $8,$1,$9 DM Reg DM DM Reg DM ALU xor $4,$1,$5 Reg Reg DM Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass ECE232: Pipelining I 19 Reg Koren Register File (write and then read) Time (clock cycles) Inst 2 Reg IM Reg IM Reg DM Reg DM Reg DM ALU or $8,$1,$9 IM ALU Inst 1 Reg ALU O r d e r IM ALU I n s t r. add $1, Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half Reg DM Reg clock edge that controls loading of pipeline state registers ECE232: Pipelining I 20 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Internal data forwarding Reg IM Reg IM Reg IM Reg and $6,$1,$7 or $8,$1,$9 DM xor $4,$1,$5 Reg DM DM Reg DM ALU IM ALU sub $4,$1,$5 Reg ALU IM ALU O r d e r add $1,… ALU I n s t r. Fix data hazards by forwarding results as soon as they are available to where they are Reg needed Reg DM Reg ALU-to-ALU forwarding vs. full forwarding ECE232: Pipelining I 21 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Forwarding with Load-use Data Hazards and $6,$1,$7 or $8,$1,$9 Reg IM Reg IM Reg IM Reg DM Reg DM Reg DM Reg DM ALU xor $4,$1,$5 IM ALU sub $4,$1,$5 Reg ALU $1,4($2) IM ALU O r d e r lw ALU I n s t r. Reg DM Reg sub needs to stall Will still need one stall cycle even with forwarding ECE232: Pipelining I 22 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren Injecting Bubbles IF ID EX MEM WB and sub lw Inst -1 Inst -2 and sub bubble lw Inst -1 Add Inst –2 Inst –1 lw sub and ECE232: Pipelining I 23 16 Sign Extend ALU Read Data Address Write Data Mem/WB File Write Addr Read Data 2 Write Data Data Memory Exec/Mem Read Address Read Addr 1 Register Read 1 Read Addr Data 2 Dec/Exec PC Instruction Memory Add Shift left 2 IFetch/Dec 4 32 System Clock Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren 3 Types of Data Hazards RAW (read after write) • only hazard for ‘fixed’ pipelines • later instruction must read after earlier instruction writes F D EX M W F D EX M add $1,$2,$3 sub $4,$1,$5 W WAW (write after write) • variable-length pipeline • later instruction must write after earlier instruction writes F D E1 E2 E3 E4 F D EX M W E5 W div $1,$4,$3 add $1,$2,$5 WAR (write after read) • instruction with late read (e.g., waiting for an execution unit) • later instruction must write after earlier instruction reads mlt $4,$1,$3 add $1,$2,$5 ECE232: Pipelining I 24 F D s1 s2 s3 s4 F D EX M W s5 E1 E2 E3 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass W Koren Control Hazard Time (in cycles) Instruction F D EX M W Destination Available Here F D EX M W Need Destination Here XX: JR ... ADD R25 ... Simple solution: Flush Instruction fetch until branch resolved ECE232: Pipelining I 25 Adapted from Computer Organization and Design, Patterson&Hennessy,UCB, Kundu,UMass Koren