IKI20210 Pengantar Organisasi Komputer Kuliah no. 25: Pipeline Sumber: 1. Hamacher. Computer Organization, ed-4. 2. Materi kuliah CS152, th. 1997, UCB. 10 Januari 2003 Bobby Nazief (nazief@cs.ui.ac.id) Johny Moningka (moningka@cs.ui.ac.id) bahan kuliah: http://www.cs.ui.ac.id/~iki20210/ 1 Pipeline Salah Satu Cara Mempercepat Eksekusi Instruksi 2 Pipelining is Natural! ° Laundry Example ° Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold A B C D ° Washer takes 30 minutes ° Dryer takes 40 minutes ° “Folder” takes 20 minutes 3 Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k A B O r d e r C D ° Sequential laundry takes 6 hours for 4 loads ° If they learned pipelining, how long would laundry take? 4 Pipelined Laundry: Start work ASAP 6 PM 7 8 9 10 11 Midnight Time 30 40 T a s k 40 40 40 20 A B O r d e r C D ° Pipelined laundry takes 3.5 hours for 4 loads 5 Pipelining Lessons 6 PM 7 8 9 Time 30 40 T a s k O r d e r 40 40 40 20 ° Pipelining doesn’t help latency of single task, it helps throughput of entire workload ° Pipeline rate limited by slowest pipeline stage A ° Multiple tasks operating simultaneously using different resources B ° Potential speedup = Number pipe stages C ° Unbalanced lengths of pipe stages reduces speedup D ° Time to “fill” pipeline and time to “drain” it reduce speedup ° Stall for Dependences 6 Pipelining Instruction Execution 7 Kilas Balik: Tahapan Eksekusi Instruksi Instruksi: Add R1,(R3) ; R1 R1 + M[R3] Langkah-langkah: 1. Fetch instruksi 1. PCout, MARin, Read, Clear Y, Set carry-in to ALU, Add, Zin 2. Zout, PCin, WMFC 3. MDRout, IRin 2. Fetch operand #1 (isi lokasi memori yg ditunjuk oleh R3) 4. R3out, MARin, Read 5. R1out, Yin, WMFC 3. Lakukan operasi penjumlahan 6. MDRout, Add, Zin 4. Simpan hasil penjumlahan di R1 7. Zout, R1in, End 8 The Five Stages of (MIPS) Load Instruction Cycle 1 Cycle 2 Load Ifetch Reg/Dec Cycle 3 Cycle 4 Cycle 5 Exec Mem Wr ° Ifetch: Instruction Fetch ° Reg/Dec: Registers Fetch and Instruction Decode ° Exec: Calculate the memory address ° Mem: Read the data from the Data Memory ° Wr: Write the data back to the register file Load/Store Architecture: access to/from memory only by Load/Store instructions 9 Pipelined Execution Time IFetch Dcd Exec IFetch Dcd Mem WB Exec Mem WB Exec Mem WB Exec Mem WB Exec Mem WB Exec Mem IFetch Dcd IFetch Dcd IFetch Dcd Program Flow IFetch Dcd WB ° Overlapping instruction execution ° Maximum number instructions executed simultaneously = number of stages 10 Why Pipeline? ° Non-pipeline machine • 10 ns/cycle x 4.6 CPI (due to instr mix) x 100 inst = 4600 ns ° Ideal pipelined machine • 10 ns/cycle x (4 cycle fill + 1 CPI x 100 inst) = 1040 ns Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Non-pipeline Implementation: Load Ifetch Store Reg Exec Mem Wr Exec Mem Wr Reg Exec Mem Ifetch R-type Reg Exec Mem Ifetch Pipeline Implementation: Load Ifetch Reg Store Ifetch R-type Ifetch Reg Exec Wr Mem Wr 11 Why Pipeline? Because the resources are there! Time (clock cycles) Inst 4 Reg Im Reg Dm Reg Dm Im Reg Im Reg Reg Reg Dm Reg ALU Inst 3 Im Dm ALU Inst 2 Reg ALU Inst 1 Im ALU O r d e r Inst 0 ALU I n s t r. Dm Reg 12 Restructuring Datapath 13 Partitioning the Datapath (1/2) Store Source (Register) Operands Store Results Result Store MemWr RegDst RegWr Reg. File Data Mem MemRd MemWr Exec Mem Access ALUctr ALUSrc ExtOp Store Instruction Operand Fetch Instruction Fetch PC Next PC nPC_sel ° Add registers between smallest steps Store Read-Data (from Memory) 14 Partitioning the Datapath (2/2) Load Reg. File Mem Access B Cycle 1 Cycle 2 Cycle 3 Ifetch Reg/Dec Exec Cycle 4 Mem M Data Mem PC Next PC Equal WB Ctrl IRwb IRmem R Mem Ctrl Ex Ctrl IRexe A ALU Dcd Ctrl Reg File IR Inst. Mem Valid Cycle 5 Wr 15 Pipeline Hazards 16 Can pipelining get us into trouble? ° Yes: Pipeline Hazards • structural hazards: attempt to use the same resource two different ways at the same time - E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) • data hazards: attempt to use item before it is ready - E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer - instruction depends on result of prior instruction still in the pipeline • control hazards: attempt to make a decision before condition is evaluated - E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in - branch instructions ° Can always resolve hazards by waiting • pipeline control must detect the hazard • take action (or delay action) to resolve hazards 17 Single Memory is a Structural Hazard Time (clock cycles) Instr 4 Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Instr 3 Reg ALU Instr 2 Mem Mem ALU Instr 1 Reg ALU O r d e r Load Mem ALU I n s t r. Mem Reg Detection is easy in this case! (right half highlight means read, left half write) 18 Control Hazard Solutions ° Stall: wait until decision is clear • Its possible to move up decision to 2nd stage by adding hardware to check registers as being read Beq Load Reg Mem Mem Reg Reg Mem Reg Mem Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg ° Impact: 2 clock cycles per branch instruction => slow 19 Control Hazard Solutions ° Predict: guess one direction then back up if wrong • Predict not taken Beq Load Reg Mem Mem Reg Reg Mem Reg Mem Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg ° Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right 50% of time) ° More dynamic scheme: history of 1 branch ( 90%) 20 Control Hazard Solutions ° Redefine branch behavior (takes place after next instruction) “delayed branch” Misc Load Mem Mem Reg Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Beq Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg ° Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) ° As launch more instruction per clock cycle, less useful 21 Data Hazard on r1 add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11 22 Data Hazard on r1: • Dependencies backwards in time are hazards Time (clock cycles) IF Dm Im Reg Dm Im Reg Dm Im Reg Dm Im Reg ALU xor r10,r1,r11 Reg ALU or r8,r1,r9 WB ALU and r6,r1,r7 MEM ALU O r d e r sub r4,r1,r3 Im EX ALU I n s t r. add r1,r2,r3 ID/RF Reg Reg Reg Reg Dm Reg 23 Data Hazard Solution: • “Forward” result from one stage to another Time (clock cycles) IF Dm Im Reg Dm Im Reg Dm Im Reg Dm Im Reg ALU xor r10,r1,r11 Reg ALU or r8,r1,r9 WB ALU and r6,r1,r7 MEM ALU O r d e r sub r4,r1,r3 Im EX ALU I n s t r. add r1,r2,r3 ID/RF Reg Reg Reg Reg Dm Reg 24 Forwarding Structure IAU npc I mem Regs op rw rs rt Forward mux B A im n op rw alu S n op rw PC ° Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe • Increase muxes to add paths from pipeline registers • Data Forwarding = Data Bypassing D mem m n op rw Regs 25 Forwarding (or Bypassing): What about Loads • Dependencies backwards in time are hazards Time (clock cycles) IF MEM Reg Dm Im Reg ALU sub r4,r1,r3 Im EX ALU lw r1,0(r2) ID/RF WB Reg Dm Reg • Can’t solve with forwarding: • Must delay/stall instruction dependent on loads 26 Execution Delay/Stall Time (clock cycles) IF Reg MEM WB Dm Reg Im Reg ALU Im EX ALU lw r1,0(r2) ID/RF no-op sub r4,r1,r3 Dm Reg 27