CDA 5155 Computer Architecture Week 1.5 Start with the materials: Conductors and Insulators • Conductor: a material that permits electrical current to flow easily. (low resistance to current flow) Lattice of atoms with free electrons • Insulator: a material that is a poor conductor of electrical current (High resistance to current flow) Lattice of atoms with strongly held electrons • Semi-conductor: a material that can act like a conductor or an insulator depending on conditions. (variable resistance to current flow) Making a semiconductor using silicon e e e e e e e e e e e e e e e e e e e e What is a pure silicon lattice? A. Conductor B. Insulator C. Semi conductor N-type Doping We can increase the conductivity by adding atoms of phosphorus or arsenic to the silicon lattice. They have more electrons (1 more) which is free to wander… This is called n-type doping since we add some free (negatively charged) electrons Making a semiconductor using silicon This electron is easily moved from here e e e e e e e e e e P e e e e e e e e e e e What is a n-doped silicon lattice? A. Conductor B. Insulator C. Semi-conductor P-type Doping Interestingly, we can also improve the conductivity by adding atoms of gallium or boron to the silicon lattice. They have fewer electrons (1 fewer) which creates a hole. Holes also conduct current by stealing electrons from their neighbor (thus moving the hole). This is called p-type doping since we have fewer (negatively charged) electrons in the bond holding the atoms together. Making a semiconductor using silicon e e e e e e e e e e Ga e e ? e e e e e e e This atom will accept an electron even though it is one too many since it fills the eighth electron position in this shell. Again this lets current flow since the electron must come from somewhere to fill this position. Using doped silicon to make a junction diode A junction diode allows current to flow in one direction and blocks it in the other. Electrons like to move to Vcc Electrons move from GND to fill holes. GND Vcc Using doped silicon to make a junction diode A junction diode allows current to flow in one direction and blocks it in the other. Current flows e e e e e e e Vcc GND Making a transistor Our first level of abstraction is the transistor. (basically 2 diodes sitting back-to-back) Gate P-type Making a transistor Transistors are electronic switches connecting the source to the drain if the gate is “on”. Vcc Vcc Vcc http://www.intel.com/education/transworks/INDEX.HTM Review of basic pipelining • 5 stage “RISC” load-store architecture – About as simple as things get • Instruction fetch: • • Instruction decode: • • perform ALU operation Memory: • • translate opcode into control signals and read regs Execute: • • get instruction from memory/cache Access memory if load/store Writeback/retire: • update register file Pipelined implementation • Break the execution of the instruction into cycles (5 in this case). • Design a separate datapath stage for the execution performed during each cycle. • Build pipeline registers to communicate between the stages. Stage 1: Fetch Design a datapath that can fetch an instruction from memory every cycle. Use PC to index memory to read instruction Increment the PC (assume no branches for now) Write everything needed to complete execution to the pipeline register (IF/ID) The next stage will read this pipeline register. Note that pipeline register must be edge triggered PC en Instruction memory en IF / ID Pipeline register Rest of pipelined datapath + Instruction bits 1 PC + 1 M U X Stage 2: Decode Design a datapath that reads the IF/ID pipeline register, decodes instruction and reads register file (specified by regA and regB of instruction bits). Decode is easy, just pass on the opcode and let later stages figure out their own control signals for the instruction. Write everything needed to complete execution to the pipeline register (ID/EX) Pass on the offset field and both destination register specifiers (or simply pass on the whole instruction!). Including PC+1 even though decode didn’t use it. Instruction bits Destreg Data en IF / ID ID / EX Pipeline register Pipeline register Contents Of regA PC + 1 Rest of pipelined datapath Contents Of regB Register File Instruction bits PC + 1 Stage 1: Fetch datapath regA regB Stage 3: Execute Design a datapath that performs the proper ALU operation for the instruction specified and the values present in the ID/EX pipeline register. The inputs are the contents of regA and either the contents of regB or the offset field on the instruction. Also, calculate PC+1+offset in case this is a branch. Write everything needed to complete execution to the pipeline register (EX/Mem) ALU result, contents of regB and PC+1+offset Instruction bits for opcode and destReg specifiers Result from comparison of regA and regB contents ID / EX EX/Mem Pipeline register Pipeline register Contents Of regA Alu Result PC + 1 PC+1 +offset Rest of pipelined datapath contents of regB M U X bits Contents Of regB A L U Instruction Instruction bits Stage 2: Decode datapath + Stage 4: Memory Operation Design a datapath that performs the proper memory operation for the instruction specified and the values present in the EX/Mem pipeline register. ALU result contains address for ld and st instructions. Opcode bits control memory R/W and enable signals. Write everything needed to complete execution to the pipeline register (Mem/WB) ALU result and MemData Instruction bits for opcode and destReg specifiers en R/W bits contents of regB Alu Result Alu Result PC+1 +offset MUX control for PC input Rest of pipelined datapath Memory Read Data Data Memory Instruction bits Instruction Stage 3: Execute datapath This goes back to the MUX before the PC in stage 1. EX/Mem Mem/WB Pipeline register Pipeline register Stage 5: Write back Design a datapath that completes the execution of this instruction, writing to the register file if required. Write MemData to destReg for ld instruction Write ALU result to destReg for add or nand instructions. Opcode bits also control register write enable signal. Alu Result Memory Read Data M U X bits Instruction Stage 4: Memory datapath This goes back to data input of register file Mem/WB Pipeline register bits 0-2 This goes back to the destination register specifier register write enable M U X bits 16-18 M U X 1 + + PC Inst mem Register file M U X Sign extend M U X A L U Data memory 0-2 M 16-18 U X IF/ ID ID/ EX EX/ Mem Mem/ WB Sample Test Question (Easy) • Which item does not need to be included in the Mem/WB pipeline register for the LC3101 pipelined implementation discussed in class? A. B. C. C. D. E. ALU result Memory read data PC+1+offset PC+1+offset Destination register specifier Instruction opcode Sample Test Question (Hard?) • What items need to be added to one of the pipeline registers (discussed in class) to support the <insert nasty instruction description here> ? A. B. C. D. E. IF/ID: PC ID/EX: PC+offset EX/Mem: Contents of regA EX/Mem: ALU2 result Mem/WB: Contents of regA Things to think about… 1. How would you modify the pipeline datapath if you wanted to double the clock frequency? 2. Would it actually double? 3. How do you determine the frequency? Sample Code (Simple) Run the following code on pipelined LC3101: add 1 2 3 nand 4 5 6 lw 2 4 20 add 2 5 5 sw 3 7 10 ; reg 3 = reg 1 + reg 2 ; reg 6 = reg 4 & reg 5 ; reg 4 = Mem[reg2+20] ; reg 5 = reg 2 + reg 5 ; Mem[reg3+10] =reg 7 M U X 1 + + PC+1 PC+1 R0 regA R1 regB R2 Register file Inst mem instruction PC target R3 0 eq? valA R4 R5 valB R6 R7 M U X A L U ALU result ALU result mdata Data memory data offset dest valB Bits 0-2 Bits 16-18 Bits 22-24 IF/ ID M U X M U X dest dest dest op op op ID/ EX EX/ Mem Mem/ WB M U X 1 + + 0 0 R0 R1 noop Inst mem Register file R2 PC 0 R3 R4 R5 R6 R7 0 36 9 12 18 7 41 22 0 0 0 0 M U X A L U 0 0 Data memory data 0 dest 0 Bits 0-2 Initial State Time: 0 Bits 16-18 Bits 22-24 IF/ ID M U X M U X 0 0 0 noop noop noop ID/ EX EX/ Mem Mem/ WB add 1 2 3 M U X 1 + + 1 0 R0 R1 R2 Register file Inst mem add 1 2 3 PC 0 R3 R4 R5 R6 R7 0 36 9 12 18 7 41 22 0 0 0 0 M U X A L U 0 0 Data memory data 0 dest 0 Fetch: add 1 2 3 Time: 1 Bits 0-2 Bits 16-18 Bits 22-24 IF/ ID M U X M U X 0 0 0 noop noop noop ID/ EX EX/ Mem Mem/ WB nand 4 5 6 add 1 2 3 M U X 1 + + 2 1 R0 R1 1 R2 2 Register file Inst mem nand 4 5 6 PC 0 R3 R4 R5 R6 R7 0 36 9 12 18 7 41 22 0 0 36 9 3 M U X A L U 0 0 Data memory data dest 0 Fetch: nand 4 5 6 Time: 2 Bits 0-2 Bits 16-18 Bits 22-24 IF/ ID M U X M U X 3 0 0 add noop noop ID/ EX EX/ Mem Mem/ WB lw 2 4 20 nand 4 5 6 add 1 2 3 M U X 3 1 + 3 2 R0 4 R2 5 Register file Inst mem lw 2 4 20 PC R1 R3 R4 R5 R6 R7 0 36 9 12 18 7 41 22 + 1 4 0 18 7 0 36 9 M U X 6 A L U 45 0 Data memory data dest 9 Fetch: lw 2 4 20 Time: 3 Bits 0-2 Bits 16-18 Bits 22-24 IF/ ID M U X 6 M U X 3 3 0 nand add noop ID/ EX EX/ Mem Mem/ WB add 2 5 5 lw 2 4 20 nand 4 5 6 add 1 2 3 M U X 6 1 + 4 3 R0 2 R2 4 Register file Inst mem add 2 5 8 PC R1 R3 R4 R5 R6 R7 0 36 9 12 18 7 41 22 + 2 8 0 9 18 45 18 7 M U X 20 A L U -3 45 0 Data memory data dest 7 Fetch: add 2 5 5 Time: 4 Bits 0-2 Bits 16-18 Bits 22-24 IF/ ID M U X 4 M U X 6 6 3 3 lw nand add ID/ EX EX/ Mem Mem/ WB sw 3 7 10 add 2 5 5 lw 2 4 20 nand 4 5 6 add M U X 20 1 + 5 4 R0 2 R2 5 Register file Inst mem sw 3 7 10 PC R1 R3 R4 R5 R6 R7 0 36 9 45 18 7 41 22 3 + 23 0 9 7 5 -3 9 M U 20 X A L U 29 -3 45 0 Data memory data dest 18 Fetch: sw 3 7 10 Time: 5 Bits 0-2 Bits 16-18 Bits 22-24 IF/ ID M U X 5 M U X 4 4 6 6 3 add lw nand ID/ EX EX/ Mem Mem/ WB sw 3 7 10 add 2 5 5 lw 2 4 20 nand M U X 5 1 + 5 R0 R1 3 R2 7 Inst mem Register file PC R3 R4 R5 R6 R7 0 36 9 45 18 7 -3 22 + 4 9 0 45 22 29 9 7 M U X 10 A L U 16 29 -3 99 Data memory data dest 7 No more instructions Time: 6 Bits 0-2 Bits 16-18 Bits 22-24 IF/ ID M U X 7 M U X 5 5 4 4 6 sw add lw ID/ EX EX/ Mem Mem/ WB sw 3 7 10 add 2 5 5 lw M U X 10 1 5 + R0 R1 R2 Inst mem Register file PC R3 R4 R5 R6 R7 0 36 9 45 99 7 -3 22 + 15 0 16 45 M U 10 X A L U 55 16 0 M U 99 X Data memory data dest 22 No more instructions Time: 7 Bits 0-2 Bits 16-18 7 M U X Bits 22-24 IF/ ID ID/ EX 7 5 5 4 sw add EX/ Mem Mem/ WB sw 3 7 10 add M U X 1 + + R0 R1 R2 Inst mem Register file PC R3 R4 R5 R6 R7 0 36 9 45 99 16 -3 22 55 M U X A L U 55 22 16 0 Data memory data dest 22 No more instructions Time: 8 Bits 0-2 Bits 16-18 M U X 7 Bits 22-24 IF/ ID 5 sw ID/ EX EX/ Mem M U X Mem/ WB sw M U X 1 + + R0 R1 R2 Inst mem Register file PC R3 R4 R5 R6 R7 0 36 9 45 99 16 -3 22 M U X M U X A L U Data memory data dest No more instructions Time: 9 Bits 0-2 Bits 16-18 M U X Bits 22-24 IF/ ID ID/ EX EX/ Mem Mem/ WB Time graphs Time: 1 add nand lw add sw 2 fetch decode fetch 3 4 5 6 7 8 9 execute memory writeback decode fetch execute memory writeback decode fetch execute memory writeback decode execute memory writeback fetch decode execute memory writeback What can go wrong? • Data hazards: since register reads occur in stage 2 and register writes occur in stage 5 it is possible to read the wrong value if is about to be written. • Control hazards: A branch instruction may change the PC, but not until stage 4. What do we fetch before that? • Exceptions: How do you handle exceptions in a pipelined processor with 5 instructions in flight? Data Hazards Data hazards What are they? How do you detect them? How do you deal with them? Pipeline function for ADD Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate sum Memory: Pass results to next stage Writeback: write sum into register file Data Hazards add 1 2 3 nand 3 4 5 time add nand fetch decode fetch execute memory writeback decode memory writeback execute If not careful, nand will read the wrong value of R3 M U X 1 + + PC+1 PC+1 R0 0 eq? R1 regA regB R2 Register file instruction PC Inst mem target R3 valA R4 R5 valB R6 R7 M U X A L U ALU result ALU result mdata Data memory data offset dest valB Bits 0-2 Bits 16-18 Bits 22-24 IF/ ID M U X dest dest dest op op op EX/ Mem Mem/ WB ID/ EX M U X M U X 1 + + PC+1 PC+1 R0 regA regB M U X 0 eq? R1 R2 Register file instruction PC Inst mem target R3 valA R4 R5 valB R6 R7 M U X A L U ALU result ALU result mdata Data memory data offset dest valB IF/ ID dest dest dest op op op EX/ Mem Mem/ WB ID/ EX M U X M U X 1 + + PC+1 PC+1 R0 regA regB M U X data 0 eq? R1 R2 Register file instruction PC Inst mem target R3 valA R4 R5 valB R6 R7 M U X A L U ALU result ALU result mdata Data memory offset valB IF/ ID op op op fwd fwd fwd EX/ Mem Mem/ WB ID/ EX M U X Three approaches to handling data hazards Avoid Make sure there are no hazards in the code Detect and Stall If hazards exist, stall the processor until they go away. Detect and Forward If hazards exist, fix up the pipeline to get the correct value (if possible) Handling data hazards I: Avoid all hazards Assume the programmer (or the compiler) knows about the processor implementation. Make sure no hazards exist. • Put noops between any dependent instructions. write R3 in cycle 5 add 1 2 3 noop noop read R3 in cycle 5 nand 3 4 5 Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more noops Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 40% of instructions are noops Program execution is slower –CPI is 1, but some instructions are noops Handling data hazards II: Detect and stall until ready Detect: Compare regA with previous DestRegs • 3 bit operand fields Compare regB with previous DestRegs • 3 bit operand fields Stall: Keep current instructions in fetch and decode Pass a noop to execute First half of cycle 3 M U X 1 target PC+1 Hazard detection nand 3 4 5 PC Inst mem PC+1 3 M U X regA regB 3 data 0 R1 14 R2 7 R3 10 R0 Register file + + eq? 14 R4 R5 7 R6 R7 3 M U X A L U ALU result ALU result mdata Data memory valB add IF/ ID ID/ EX op op EX/ Mem Mem/ WB M U X Hazard detected compare compare regA 3 compare compare regB REG file 3 IF/ ID ID/ EX 1 Hazard detected compare 0 0 0 011 regA regB 011 3 Handling data hazards II: Detect and stall until ready • Detect: – Compare regA with previous DestReg • 3 bit operand fields – Compare regB with previous DestReg • 3 bit operand fields Stall: Keep current instructions in fetch and decode Pass a noop to execute First half of cycle 3 en M U X 1 PC Inst mem target 1 Hazard nand 3 4 5 en 2 3 M U X R0 R1 regA regB 3 data R2 Register file + + R3 R4 R5 0 14 7 10 11 eq? 14 7 R6 R7 M U X A L U ALU result ALU result mdata Data memory valB add IF/ ID ID/ EX EX/ Mem Mem/ WB M U X Handling data hazards II: Detect and stall until ready • Detect: – Compare regA with previous DestReg • 3 bit operand fields – Compare regB with previous DestReg • 3 bit operand fields Stall: – Keep current instructions in fetch and decode Pass a noop to execute End of cycle 3 M U X 1 + + 2 R0 M U X 3 data R2 Register file Inst mem nand 3 4 5 PC R1 regA regB R3 R4 0 14 7 10 11 ALU result R5 M U X R6 R7 noop IF/ ID ID/ EX A L U 21 mdata Data memory add EX/ Mem Mem/ WB M U X First half of cycle 4 en M U X 1 PC Inst mem Hazard nand 3 4 5 en 2 3 M U X R0 R1 regA regB 3 data R2 Register file + + R3 R4 0 14 7 10 11 ALU result R5 M U X R6 R7 noop IF/ ID ID/ EX A L U 21 mdata Data memory add EX/ Mem Mem/ WB M U X End of cycle 4 M U X 1 + + 2 R0 M U X 3 data R2 Register file Inst mem nand 3 4 5 PC R1 regA regB R3 R4 0 14 7 10 11 21 R5 M U X R6 R7 noop IF/ ID ID/ EX A L U Data memory noop add EX/ Mem Mem/ WB M U X First half of cycle 5 M U X 1 Inst mem No Hazard nand 3 4 5 PC 2 3 M U X R0 R1 regA regB 3 data R2 Register file + + R3 R4 0 14 7 10 11 21 R5 M U X R6 R7 noop IF/ ID ID/ EX A L U Data memory noop add EX/ Mem Mem/ WB M U X End of cycle 5 M U X 1 + + 3 2 0 R1 14 R2 7 R3 21 R4 11 R5 77 R6 1 R7 8 R0 M U X 5 data Register file add 3 7 7 PC Inst mem regA regB 21 11 nand IF/ ID ID/ EX M U X M U X A L U Data memory noop noop EX/ Mem Mem/ WB No more stalling add 1 2 3 nand 3 4 5 time add nand fetch decode execute memory writeback fetch decode decode decode execute hazard hazard Assume Register File gives the right value of R3 when read/written during same cycle. Problems with detect and stall CPI increases every time a hazard is detected! Is that necessary? Not always! Re-route the result of the add to the nand • nand no longer needs to read R3 from reg file • It can get the data later (when it is ready) • This lets us complete the decode this cycle – But we need more control to remember that the data that we aren’t getting from the reg file at this time will be found elsewhere in the pipeline at a later cycle. Handling data hazards III: Detect and forward Detect: same as detect and stall Except that all 4 hazards are treated differently • i.e., you can’t logical-OR the 4 hazard signals Forward: New bypass datapaths route computed data to where it is needed New MUX and control to pick the right data •Beware: Stalling may still be required even in the presence of forwarding Sample Code Which hazards do you see? add 1 2 3 nand 3 4 5 add 4 3 7 add 6 3 7 lw 3 6 10 sw 6 2 12 First half of cycle 3 M U X 1 1 Hazard nand 3 4 5 PC Inst mem 2 3 M U X regA regB 3 data 0 R1 14 R2 7 R3 10 R4 11 R5 77 R6 1 R7 8 R0 Register file + + 14 7 M U X M U X A L U Data memory add fwd IF/ ID ID/ EX fwd fwd EX/ Mem Mem/ WB End of cycle 3 M U X 1 + + 3 2 R0 M U X 53 data R2 Register file add 4 3 7 PC Inst mem R1 regA regB R3 R4 R5 R6 R7 0 14 7 10 11 77 1 8 10 11 nand M U X A L U M U X 21 Data memory add H1 IF/ ID ID/ EX EX/ Mem Mem/ WB First half of cycle 4 M U X 1 2 New Hazard add 6 3 7 PC Inst mem 3 R0 R1 regA regB M U X 3 53 data R2 Register file + + R3 R4 R5 R6 R7 0 14 7 10 11 77 1 8 21 M U X 10 11 nand 11 M U X A L U M U X 21 Data memory add H1 IF/ ID ID/ EX EX/ Mem Mem/ WB End of cycle 4 M U X 1 + + 4 3 0 R1 14 R2 7 R3 10 R4 11 R5 77 R6 1 R7 8 R0 M U X 75 3 data IF/ ID Register file lw 3 6 10 PC Inst mem regA regB 1 10 M U X M U X 21 A L U -2 Data memory add nand H2 H1 ID/ EX EX/ Mem add Mem/ WB M U X First half of cycle 5 M U X 1 lw 3 6 10 PC Inst mem 4 3 M U X 75 3 0 R1 14 R2 7 R3 10 R4 11 R5 77 R6 1 R7 8 R0 regA regB data IF/ ID 3 No Hazard Register file + + 1 10 1 M U X 21 M 21 A L U -2 Data memory U X add nand H2 H1 ID/ EX EX/ Mem add Mem/ WB M U X End of cycle 5 M U X 1 + + 5 4 R0 6 2 12 M U X 67 5 data R2 Register file sw PC Inst mem R1 regA regB R3 R4 R5 R6 R7 0 14 7 21 11 77 1 8 21 M U X M U X -2 A L U 22 Data memory 10 lw IF/ ID ID/ EX add nand H2 H1 EX/ Mem Mem/ WB M U X First half of cycle 6 en M U X 1 6 2 12 Inst mem 4 Hazard 6 sw en PC 5 M U X R0 R1 regA regB 67 5 L data R2 Register file + + R3 R4 R5 R6 R7 0 14 7 21 11 77 1 8 21 M U X M U X -2 A L U 22 Data memory 10 lw IF/ ID ID/ EX add nand H2 H1 EX/ Mem Mem/ WB M U X End of cycle 6 M U X 1 + + 5 R0 6 2 12 Inst mem M U X 67 data R2 Register file sw PC R1 regA regB R3 R4 R5 R6 R7 0 14 7 21 11 -2 1 8 M U X M U X noop 22 A L U 31 Data memory lw add H2 IF/ ID ID/ EX EX/ Mem Mem/ WB M U X First half of cycle 7 M U X 1 5 Hazard 6 regA sw Inst mem 6 2 12 PC R0 R1 R2 regB M U X 67 data Register file + + R3 R4 R5 R6 R7 0 14 7 21 11 -2 1 8 M U X M U X noop 22 A L U 31 Data memory lw add H2 IF/ ID ID/ EX EX/ Mem Mem/ WB M U X End of cycle 7 M U X 1 + + 5 R0 Inst mem M U X 6 data R2 Register file PC R1 regA regB R3 R4 R5 R6 R7 0 14 7 21 11 -2 1 22 1 7 M U X M U X A L U 99 Data memory 12 sw noop lw H3 IF/ ID ID/ EX EX/ Mem Mem/ WB M U X First half of cycle 8 M U X 1 + + 5 R0 Inst mem M U X 6 data R2 Register file PC R1 regA regB R3 R4 R5 R6 R7 0 14 7 21 11 -2 1 8 99 1 7 M U X M U 12 X A L U 99 Data memory 12 sw noop lw EX/ Mem Mem/ WB H3 IF/ ID ID/ EX M U X End of cycle 8 M U X 1 + + 5 R0 Inst mem M U X data R2 Register file PC R1 regA regB R3 R4 R5 R6 R7 0 14 7 21 11 -2 99 8 1 7 M U X M U X A L U M U X 111 Data memory 12 sw noop H3 IF/ ID ID/ EX EX/ Mem Mem/ WB Control hazards How can the pipeline handle branch and jump instructions? Pipeline function for BEQ Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate target address and test for equality Memory: Send target to PC if test is equal Writeback: Nothing left to do Control Hazards beq sub 1 1 10 3 4 5 time beq sub fetch decode fetch execute memory decode execute writeback Approaches to handling control hazards Avoid Make sure there are no hazards in the code Detect and Stall Delay fetch until branch resolved. Speculate and Squash-if-Wrong Go ahead and fetch more instruction in case it is correct, but stop them if they shouldn’t have been executed Handling control hazards I: Avoid all hazards Don’t have branch instructions! Maybe a little impractical Delay taking branch: dbeq r1 r2 offset Instructions at PC+1, PC+2, etc will execute before deciding whether to fetch from PC+1+offset. (If no useful instructions can be placed after dbeq, noops must be inserted.) Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more instuctions/noops after delayed beq Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 40% of instructions are noops Program execution is slower –CPI equals 1, but some instructions are noops Handling control hazards II: Detect and stall Detection: Must wait until decode Compare opcode to beq or jalr Alternately, this is just another control signal Stall: Keep current instructions in fetch Pass noop to decode stage (not execute!) M U X 1 + + PC Inst mem REG file sign ext M U X M U X A L U Data memory noop Control IF/ ID ID/ EX EX/ Mem Mem/ WB Control Hazards beq sub 1 1 10 3 4 5 time beq fetch sub decode fetch execute memory fetch fetch writeback fetch or Target: fetch Problems with detect and stall CPI increases every time a branch is detected! Is that necessary? Not always! Only about ½ of the time is the branch taken • Let’s assume that it is NOT taken… – In this case, we can ignore the beq (treat it like a noop) – Keep fetching PC + 1 • What if we are wrong? – OK, as long as we do not COMPLETE any instructions we mistakenly executed (i.e. don’t perform writeback) Handling data hazards III: Speculate and squash Speculate: assume not equal Keep fetching from PC+1 until we know that the branch is really taken Squash: stop bad instructions if taken Send a noop to: • Decode, Execute and Memory Send target address to PC M U X 1 + + equal REG file sign ext beq IF/ ID Data memory noop beq Control M U X A L U noop sub beq sub add nand Inst mem noop add PC M U X ID/ EX EX/ Mem Mem/ WB Problems with fetching PC+1 CPI increases every time a branch is taken! About ½ of the time Is that necessary? No!, but how can you fetch from the target before you even know the previous instruction is a branch – much less whether it is taken??? M U X 1 + target + Inst mem PC REG file eq? sign ext bpc target M U X M U X A L U Data memory Control beq IF/ ID ID/ EX EX/ Mem Mem/ WB Branch prediction Predict not taken: Predict backward taken: Predict same as last time: ~50% accurate ~65% accurate ~80% accurate Pentium: Pentium Pro: Best paper designs: ~85% accurate ~92% accurate ~97% accurate Handling control hazards II: Detect and stall • Detection: – Must wait until decode – Compare opcode to beq or jalr – Alternately, this is just another control signal • Stall: – Keep current instructions in fetch – Pass noop to decode stage (not execute!) M U X 1 + + PC Inst mem REG file sign ext M U X M U X A L U Data memory noop Control IF/ ID ID/ EX EX/ Mem Mem/ WB Role of the Compiler • The primary user of the instruction set – Exceptions: getting less common • Some device drivers; specialized library routines • Some small embedded systems (synthesized arch) • Compilers must: – generate a correct translation into machine code • Compilers should: – fast compile time; generate fast code • While we are at it: – generate reasonable code size; good debug support Structure of Compilers • Front-end: translate high level semantics to some generic intermediate form – Intermediate form does not have any resource constraints, but uses simple instructions. • Back-end: translates intermediate form into assembly/machine code for target architecture – Resource allocation; code optimization under resource constraints Architects mostly concerned with optimization Typical optimizations: CSE • Common sub-expression elimination c = array1[d+e] / array2[d+e]; c = array1[i] / arrray2[i]; • Purpose: –reduce instructions / faster code • Architectural issues: –more register pressure Typical optimization: LICM • Loop invariant code motion for (i=0; i<100; i++) { t = 5; array1[i] = t; } • Purpose: – remove statements or expressions from loops that need only be executed once (idempotent) • Architectural issues: – more register pressure Other transformations • Procedure inlining: better inst schedule – greater code size, more register pressure • Loop unrolling: better loop schedule – greater code size, more register pressure • Software pipelining: better loop schedule – greater code size; more register pressure • In general – “global”optimization: faster code – greater code size; more register pressure Compiled code characteristics • Optimized code has different characteristics than unoptimized code. – Fewer memory references, but it is generally the “easy ones” that are eliminated • Example: Better register allocation retains active data in register file – these would be cache hits in unoptimized code. – Removing redundant memory and ALU operations leaves a higher ratio of branches in the code • Branch prediction becomes more important Many optimizations provide better instruction scheduling at the cost of an increase in hardware resource pressure What do compiler writers want in an instruction set architecture? • More resources: better optimization tradeoffs • Regularity: same behaviour in all contexts – no special cases (flags set differently for immediates) • Orthogonality: – data type independent of addressing mode – addressing mode independent of operation performed • Primitives, not solutions: – keep instructions simple – it is easier to compose than to fit. (ex. MMX operations) What do architects want in an instruction set architecture? • Simple instruction decode: – tends to increase orthogonality • Small structures: – more resource constraints • Small data bus fanout: – tends to reduce orthogonality; regularity • Small instructions: – Make things implicit – non-regular; non-orthogonal; non-primative To make faster processors • Make the compiler team unhappy – More aggressive optimization over the entire program – More resource constraints; caches; HW schedulers – Higher expectations: increase IPC • Make hardware design team unhappy – Tighter design constraints (clock) – Execute optimized code with more complex execution characteristics – Make all stages bottlenecks (Amdahl’s law)