PROCESSOR ARCHITECTURE Jehan-François Pâris jparis@uh.edu Chapter Organization • • • • Logic design conventions Implementation of a "toy" CPU Pipelining Pipelining hazards IMPORTANT – Data hazards – Control hazards • Exceptions • Parallelism LOGIC DESIGN CONVENTIONS Combinational/state elements • Combinational elements: – Outputs only depend on current inputs – Stateless • Adders and, more generally, arithmetic logic unit (ALU) Combinational/state elements • State elements: – Have a memory holding a state – Output depends on current inputs and state of element – State reflects past inputs • Flip-flops, … Judicial analogy • In our legal system – Guilty/not guilty decision is stateless • Good reasons – Sentencing decision is not • "Three strikes and you are out" laws • Good reasons Clocking methodology • We will assume an edge-triggered clocking technology – Edge is short-enough to prevent data propagation in state elements – Can read current state of a memory element at the same time we update it Clocking convention • Omit write control signal if state element is updated at every active clock edge A "TOY" CPU Motivation • "Toy" CPU will implement a subset of MIPS instruction set • Subset will be – Self-sufficient – Simpler to implement – Complex enough to allow a serious discussion of CPU architecture The subset • Will include – Load and store instructions: lw (load word) and sw (store word) – Arithmetic-logic instructions: add, sub, and, or and slt (set less than) – Branch instructions: beq (branch if equal) and j (jump) Load and store instructions • Format I • Three operands: – Two registers $r1 and $r2 – One displacement d • lw $r1, d($r2) loads into register $r1 main memory word at address contents($r2) + d • sw $r1, d($r2) stores contents of register $r1 into main memory word at address contents($r2) + d Arithmetic-logic instructions • Format R • Three operands: – Three registers $r1, $r2 and $r3 • Store into register $r1 result of $r2 <op> $r3 where <op> can be add, subtract, and, or as well as set if less than Branch instruction • Format I • Three operands: – Two registers $r1 and $r2 – One displacement d • beq $r1, $r2, d set value of PC to PC+4 + 4×d iff $r1 = $r2 The simplest data path • Assume CPU will do nothing but – Incrementing its program counter and – Deliver the next instruction The simplest data path 4 P C Instruction Memory Read address Instruction Add Implementing R2R instructions • Takes two 32-bit inputs • Returns – A 32-bit output – A 1-bit signal if the result is zero The register file • Two read outputs that are always available • One write input activated by a RegWrite signal • Three register selectors The register file 5 5 5 Read select 1 Read data 1 Read select 2 Read data 2 Write select Write data RegWrite: enables register writes Implementing R2R instructions Register file RegWrite is enabled ALU Result Implementing load and store • Require – An address calculation: • contents($r2) + d – An access to data memory • Before doing the address calculation, we must transform 16-bit displacement d into a 32-bit value using sign extension The data memory • • • • One address selector One write data input One read data output Two controls – MemWrite – MemRead Sign extension (I) • If 16-bit number has a zero as MSB – It is positive – Must add 16 zero bits 0110 1010 1010 0100 0000 0000 0000 0000 0110 1010 1010 0100 Sign extension (II) • If 16-bit number has a one as MSB – It is negative – Must add 16 one bits 1110 1010 1010 0100 1111 1111 1111 1111 1110 1010 1010 0100 The data memory MemWrite: enables memory writes Memory address Read data Write data MemRead: enables memory reads Implementing the store instruction Register file ALU Address Write SE Sign-extended d field Read Implementing the load instruction Register file ALU SE d field SE Address Write Read Implementing conditional branch • Target Address: – Sign-extend 16-bit immediate part of instruction – Shift left 2 – Add to PC • Branch Control Logic: – Perform test operation on two registers – Check result Implementing conditional branch PC+4 Add Shift left 2 Register file d field of instruction SE ALU Branch Destination To branch control logic Sign-extended d field Note • Arithmetic-logic operations only use – Register file and ALU • Load and store use – ALU for computing memory address – Data memory Implementing other instructions Combining everything Left to be done • All control signals: – Two multiplexers: ALUSrc and MemtoReg – RegWrite, MemRad and MemWrite switches – ALU controls (4 bits) ALU control signals ALU control lines 0000 0001 0010 0110 0111 1100 Function and or add subtract set on less than nor (not in "toy" subset) Controlling the ALU • Recall that all R-format instructions have same opcode – Operation performed by ALU is specified in the function field (bits <0:5>) Controlling the ALU • ALU control inputs generated by two-step process – Construct two ALUOp control bits from opcode – Construct four ALU control bits using • Two ALUop bits • Six bits from function field when they are needed Dependence table Opcode lw sw beq R-type R-type R-type R-type R-type ALUOp 00 00 01 10 10 10 10 10 Operation lw sw beq add subtract and or slt Function Action ALU Ctl add 0010 add 0010 subtract 0110 100000 add 0010 100010 subtract 0110 100100 and 0000 100101 or 0001 101010 slt 0111 Notes • Two step process simplifies combinatorial logic • Many don't care conditions in truth table Truth table ALU Op1 0 0 1 1 1 1 1 ALU F5 F4 F3 F2 F1 F0 ALU Op2 Control bits 0 X X X X X X 0010 1 X X X X X X 0110 0 X X 0 0 0 0 0010 X X X 0 0 1 0 0110 0 X X 0 1 0 0 0000 0 X X 0 1 0 1 0001 X X X 1 0 1 0 0111 Note • Bits 4 and 5 of function field are not used • ALUOp bits only have three possible values: 00, 01 and 10 – Introduces don't care conditions • All R instructions use same data paths – Other control bits depend only on opcode Control signal effects Signal Regdest Regwrite ALUSrc When deasserted Destination register comes from rt field (bits 20:16) None When asserted Destination register comes from rd field (bits 15:10) Enables write into destination register Second ALU operand Second ALU operand comes from second comes from signregister output extended displacement (bits 15:0) Control signal effects Signal PCSrc When deasserted PC is incremented by 4 MemRead None When asserted PC set to branch target value Enables memory read output MemWrite None Enables memory write MemtoReg Value fed to Value fed to destination destination register register comes from comes from ALU memory Note • PCSrc is asserted when – Instruction is a branch and – ALU Zero result bit is asserted • We will introduce a Branch control line Control line settings Instruction Rdest ALUsrc MemtoReg RegWrite R-format lw sw beq 1 0 X X 0 1 1 0 0 1 X X 1 1 0 0 Control line settings Instruction Mem Mem Branch ALUOp ALUOp Read Write 1 0 R-format 0 0 0 1 0 lw 1 0 0 0 0 sw 0 1 0 0 0 beq 0 0 1 0 1 Active datapaths for a R instruction Active datapaths for a load instruction Active datapaths for a beq instruction The “weird" jump instruction • Uses J format – Single 26 bit operand – Implements an unconditional jump • New value of PC is obtained as follows – Bits 1:0 are zero (address is multiple of 4) – Bits 28:2 come from jump operand – Bits 31:29 come from PC+4 Implementing the jump instruction Limitations of single-cycle design • If we want all instructions to be executed in one cycle – Clock cycle must be long enough to accommodate instruction taking the most time • Floating-point multiply or divide • Does not work for CPUs that have a rich instruction set PIPELINING An analogy (I) • Washing your clothes – Four steps: 1. Putting in the washer 2. Putting in the dryer 3. Folding/ironing 4. Putting them away An analogy (II) • Most people – Start second wash load as soon as first wash load is in dryer – Put second wash load in dryer and start a third wash load while they are folding/ironing the firs washload Purely sequential approach Time 6 pm 6:30 Wash Dry 7pm 7:30 8pm 8:30 Wash Dry 9pm 9:30 Fold Store Fold Store Smart approach Time 6 pm 6:30 Wash Dry Wash 7pm 7:30 8pm 8:30 9pm Fold Store Dry Wash Fold Store Dry Wash Fold Store Dry Fold Store Solution assumes that a housemate puts folded/ironed clothes away for us 9:30 Main advantage • Can do much more in much less time Limitation • Slowed down by time taken by longest step – Could be washing/drying/ironing Instruction steps (I) • Good candidates for pipelining steps 1. Fetch instruction from memory 2. Decode instruction 3. Read registers 4. Execute register to register operation or calculate address 5. Access operand in memory 6. Write results into a register Instruction steps (II) • Since MIPS instruction set has fixed fields, we can combine steps 2 and 3 1. Fetch instruction from memory 2. Read registers while decoding instruction 3. Execute register to register operation or calculate address 4. Access operand in memory 5. Write results into a register Sample step timings Instruction Instruction class fetch Register read ALU operation Data access Register write Total time Load word (lw) Store word (sw) R format instruction 200 ps 100ps 200ps 200ps 100ps 800ps 200 ps 100ps 200ps 200ps --- 700ps 200 ps 100ps 200ps -- 100ps 600ps Branch (beq) 200 ps 100ps 200ps -- -- 500 ps Step 1: Fetch and decode Step 2: Read registers Step 3: Use the ALU Step 4: Access operand in memory Step 5: Store result in register Observations • Most R format instructions operate on three registers and skip step 4 • Same for most I format instructions with an immediate operand • Store operations skip step 5 • Load register instructions go through all five steps Pipelining limitations • Some instructions that skip a step will still have to wait until preceding instruction is done. • Hazards: – An instruction cannot proceed because • Hardware cannot support the combination of instructions (structural hazards) • Data are not ready (data hazards) • Control/branch hazards Structural hazards • Combinations of instructions that prevent pipelining A bad MIPS instruction (I) • Recall that IBM instructions set had instructions allowing to add to a register the contents of a memory location – RX format A bad MIPS instruction (II) • We could think of a MIPS instruction with three registers operands ADDX r1, r2, r3 adding to r1 the contents of the word at address contents of r2 + contents of r3 • We would have r1 = r1 + Mem[r2+r3] A bad MIPS instruction (III) • It would be great for accessing arrays – r2 will have starting address of array – r3 would contain the array index multiplied by 4 r3 r2 (fixed value) (incremented after each step) A bad MIPS instruction (IV) • Adding this instruction would be a very bad idea – Why? Answer • Instruction would require two steps using the ALU – Adding r2 and r3 to compute the address of the memory operand (step 4) – Adding the memory operand to r1 • New step would introduce a structural hazard by preventing any other instruction to access the ALU My comment • Careful design of the MIPS CPU and instruction set should be noted – Not true for older instructions sets • IBM 360, DEC VAX, … – Not true for X86 instruction sets • CPU is designed to be compatible with an existing instruction set Designing instruction sets for pipelining (I) • All instructions should have the same length – Can fetch future instructions before the current one is decoded • Have few instruction formats with register fields always in the same position – Can combine instruction decode and register read steps Designing instruction sets for pipelining (II) • Memory operands should only appear in load and store instruction – No instruction can use the ALU twice! • Operands must be properly aligned in memory – Can always access them in a single memory cycle Data hazards (I) • Assume we have add $s0, $t0, $t1 sub $t2, $s0, $t3 or s0 = t0 + t1 t2 = s0 – t3 • Need result of add before proceeding with sub instruction Detail of steps Cycle add sub 1 IF 2 ID/RR IF 3 ALU stall 4 RW stall 5 6 ID/RR ALU • Second instruction must wait until first instruction updated $s0 in cycle 4 before reading its value in cycle 5 Data hazards (II) • New value of $s0 computed by the add instruction is not stored in $s0 until its step 5 has completed • New instruction must wait until add instruction has performed its step 5 before performing its step Data hazards (III) sub add Data hazards (IV) • We lose two cycles during which nothing can be done • Cannot trust compiler to remove all data hazards • Observe that new value of $s0 become available at the end of step 3 of add instruction – Add special circuitry to provide this value at the end of step 2 of sub instruction • Forwarding or bypassing After forwarding sub add Detail of steps Cycle add sub 1 IF 2 3 ID/RR ALU IF ID/RR 4 RW ALU 5 6 RW • Second instruction now gets updated value at the end of cycle 3 just in time to use it in cycle 4 – No stall cycles Limitations (I) • Forwarding worked very well because output of step 4 of add was forwarded to be input of step 3 of sub • Would not work as well if output of an instruction step is need as input of instruction step of next instruction – Will still have one or more pipeline stalls (bubbles) Limitations (II) • Assume we have lw $s0, 20($t1) sub $t2, $s0, $t3 or s0 = Mem[t1+20] t2 = s0 – t3 • Need new value of s0 before proceeding with sub instruction Limitations (III) sub add Detail of steps Cycle lw sub 1 IF 2 3 4 ID/RR ALU MEM IF ID/RR stall 5 RW ALU 6 RW • Even with forwarding second instruction must wait until completion of memory access of first instruction in cycle 4 before performing its ALU step in cycle 5 – One stall cycle A last word • In many architectures, the floating point unit is a significant source of structural hazards – Less well adapted to pipelining • The MIPS architecture assumes that we have separate memories for instructions and data – Having a single memory for both would result in many more hazards Control / jump hazards • Happen whenever we have a conditional jump • Consider the instructions add $4, $5,$6 beq $1,$2, 40 or $7, $8, $9 • Need result of conditional branch (beq) before deciding whether to execute next instruction (or) Control hazards (II) or beq Pipelined datapath Datapaths for pipelined organization • Define five steps 1. Fetch instruction from memory (IF) 2. Instruction decode and register reads (ID) 3. Execute AL operation on ALU (EX) 4. Access operand in memory (MEM) 5. Write back results into a register (WB) Datapaths for pipelined organization • Insert registers to save outputs of each step before they get updated by th next step 1. IF/ID registers 2. ID/EX registers 3. EX/MEM registers 4. MEM/WB registers A first try New New IF/ New New Comments • This first try is not correct – Load instruction will not be implemented correctly • Address of destination register will be lost as soon as new instruction will be fetched • Must save it at each step The almost correct datapaths Register address follows instruction The almost correct datapaths More problems • Address of destination register is not always at the same place in all instructions – Could be instruction bits (20-16) • For all I-format instructions that write into a register – Could be instruction bits (15-11) • In R format instructions Why? • In R format instructions opcode source source dest shamt funct • In I format instructions opcode source s/d constant/address The solution • Add a multiplexer at stage EX More about data hazards • Consider sub $2,$1,$3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) • Last four instructions depend on result of sub More about data hazards • $2 is updated at the end of last cycle of sub • First instruction that would get the correct value of $2 would be the add More about data hazards sub and or add sw IF ID+Reg EX MEM WB IF ID+Reg EX MEM WB IF ID+Reg EX MEM IF ID+Reg EX IF ID+Reg Adding a forwarding unit More data hazards • We can forward the results of sub instruction at the end of its EX step – In time for all four following instructions • To do that we need special forwarding unit • Not all data hazards can be avoided – lw followed by any instruction accessing the loaded word Why? • lw loads word from RAM into memory – Goes through IF, ID+Reg, EX, MEM and WB steps – Register value is updated at the end of WB step • Must delay any following instruction that wants to access the contents of the register Data hazard detection unit • Detects hazards that cannot be avoided • Inserts no operation instructions (nop) – They do nothing! More about control hazards • Outcome of conditional branch is not known until end of step EX – beq and bne use arithmetic unit to evaluate the branch condition – If branch is taken, we must abort the two following instructions • Easy because they have not yet updated anything More about control hazards beq next next dest IF ID+Reg EX MEM IF ID+Reg ABORT IF ABORT IF WB ID+Reg EX More about control hazards beq next dest IF ID+Reg EX IF ABORT IF MEM WB ID+Reg EX MEM Better implementation of beq/bne MIPS Optimization • Move comparison ahead to reduce the number of aborted instructions – Add a simple EQUAL/NOT EQUAL comparison hardware that tests outputs of register file • Bitwise XOR then ORing the results – Will return zero if the register contents are identical Explanations • Moving the jump address calculation one step ahead means that we will always do the calculation even when it is not needed. • Simple comparator duplicates one ALU function New problem • We need now the correct values of the input registers in step ID – More data hazards add $t0, $t2, $t3 beq $t0, $s0, 400 – Data forwarding can reduce the number of nops but not eliminate them New data hazards add IF ID+Reg EX MEM WB IF ID+Reg EX nop nop beq MEM EXCEPTIONS AND INTERRUPTS Interrupts (I) • Request to interrupt the flow of execution the CPU • Detected by the CPU hardware – After it has executed the current instruction – Before it starts the next instruction. Interrupts (II) • When an interrupt occurs: a) The current state of the CPU (program counter, program status word, contents of registers, and so forth) is saved, normally on the top of a stack b) A new CPU state is fetched Interrupts (III) • New state includes a new hardware-defined value for the program counter – Cannot “hijack” interrupts • Process is totally transparent to the task being interrupted – A process never knows whether it has been interrupted or not Types of interrupts (I) • I/O completion interrupts – Notify the OS that an I/O operation has completed, • Timer interrupts – Notify the OS that a task has exceeded its quantum of CPU time, Types of interrupts (II) • Traps – Notify the OS of a program error (division by zero, illegal op code, illegal operand address, ...) or a hardware failure • System calls – Notify OS that the running task wants to submit a request to the OS • Notification of another event A surprising discovery • Programs do interrupt themselves! Context switches • Each interrupt will result into two context switches: – One when the running task is interrupted – Another when it regains the CPU • Context switches are not cheap • The overhead of any simple system call is two context switches Remember that for 4330! Prioritizing interrupts (I) • Interrupt requests may occur while the system is processing another interrupt • All interrupts are not equally urgent (as it is also in real life) – Some are more urgent than other – Also true in real life Prioritizing interrupts (II) • The best solution is to prioritize interrupts and assign to each source of interrupts a priority level – New interrupt requests will be allowed to interrupt lower-priority interrupts but will have to wait for the completion of all other interrupts • Solution is known as vectorized interrupts. Example from real life • Let us try to prioritize – Phone is ringing – Washer signals end of cycle – Dark smoke is coming out of the kitchen – … • With vectorized interrupts, a phone call will never interrupt another phone call The solution Smoke in the kitchen Phone is ringing End of washer cycle More low-priority stuff MIPS Implementation (I) • Interrupts are a special case of a branch – Use same techniques for handling control hazards • Almost all MIPS interrupts jump to the same hardware address (x80000180) – MIPS use a special register to pass along the type of interrupt to the interrupt handler • The Cause register MIPS Implementation (II) • MIPS also saves the address + 4 of the affected instruction in a special register – EPC register • A STATUS register allows selective disabling of interrupts – Useful for handling short critical sections in single-threaded kernel Issues (I) • Interrupted instruction may have to be restarted – Typical for I/O completion interrupts • Must then maintain precise exceptions that accurately identify the instruction being interrupted – Not true for hardware interrupts Issues (II) • Must be able to restart instruction at the exact point it was interrupted – Not always easy on many architectures • MIPS solution is to roll back everything and restart instruction as if nothing had happened – Easier on MIPS since register/memory update is always the last step of any instruction – Must still ensure that we can restore the original values of all registers Branch prediction • CPU will try to predict whether a branch will be taken or not • Important for loops – Branch is taken at every iteration but last one See speculative execution Parallelism • Instruction-level parallelism (ILP) • Two ways: – Increasing the depth of the pipeline: • More steps can be executed in parallel – Multiple issue: • We duplicate some units (ALU) – Two or more units can be at the same pipeline stage An example • Could modify the toy MIPS architecture by adding a second ALU: – Would allow RR instructions be executed in parallel with load and store instructions – Would also need extra ports in the register bank • Faster but much more complex Hazards • Become an even bigger issue • Some architectures assume that the compiler will take care of all data hazards – Will never issue sequence of instructions with unsatisfied dependencies • Other architectures check for problems Speculative execution (I) • Can speculate that – A branch will not be taken (think of loops) – A store than precedes a load will not store at the address the load will use and execute the instruction(s) hoping for the best • If speculation is wrong, we must undo what we have done Speculative execution (II) • Any speculation mechanism must include methods to – Check if the speculation was correct – Undo the effect of the speculated instructions • Quite complex • Can be done by the compiler or the hardware Fallacies • Pipelining is easy • Pipelining ideas can be implemented independently of technologies Pitfalls • Instruction set must be pipelining friendly