Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008 Admin • Homework #1 Due Today • Homework #2 Assigned • Reading – H&P Chapter 2 & 3 (suggested) – Research papers (not yet ready to read, but will be soon!): » Hinton et al: “The Microarchitecture of the Pentium 4 Processor” » Palacharla, Jouppi, and Smith: “Complexity-Effective Superscalar Processors” » Akkary, Rajwar, and Srinivasan: “Checkpoint Processing and Recovery” © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 2 Review: Hazards Data Hazards • RAW – only one that can occur in simple 5-stage pipeline • WAR, WAW • Data Forwarding (Register Bypassing) – send data from one stage to another bypassing the register file • Still have load use delay Structural Hazards • Replicate Hardware, scheduling Control Hazards • Compute condition and target early (delayed branch) © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 3 Review: Dynamic Branch Prediction • Solution: 2-bit counter where prediction changes only if mispredict twice: • Increment for taken, decrement for not-taken – 00,01,10,11 • Helps when target is known before condition T NT Predict Taken Predict Taken T T Predict Not Taken © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz NT T Computer Science 220 NT Predict Not Taken NT 4 Review: Correlating Branches • Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) • Tournament Branch address 2-bits per branch predictor Prediction Choose between alternative predictors • How do you choose? 2-bit global branch history © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 5 Review: Need Address @ Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) – Note: must check for branch match now, since can’t use wrong branch address PC of Inst to fetch Predicted PC Branch Prediction: Taken or not Taken 0 … n-1 = Yes, use predicted PC No, not branch Procedure Return Addresses Predicted with a Stack © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 6 Review: Multicycle Ops in Pipeline EX M1 IF M2 M EM M3 M4 M5 M6 M7 WB ID/RF A1 A2 A3 A4 FP/INT Divide Unit Not Pipelined 25 Clocks © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 7 Interrupts and Exceptions • Unnatural change in control flow • warning: varying terminology – “exception” sometimes refers to all cases – “Trap” software trap, hardware trap • Exception is potential problem with program – – – – – – condition occurs within the processor segmentation fault bus error divide by 0 Don’t want my bug to crash the entire machine page fault (virtual memory…) © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 8 Interrupts and Exceptions • Interrupt is external event – devices: disk, network, keyboard, etc. – clock for timeslicing – These are useful events, must do something when they occur. • Trap is user-requested exception – operating system call (syscall) © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 9 Handling an Exception/Interrupt User Program ld add st div beq ld sub bne Interrupt Handler • Invoke specific kernel routine based on type of interrupt – interrupt/exception handler • Must determine what caused interrupt – could use software to examine each device – PC = interrupt_handler RETT • Vectored Interrupts – PC = interrupt_table[i] • Similar mechanism is used to handle interrupts, exceptions, traps © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz • Kernel initializes table at boot time • Clear the interrupt • May return from interrupt (RETT) to different process (e.g, context switch) Computer Science 220 10 Execution Mode • What if interrupt occurs while in interrupt handler? – Problem: Could lose information for one interrupt clear of interrupt #1, clears both #1 and #2 – Solution: disable interrupts • Disabling interrupts is a protected operation – Only the kernel can execute it – user v.s. kernel mode – mode bit in CPU status register • Other protected operations – installing interrupt handlers – manipulating CPU state (saving/restoring status registers) • Changing modes – interrupts – system calls (syscall instruction) © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 11 A System Call (syscall) User Program ld add st TA 6 beq ld sub bne • Special Instruction to change modes and invoke service Kernel – read/write I/O device – create new process Trap Handler RETT © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Service Routines • Invokes specific kernel routine based on argument • kernel defined interface • May return from trap to different process (e.g, context switch) • RETT, instruction to return to user process Computer Science 220 12 Interrupts/exceptions • classifying interrupts – – – – – terminal (fatal) vs. restartable (control returned to program) synchronous (internal) vs. asynchronous (external) user vs. coerced maskable (ignorable) vs. non-maskable between instructions vs. within instruction © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 13 Precise Exceptions “unobserved system can exist in any intermediate state, upon observation system collapses to welldefined state” – 2nd postulate of quantum mechanics • system processor, observation interrupt • what is the “well-defined” state? – von Neumann: “sequential, instruction atomic execution” – precise state at interrupt » all instructions older than interrupt are complete » all instructions younger than interrupt haven’t started • implies interrupts are taken in program order • necessary for VM (why?), “highly recommended” by IEEE © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 14 Pipelining Complications • Interrupts (Exceptions) – – – – 5 instructions executing in 5 stage pipeline How to stop the pipeline? How to restart the pipeline? Who caused the interrupt? Stage IF ID EX MEM Problem interrupts occurring Page fault on instruction fetch; misaligned memory access; memory-protection violation Undefined or illegal opcode Arithmetic interrupt Page fault on data fetch; misaligned memory access; memory-protection violation © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 15 Pipelining Complications • Simultaneous exceptions in > 1 pipeline stage – Load with data page fault in MEM stage – Add with instruction page fault in IF stage • Solution #1 – Interrupt status vector per instruction – Defer check til last stage, kill state update if exception • Solution #2 – Interrupt ASAP – Restart everything that is incomplete • Another advantage for state update late in pipeline! © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 16 Interrupts/Exceptions are Nasty • odd bits of state must be precise (e.g., condition codes) • delayed branches – what if instruction in delay slot takes an interrupt? • Out of order Writes (e.g., autoinc, multicycle ops) – must undo write (e.g., future-file, history-file) • some machines had precise interrupts only in integer pipe – sufficient for implementing VM (e.g., VAX/Alpha) • Lucky for us, there’s a nice, clean way to handle precise state – We’ll see how this is done in a couple of lectures ... © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 17 Pipelining x86 • The x86 ISA has some really nasty instructions - how did Intel ever figure out how to build a pipelined x86 microprocessor? • Solution: at runtime, “crack” x86 instructions (macroops) into RISC-like micro-ops – First used in P6 (Pentium Pro) – Used in all subsequent x86 processors, including those from AMD • What are the potential challenges for implementing this solution? © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 18 Where are We • principles of pipelining – pipeline depth: clock rate vs. number of stalls (CPI) • hazards – structural – data (RAW, WAR, WAW) – control • Branch prediction • multi-cycle operations – structural hazards, WAW hazards • interrupts – precise state • Next up: CPI < 1 © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 19 Getting CPI < 1: Issuing Multiple Instructions/Cycle • “Flynn bottleneck” – single issue performance limit is CPI = IPC = 1 – hazards + overhead CPI >= 1 (IPC <= 1) • diminishing returns from deep pipelines • solution: issue multiple instructions per cycle • Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler (statically scheduled) or by HW (Tomasulo; dynamically scheduled) – First superscalar IBM America → RS6000 → Power1 – Pentium4, IBM PowerPC, Sun SuperSparc, DEC Alpha, HP PA8000 © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 20 Base Implementation • statically scheduled (in-order) superscalar – – – – executes unmodified sequential programs Figures out on its own what can be done in parallel e.g., Sun UltraSPARC, Alpha 21164 we’ll start with this one – What has to change from single issue to multiple issue? © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 21 CPI < 1: Issuing Multiple Instructions/Cycle • Ex 2-way superscalar: 1 FP & 1 anything else – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair Type Int. instruction FP instruction Int. instruction FP instruction Int. instruction FP instruction PipeStages IF ID IF ID IF IF EX MEM WB EX MEM WB ID EX MEM WB ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB • 1 cycle load delay expands to 3 instructions in SS – instruction in right half can’t use it, nor instructions in next slot © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 22 Implications of Superscalar regfile PC F/D D/X X/M M/W BP F I$ D$ D X M W • what is involved in – – – – – fetching two instructions per cycle? decoding two instructions per cycle? executing two ALU operations per cycle? accessing the data cache twice per cycle? writing back two results per cycle? • what about 4 or 8 instructions per cycle? © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 23 Wide Fetch • Fetch N instructions per cycle • if instructions are sequential... – and on same cache line nothing really – and on different cache lines banked I$ + combining network • if instructions are not sequential... – more difficult – two serial I$ accesses (access1predict targetaccess2)? no • note: embedded branches OK as long as predicted NT – serial access + prediction in parallel – if prediction is T, discard serial part after branch • Trace Cache… © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 24 Wide Decode • Decode N instructions per cycle • actually decoding instructions? – easy if fixed length instructions (multiple decoders) – harder (but possible) if variable length • reading input register values? – 2N register read ports (register file read latency ~2N) – actually less than 2N, since most values come from bypasses • what about the stall logic to enforce RAW dependences? © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 25 N2 Dependence Check Logic • remember stall logic for single issue pipeline – rs1(D) == rd(D/X) || rs1(D) == rd(X/M) || rs1(D) == rd(M/W) – same for rs2(D) – full-bypassing reduces to rs1(D) == rd(D/X) && op(D/X) == LOAD • doubling issue width (N) quadruples stall logic! – – – – – not only 2 instructions in D, but two instructions in every stage (rs1(D1) == rd(D/X1) && op(D/X1) == LOAD) (rs1(D1) == rd(D/X2) && op(D/X2) == LOAD) repeat for rs1(D2), rs2(D1), rs2(D2) also check dependence of 2nd instruction on 1st: rs1(D2) == rd(D1) • “N2 dependence cross-check” – for N-wide pipeline, stall (and bypass) circuits grow as N2 © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 26 Superscalar Stalls • invariant: stalls propagate upstream to younger instructions • what if older instruction in issue “pair” (inst0) stalls? – younger instruction (inst1) stalls too, cannot pass it • what if younger instruction (inst1) stalls? – can older instruction from next group (inst2) move up? • Rigid pipeline: No • Fluid pipeline: Yes © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 27 Wide Execute • What does it take to execute N instructions per cycle? • multiple execution units...N of every kind? – N ALUs? OK, ALUs are small – N FP dividers? no, FP dividers are huge (and fdiv is uncommon) • typically have some mix (proportional to instruction mix) • RS/6000: 1 ALU/memory/branch + 1 FP – Pentium: 1 any + 1 ALU (Pentium) – Pentium II: 1 ALU/FP + 1 ALU + 1 load + 1 store + 1 branch – Alpha 21164: 1 ALU/FP/branch + 2 ALU + 1 load/store © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 28 N2 Bypass • N2 bypass logic... OK – only 5-bit quantities – compare to generate 1-bit outcomes – similar to stall logic • N2 bypass buses... not even close to OK – – – – 32-bit or 64-bit quantities broadcast, route, and multiplex (mux) difficult to lay out and route all the wires wide (SLOW) muxes • big design problem today © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 29 One Solution to N2 Bypass: Clustering D/X X/M • group functional units into clusters – – – – full bypass within cluster no bypass between clusters ~(N/k) inputs at each mux ~(N/k)2 routed buses in each cluster • steer instructions to different clusters – dependent instructions to same cluster – exploit intra-cluster bypass – static or dynamic steering is possible • e.g., Alpha 21264 – 4-wide, 300MHz – full bypass didn’t fit into 1 clock cycle – 2 clusters with full intra-cluster bypass © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 30 Wide Memory Access • what is involved in accessing memory for multiple instructions per cycle? • multi-banked D$ – requires bank assignment and conflict-detection logic • (rough) instruction mix: 20% loads, 15% stores – for width N, we need about 0.2*N load ports, 0.15*N store ports © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 31 Wide Writeback • what is involved in writing back multiple instructions per cycle? • nothing too special, just another port on the register file – everything else is taken care of earlier in pipeline • adding ports isn’t free, though – increases area – increases access latency © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 32 Multiple Issue Summary • • • • superscalar problem spots fetch, branch prediction trace cache? decode (N2 dependence cross-check) execute (N2 bypass) clustering? © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 33 Can we do better? • Problem: Stall in ID stage if any data hazard. • Your task: Teams of two, propose a design to eliminate these stalls. MULD ADDD ADDD ADDD F2, F3, F4 F1, F2, F3 F3, F4, F5 F1, F4, F5 © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Long latency… Computer Science 220 34 Next Time • Dynamic Scheduling • Read papers • HW #2 Assigned © 2008 Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti, Katz Computer Science 220 35