IMPLEMENTATION OF PRECISE INTERRUPTS IN PIPELINED PROCESSORS James E. Smith, Andrew R. Pleszkun Proceedings of ISCA, 1985 Presented By: Vishal Shah Outline Interrupts & Precise Interrupts Model Architecture and Implementation Handling Interrupts Prior to Instruction Issue In-order Instruction Completion (IIC) Reorder Buffer (ROB) ROB + Bypass Paths, History Buffer, Future File Performance Evaluation Extensions Summary Current Architectures Retrospective 2 Interrupts Exceptions – Division by zero, Floating point anomaly – Page Fault, Misaligned Memory Access – Using an undefined instruction Traps – Breakpoints, System Calls External Interrupts – I/O Device Request – Timer interrupt – Hardware malfunctions 3 Problem due to Interrupts Sequential model of program execution High-performance implementations pipelined using parallel functional units Sequential model and pipelined implementation clash when interrupts occur Hardware may not be in a consistent state when interrupts occur 4 Example DIV.D ADD.D SUB.D F0, F2, F4 F1, F1, F8 F6, F6, F5 5 Precise Interrupts Saved process state consistent with sequential architecture model All instructions preceding the one indicated by the saved PC completed All instructions following the one indicated by the saved PC did not modify the process state If interrupt is exception caused by instruction, saved PC points to that instruction 6 Need for Precise Interrupts Restarting I/O & Timer Interrupts Software Debugging Graceful recovery from arithmetic exceptions Serving page faults in virtual memory systems Simulating unimplemented opcodes Implementing virtual machines 7 Model Architecture and Implementation Register- Register Architecture Only one set of registers Instructions stay in-order till they are issued Parallel Functional Units Process state – General Purpose Registers – Program Counter – Main Memory 8 An Imprecise Interrupt Statement Comments 0 R2 0 Init. Loop index 1 R0 0 Init. Loop count 2 R5 1 Loop inc. value 3 R7 100 Maximum loop count Execution Time 4 Loop: R1 (R2 + A) Load A(I) 11 clock cycles 5 R3 (R2 + B) Load B(I) 11 clock cycles 6 R4 R1 + fR3 Floating add 6 clock cycles 7 R0 R0 + R5 Inc. loop count 2 clock cycles 8 (R0 + C) R4 Store C(I) 9 R2 R2 + R5 Inc. loop index 10 P = Loop:R0 != R7 Cond. Branch not equal 2 clock cycles 9 Handling Interrupts prior to Instruction Issue Easy to maintain precise interrupts in this case – Stop issuing new instructions – Let all previously issued instructions complete – Now the process is in a precise state Examples – Privileged instruction faults – Unimplemented instructions – Some external interrupts which can be checked at issue stage 10 Maintaining Precise Interrupts In-order Instruction Completion Reorder Buffer Reorder Buffer with Bypass Paths History Buffer Future File 11 In-order Instruction Completion (IIC) Instructions modify process state only when all previous ones are free of exceptions Assume – pipeline delays are fixed, i.e independent of the operands Result bus may be reserved at the time of issue 12 IIC - Result Shift Register (RSR) 13 Result Shift Register (contd.) Registers – Instruction that takes j clock periods reserves stages j in the RSR – Stages 1..j-1 loaded with null control info, so that any other instruction that issues later cannot reserve a stage i (i<j) – Each clock period, the control information in RSR is shifted up one stage (towards stage 1) – Logic on result bus used to check exceptions and take necessary actions to ensure precise interrupts 14 Result Shift Register Main Memory – Stores wait for RSR to be empty before issuing – Stores issued and held in load/store pipeline until all preceding instructions are exceptionfree, by making a dummy store entry in RSR Program Counter – Store PC of instruction in RSR when it issues – In case of exception, PC can be restored from RSR 15 IIC – Advantages and Disadvantages Advantage – Very simple to implement Disadvantages – Fast instructions held up at issue register – These block issue register while slower instructions behind them could issue R0 R1 +f R2 (FP Add, 6 Clock cycles) R4 R3 + R8 (INT Add, 2 Clock cycles) R7 R6 +f R5 (FP Add, 6 Clock cycles) 16 Reorder Buffer (ROB) Separate the process of executing instructions from that of modifying process state Instructions – Can finish out-of-order – Must modify process state in-order ROB (Circular Q) reorders the instructions before they modify the process state 17 Reorder Buffer 18 Reorder Buffer Main memory preciseness maintained similar to IIC scheme PC stored in Reorder buffer at issue time RSR typically has more rows than Reorder Buffer, so save PC in Reorder Buffer 19 Reorder Buffer – Advantages and Disadvantages Advantage – A Definite improvement over the IIC Scheme since instructions can complete out of order Disadvantage – Results computed out of order held in reorder buffer until all previously issued instructions have written to the register file – Thus, instructions dependent on these results cannot be issued till these results are written 20 Reorder Buffer with Bypass Paths 21 Reorder Buffer with Bypass Paths Handling multiple bypass paths – Only the latest reorder buffer entry is used – When an instruction is placed in reorder buffer, any entries with same destination register inhibited from matching a bypass check Disadvantage – Number of bypass comparators needed – Amount of circuitry needed for multiple bypass check 22 History Buffer Maintain a history buffer instead of reorder buffer At issue time, load a buffer entry with control info, and also store the current value of destination register Results written directly onto register file Buffer should be long enough (~no. of pipeline stages) 23 History Buffer Organization 24 History Buffer – Extra H/W A large buffer to store history 3 read ports needed in register file instead of 2 If result bus has bypass around register file, bypass has to be connected to history buffer 25 Future File Two Separate Register Files – Architectural File: refers to state of sequential machine – Future File: working file used for computation by functional units Reorder buffer, which receives results simultaneously with Future file Results transferred in-order from Reorder buffer to Architectural file 26 Future File Useful when the machine has multiple register sets No bypass problem like History Buffer 27 Performance Evaluation CRAY-1S Simulator that indicates total number of clock cycles required First 14 Lawrence Livermore Loops used as simulation workload Relative Performance reported 3 groups – In order instruction completion – Reorder Buffer – Reorder Buffer w/ Bypass, History Buffer, Future File 28 Relative Performance with two methods for handling stores Number of Buffer Entries 3 4 5 8 10 In-order 1.2322 1.2322 1.2322 1.2322 1.2322 Reorder 1.3315 1.2183 1.1954 1.1808 1.1808 Reorder with Bypass 1.3069 1.1743 1.1439 1.1208 1.1208 Stores blocked until result pipeline empty Number of Buffer Entries 3 4 5 8 10 In-order 1.156 1.156 1.156 1.156 1.156 Reorder 1.3058 1.1724 1.1348 1.1167 1.1167 Reorder with Bypass 1.2797 1.1152 1.0539 1.0279 1.0279 Stores held in memory pipeline after issue 29 Handling Other State Values Additional state may include pointers to page tables, condition codes, interrupt mask conditions etc. Can save the additional state in reorder buffer or history file. 30 Virtual Memory It should be possible to recover from page faults Address translation pipeline should process load/store (L/S) instructions in-order If a L/S instruction causes exception, all subsequent L/S instructions in the address translation pipeline are cancelled 31 Cache Write-Through Caches – Cache updated immediately, while main memory write- through handled as before – In case of interrupt, flush the cache Write-back Caches – Before writing back, empty the reorder buffer or check it for data belonging to the line being written back – If such data is found, delay the write-back until the data has made its way to the cache – In case of history buffer, cache line can be saved there. 32 Summary 5 methods for implementing precise interrupts Performance degradation can range from 3% to 25% Trade-offs between Performance and Cost IIC offers a decent performance at a very low cost 33 Current Architectures MIPS R2000/3000, MIPS R4000, Pentium(?) – In-order Instruction Completion Pentium Pro, HP PA8000 – Reorder Buffer P6, Athlon, R10000, PowerPC620 – ROB with Bypass Paths UltraSparc III, Gekko (Nintendo Games) – Future File Alpha 21064, IBM Power-1/2, MIPS R8000 – fast imprecise interrupts, slow precise interrupts 34 Retrospective Main Contribution of Paper – Reorder Buffers Reorder Buffers are widely used for register renaming, speculative execution in current architectures Some methods may not be easy to implement as processors become more complex 35