shah-precise

advertisement
IMPLEMENTATION OF
PRECISE INTERRUPTS IN
PIPELINED PROCESSORS
James E. Smith, Andrew R. Pleszkun
Proceedings of ISCA, 1985
Presented By:
Vishal Shah
Outline











Interrupts & Precise Interrupts
Model Architecture and Implementation
Handling Interrupts Prior to Instruction Issue
In-order Instruction Completion (IIC)
Reorder Buffer (ROB)
ROB + Bypass Paths, History Buffer, Future File
Performance Evaluation
Extensions
Summary
Current Architectures
Retrospective
2
Interrupts

Exceptions
– Division by zero, Floating point anomaly
– Page Fault, Misaligned Memory Access
– Using an undefined instruction

Traps
– Breakpoints, System Calls

External Interrupts
– I/O Device Request
– Timer interrupt
– Hardware malfunctions
3
Problem due to Interrupts

Sequential model of program execution
 High-performance implementations
pipelined using parallel functional units
 Sequential model and pipelined
implementation clash when interrupts occur
 Hardware may not be in a consistent state
when interrupts occur
4
Example
DIV.D
ADD.D
SUB.D
F0, F2, F4
F1, F1, F8
F6, F6, F5
5
Precise Interrupts

Saved process state consistent with sequential
architecture model
 All instructions preceding the one indicated by the
saved PC completed
 All instructions following the one indicated by the
saved PC did not modify the process state
 If interrupt is exception caused by instruction,
saved PC points to that instruction
6
Need for Precise Interrupts

Restarting I/O & Timer Interrupts
 Software Debugging
 Graceful recovery from arithmetic
exceptions
 Serving page faults in virtual memory
systems
 Simulating unimplemented opcodes
 Implementing virtual machines
7
Model Architecture and
Implementation





Register- Register
Architecture
Only one set of registers
Instructions stay in-order
till they are issued
Parallel Functional Units
Process state
– General Purpose Registers
– Program Counter
– Main Memory
8
An Imprecise Interrupt
Statement
Comments
0
R2  0
Init. Loop index
1
R0  0
Init. Loop count
2
R5  1
Loop inc. value
3
R7  100
Maximum loop count
Execution Time
4
Loop: R1  (R2 + A)
Load A(I)
11 clock cycles
5
R3  (R2 + B)
Load B(I)
11 clock cycles
6
R4  R1 + fR3
Floating add
6 clock cycles
7
R0  R0 + R5
Inc. loop count
2 clock cycles
8
(R0 + C)  R4
Store C(I)
9
R2  R2 + R5
Inc. loop index
10
P = Loop:R0 != R7
Cond. Branch not equal
2 clock cycles
9
Handling Interrupts prior to
Instruction Issue

Easy to maintain precise interrupts in this case
– Stop issuing new instructions
– Let all previously issued instructions complete
– Now the process is in a precise state

Examples
– Privileged instruction faults
– Unimplemented instructions
– Some external interrupts which can be checked at issue
stage
10
Maintaining Precise Interrupts

In-order Instruction Completion
 Reorder Buffer
 Reorder Buffer with Bypass Paths
 History Buffer
 Future File
11
In-order Instruction
Completion (IIC)

Instructions modify process state only when
all previous ones are free of exceptions
 Assume – pipeline delays are fixed, i.e
independent of the operands
 Result bus may be reserved at the time of
issue
12
IIC - Result Shift Register
(RSR)
13
Result Shift Register (contd.)

Registers
– Instruction that takes j clock periods reserves stages j in
the RSR
– Stages 1..j-1 loaded with null control info, so that any
other instruction that issues later cannot reserve a stage
i (i<j)
– Each clock period, the control information in RSR is
shifted up one stage (towards stage 1)
– Logic on result bus used to check exceptions and take
necessary actions to ensure precise interrupts
14
Result Shift Register

Main Memory
– Stores wait for RSR to be empty before issuing
– Stores issued and held in load/store pipeline
until all preceding instructions are exceptionfree, by making a dummy store entry in RSR

Program Counter
– Store PC of instruction in RSR when it issues
– In case of exception, PC can be restored from
RSR
15
IIC – Advantages and
Disadvantages

Advantage
– Very simple to implement

Disadvantages
– Fast instructions held up at issue register
– These block issue register while slower
instructions behind them could issue



R0  R1 +f R2 (FP Add, 6 Clock cycles)
R4  R3 + R8 (INT Add, 2 Clock cycles)
R7  R6 +f R5 (FP Add, 6 Clock cycles)
16
Reorder Buffer (ROB)

Separate the process of executing
instructions from that of modifying process
state
 Instructions
– Can finish out-of-order
– Must modify process state in-order

ROB (Circular Q) reorders the instructions
before they modify the process state
17
Reorder Buffer
18
Reorder Buffer

Main memory preciseness maintained
similar to IIC scheme
 PC stored in Reorder buffer at issue time
 RSR typically has more rows than Reorder
Buffer, so save PC in Reorder Buffer
19
Reorder Buffer – Advantages
and Disadvantages

Advantage
– A Definite improvement over the IIC Scheme
since instructions can complete out of order

Disadvantage
– Results computed out of order held in reorder
buffer until all previously issued instructions
have written to the register file
– Thus, instructions dependent on these results
cannot be issued till these results are written
20
Reorder Buffer with Bypass
Paths
21
Reorder Buffer with Bypass
Paths

Handling multiple bypass paths
– Only the latest reorder buffer entry is used
– When an instruction is placed in reorder buffer,
any entries with same destination register
inhibited from matching a bypass check

Disadvantage
– Number of bypass comparators needed
– Amount of circuitry needed for multiple bypass
check
22
History Buffer

Maintain a history buffer instead of reorder
buffer
 At issue time, load a buffer entry with
control info, and also store the current value
of destination register
 Results written directly onto register file
 Buffer should be long enough (~no. of
pipeline stages)
23
History Buffer Organization
24
History Buffer – Extra H/W

A large buffer to store history
 3 read ports needed in register file instead
of 2
 If result bus has bypass around register file,
bypass has to be connected to history buffer
25
Future File

Two Separate Register Files
– Architectural File: refers to
state of sequential machine
– Future File: working file
used for computation by
functional units
 Reorder buffer, which receives
results simultaneously with
Future file
 Results transferred in-order
from Reorder buffer to
Architectural file
26
Future File

Useful when the machine has multiple
register sets
 No bypass problem like History Buffer
27
Performance Evaluation

CRAY-1S Simulator that indicates total number of
clock cycles required
 First 14 Lawrence Livermore Loops used as
simulation workload
 Relative Performance reported
 3 groups
– In order instruction completion
– Reorder Buffer
– Reorder Buffer w/ Bypass, History Buffer, Future File
28
Relative Performance with two
methods for handling stores
Number of Buffer Entries
3
4
5
8
10
In-order
1.2322
1.2322
1.2322
1.2322
1.2322
Reorder
1.3315
1.2183
1.1954
1.1808
1.1808
Reorder with Bypass
1.3069
1.1743
1.1439
1.1208
1.1208
Stores blocked until result pipeline empty
Number of Buffer Entries
3
4
5
8
10
In-order
1.156
1.156
1.156
1.156
1.156
Reorder
1.3058
1.1724
1.1348
1.1167
1.1167
Reorder with Bypass
1.2797
1.1152
1.0539
1.0279
1.0279
Stores held in memory pipeline after issue
29
Handling Other State Values

Additional state may include pointers to
page tables, condition codes, interrupt mask
conditions etc.
 Can save the additional state in reorder
buffer or history file.
30
Virtual Memory

It should be possible to recover from page
faults
 Address translation pipeline should process
load/store (L/S) instructions in-order
 If a L/S instruction causes exception, all
subsequent L/S instructions in the address
translation pipeline are cancelled
31
Cache

Write-Through Caches
– Cache updated immediately, while main memory write-
through handled as before
– In case of interrupt, flush the cache

Write-back Caches
– Before writing back, empty the reorder buffer or check
it for data belonging to the line being written back
– If such data is found, delay the write-back until the data
has made its way to the cache
– In case of history buffer, cache line can be saved there.
32
Summary

5 methods for implementing precise
interrupts
 Performance degradation can range from
3% to 25%
 Trade-offs between Performance and Cost
 IIC offers a decent performance at a very
low cost
33
Current Architectures





MIPS R2000/3000, MIPS R4000, Pentium(?) –
In-order Instruction Completion
Pentium Pro, HP PA8000 – Reorder Buffer
P6, Athlon, R10000, PowerPC620 – ROB with
Bypass Paths
UltraSparc III, Gekko (Nintendo Games) – Future
File
Alpha 21064, IBM Power-1/2, MIPS R8000 – fast
imprecise interrupts, slow precise interrupts
34
Retrospective
Main Contribution of Paper – Reorder
Buffers
 Reorder Buffers are widely used for register
renaming, speculative execution in current
architectures
 Some methods may not be easy to
implement as processors become more
complex

35
Download