ppt - TAMU Computer Science Faculty Pages

advertisement
Hardware-Based Speculation
• As more instruction-level parallelism is exploited,
maintaining control dependences becomes an
increasing burden.
=> Speculating on the outcome of branches and
executing the program as if the guesses were
correct.
• Hardware Speculation
CSCE 614 Fall 2009
1
3 Key Ideas of Hardware
Speculation
• Dynamic Branch Prediction
– Choose which instruction to execute.
• Speculation
– Allow the execution of instructions before the
control dependences are resolved (with the
ability to undo the effect of an incorrectly
speculated sequence).
• Dynamic Scheduling
– Deal with the scheduling of different
combinations of basic blocks
CSCE 614 Fall 2009
2
Examples
•
•
•
•
•
PowerPC 603/604/G3/G4
MIPS R10000/12000
Intel Pentium II/III/4
Alpha 21264
AMD K5/K6/Athlon
CSCE 614 Fall 2009
3
Hardware Speculation
• Extended Tomasulo’s algorithm
• Additional step (instruction commit)
required
• Allow instructions to execute out-of-order
but to force them to commit in order.
• Any irrevocable action (updating state or
taking an exception) is prevented until an
instruction commits.
CSCE 614 Fall 2009
4
Reorder Buffer (ROB)
• Holds the result of an instruction between
the time the operation associated with the
instruction completes and the time the
instruction commits.
• Source of operands for instructions
• With speculation, the register file (or
memory) is not updated until the
instruction commits.
CSCE 614 Fall 2009
5
ROB Fields
• Instruction type: indicates whether the instruction
is a branch, a store, or a register operation (ALU
or Load).
• Destination: supplies the register number (for
loads and ALU operations) or the memory
address (for stores).
• Value: holds the value of the instruction result
until the instruction commits.
• Ready: indicates that the instruction has
completed execution, and the value is ready.
CSCE 614 Fall 2009
6
Hardware Speculation
Issue
Execute
Write result (to ROB)
Commit
(write to RF, MEM)
Reorder
Buffer
(ROB)
Reservation
Stations
CSCE 614 Fall 2009
7
Basic Structure of MIPS FP Unit
The ROB completely
replaces the store
buffer.
The renaming function
of the reservation stations
is replaced by the ROB
CSCE 614 Fall 2009
8
4 Steps of Execution
1. Issue (also called “dispatch”)
- Get an instruction from the instruction
queue.
- Issue the instruction if there is an empty
reservation station and an empty slot in ROB.
- If either all reservation stations are full or
the ROB is full, then instruction issue is
stalled.
CSCE 614 Fall 2009
9
2. Execute
- If one or more of the operands is not yet
available, monitor the CDB while waiting for
the register to be computed.
- Also RAW hazards are checked.
- When both operands are available at a
reservation station, execute the operation.
- Loads require two steps (effective address
calculation and source operand).
- Stores need one step (effective address
calculation).
CSCE 614 Fall 2009
10
3. Write Result
- When the result is available, write it on
the CDB and from the CDB into the ROB,
as well as to any reservation stations
waiting for this result.
- For stores, if the value to be stored is
available, it is written into the Value field of
the ROB entry for the store.
CSCE 614 Fall 2009
11
4. Commit (also called “completion” or “graduation”)
- Normal commit: When an instruction reaches
the head of the ROB and its result is present in
the buffer, the processor updates the register
with the result and removes the instruction from
the ROB.
- Store commit: Similar except that memory is
updated.
- Branch with an incorrect prediction: The
speculation is wrong. The ROB is flushed and
execution is restarted at the correct successor of
the branch.
CSCE 614 Fall 2009
12
Example
L.D
L.D
MUL.D
SUB.D
DIV.D
ADD.D
F6, 34(R2)
F2, 45(R3)
F0, F2, F4
F8, F6, F2
F10, F0, F6
F6, F8, F2
When the MUL.D is ready to commit.
CSCE 614 Fall 2009
13
Example (p.109)
Loop:
L.D
MUL.D
S.D
DADDIU
BNE
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
R1, R1, #-8
R1, R2, Loop
Assume that we have issued all the instructions in the loop twice.
Assume that L.D and MUL.D from the first iteration have committed
and all other instructions have completed execution.
Show the contents of the ROB and the FP registers.
CSCE 614 Fall 2009
14
Hardware Speculation
• Because neither the register values nor
any memory values are actually written
until an instruction commits, the processor
can easily undo its speculative actions
when a branch is found to be mispredicted.
• Exceptions are handled by not recognizing
the exception until it is ready to commit.
CSCE 614 Fall 2009
15
Hardware Speculation
• Figure 2.17 (p.113)
CSCE 614 Fall 2009
16
Multiple-Issue Processors
• Allow multiple instructions to issue in a
clock cycle.
• Ideal CPI < 1
• 3 flavors
– Statically Scheduled Superscalar
– Dynamically Scheduled Superscalar
– VLIW (Very Long Instruction Word)
CSCE 614 Fall 2009
17
Superscalar Processors
• Issue varying numbers of instructions per
clock
– statically scheduled
• using compiler techniques
• in-order execution
– dynamically scheduled
• Tomasulo’s algorithm
• out-of-order execution
CSCE 614 Fall 2009
18
VLIW Processors
• issue a fixed number of instructions
formatted either as one large instruction or
as a fixed instruction packet with the
parallelism among instructions explicitly
indicated by the instruction (EPIC:
Explicitly Parallel Instruction Computers).
• Statically scheduled by the compiler.
CSCE 614 Fall 2009
19
name
Issue
structure
Hazard
detection
Scheduling
Distinguishing
characteristic
Examples
Superscalar
(static)
dynamic
h/w
static
in-order
execution
MIPS and
ARM
(embedded)
Superscalar
(dynamic)
dynamic
h/w
dynamic
some out-oforder
execution
None
Superscalar
(speculative)
dynamic
h/w
dynamic w/
speculation
out-of-order
execution w/
speculation
Pentium 4,
MIPS R12K,
Alpha 21264,
IBM Power5
VLIW/LIW
static
primarily
s/w
static
all hazards
determined by
compiler
TI C6x
(embedded)
EPIC
mostly
static
mostly
s/w
mostly
static
all hazards
determined by
compiler
Itanium
CSCE 614 Fall 2009
20
Multiple Instruction Issue with
Dynamic Scheduling
• Two-issue dynamically scheduled
processor
– It can issue any pair of instructions if there are
reservation stations of the right type available.
– Extended Tomasulo’s algorithm
Note that Tomasulo’s algorithm (and Hardware Speculation) is used
for both integer operations and FP operations.
CSCE 614 Fall 2009
21
• Two approaches to implement
– Issue one instruction in half a clock cycle, so
that two instructions can be processed in one
clock cycle.
– Build the logic necessary to handle two
instructions at once, including any possible
dependences between the instructions.
• Modern superscalar processors that issue
4 or more instructions per clock often
include both approaches.
CSCE 614 Fall 2009
22
How to Handle Branches?
• Dynamically scheduled processors
– Only allow instructions to be fetched and
issued (but not actually executed) until the
branch has completed.
– IBM 360/91
• Processors with hardware speculation
– Can actually execute instructions based on
branch prediction.
CSCE 614 Fall 2009
23
• Note that we consider loads and stores,
including those to FP registers, as integer
operations.
• Assume that FP adds take 3 execution
cycles.
• Latency:
Write CDB
Execute
CSCE 614 Fall 2009
24
• The throughput improvement versus a
single-issue pipeline is small.
– There is only one FP operation per iteration.
– There is only one Integer ALU for both integer
ALU operations and effective address
calculations.
• Larger improvements would be possible if
the processor could execute more integer
operations per cycle.
CSCE 614 Fall 2009
25
Multiple Issue with Speculation
• We process multiple instructions per clock
assigning reservation stations and reorder
buffers to the instructions.
• To maintain throughput of greater than one
instruction per cycle, a speculative
processor must be able to handle multiple
instruction commits per clock cycle.
CSCE 614 Fall 2009
26
Example (p.119)
Loop:
LD
DADDIU
SD
DADDIU
BNE
R2, 0(R1)
R2, R2, #1
R2, 0(R1)
R1, R1, #8
R2, R3, Loop
Consider the execution of the loop on a two-issue processor, once without
speculation (dynamic scheduling/Tomasulo’s algorithm) and once with speculation.
Assume that there are separate integer functional units for effective address
calculation, for ALU operations, and for branch condition evaluation.
Assume that there are 2 CDBs.
Assume that up to two instructions of any type can commit per clock for a processor
with speculation.
Show the execution timing of the first three iterations of the loop.
CSCE 614 Fall 2009
27
High-Performance Instruction
Delivery
• For multiple-issue (delivering 4~8
instructions per clock cycle) processors
– Branch-target buffers
– Integrated instruction fetch unit
– Return address prediction
CSCE 614 Fall 2009
28
Branch-Target Buffers
• To reduce the branch penalty for the classic
5-stage pipeline, we want to know what
address to fetch by the end of IF.
• Branch-target buffer: a branch-prediction
cache that stores the predicted address for
the next instruction after a branch.
• We access the buffer during the IF stage
using the instruction address. (We don’t
know what the instruction is.)
CSCE 614 Fall 2009
29
Branch-Target Buffers
Optional.
May be
used for
extra
prediction
state bits.
Branch-Target Cache
CSCE 614 Fall 2009
30
Branch-Target Buffers
• We only need to store the predicted-taken
branches in the branch-target buffer.
– Why?
• No branch delay if a branch-prediction
entry is found and is correct.
CSCE 614 Fall 2009
31
CSCE 614 Fall 2009
32
Return Address Predictors
• Predicting indirect jumps (destination
address varies at run time)
– Procedure returns, procedure calls, case,
select, etc.
– SPEC89: 85% of indirect jumps are procedure
returns.
• A small buffer of return addresses operating
as a stack
– Caches the most recent return addresses
– Push a return address on the stack at a call
– Pop one off at a return
CSCE 614 Fall 2009
33
Integrated Instruction Fetch Units
• A separate autonomous unit that feeds
instructions to the rest of the pipeline for
multiple-issue processors.
• Have several functions
– Integrated branch prediction
– Instruction prefetch
– Instruction memory access and buffering
CSCE 614 Fall 2009
34
Download