Here are the notes from Chapter 6.

advertisement
CH 6 - PIPELINING
INTRODUCTION - basics of pipelining and the problems that are
generated by this technique.
Basic Principle is simple: Divide up the instruction execution into
discrete parts, much like the multicycle processor:
IF - Instruction fetch
ID - Register read/ Instruction decode
EXE - Execute operation
MEM - Memory Access
WB - Write back
Note that all instructions will write back on the 5th cycle (normally); data
will remain on hold during mem access stage if there is no mem access.
Key: can execute several instructions in parallel- up to five- because we
have five 'functional units' in our pipeline. If instruction n is being
executed, n+1 is doing register read, n+2 is being fetched, n-1 is being
memory accessed, and n-2 being written back.
How does this improve performance? Before, each instruction would have
taken 5 cycles. Now, a new instruction might be fetched every cycle- giving
an upper-limit performance increase of:
MaxSpeedup = CyclesBefore/EffectiveCyclesAfter = pipelineStages
It looks as if each instruct takes one cycle! However, execution for a
given instruction is the same, or more (due to delayed writeback), than for
multicycle. Let's look at an example:
ClockCycle:
add $2,$3,$4
lw $4,0($2)
beq $4,$0,foo
sw $2,0($6)
1 2
if id
if
3
4
5
6
7
8
exe mem wb
id
exe mem wb
if
id exe mem wb
if
id exe mem wb
9
10
Much better, right? Well, yes and no. There are a few problems. HAZARDS
are situations where the result of an instruction must be
determined in order to be able to execute later instructions.
1
There are three basic types of hazards:
1) Structural hazards occur when the hardware is incapable of supporting
the operations needed in the same clock cycle. MIPS is so simple that it
does not have structural hazards. If, however, program and data memory
were unified, then mem read could not be done at the same time as IF. Or, if
there were a two-cycle divide unit that was not pipelined, couldn't have
adjacent divide instructions.
Solution: STALL. Continue the earlier instruction's execution, but hold
back all later instructions one cycle so the conflict can be resolved. This
inserts a BUBBLE into the pipeline where useful work is not being done.
2) Control hazards occur because until a conditional branch is evaluated,
or a jump address determined, the address of the next instruction is
unclear.
Question: in the example above, when is the address of the instruction
after the branch available? (After bra exe, could do sw if).
Solutions:
a) Stall until decision known (2 clocks IF branch calculated early)
b) Assume "branch-not-taken" and inhibit WB if branch
c) Assume "branch-taken", which causes at least one stall
d) Branch prediction: keep track of each branch address and whether
the last three times branch was taken or not more often. ***
Prediction is used in most systems. NO STALLS unless wrong.
e) Delayed branch. The instruction after the bra is assumed to always
be executed. "Early decision" branch detection could then be
used to decide branch at end of id. Allows only bzero, bnz.
The instruction after the branch is in the "branch delay slot". Often
there’s no instruction that works, so nop used.
3) Data hazards are most common. This happens when a register value
calculated in a previous instruction has not yet been written back by the time
a later instruction needs it in the id stage.
Question: Can you find examples in our code? ($2:add->lw, $4: lw->beq;
$2:add->sw is NOT a hazard- we can assume that a wb and read in same
2
cycle can be supported by the hardware. Write completes in beginning of
clock, read at end).
Solution: Non-load data hazards can be fixed using data forwarding, Load
data hazards can't be fixed and must invoke stalls. Note that for non-load
instructions, the results are available after the exe stage, and the next
instruction doesn't need the result until IT'S exe stage. Thus, the result
from the first instruction can be forwarded from the exe output to the exe
input, bypassing the register altogether (this technique is also called
data bypassing). This is done with multiplexers.
Let's reexamine our program snippet. If we used just stalls, we'd have to:
ClockCycle:
add $2,$3,$4
lw $4,0($2)
beq $4,$0,foo
sw $2,0($6)
1 2
3
4
if id exe mem
if STA STA
if STA
5
6
7
8
wb
id exe mem wb
STA STA STA id
9
10
11 12
13 14
exe mem
if id exe mem wb
And the sw WB would occur on clock cycle 14! Note that this is not
unusually "unfortunate" code; data hazards are the rule more than the
exception. But it is folly to just wait for the branch decision to be made.
If we use the schemes we have discussed, and we make the correct branch
decision, then we would have:
ClockCycle:
add $2,$3,$4
lw $4,0($2)
beq $4,$0,foo
sw $2,0($6)
1 2
if id
if
3
4
5
6
7
8
9
exe mem wb
id exe mem wb
if STA id exe mem wb
if
STA id exe mem wb
10
(data forward)
(data forward)
(branch predict)
Only one stall, due to the unavoidable load data hazard!
A performance improvement of 14/9 = 56%!
That's the background on pipelining, next we'll consider how to extend our
simple MIPS to do pipelining.
3
MIPS pipeline implementation (Sec. 6.2) - fig. 6.10/6.12
To consider how to implement the pipelined MIPS, we'll have to go back to
assuming separate program and data memories, to avoid memory contention
problems for L/S instructions.
Let's start back at the single cycle block diagram, reorganized as 6.10.
This shows the datapath elements associated with the five pipeline stages:
IF: fetch instruction, update PC=PC+4
ID: register access
EXE: execute, calculate branch addr
MEM: Load/Store only
WB: dest. MUX only
Just like in the multicycle implementation, the key factor is that we need
to use registers to save the results of each stage so that they can be
available as inputs to the next stage in the next clock cycle. The four
pipeline registers are shown in the next Figure. The IF/ID register
contains the IR and updated PC value.
A few of the details are glossed over in this figure. Most egregious is
that the write register addressing scheme does not extend through the other
three pipeline registers. Control signals and data have to travel together
through the registers.
Note that registers are updated all at once at the clock edge. It is
assumed that the clock is slow enough that all signals settle in time
before the edge.
Let's run our little program snippet on this architecture. Note that there
is no forwarding available, and ... when is the branch decision made?
(during exe, PC not updated with branch address until mem).
ClockCycle:
add $2,$3,$4
lw $4,0($2)
beq $4,$0,foo
sw $2,0($6)
1 2
if id
if
3
4
5
6
7
8
exe mem wb
id exe mem wb
if
id
4
9
10
exe mem
if
11
id
12
exe
PIPELINE CONTROL (6.3, handout Fig. 6.30)
In this lecture, we’ll consider the control signals for a basic pipeline, and
then expand to detection and (if possible) circumvention of hazards.
The key concept here is that control signals must go through the pipeline
registers in step with the data buses, so they can be applied in the correct
stage at the correct clock cycle.
Look at the handout. We see that once control signals are decoded, they are
carried along through the pipeline registers to later stages.
Dividing the control signals into the appropriate pipeline stage, we have:
IF – Nothing to assert (registers always clocked)
ID – Nothing to write here. RegWrite controlled in last stage.
EXE – RegDst, ALUop, ALUSrc applied here
MEM – Branch, MemRead, MemWrite needed here. Branch PC here!
WB – MemToReg and RegWrite. NOTE that register write happens here!
WORK through the example in the text pp. 471-476. You should be able to
identify which instruction is doing what in which portion of the pipeline.
DATA HAZARDS and FORWARDING (6.4)
In order to deal with data hazards, we need to do the following:
1. Detection of data hazards
2. Forward data where possible
3. Stall where forwarding not possible (Loads)
Let’s consider these steps one at a time.
Detection of data hazards. First, consider how far apart instructions can be
and still have data hazard problems.
Clock
Add $1,$2,$3
Add $4,$5,$1
Add $6,$7,$1
Add $8,$9,$1
1
if
2
id
if
3
exe
id
if
4
mem
exe
id
if
5
5
wb
mem
exe
id
6
7
8
wb
mem wb
exe mem wb
9
Since we can do reg fetch and writeback in same cycle, The last add is
independent of the first. Thus, the second and third are dependent. So we
need to examine whether any register values of an instruction are generated
by either of the two previous instructions.
This is done in the execution stage by pipelining the rs and rt addresses, and
comparing them to the rd address in the following two stages:
id/exe
exe/mem
mem/wb
rd
rs
rt
Forward
Detect
If any address matches (rs=rd_mem, rs=rd_wb, rt=rd_mem,rt=rd_wb), data
must be forwarded UNLESS the operation is a load.
Forwarding is achieved by adding TWO MORE inputs to the MUX at the
ALU inputs, which are connected to the datastream outputs from the mem
and WB stages. See Handout.
What happens with a load?
Clock:
1
2
Lw $1, 0($2)
if
id
Add $3,$4,$1
if
Add $5,$6,$1
3
exe
id
if
4
mem
stall
stall
Lw $1, 0($2)
Add $3,$4,$10
Add $5,$6,$1
exe
id
if
mem wb
exe mem wb
id
exe mem wb
if
id
if
6
5
wb
exe
id
6
7
8
mem wb
(data haz)
exe mem wb
(no data haz)
(data haz)
Data forwarding can be used after one cycle. Stall required if instruction
after load has data hazard with load. Thus, we still need to be able to do a
stall…. Here’s how:
Need a HAZARD DETECTION UNIT (in the book’s pseudocode):
If(ID/EX.MemRead and
((ID/EX.rt=IF/ID.Rs) or (ID/EX.rt = IF/ID.rt))
then stall the pipeline’s if and id stages for one cycle.
Note that exe continues (to do memory load), while the following
instructions (in id, if) are stalled one cycle. Note that ID/EX.rt is actually the
destination address for the Load, since the RegDst MUX is in the EX stage!
The stall is accomplished by inhibiting clocking of the PC and IF/ID
pipeline. Also, the control signals for the bubble are set to 0, which inhibits
writeback, mem write, and branching. That’s all!!
BRANCH HAZARDS (6.6)
Reduction of branch hazards can be achieved in a number of ways:
1) Assume branch not taken. This will give no stalls IF branch shouldn’t
be taken, but will give 3 stalls if it IS taken, if new address not
calculated until mem unit
Clock:
1
2
3
4
5
6
7
8
Beq $1, $2, foo
if
id
exe mem wb
+1
if
id
exe mem (wb)
+2
if
id
exe mem (wb)
+3
if
id
exe mem (wb)
foo: instruct
if
id
exe mem
*** write/writeback inhibited for +1, +2, +3 ***
2) Reducing branch delay. Can improve above situation by one cycle by
pre-calculating branch address. PC+4 can be calculated in the IF
stage, and the offset can be calculated in the ID stage. Normally
would still have to wait for exe to determine result, so earliest that
new IF could occur would be while branch is in mem (2 stalls).
We can go one better by using new hardware that determines if rs, rt
equal – and put it in ID/reg fetch stage (see diagram). Then, decision
7
can be made in ID stage, and branched-to IF can happen during EXE,
giving only ONE delay
3) Dynamic branch prediction. Based on actual history of execution!
Uses branch prediction buffer or branch history table. Small memory
that is indexed by lowest address bits, and has high address bits and
recent history:
A2-A6
A7-A31
A7-A31 H0-H1
=?
Branch decision
Match: instruction is branch
Where decision is based on state machine that predicts based on two
of three last branch outcomes at that address.
When branch resolved, the decision outcome bits are modified.
4) Delayed Branch – instruction after branch ALWAYS executed. Used
to reduce bubbles even if branch prediction is wrong. Or for jumps.
EXCEPTIONS (6.7)
As usual, exceptions mess up the elegance of the architectural model.
Exceptions can happen in several pipeline stages (id=bad instruction,
exe=overflow, mem=memory fault) and in some of these cases we need to
RESTART the program from the instruction that faulted. External interrupts
need to be dealt with as well.
Need logic that COMPLETES (WB) instructions BEFORE the faulting
instruction, but disables writeback of the faulting instruction or later ones.
Again, control bits of the faulting instruction and later ones can be zeroed
out.
Also need to save the address (+4) of the faulting instruction in the EPC, and
the reason in the CAUSE register. This means passing the address+4 through
the pipeline registers (Fig. 6.55) allows saving exe exceptions but not mem.
8
For interrupts, there is some flexibility as to where the pipeline should
complete. Somewhat arbitrary.
SUPERSCALAR PROCESSORS (6.8)
In the hunt for greater performance, the step after pipelining was superscalar.
A superscalar CPU is capable of issuing (fetching) more than one instruction
at a time. Usually the ORDER of a superscalar is 2 or 4 (simultaneous
instructions). How is this done?
1) The cache memory can send 2 or 4 instructions at a time to the CPU.
(The path from DRAM to cache may be narrower).
2) The IF and ID stages run in tandem for these instructions, usually, so
that the register set is multiple ported.
3) Once the instructions are decoded, they are scheduled for issue to one
of a collection of functional units:
Int unit
IF
ID
Int unit
Float unit
Writeback
Load/Store
Branch unit
4) Instructions may be issued OUT OF ORDER if there is no functional
unit available at the right time (structural hazard). This is called
dynamic scheduling.
5) Some instructions (div, float ops) take longer, so instructions can
complete OUT OF ORDER
6) WRITEBACK is done by collecting results and writing them back in
order, with appropriate data forwarding.
Control and exceptions are VERY COMPLICATED! A SCOREBOARD is
often used to keep track of instructions as they are executing, and
data/control dependencies between instructions.
Nifty techniques can be implemented, such as executing both the nonbranch
code and the branch-to code at the same time, then tossing out the
instructions that were not to be executed.
9
The DEC Alpha 21264 was the first processor to implement a superscalar
processor with deep pipelines (nine stages for integer ops only!), 4 th order,
max of 6 instructions issued at once, first 1GHz clock CPU.
Problems:
1) (6.2) Show the forwarding paths needed to execute the following three
instructions:
add
add
add
$2, $3, $4
$4, $5, $6
$5, $3, $4
Solution: we need a forwarding path from the second instruction to the third
since it depends on the first ( $4)
2) (6.3) How could we modify the following code to make use of a delayed
branch slot?
Solution:
Loop:
lw
$2, 100($3)
Addi $3, $3, 4
Beq $3, $4, Loop
Loop:
addi $3, $3, 4
Beq $3, $4, Loop
Lw $2, 96($3)
3) (6.11) Consider executing the following code on the pipelined datapath of
Figure 6.46.
Add
Add
Add
Add
Add
$1, $2, $3
$4, $5, $6
$7, $8, $9
$10, $11, $12
$13, $14, $15
10
At the end of the fifth cycle of execution, which registers are being read and
which registers being written?
Solution: In the fifth cycle, register $1 will be written and $11 and $12 will
be read.
4) (6.12) With regard to the last problem, explain what the forwarding unit is
doing during the fifth cycle of execution. If any comparisons are being
made, mention them.
Solution: The forwarding unit is looking at the instructions in the fourth and
fifth stages and checking to see whether they intend to write to the register
file and whether the register written is being used as an ALU input. Thus, it
is comparing 8 = 4? 8 = 1? 9 = 4? 9 = 1?
5) (6.23) Normally we want to maximize performance on our pipelined
datapath with forwarding and stalls on use after a load. Rewrite this code to
minimize performance while still obtaining the same result.
Lw
Lw
Add
Add
Add
Sw
Beq
$3, 0($5)
$4, 4($5)
$7, $7, $3
$8, $8, $4
$10, $7, $8
$6, 0($5)
$10, $11, Loop
Lw
Add
Sw
Lw
Add
Add
Beq
$3, 0($5)
$7, $7, $3
$6, 0($5)
$4, 4($5)
$8, $8, $4
$10, $7, $8
$10, $11, Loop
Solution:
11
Homework:
1) (6.4) Identify all of the data dependencies in the following code. Which
dependencies are data hazards that will be removed via forwarding?
Add
Add
Sw
Add
$2, $5, $4
$4, $2, $5
$5, 100($2)
$3, $2, $4
2) (6.9) Given Figure 6.71, determine as much as you can about the five
instructions in the five pipeline stages. If you cannot fill in a field of an
instruction, state why.
3) (6.14) Consider a program consisting of 100 lw instructions and in which
each instruction is dependent upon the instruction before it. What would the
actual CPI be if the program were run on the pipelined datapath of Figure
6.45?
4) (6.15) Consider executing the following code on the pipelined datapath of
Figure 6.46.
add $5, $6, $7
lw
$6, 100($7)
sub $7, $6, $8
How many cycles will it take to execute this code? State what hazards there
are and how they are fixed.
12
Download