CS 1104 Help Session IV Five Issues in Pipelining Colin Tan,

advertisement
CS 1104
Help Session IV
Five Issues in Pipelining
Colin Tan,
ctank@comp.nus.edu.sg
S15-04-15
Issue 1
Pipeline Registers
• Instruction execution typically involves several
stages
– Fetch: Instructions are read from memory
– Decode: The instruction is interpreted, data is read from
registers.
– Execute: The instruction is actually executed on the
data
– Memory: Any data memory accesses are performed
(e.g. read, write data memory)
– Writeback: Results are written to destination registers.
Issue 1
Pipeline Registers
• Stages are built using combinational devices
– The second you put something on the input of a
combinational device, the outputs change.
– These outputs form inputs to other stages, causing their
outputs to change
– Hence instructions in the fetch stage affect every other
stage in the CPU.
– Not possible to have multiple instructions at different
stages, since each stage will now affect stages further
down.
• Effect is that CPU can only execute 1 instruction at a time.
Issue 1
Pipeline Registers
• To support pipelining, stages must be de-coupled
from each other.
– Add pipeline registers!
– Pipeline registers allow each stage to hold a different
instruction, as they prevent one stage from affecting
another stage until the appropriate time.
• Hence we can now execute 5 different instructions
at 5 stages of the pipeline
– 1 different instruction at each stage.
Issue 2
Speedup
• Figure below shows a non-pipelined CPU
executing 2 instructions
IF
ID
EX MEM
WB
IF
ID
EX
MEM
WB
• Assuming each stage takes 1 cycles, each
instruction will take 5 cycles (i.e. CPI=5)
Issue 2
Speedup
IF
ID
EX MEM
WB
IF
ID
EX MEM
IF
ID
WB
EX MEM
WB
• For pipelined case:
– First instruction takes 5 cycles
– Subsequent instructions will take 1 cycle
• The other 4 cycles of each subsequent instruction is amortized
by the previous instruction (see diagram).
Issue 2
Speedup
• For N+1 instructions, 1st instruction takes 5
cycles, subsequent N instructions take 1 cycle.
• Total number of cycles is not 5+N.
• Hence average number of cycles per instruction is:
CPI = (5+N)/(N+1)
• As N tends to infinity, CPI tends to 1.
• Compare with none-pipeline case, a 5-stage
pipeline gives a 5:1 speedup!
• Ideally, an M stage pipeline will give an M times
speedup.
Issue 3
Hazards
• A hazard is a situation where computation is
prevented from proceeding correctly.
– Hazards can cause performance problems.
– Even worse, hazards can cause computation to
be incorrect.
– Hence hazards must be resolved
Issue 3A
Structural Hazards
• Structural Hazards
• Generally a shared resource (e.g. Memory, disk drive) can be
used by only 1 processor or pipeline stage at a time.
• If >1 processor or pipeline stage needs to use the resource, we
have a structural hazard.
• E.g. if 2 processors want to use a bus, we have a structural
hazard that must be resolved by arbitration (See I/O notes).
• Structural hazards are reduced in the MIPS by having separate
instruction and data memory
– If they were in the same memory, it is possible that the IF stage
may try to fetch instructions at the same time as the MEM stage
accesses data => Structural Hazard results.
Issue 3B
Data Hazards
• Data Hazards
• Caused by having >1 instruction executing concurrently.
• Consider the following instructions:
add $1, $2, $3
add $4, $1, $1
add $1,$2,$3
add $4,$1,$1
IF
ID
EX MEM
IF
ID
WB
EX MEM
WB
• The first add instruction will update the contents of $1 in WB
stage.
• But the second instruction will read it $1 2 cycles earlier in the
ID stage! It will obviously read the old value of $1, and the add
instruction will give the wrong result.
Issue 3B
Data Hazards
• The result that will be written to $1 in the WB stage first
becomes available at the ALU in the EX stage.
• The result for the first add become available from the ALU
just as the second instruction needs it.
• If we can just send this result over, can resolve hazard
already! This is called “Forwarding”.
add $1,$2,$3
add $4,$1,$1
IF
ID
EX
MEM
IF
ID
EX
WB
MEM
WB
Issue 3B
Data Hazards
• Sometimes forwarding doesn’t quite work.
Consider:
lw $1, 4($3)
add $4, $1, $1
• For the lw instruction the EX stage actually
computes the result of 4 + $3 (i.e. the 4($3)
portion of the instruction).
• This forms the fetch address. No use forwarding
this to the add instruction!
Issue 3B
Data Hazards
• The result of the ALU stage (i.e. the fetch address
of the lw instruction) gets sent to the memory
system in the MEM stage, and the contents of that
addresses becomes available at the end of the
MEM stage.
• But the add instruction needs it at the start of the
EX stage. We have this situation:
Issue 3B
Data Hazards
lw $1,4($3)
add $4,$1,$1
IF
ID
EX
MEM
IF
ID
EX
WB
MEM
WB
• The forwarding is being done backwards, meaning
that at the point when the add instruction needs the
data, the data is not yet available.
• Since it is not yet available, cannot forward!!
Issue 3B
Data Hazards
• This form of hazard is called a “load-use” hazard,
and is the only type of data hazard that cannot be
resolved by forwarding.
• The way to resolve this is stall the add instruction
by 1 cycle.
lw $1, 4($3)
add $4,$1,$1
IF
ID
IF
EX MEM
ID
(stall)
WB
EX MEM
WB
Issue 3B
Data Hazards
• Stalling is a bad idea as the processor spends 1
cycle not doing anything.
• Can also find an independent instruction to place
between the lw and add.
lw $1, 4($3)
IF
ID
EX
IF
ID
EX
MEM
add $4,$1,$1
IF
ID
EX
sub $5,$7,$7
MEM
WB
WB
MEM
WB
Issue 3B
Data Hazards
• Forwarding can be done either from the ALU
stage or from the MEM stage.
add $1,$2,$3
add $4,$1, $5
add $6,$1, $7
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
IF
ID
EX
WB
MEM
WB
Issue 3B
Data Hazards
• Hazards between adjacent instructions are
resolved from the ALU stage (between add $1,$2,
$3 and add $4,$1,$5)
• Hazards between instructions separated by another
instructions (between add $1,$2,$3 and add
$6,$1,$7) are resolved from the MEM stage.
– This is because if we try to resolve this from the MEM
stage, the add $6,$1,$7 instruction will actually get the
results of the previous (add $4,$1,$5) instruction
instead (since it is the instruction that is in the EX
stage)
Issue 3C
Control Hazards
• In an unoptimized pipeline branch decisions are
made after the EX stage
– The EX stage is where the comparisons are made.
– Hence the branching decision becomes available at the
end of the EX stage.
• Pipeline can be optimized by moving the
comparisons to the ID stage
– Comparisons are always made between register
contents (e.g. beq $1, $3, R).
– The register contents are available by the end of the ID
stage.
Issue 3C
Control Hazards
• However still have a problem. E.g.
L1: add $3, $1, $1
beq $3, $4, L1
sub $4, $3, $3
• Depending on whether the beq is taken or not,
the next instruction to be fetched is either add
(if the branch is taken) or sub (if the branch is
not taken)
Issue 3C
Control Hazard
beq $3,$4,L1
sub $4,$3,$3
or
add $3,$1,$1
IF
ID
EX MEM
WB
IF
• We don’t know which instruction to fetch until the
end of the ID stage.
• But the IF stage must still fetch something!
– Fetch add or fetch sub?
Issue 3C
Control Hazards - Solutions
• Always assume that the branch is taken:
– The fetch stage will fetch the add instruction.
– By the time this fetch is complete, the outcome of the
branching is known.
– If the branch is taken, the add instruction proceeds to
completion.
– If the branch is not taken, the add instruction is flushed
from the pipeline, and the sub instruction is fetched and
executed. This causes a 1 cycle stall.
Issue 3C
Control Hazards - Solutions
• Always assume that the branch is not taken
– The IF stage will fetch the sub instruction first.
– By this time, the outcome of the branch will be known.
– If the branch is not taken, then the sub instruction
executes to completion.
– If the branch is taken, then the sub instruction is
flushed, and the add instruction is fetched and
executed. Results in 1 cycle stall.
Issue 3C
Control Hazards - Solutions
• Delayed Branching
L1: add $3, $1, $1
beq $3, $4, L1
sub $4, $3, $3
ori $5, $2, $3
• Just like the assume not taken strategy, the IF
stage fetches the sub instruction. However, the sub
instruction executes to completion regardless of
the outcome of the branch.
Issue 3C
Control Hazards-Solutions
– The outcome of the branching will now be known.
– If the branch is taken, the add instruction is fetched and
executed.
– Otherwise the ori instruction is fetched and executed.
• This strategy is called “delayed branching”
because the effect of the branch is not felt until
after the sub instruction (i.e. 1 instruction later).
• The sub instruction here is called the delay slot or
delay instruction, and it will always be executed
regardless of the outcome of the branching.
Issue 4
Instruction Scheduling
• We must prevent pipeline stalls in order to gain
maximum pipeline performance.
• For example, for the load-use hazard, we must
find an instruction to place between the lw
instruction and the instruction that uses the lw
results to prevent stalling.
• Also may need to place instructions into delay
slots.
• This re-arrangement of instructions is called
Instruction Scheduling.
Issue 4
Instruction Scheduling
• Basic criteria:
– An instruction I3 to be placed between two instructions
I1 and I2 must be independent of both I1 and I2.
lw $1, 0($3)
add $2, $1, $4
sub $4, $3, $2
In this example, the sub instruction modifies $4, which is used by
the add instruction. Hence it cannot be moved between the lw
and the add.
Issue 4
Instruction Scheduling
– An instruction I3 that is moved must not violate
dependency orderings. For example:
add $1, $2, $3
sub $5, $1, $7
lw $4, 0($6)
ori $9, $4, $4
• The add instruction cannot be moved between the lw and ori
instructions as it would violate the dependency ordering with
the sub instruction.
– I.e. the sub depends on the add, and moving the add after the sub
would cause incorrect computation of the sub.
Issue 4
Instruction Scheduling
• The nop instruction stands for “no operation”.
• When the CPU reads and executes the nop
instruction, absolutely nothing happens, except
that 1 cycle is wasted executing this instruction.
• The nop instruction can be used in delay slots or
simply to waste time.
Issue 4
Instruction Scheduling
• Delay branch example: Suppose we have the following
program, and suppose that branches are not delayed:
add $3, $4, $5
add $2, $3, $7
beq $2, $3, L1
sub $7, $2, $4
L1:
• In this program, the 2 add instructions will be executed
regardless of the outcome of the branch, but the sub
instruction will not be executed if the branch is taken.
Issue 4
Instruction Scheduling
• Suppose a hardware designer modifies the beq instruction
to become a delayed branch.
– The sub instruction is now in the delay slot, and will be executed
regardless of the outcome of the branch!
– This is obviously not what the programmer originally intended.
– To correct this, we must place an instruction that will be executed
regardless of the outcome of the branch into the delay slot.
– Either of the 2 add instructions qualify, since they will be executed
no matter how the branch turns out.
– BUT moving either of them into the delay slot will cause incorrect
computation
• They will violate the dependency orderings between the first and
second add, and between the second add and the lw.
Issue 4
Instruction Scheduling
• But if we don’t move anything into the delay slot,
the program will not execute correctly.
• Solution: Place a nop instruction there.
add $3, $4, $5
add $2, $3, $7
beq $2, $3, L1
nop
sub $7, $2, $4
L1:
#delay slot here
• The sub instruction now moves out of the delay
slot.
Issue 4
Instruction Scheduling
• Loop Unrolling
– Idea: If we loop 16 times to perform an operation, we
can duplicate that operation 4 times and loop only 4
times.
E.g.
for(int i=0; i<16; i++)
my[i] = my[i]+3;
Issue 4
Instruction Scheduling
• This loop can be unrolled to:
for(int i=0; i<16; i=i+4)
{
my[i] = my[i]+3;
my[i+1]=my[i+1]+3;
my[i+2]=my[i+2]+3;
my[i+3]=my[i+3]+3;
}
Issue 4
Instruction Scheduling
• But why even bother doing this?
– Loop unrolling actually generates more instructions!
• Previously we only had 1 instruction doing my[i]=my[i]+3
• Now we have 4 such instructions!
– Increasing the number of instructions gives us more
flexibility in scheduling the code.
• This allows us to eliminate pipeline stalls etc. more effectively.
Issue 5
Improving Performance
• Super-pipelines
– Each pipeline stage is further broken down.
– Effectively each pipeline stage is in turn pipe-lined.
– E.g. if each stage can be further broken down into 2
sub-stages:
IF1 IF2 ID1 ID2 EX1 EX2 M1 M2 WB1 WB2
Issue 5
Improving Performance
• This allows us to accommodate more
instructions in the pipeline.
– Now we can have 10 instructions operating
simultaneously.
– So now we can have a 10x speedup over the
non-pipeline architecture instead of just a 5x
speedup!
Issue 5
Improving Performance
• Unfortunately when things go wrong, penalties are
higher:
– E.g. a branch misprediction resulting in an IF-stage
flush will now cause 2 bubbles (in the IF1 and IF2
stages) instead of just 1.
– In a load-use stall, 2 bubbles must be inserted.
Issue 5
Improving Performance
• Superscalar Architectures
– Have 2 or more pipelines working at the same time!
• In a single pipeline, normally the best CPI possible is 1.
• With 2 pipelines, the average CPI goes down to 1/2!
– This will allow us to execute twice as many instructions
per second.
– Real situation not that ideal
• Instructions going to 2 different pipelines simultaneously must
be independent of each other
– There is NO forwarding between pipelines!
Issue 5
Improving Performance
• If CPU is unable to find independent instructions,
then 1 pipeline will remain idle.
• Example of superscalar machine:
– Pentium processor - 2 integer pipelines, 1 floating-point
pipeline, giving a total of 3 pipelines!
Summary
• Issue 1: Pipeline registers
– These decouple stages so that different instructions can
exist in each stage.
– Allows us to execute multiple instructions in a single
pipeline.
• Issue 2: Speed-up
– Ideally, an N stage pipeline should give you an N times
speedup.
Summary
• Issue 3: Hazards
– Structural hazards: Solved by having multiple
resources. E.g. Separate memory for instruction and
data.
– Data hazards: Solved by forwarding or stalling.
– Control hazards: Solved by branch prediction or
delayed branching.
Summary
• Issue 4: Instruction Scheduling
– Instructions may need to be re-arranged to avoid
pipeline stalls (e.g. load-use hazards) or to ensure
correct execution (e.g. filling in delayed slots).
– Loop unrolling gives extra instructions.
• This in turn gives better scheduling opportunities.
Summary
• Issue 5: Improving Performance
– Super-pipelines: Increases pipeline depth.
• A 5-stage pipeline becomes a 10-stage pipeline,
improving performance by 10x instead of 5x.
• Also causes larger penalties.
– Super-scalar pipelines: Have multiple pipelines
• Can increase instruction execution rate.
• Average CPI can actually fall below 1!
Download