lecture35014N15

advertisement
ENEE350 Lecture Notes-Weeks 14 and 15
Pipelining & Amdahl’s Law
Pipelining is a method of processing in which a problem is divided into a number of
sub-problems and solved and the solutions of the sub-problems for different
instances of the problem are then overlapped.
Example: a[i] = b[i] + c[i] + d[i] + e[i] + f[i], i = 1, 2, 3,…,n
f[2]
f[1]
e[2]
e[1]
d[2]
D
D
D
D
D
+
+
+
d[1]
c[2]
c[1]
b[1]
+
D
a[2]
a[1]
Adders have delay D to compute.
Computation time = 4D + (n-1)D = nD +3D
Speed-up = 4nD/{3D + nD} -> 4 for large
n.
We can describe the computation process in a linear pipeline algorithmically. There are
three distinct phases to this computation: (a) filling the pipeline,
(b) running the pipeline in the filled state until the last input arrives, and
(c) emptying the pipeline.
(linear pipeline)
while(1)
{resetLa tches();
clock = 0 ;
//fill the pipeline
for(j = 0; j <= n-1; j++)
{for(k = 0; k <= j; k++) segment(k);
clock++;
}
//execute all segments until the last input arrives
while (clock <= m)
{for(j = 0; j <= n-1; j++) segment(j);
clock++;
}
//empty the pipeline
for(j = 0; j <= n-1; j++)
{for(k = j; k <= n-1; k++) segment(k);
clock++;
}
}
Instruction pipelines:
Goal:
(i) to increase the throughput (number of instructions/sec) in executing programs
(ii) to reduce the execution time (clock cycles/instruction, etc.
clock
fetch
decode
execute
0
I1
1
I2
I1
2
I3
I4
I2
I3
I1
I2
I4
I3
3
4
clock
fetch
decode
execute
memory
0
I1
1
I2
I1
2
3
I3
I4
I2
I3
I1
I2
I1
4
I5
I4
I3
I2
write back
I1
Speed-up of pipelined execution of instructions over a sequential execution:
CPIu N u f 5
S(5) 
CPI p N p f1
Assuming that the systems operate at the same clock rate and use the same
number of operations:

S(5) 

CPIu
CPI p1

Example
Suppose that the instruction mix of programs executed on a serial and pipelined
machines is 40% ALU, 20% branching, and 40% memory with 4, 2, and 4 cycles per each
instruction in the three classes respectively.
Then, under ideal conditions (no stalls due to hazards)
CPIu 4  0.4  2  0.2  4  0.4
S(5) 

 3.3
CPI p1
1
If, the clock speed needs to be increased for the pipeline implementation then the
speed-up will have to be scaled down accordingly.
MIPS Pipeline
IF
ID
EX
WB
Register operations
IF
ID
EX
ME
Register/Memory operations
WB
Instruction Pipelines (Hennessy & Patterson)
Hazards
1-Structural Hazards
2-Data Hazards
3-Control Hazards
Structural Hazards: They arise when limited resources are scheduled to operate on
different streams during the same clock period.
Structural Hazards: They arise when limited resources are scheduled to operate
concurrently on different streams during the same clock period.
Example: Memory conflict (data fetch + instruction fetch) or datapath conflict
(arithmetic operation + PC update)
Clock
IF
ID
EX
ME
WB
0
I1
1
I2
I1
2
I3
I2
I1
3
I4
I3
I2
I1
4
I5
I4
I3
I2
I1
5
I6
I5
I4
I3
I2
6
I7
I6
I5
I4
I3
Fix: Duplicate hardware (too expensive)
Stall the pipeline (serialize the operation) (too slow)
Clock
IF
0
I1
1
I2
2
ID
EX
3
I1
I2
I3
5
I4
6
I3
7
9
I6
I3
I3
I4
I5
I1
I2
I4
I5
I1
I2
I4
8
WB
I1
I2
4
ME
I3
I4
Speed-up = Tserial/Tpipeline
= 5nts/ {2nts + 2ts}, for odd n
= 5nts/ {2nts + 3ts }, for even n
-> 5/2 as the number of instructions, n, tends to infinity.
Thus, we loose half the throughput due to stalls.
Note: The pipeline time of execution can be computed using the recurrences
T1 = 4
Ti = Ti-1 + 1 for even i
Ti = Ti-1 + 3 for odd i
Data Hazards
They occur when the executions of two instructions may result in the incorrect reading
of operands and/or writing of a result.
Read After Write (RAW) Hazard (Data Dependency)
Write After Read Hazard (WAR) (Data Anti-dependency)
Write After Write Hazard (WAW) (Data Anti-dependency)
RAW Hazards
They occur when reads are early and writes are late.
Clock
IF
ID
0
I1
1
I2
I1
2
I3
I2
I1
3
I4
I3
Read
I1
4
I5
I4
I3
I2
Write
5
I6
I5
I4
I3
I2
6
I7
I6
I5
I4
I3
I1: R1 = R1 + R2 I2: R3 = R1 + R2
EX
ME
WB
RAW Hazards (Cont’d)
They can be avoided by stalling the reads but this increases the execution time. A better
approach is to use data forwarding:
Clock
IF
ID
0
I1
1
I2
I1
2
I3
I2
I1
3
I4
I3
Read
I1
4
I5
I4
Read
I2
Write
5
I6
I5
I4
I3
I2
6
I7
I6
I5
I4
I3
I1: R1 = R1 + R2 I2: R3 = R1 + R2
EX
ME
WB
WAR Hazards
They occur when writes are early and reads are late
Clock IF
ID
EX
ME
WB
EX
ME
0
I1
1
I2
I1
2
I3
I2
I1
3
I4
I3
I2
I1
4
I5
I4
I3
I2
I1
5
I6
I5
I4
I3
Write Read
6
I7
I6
I5
I4
I3
I2
I1
I4
I3
I2
I1: R2 = R2 + R3 ; R9 = R3 + R4 , I2: R3 = R7 + R5; R6 = R2 + R8
WB
I1
Branch Prediction in Pipeline Instruction Sequencing
One of the major issues in pipelined instruction processing is to schedule
conditional branch instructions.
When a pipeline controller encounters a conditional branch instruction it has a
choice to decode it into one of two instruction streams.
If the branch condition is met then the execution continues from the target of the
conditional branch instruction;
Otherwise, it continues with the instruction that follows the conditional branch
instruction.
Example: Suppose that we execute the following assembly code on a 5-stage
pipeline (IF, ID,EX,ME, WB):
JCD R0 < 10, add;
SUB R0,R1; JMP D,halt;
add: ADD R0,R1;
halt: HLT;
If we assume that R0 < 10 then the SUB instruction would have been incorrectly
fetched during the second clock cycle. and we will have to another fetch cycle
to fetch the ADD instruction.
Classification of branch prediction algorithms
Static Branch Prediction: The branch decision does not change over time-- we use a fixed
branching policy.
Dynamic Branch Prediction: The branch decision does change over time-- we use a
branching policy that varies over time.
Static Branch Prediction Algorithms
1 Don’t predict (stall the pipeline)
2- Never take the branch
3- Always take the branch
4- Delayed branch
1- Stall the pipeline by 1 clock cycle : This allows us to determine the target of the
branch instruction.
JCD
IF
ID
EX
ME WB
SUB
ADD
IF
ID
EX
ME WB
Stall and decide the branch.
Pipeline Execution Speed (stall case):
Assuming only branch hazards, we can compute the average number of clock cycles
per instruction (CPI) as
CPI of the pipeline
= CPI of ideal pipeline + the number of idle cycles/instruction
= 1 + branch penalty  branch frequency
= 1 + branch frequency
In general, CPI of the pipeline > 1 + branch frequency because of data and possibly
structural hazards
Pros: Straightforward to implement
Cons: The time overhead is high when the instruction mix includes a high
percentage of branch instructions.
2- Never take the branch. The instruction in the pipeline is flushed if it is
determined that the branch should have been taken after the ID stage is
carried out.
JCD
SUB
IF
ID
EX
ME WB
IF
ID
EX
ME WB
IOR
IF
ID
EX
ME WB
XOR
IF
ID
EX
ME WB
SUB instruction is always executed and then either the IOR instruction is
executed next or SUB is flushed and XOR is executed.


Pipeline Execution Speed (Never take the branch case):
Assuming only branch hazards, we can compute the average number of clock cycles per
instruction (CPI) as
CPI of the pipeline
= CPI of ideal pipeline + the number of idle cycles/instruction
= 1 + branch penalty  branch frequency  misprediction rate
= 1 + branch frequency  misprediction rate
Pros: If the prediction is highly accurate then the pipeline can operate close to its full
throughput.
Cons: Implementation is not as straightforward and requires flushing if decoding the
branch address takes more than 1 clock cycle.
3- Always take the branch. The instruction in the pipeline is flushed if it is determined
that the branch should have been taken after the ID stage is carried out.
JCD
IF
ID
EX
SUB
ME WB
IF
IF
IOR
XOR
ID
IF
ID
address computation
EX
EX
ME WB
ID
EX
ME WB
ME WB


Pipeline Execution Speed (Always take the branch case):
Assuming only branch hazards, we can compute the average number of clock cycles per
instruction (CPI) as
CPI of the pipeline
= CPI of ideal pipeline + the number of idle cycles/instruction
= 1 + branch penalty  branch frequency  prediction rate
+ branch penalty  branch frequency  misprediction rate
= 1 + branch frequency  prediction rate
+ 2 branch frequency  misprediction rate
Pros: Better suited for the execution of loops without the compiler's intervention (but this
can generally be overcome, see the next slide).
Cons: Implementation is not as straightforward, and has a higher misprediction penalty.
Not as advantageous as not taking the branch since the branch address computation is not
completed until after the EX segment is carried out.
Example: for (i = 0; i < 10; i++) a[i] = a[i] + 1;
“Branch always” will not work well without compiler’s help
CLR R0;
loop: JCD R0 >=10,exit
LDD R1,R0;
ADD R1,1;
ST+ R1,R0;
JMP D,loop;
exit:
---------------------------------------------------------“Branch always” will work well without compiler’s help
CLR R0;
loop: LDD R1,R0;
ADD R1,1;
ST+ R1,R0;
JCD R0 < 10,loop;
3- Delayed branch: Insert an instruction after a branch instruction, and always execute it
whether or not the branch condition applies. Of course, this must be an instruction that
can be executed without any side effects on the correctness of the program.
Pros: Pipeline is never stalled or flushed and with the correct choice branch delayed slot
instruction, performance can approach that of an ideal pipeline.
Cons: It is not always possible to find a delayed slot instruction in which case a NOP
instruction may have to be inserted into the delayed slot to make sure that the
program's integrity is not violated. It makes compilers work harder.
Which instruction to place into the delayed branch slot?
3.1-Choose an instruction before the branch, but make sure that branch does not
depend on moved instruction. If such an instruction can be found, this always pays
off.
Example:
ADD R1,R2;
JCD R2>10,exit;
can be rescheduled as
JCD R2,>,10,exit;
ADD R1,R2; (Delay slot)
3.2-Choose an instruction from the target of the branch, but make sure that the moved
instruction is executable when the branch is not taken.
Example:
ADD R1,R2;
JCD R2 > 10,sub;
JMP D, add;
….
sub: SUB R4,R5;
add: ADI R3,5;
can be rescheduled as
sub:
ADD R1,R2;
JCD R2,>,10,sub;
ADI R3,5; (Delay slot)
….
SUB R4,R5;
3.3-Choose an instruction from the anti-target of the branch, but make sure that the
moved instruction is executable when the branch is taken.
Example:
//
ADD R3,R2;
JCD R2 > 10,exit;
ADD R3,R2;
exit: SUB R4,R5;
// ADD R4,R3;
can be rescheduled as
ADD R1,R2;
JCD R2,>,10,exit;
ADD R3,R2; (Schedule for execution if it does not alter the program flow or output)
exit: SUB R4,R5;
Dynamic Branch Prediction
--Dynamic branch prediction relies on the history of how branch conditions were resolved
in the past.
--History of branches is kept in a buffer. To keep this buffer reasonably small and easy to
access, the buffer is indexed by some fixed number of lower order bits of the address of
the branch instruction.
--Assumption is that the address values in the lower address field are unique enough to
prevent frequent collisions or overrides. Thus if we are trying to predict branches in a
program which remains within a block of 256 locations, 8 bits should suffice.
x
JCD
x+1
.
.
x+256
JCD
Branch instructions in the instruction cache include a branch prediction field that
is used to predict if the branch should be taken.
Memory Location
Program
x
Branch instruction
0 (branch was not taken)
Branch instruction
0 (branch was not taken)
Branch instruction
1 (branch was taken)
Branch prediction field
x+4
x+8
x+12
x+16
x+20
Branch prediction:
In the simplest case, the field is a 1-bit tag:
0 <=> branch was not taken last time (State A)
1 <=> branch was taken last time (State B)
not taken
taken
taken
A
B
not taken
While in state A predict the branch as “not to be taken”
While in state B predict the branch as “to be taken”
This works relatively well: It accurately predicts the branches in loops in all but two of the
iterations
CLR R0;
loop: LDD R1,R0;
ADD R1,1;
ST+ R1,R0;
JCD R0 < 10,loop;
Assuming that we begin in state A, prediction fails
when R0 = 1 (branch is not taken when it should be)
and R0 =10(branch is taken when it should not be)
Assuming that we begin in state B, prediction fails
when R0 =10 (branch is taken when it should not be)
We can modify the loop to make the branch prediction algorithm fail twice when we
begin in state B as well.
CLR R0;
loop:LDD R1,R0;
ADD R1,1;
ST+ R1,R0;
JCD R0 >=10,exit;
JMP D,loop;
exit:
Assuming that we begin in state B, prediction fails:
when R0 = 1 (branch is taken when it should not be)
and R0 =10(branch is not taken when it should not be)
What is worse is that we can make this branch prediction algorithm fail each time it
makes a prediction:
LDI R0,1;
loop: JCD R0 > 0,neg;
LDI R0,1;
JMP D,loop;
neg: LDI R0,-1;
JMP D,loop;
Assuming that we begin in state A, prediction fails
when R0 = 1 (branch is not taken when it should be)
R0 = -1 (branch is taken when it should not be)
R0 = 1 (branch is not taken when it should be)
R0 = -1 (branch is taken when it should not be)
and so on
2- bit prediction ( A more reluctant flip in decision )
not taken
taken
A1
A2
not taken
taken
not taken
taken
taken
B2
B1
not taken
While in states A1 and A2 predict the branch as “not to be taken”
While in states B1 and B2 predict the branch as “to be taken”
not taken
taken
CLR R0;
loop: LDD R1,R0;
ADD R1,1;
ST+ R1,R0;
JCD R0 < 10,loop;
A1
A2
not taken
taken
not taken
taken
taken
Assuming that we begin in state A1, prediction fails
when R0 = 1,2 (branch is not taken when it should be)
and R0 = 10 (branch is taken when it should not be)
Assuming that we begin in state B1, prediction fails
when R0 = 10 (branch is taken when it should not be)
B2
B1
not taken
2-bit predictors are more resilient to branch inversions (predictions are reversed when
they are missed twice):
not taken
LDI R0,1;
taken
loop: JCD R0 > 0,neg;
A1
A2
LDI R0,1;
not taken
JMP D,loop;
taken
neg: LDI R0,-1;
not taken
JMP D,loop;
taken
taken
Assuming that we begin in state B1, prediction
succeeds when R0 = 1 (branch is taken when it should be)
fails when
R0 = -1 (branch is taken when it should not be)
succeeds when R0 = 1 (branch is taken when it should be)
fails when
R0 = -1 (branch is taken when it should not be)
and so on…
B2
B1
not taken
Amdahl's Law (Fixed Load Speed-up)
Let q be the fraction of a load L that cannot be speeded-up by introducing more
processors and let T(p) be the amount time it takes to execute L on p processors
by a linear work function, p > 1. Then
(1 q)T(1)
p
T(1)
1
1
S( p) 

 as p  
T( p) q  1 q
q
p
T( p)  qT(1) 
All this means is that, the maximum speed-up of a system is limited by the
fraction of the work that must be completed sequentially. Thus, the execution of
the work using p processors can be reduced to qT(1) under the best of

circumstances, and the speed-up cannot exceed 1/q.
Example
A 4-processor computer executes instructions that are fetched from
a random access memory over a shared bus as shown below:
The task to be performed is divided into two parts:
1.
2.
Fetch instruction (serial part)- it takes 30 microseconds
Execute instruction (parallel part)- it takes 10 microseconds to
execute:
S(4) = T(1)/T(4) = 1/(0.75 + 0.25/4) = 4/3.25 = 1.23
microseconds
microseconds
microseconds
microseconds
Now, suppose that the number of processors is doubled. Then
S(8) = T(1)/T(8) = 1/(0.75 + 0.25/8) = 8/6.25 = 1.28
Suppose that the number of processors is doubled again. Then
S(16) = T(1)/T(16) = 1/(0.75 + 0.25/16) = 16/12.25 = 1.30.
What is the limit
S(p) = T(1)/T(p) = 1/(0.75 + 0.25/p) = 1/0.75 = 1.333.
Alternate Forms of Amdahl's Law
T(1)
S
Tunenhanced  Tenhanced
T(1)
1

 as s  .
1 q
q
T(1)(q 
)
s
where s is the speed-up of the computation that can be enhanced.

Example:
Suppose that you've upgraded your computer from a 2 GHz processor to a 4
GHz processor. What is the maximum speed-up you expect in executing a
typical program assuming that (1) the speed of fetching each instruction is
directly proportional to the speed of reading an instruction from the
primary memory of your computer, and reading an instruction takes four
times longer than executing it, (2) the speed of executing each instruction is
directly proportional to the clock speed of the processor of your computer?
Using Amdahl's Law with q = 0.8 and s = 2, we have
S = 2 /(0.2 + 0.8 x 2) = 1.111
Very disappointing as you are likely to have paid quite a bit of money for the
upgrade!
Generalized Amdahl's Law
In general, a task may be partitioned into a set of subtasks, with each
subtask requiring a designated number of processors to execute. In this
case, the speed-up of the parallel execution of the task over its
sequential execution can be characterized by the following, more general
formula:
T(1)
S( p1, p2 , , pk ) 
T( p1, p2 , , pk )

where q1  q2 

T(1)
q1T(1) q2T(1)


p1
p2
 qk  1.

qk T(1)
pk

1
q1 q2


p1 p2
When k = 2, q1 = q, q2 = 1- q, p1 = 1, p2 = p, this formula reduces to
Amdahl's Law.

qk
pk

Remark:
The generalized Amdahl's Law can also be rewritten to express the speed-up due to
different amounts of speed enhancement (Se) that can be made to different parts of a
system:
T(1)
T(s1,s2 , ,sk )
T(1)

q1T(1) q2T(1)


s1
s2
where q1  q2   qk  1.
Se (s1,s2 , ,sk ) 

qk T(1)
sk

1
q1 q2
 
s1 s2

qk
sk
Example:
Suppose that your computer executes a program that has the following
profile of execution:
(a) 30% integer operations,
(b) 20% floating-point operations,
(c) 50% memory reference instructions
How much speed-up will you expect if you double the speed of the
floating unit of your computer?Using the formula above:
Se =1/(0.3 + 0.2/2 + 0.5 ) = 1.1
Example:
Suppose that you have a fixed budget of $500 to upgrade each of the
computers in your laboratory, and you find out that the computations you
perform on your computers require
(a) 40% integer operations,
(b) 60% floating-point operations,
If every dollar spent on the integer unit after $50 decreases its execution
time by 2%, and if every dollar spent on the floating-point unit after $100
decreases its execution time by 1%, how would you spend the $500?
Example (Continued):
S
T(1)
where x1  x 2  350
Ti (x1)  Tf (x 2 )
T i(x1 )  (1 0.02)T i(x1 1) T i(x1)  0.98x1 Ti (0)
T f (x 2 )  (1 0.01)T f (x 2 1) T f (x 2 )  0.99x2 Tf (0)

T i(0)  0.4T(1)
T f (0)  0.6T(1)

Substituting these into the generalized Amdahl's speed-up expression gives:

T(1)
0.98 x1  0.4  T(1)  0.99 x2  0.6  T(1)
1
=
0.98 x1  0.4  0.99 x2  0.6
S
Example 8 (Continued):
So we maximize
1
0.98x1  0.4  0.99x2  0.6
subject to x1 + x2 = 350,
or maximize

1
0.98x1  0.4  0.99350x1  0.6
subject to x1 < 350.

Example (Continued):
Computing the values in the neighborhood of 120 reveals that the speed-up is
maximized when x1 = 126.
From Mathematica:
Table[1/ (0.4 * 0.98^x + 0.6 * 0.99 ^(350 - x)),{ x, 120,128,1}]
{10.5398,10.5518,10.5616,10.5691,10.5744,10.5776,10.5785,10.5773,10.574}
Note: It is possible to have higher speed up with all of the money invested in
one of the units if the fix cost for one of the units becomes sufficiently large.
Addendum:
If the changes in performance due to upgrades are specified in terms
of speed rather than time, we can then use the following formulation:
tL
s
t  t s   L s   L s 1 
x s x
s s x
s2 x
t   L s  t s 
s s
s
t T(x) T(x 1)  T (x 1) s 
s
T(x)  (1 s )T(x 1)
s

where
s
denotes the percentage change in speed.
s
Download