slides in ppt

advertisement
Lecture 5:
Introduction to Advanced Pipelining
L.N. Bhuyan
CS 162
DAP.F96 1
Arithmetic Pipeline
• The floating point executions cannot be
performed in one cycle during the EX stage.
Allowing much more time will increase the
pipeline cycle time or subsequent
instructions have to be stalled
• Solution is to break the FP EX stage to
several stages whose delay can match the
cycle time of the instruction pipeline
• Such a FP or arithmetic pipeline does not
reduce latency, but can decouple from the
integer unit and increase throughput for a
sequence of FP instructions
• What is a vector instruction and or a vector
computer?
DAP.F96 2
MIPS R4000 Floating Point
• FP Adder, FP Multiplier, FP Divider
• Last step of FP Multiplier/Divider uses FP Adder HW
• 8 kinds of stages in FP units:
Stage
A
D
E
M
N
R
S
U
Functional unit
FP adder
FP divider
FP multiplier
FP multiplier
FP multiplier
FP adder
FP adder
Description
Mantissa ADD stage
Divide pipeline stage
Exception test stage
First stage of multiplier
Second stage of multiplier
Rounding stage
Operand shift stage
Unpack FP numbers
DAP.F96 3
MIPS FP Pipe Stages
FP Instr
Add, Subtract
Multiply
Divide
Square root
Negate
Absolute value
FP compare
Stages:
M
N
R
S
U
1
U
U
U
U
U
U
U
2
S+A
E+M
A
E
S
S
A
3
4
A+R R+S
M
M
R
D28
(A+R)108
7
8
…
5
6
M
…
…
N
N+A R
D+A D+R, D+R, D+A, D+R, A, R
A
R
R
First stage of multiplier
Second stage of multiplier
Rounding stage
Operand shift stage
Unpack FP numbers
A
D
E
Mantissa ADD stage
Divide pipeline stage
Exception test stage
DAP.F96 4
Pipeline with Floating point
operations
• Example of FP pipeline integrated with the
instruction pipeline: Fig. A.31, A.32 and A.33
distributed in the class
• The FP pipeline consists of one integer unit with 1
stage, one FP/integer multiply unit with 7 stages,
one FP adder with 4 stages, and a FP/integer
divider with 24 stages
• A.31 shows the pipeline, A.32 shows execution of
independent instns, and A.33 shows effect of data
dependency
• Impact of data dependency is severe. Possibility of
out-of-order execution => creates different hazards
to be considered later
DAP.F96 5
R4000 Performance
Base
Load stalls
Branch stalls
FP result stalls
tomcatv
su2cor
spice2g6
ora
nasa7
doduc
li
gcc
espresso
eqntott
• Not ideal CPI of 1:
– Load stalls (1 or 2 clock cycles)
– Branch stalls (2 cycles + unfilled slots)
– FP result stalls: RAW data hazard (latency)
– FP structural stalls: Not enough FP hardware (parallelism)
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
FP structural
stalls
DAP.F96 6
FP Loop: Where are the Hazards?
Loop:
LD
ADDD
SD
SUBI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,Loop
Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op
•
;F0=vector element
;add scalar from F2
;store result
;decrement pointer 8B (DW)
;branch R1!=zero
;delayed branch slot
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
Where are the stalls?
Latency in
clock cycles
3
2
1
0
0
DAP.F96 7
FP Loop Showing Stalls
1 Loop: LD
2
stall
3
ADDD
4
stall
5
stall
6
SD
7
SUBI
8
BNEZ
9
stall
Instruction
producing result
FP ALU op
FP ALU op
Load double
F0,0(R1)
;F0=vector element
F4,F0,F2
;add scalar in F2
0(R1),F4
R1,R1,8
R1,Loop
;store result
;decrement pointer 8B (DW)
;branch R1!=zero
;delayed branch slot
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Latency in
clock cycles
3
2
1
• 9 clocks: Rewrite code to minimize stalls?
DAP.F96 9
Minimizing Stalls Technique 1:
Compiler Optimization
1 Loop: LD
2
stall
3
ADDD
4
SUBI
5
BNEZ
6
SD
F0,0(R1)
F4,F0,F2
R1,R1,8
R1,Loop
8(R1),F4
;delayed branch
;altered when move past SUBI
Swap BNEZ and SD by changing address of SD
Instruction
producing result
FP ALU op
FP ALU op
Load double
Instruction
using result
Another FP ALU op
Store double
FP ALU op
6 clocks
Latency in
clock cycles
3
2
1
DAP.F96 10
Technique 2: Loop Unrolling
(Software Pipelining)
1 Loop: LD
F0,0(R1)
2
ADDD
F4,F0,F2
;1 cycle delay *
3
SD
0(R1),F4
;drop SUBI & BNEZ – 2cycles
delay *
4
LD
F6,-8(R1)
5
ADDD
F8,F6,F2
; 1 cycle delay
6
SD
-8(R1),F8
;drop SUBI & BNEZ – 2 cycles
delay
7
LD
F10,-16(R1)
8
ADDD
F12,F10,F2
; 1 cycle delay
9
SD
-16(R1),F12 ;drop SUBI & BNEZ – 2 cycles
delay
10
LD
F14,-24(R1)
11
ADDD
F16,F14,F2
; 1 cycle delay
12
SD
-24(R1),F16
; 2 cycles daly
13
SUBI
R1,R1,#32
;alter to 4*8
14
BNEZ
R1,LOOP
15
NOP
*1 cycle delay for FP operation after load. 2 cycles delay
DAP.F96 11
for store after FP
Minimize Stall + Loop Unrolling
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
branch
14
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SD
SUBI
BNEZ
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
SD
8(R1),F16
• What assumptions
made when moved
code?
– OK to move store past
SUBI even though changes
register
– OK to move loads before
stores: get right data?
– When is it safe for
compiler to do such
changes?
; Delayed
; 8-32 = -24
14 clock cycles, or 3.5 per iteration
When safe to move instructions?
DAP.F96 12
Compiler Perspectives on Code
Movement
• Definitions: compiler concerned about dependencies in
program, whether or not a HW hazard depends on a given
pipeline
• Try to schedule to avoid hazards
• (True) Data dependencies (RAW if a hazard for HW)
– Instruction i produces a result used by instruction j, or
– Instruction j is data dependent on instruction k, and instruction k is data
dependent on instruction i.
• If dependent, can’t execute in parallel
• Easy to determine for registers (fixed names)
• Hard for memory:
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
DAP.F96 13
Compiler Perspectives on Code
Movement
• Another kind of dependence called name dependence:
two instructions use same name (register or memory
location) but don’t exchange data
• Antidependence (WAR if a hazard for HW)
– Instruction j writes a register or memory location that instruction i
reads from and instruction i is executed first
• Output dependence (WAW if a hazard for HW)
– Instruction i and instruction j write the same register or memory
location; ordering between instructions must be preserved.
DAP.F96 14
RAW
WAR
WAW and
RAW
I1
I3
I5
I2
I4
I6
Program order
EXAMPLE
I1. Load R1, A /R1 Memory(A)/
I2. Add R2, R1 /R2  (R2)+(R1)/
I3. Add R3, R4 /R3  (R3)+(R4)/
I4. Mul R4, R5 /R4  (R4)*(R5)/
I5. Comp R6 /R6  Not(R6)/
I6. Mul R6, R7 /R6  (R6)*(R7)/
Output
Flow
Antidependencedependence dependence,
also flow
dependence
Due to Superscalar Processing, it is possible that I4 completes before
I3 starts. Similarly the value of R6 depends on the beginning and end of I5
and I6. Unpredictable result!
A sample program and its dependence graph, where I2 and I3 share the
adder and I4 and I6 share the same multiplier. These two dependences can
be removed by duplicating the resources, or pipelined adders and multipliers.
DAP.F96 15
Register Renaming
Rewrite the previous program as:
• I1. R1b  Memory (A)
• I2. R2b  (R2a) + (R1b)
• I3. R3b  (R3a) + (R4a)
• I4. R4b  (R4a) * (R5a)
• I5. R6b  -(R6a)
• I6. R6c  (R6b) * (R7a)
Allocate more registers and rename the registers
that really do not have flow dependency. The
WAR hazard between I3 and I4, and WAW hazard
between I5 and I6 have been removed.
These two hazards also called Name dependencies
DAP.F96 16
Where are the name
dependencies?
1 Loop: LD
2
ADDD
3
SD
4
LD
2
ADDD
3
SD
7
LD
8
ADDD
9
SD
10
LD
11
ADDD
12
SD
13
SUBI
14
BNEZ
15
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
F0,-8(R1)
F4,F0,F2
-8(R1),F4
F0,-16(R1)
F4,F0,F2
-16(R1),F4
F0,-24(R1)
F4,F0,F2
-24(R1),F4
R1,R1,#32
R1,LOOP
;drop SUBI & BNEZ
;drop SUBI & BNEZ
;drop SUBI & BNEZ
;alter to 4*8
How can remove them?
DAP.F96 17
Detailed Scoreboard Pipeline
Control
Instruction
status
Wait until
Bookkeeping
Issue
Not busy (FU)
and not result(D)
Busy(FU) yes; Op(FU) op;
Fi(FU) `D’; Fj(FU) `S1’;
Fk(FU) `S2’; Qj Result(‘S1’);
Qk Result(`S2’); Rj not Qj;
Rk not Qk; Result(‘D’) FU;
Read
operands
Rj and Rk
Rj No; Rk No
Execution
complete
Functional unit
done
Write result
f((Fj( f )≠Fi(FU)
f(if Qj(f)=FU then Rj(f) Yes);
or Rj( f )=No) &
f(if Qk(f)=FU then Rj(f) Yes);
(Fk( f ) ≠Fi(FU) or
Result(Fi(FU)) 0; Busy(FU) No
Rk( f )=No))
DAP.F96 19
Download