presentation source

advertisement
CMPUT429/CMPE382 Winter 2001
Topic7: Instruction Level Parallelism Static Scheduling
(Adapted from David A. Patterson’s CS252,
Spring 2001 Lecture Slides)
1/17/01
CMPUT429/CMPE382
Amara, 1
Recall from Pipelining Review
• Pipeline CPI = Ideal pipeline CPI + Structural
Stalls + Data Hazard Stalls + Control Stalls
– Ideal pipeline CPI: measure of the maximum performance
attainable by the implementation
– Structural hazards: HW cannot support this combination of
instructions
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps)
1/17/01
CMPUT429/CMPE382
Amara, 2
Ideas to Reduce Stalls
Chapter 3
Chapter 4
1/17/01
Technique
Dynamic scheduling
Dynamic branch
prediction
Issuing multiple
instructions per cycle
Speculation
Dynamic memory
disambiguation
Loop unrolling
Basic compiler pipeline
scheduling
Compiler dependence
analysis
Software pipelining and
trace scheduling
Compiler speculation
Reduces
Data hazard stalls
Control stalls
Ideal CPI
Data and control stalls
Data hazard stalls involving
memory
Control hazard stalls
Data hazard stalls
Ideal CPI and data hazard stalls
Ideal CPI and data hazard stalls
Ideal CPI, data and control stalls
CMPUT429/CMPE382
Amara, 3
Instruction-Level Parallelism (ILP)
• Basic Block (BB) ILP is quite small
– BB: a straight-line code sequence with no branches in except
to the entry and no branches out except at the exit;
– If one instruction of the basic block is executed, then all the
instructions in the basic block must be executed
– average dynamic branch frequency 15% to 25%
=> 4 to 7 instructions execute between a pair of branches
– Plus instructions in BB likely to depend on each other
• To obtain substantial performance enhancements,
we must exploit ILP across multiple basic blocks
• Simplest: loop-level parallelism to exploit
parallelism among iterations of a loop
– Vector is one way
– If not vector, then either dynamic execution via branch
prediction or static scheduling via loop unrolling by compiler
1/17/01
CMPUT429/CMPE382
Amara, 4
Data Dependence and Hazards
• InstrJ is data dependent on InstrI
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
• or InstrJ is data dependent on InstrK which is
dependent on InstrI
• Caused by a “True Dependence” (compiler term)
• If true dependence caused a hazard in the pipeline,
called a Read After Write (RAW) hazard
1/17/01
CMPUT429/CMPE382
Amara, 5
Data Dependence and Hazards
• Dependences are a property of programs
• Presence of dependence indicates potential for a
hazard, but actual hazard and length of any stall
is a property of the pipeline
• Importance of the data dependencies
1) indicates the possibility of a hazard
2) determines order in which results must be
calculated
3) sets an upper bound on how much parallelism can
possibly be exploited
• Today looking at HW schemes to avoid hazard
1/17/01
CMPUT429/CMPE382
Amara, 6
Name Dependence #1:
Anti-dependence
• Name dependence: when 2 instructions use same
register or memory location, called a name, but no
flow of data between the instructions associated
with that name; 2 versions of name dependence
• InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”
• If anti-dependence caused a hazard in the
pipeline, called a Write After Read (WAR) hazard
1/17/01
CMPUT429/CMPE382
Amara, 7
Name Dependence #2:
Output dependence
• InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”
• If anti-dependence caused a hazard in the pipeline,
called a Write After Write (WAW) hazard
1/17/01
CMPUT429/CMPE382
Amara, 8
ILP and Data Hazards
• HW/SW must preserve program order:
order instructions would execute in if executed
sequentially 1 at a time as determined by original
source program
• HW/SW goal: exploit parallelism by preserving
program order only where it affects the outcome
of the program
• Instructions involved in a name dependence can
execute simultaneously if name used in instructions
is changed so instructions do not conflict
– Register renaming resolves name dependence for regs
– Either by compiler or by HW
1/17/01
CMPUT429/CMPE382
Amara, 9
Control Dependencies
• Every instruction is control dependent on
some set of branches, and, in general,
these control dependencies must be
preserved to preserve program order
if p1 {
S1;
};
if p2 {
S2;
}
• S1 is control dependent on p1, and S2 is
control dependent on p2 but not on p1.
1/17/01
CMPUT429/CMPE382
Amara, 10
Control Dependence Ignored
• Control dependence need not be preserved
– willing to execute instructions that should not have been
executed, thereby violating the control dependences, if can do
so without affecting correctness of the program
• Instead, 2 properties critical to program
correctness are exception behavior and data flow
1/17/01
CMPUT429/CMPE382
Amara, 11
Exception Behavior
• Preserving exception behavior => any
changes in instruction execution order must
not change how exceptions are raised in
program (=> no new exceptions)
• Example:
DADDU
R2,R3,R4
BEQZ
R2,L1
LW
R1,0(R2)
L1:
• Problem with moving LW before BEQZ?
1/17/01
CMPUT429/CMPE382
Amara, 12
Static Branch Prediction
• Simplest: Predict taken
– average misprediction rate = untaken branch frequency,
which for the SPEC programs is 34%.
– Unfortunately, the misprediction rate ranges from not
very accurate (59%) to highly accurate (9%)
• Predict on the basis of branch direction?
– choosing backward-going branches to be taken (loop)
– forward-going branches to be not taken (if)
– SPEC programs, however, most forward-going branches
are taken => predict taken is better
• Predict branches on the basis of profile
information collected from earlier runs
– Misprediction varies from 5% to 22%
1/17/01
CMPUT429/CMPE382
Amara, 13
Running Example
• This code, add a scalar to a vector:
for (i=1000; i>0; i=i–1)
x[i] = x[i] + s;
• Assume following latencies for all examples
Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op
1/17/01
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op
Execution
in cycles
4
3
1
1
1
Latency
in cycles
3
2
1
0
0
CMPUT429/CMPE382
Amara, 14
FP Loop: Where are the Hazards?
• First translate into MIPS code:
-To simplify, assume 8 is lowest address
Loop:
1/17/01
L.D
ADD.D
S.D
DSUBUI
BNEZ
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,Loop
;F0=vector element
;add scalar from F2
;store result
;decrement pointer 8B (DW)
;branch R1!=zero
;delayed branch slot
CMPUT429/CMPE382
Amara, 15
FP Loop Showing Stalls
1 Loop: L.D
2
stall
3
ADD.D
4
stall
5
stall
6
S.D
7
DSUBUI
8
BNEZ
9
stall
Instruction
producing result
FP ALU op
FP ALU op
Load double
F0,0(R1) ;F0=vector element
F4,F0,F2 ;add scalar in F2
0(R1),F4 ;store result
R1,R1,8 ;decrement pointer 8B (DW)
R1,Loop ;branch R1!=zero
;delayed branch slot
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Latency in
clock cycles
3
2
1
• 9 clocks: Rewrite code to minimize stalls?
1/17/01
CMPUT429/CMPE382
Amara, 16
Revised FP Loop Minimizing Stalls
1 Loop: L.D
2
stall
3
ADD.D
4
DSUBUI
5
BNEZ
6
S.D
F0,0(R1)
F4,F0,F2
R1,R1,8
R1,Loop ;delayed branch
8(R1),F4 ;altered when move past DSUBUI
Swap BNEZ and S.D by changing address of S.D
Instruction
producing result
FP ALU op
FP ALU op
Load double
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Latency in
clock cycles
3
2
1
6 clocks, but just 3 for execution, 3 for loop
overhead; How make faster?
1/17/01
CMPUT429/CMPE382
Amara, 17
Unroll Loop Four Times
(straightforward way)
1 Loop:L.D
3
ADD.D
6
S.D
7
L.D
9
ADD.D
12
S.D
13
L.D
15
ADD.D
18
S.D
19
L.D
21
ADD.D
24
S.D
25
DSUBUI
26
BNEZ
27
NOP
1/17/01
F0,0(R1)
F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F6,F2
-8(R1),F8
F10,-16(R1)
F12,F10,F2
-16(R1),F12
F14,-24(R1)
F16,F14,F2
-24(R1),F16
R1,R1,#32
R1,LOOP
1 cycle stall
2 cycles stall
;drop DSUBUI &
Rewrite loop to
minimize stalls?
BNEZ
;drop DSUBUI & BNEZ
;drop DSUBUI & BNEZ
;alter to 4*8
27 clock cycles, or 6.8 per iteration
Assumes R1 is multiple of 4
CMPUT429/CMPE382
Amara, 18
Unrolled Loop Detail
• Do not usually know upper bound of loop
• Suppose it is n, and we would like to unroll
the loop to make k copies of the body
• Instead of a single unrolled loop, we
generate a pair of consecutive loops:
– 1st executes (n mod k) times and has a body that is
the original loop
– 2nd is the unrolled body surrounded by an outer loop
that iterates (n/k) times
– For large values of n, most of the execution time will
be spent in the unrolled loop
1/17/01
CMPUT429/CMPE382
Amara, 19
Unrolled Loop That Minimizes Stalls
1 Loop:L.D
2
L.D
3
L.D
4
L.D
5
ADD.D
6
ADD.D
7
ADD.D
8
ADD.D
9
S.D
10
S.D
11
S.D
12
DSUBUI
13
BNEZ
14
S.D
F0,0(R1)
• What assumptions
F6,-8(R1)
made when moved
F10,-16(R1)
code?
F14,-24(R1)
F4,F0,F2
– OK to move store past
F8,F6,F2
DSUBUI even though the
F12,F10,F2
store changes register
F16,F14,F2
– OK to move loads before
0(R1),F4
stores: get right data?
-8(R1),F8
– When is it safe for
-16(R1),F12
compiler to do such
R1,R1,#32
changes?
R1,LOOP
8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
1/17/01
CMPUT429/CMPE382
Amara, 20
Compiler Perspectives on Code Movement
• Compiler concerned about dependencies in program
• Existence of a Hardware hazard depends on pipeline
• Try to schedule to avoid hazards that cause
performance losses
• (True) Data dependencies (RAW)
– Instruction i produces a result used by instruction j, or
– Instruction j is data dependent on instruction k, and instruction k
is data dependent on instruction i.
• If dependent, can’t execute in parallel
• Easy to determine for registers (fixed names)
• Hard for memory (“memory disambiguation” problem):
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
1/17/01
CMPUT429/CMPE382
Amara, 21
Where are the name dependencies?
1 Loop:L.D
3
ADD.D
6
S.D
7
L.D
9
ADD.D
12
S.D
13
L.D
15
ADD.D
18
S.D
19
L.D
21
ADD.D
24
S.D
25
DSUBUI
26
BNEZ
27
NOP
1/17/01
F0,0(R1)
F4,F0,F2
0(R1),F4
F0,-8(R1)
F4,F0,F2
-8(R1),F4
F0,-16(R1)
F4,F0,F2
-16(R1),F4
F0,-24(R1)
F4,F0,F2
-24(R1),F4
R1,R1,#32
R1,LOOP
CMPUT429/CMPE382
Amara, 22
Where are the name dependencies?
1 Loop:L.D
3
ADD.D
6
S.D
7
L.D
9
ADD.D
12
S.D
13
L.D
15
ADD.D
18
S.D
19
L.D
21
ADD.D
24
S.D
25
DSUBUI
26
BNEZ
27
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
F0,-8(R1)
F4,F0,F2
-8(R1),F4
F0,-16(R1)
F4,F0,F2
-16(R1),F4
F0,-24(R1)
F4,F0,F2
-24(R1),F4
R1,R1,#32
R1,LOOP
How can remove them?
1/17/01
CMPUT429/CMPE382
Amara, 23
Where are the name dependencies?
1 Loop:L.D
3
ADD.D
6
S.D
7
L.D
9
ADD.D
12
S.D
13
L.D
15
ADD.D
18
S.D
19
L.D
21
ADD.D
24
S.D
25
DSUBUI
26
BNEZ
27
NOP
F0,0(R1)
F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F6,F2
-8(R1),F8
F10,-16(R1)
F12,F10,F2
-16(R1),F12
F14,-24(R1)
F16,F14,F2
-24(R1),F16
R1,R1,#32
R1,LOOP
The Original“register renaming”
1/17/01
CMPUT429/CMPE382
Amara, 24
Compiler Perspectives on Code
Movement
• Name Dependencies are Hard to discover for Memory
Accesses
– Does 100(R4) = 20(R6)?
– From different loop iterations, does 20(R6) = 20(R6)?
• Our example required compiler to know that if R1
doesn’t change then:
0(R1)  -8(R1)  -16(R1)  -24(R1)
There were no dependencies between some loads and
stores so they could be moved by each other
1/17/01
CMPUT429/CMPE382
Amara, 25
Steps Compiler Performs to Unroll
• Check if it is OK to move the S.D after DSUBUI
and BNEZ, and find amount to adjust S.D offset
• Determine that unrolling the loop is useful by
finding that the loop iterations are independent
• Rename registers to avoid name dependencies
• Eliminate extra test and branch instructions and
adjust the loop termination and iteration code
• Determine that loads and stores in unrolled loop
can be interchanged because loads and stores
from different iterations are independent
– requires analyzing memory addresses and finding that they do
not refer to the same address.
• Schedule the code, preserving any dependencies
needed to yield the same result as the original
code
1/17/01
CMPUT429/CMPE382
Amara, 26
When a Loop is Parallel?
• Example: Where are the data dependencies?
(Assume that A,B, C are distinct and non-overlapping)
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i];
/* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
1. S2 uses the value, A[i+1], computed by S1 in the same
iteration.
2. S1 uses a value computed by S1 in an earlier iteration, since
iteration i computes A[i+1] which is read in iteration i+1. The
same is true of S2 used the value B[i] computed by S2 in the
previous iteration. This is a “loop-carried dependence”.
1
S1
1/17/01
1
0
S2
Typically, when the loop dependency graph
has a cycle with dependence distance  1,
a loop is not parallel.
CMPUT429/CMPE382
Amara, 27
How to find dependences?
• One way to find dependences is to unroll the loop and
find the RAW, WAR, and WAW dependences in the
unrolled loop.
• Example: Where are the data dependencies?
(Assume that A,B, C are distinct and non-overlapping)
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i];
B[i+1] = B[i] + A[i+1];
}
A[1]
B[1]
A[2]
B[2]
A[3]
B[3]
1/17/01
=
=
=
=
=
=
A[0]
B[0]
A[1]
B[1]
A[2]
B[2]
+
+
+
+
+
+
C[0];
A[1];
C[1];
A[2];
C[2];
A[3];
/*
/*
/*
/*
/*
/*
S1
S2
S1
S2
S1
S2
*/
*/
*/
*/
*/
*/
/* S1 */
/* S2 */
Iteration i=0
Iteration i=1
Iteration i=2
CMPUT429/CMPE382
Amara, 28
When a Loop is Parallel?
• Example: How about this loop? Is it parallel?
(Assume A,B,C, and D are distinct and non-overlapping)
for (i=1; i<100; i=i+1) {
A[i] = A[i] + B[i];
B[i+1] = C[i] + D[i];
}
1/17/01
C
•••
A
•••
B
•••
/* S1 */
/* S2 */
CMPUT429/CMPE382
Amara, 29
When a Loop is Parallel?
• Example: Where are data dependencies?
(Assume A,B,and C are distinct and non-overlapping)
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i];
B[i+1] = B[i] + A[i+1];
}
C
/* S1 */
/* S2 */
•••
S1
1/17/01
A
•••
B
•••
CMPUT429/CMPE382
Amara, 30
When a Loop is Parallel?
• Example: Where are data dependencies?
(Assume A,B, and C are distinct and non-overlapping)
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i];
B[i+1] = B[i] + A[i+1];
}
C
•••
A
•••
/* S1 */
/* S2 */
S2
B
1/17/01
•••
CMPUT429/CMPE382
Amara, 31
When a Loop is Parallel?
• Example: Where are data dependencies?
(Assume A,B, and C are distinct and nonoverlapping)
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i];
B[i+1] = B[i] + A[i+1];
}
C
/* S1 */
/* S2 */
•••
S1
1/17/01
A
•••
B
•••
CMPUT429/CMPE382
Amara, 32
When a Loop is Parallel?
• Example: Where are data dependencies?
(Assume A,B, and C are distinct and nonoverlapping)
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i];
B[i+1] = B[i] + A[i+1];
}
C
•••
A
•••
/* S1 */
/* S2 */
S2
B
1/17/01
•••
CMPUT429/CMPE382
Amara, 33
When a Loop is Parallel?
• Example: How about this loop? Is it parallel?
(Assume A,B,C, and D are distinct and non-overlapping)
for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i];
B[i+1] = C[i] + D[i];
}
The only cycle in the loop
has a dependence distance
of zero. Thus we should
be able to parallelize the loop.
0
S1
1/17/01
/* S1 */
/* S2 */
1
S2
But the loop has a loop-carried
dependence, thus it seems that
iteration i+1 cannot execute
until iteration i finishes.
CMPUT429/CMPE382
Amara, 34
Loop Parallelization?
• A good compiler will do the following code
transformation:
A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
Now there are no more loop carried
dependences, and all iterations can
execute in parallel if the processor
has enough functional units.
1/17/01
CMPUT429/CMPE382
Amara, 35
Hardware Support for Exposing
More Parallelism at Compile-Time
• Conditional or Predicated Instructions
– Conditional instruction execution
First Instruction Slot Second Instruction Slot
LW R1, 40(R2)
ADD R3, R4, R5
ADD R6, R3, R7
BEQZ R10, L
LW R8, 0(R10)
LW R9, 0(R8)
• Waste slot since 3rd LW dependent on result
of 2nd LW
1/17/01
CMPUT429/CMPE382
Amara, 36
Hardware Support for Exposing
More Parallelism at Compile-Time
• Use predicated version load word (LWC)?
– load occurs unless the third operand is 0
First Instruction Slot Second Instruction Slot
LW R1, 40(R2)
ADD R3, R4, R5
LWC R8, 20(R10),R10 ADD R6, R3, R7
BEQZ R10, L
LW R9, 0(R8)
• If the sequence following the branch were
short, the entire block of code might be
converted to predicated execution, and the
branch eliminated
1/17/01
CMPUT429/CMPE382
Amara, 37
Exception Behavior Support
• Several mechanisms to ensure that
speculation by compiler does not violate
exception behavior
– For example, cannot raise exceptions in predicated
instruction is squashed
– Prefetch does not cause exceptions
1/17/01
CMPUT429/CMPE382
Amara, 38
What if We Could Change the
Instruction Set?
• Superscalar processors decide on the fly
how many instructions to issue
– HW complexity of Number of instructions to issue O(n2)
• Why not allow compiler to schedule
instruction level parallelism explicitly?
• Format the instructions in a potential issue
packet so that HW need not check explicitly
for dependences
1/17/01
CMPUT429/CMPE382
Amara, 39
VLIW: Very Large Instruction Word
• Each “instruction” has explicit coding for multiple
operations
– In IA-64, grouping called a “bundle”
– In Transmeta, grouping called a “molecule” (with “atoms” as ops)
• Tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long
instruction word are independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168
bits wide
– Need compiling technique that schedules across several branches
1/17/01
CMPUT429/CMPE382
Amara, 40
Example of a VLIW Architecture:
IA-64. Suggested Reading
Intel IA-64 Architecture Software
Developer’s Manual, Chapters 8, 9
1/17/01
CMPUT429/CMPE382
Amara, 41
IA-64 Instruction Group
An instruction group is a set of instructions that
have no read after write (RAW) or write after write (WAW)
register dependencies.
Consecutive instruction groups are separated by stops
(represented by a double semi-column in the assembly code).
ld8
sub
add
st8
1/17/01
r1=[r5]
r6=r8, r9
r3=r2,r4 ;;
[r6]=r12
// First group
// First group
// First group
// Second group
CMPUT429/CMPE382
Amara, 42
Instruction Bundles
Instructions are organized in bundles of three instructions,
with the following format:
127
8786
54
0
instruction slot 2
instruction slot 1
instruction slot 0
template
41
41
41
5
Instruction
Description
A
I
Integer ALU
Non-ALU
integer
Memory
Floating-Point
Branch
Extended
M
F
B
L+X
1/17/01
46 45
Execution Unit
Type
I-unit or M-unit
I-unit
M-unit
F-unit
B-unit
I-unit/B-unit
CMPUT429/CMPE382
Amara, 43
Bundles
In assembly, each 128-bit bundle is enclosed in
curly braces and contains a template specification
{ .mii
ld4
add
add
r28=[r8] // Load a 4-byte value
r9=2,r1 // 2+r1 and put in r9
r30=1,r1 // 1+r1 and put in r30
}
An instruction group can extend over an arbitrary
number of bundles.
1/17/01
CMPUT429/CMPE382
Amara, 44
Templates
There are restrictions on the type of instructions that
can be bundled together. The IA-64 has five slot types
(M, I, F, B, and L), six instruction types (M, I, A, F, B, L),
and twelve basic template types (MII, MI_I, MLX, MMI,
M_MI, MFI, MMF, MIB, MBB, BBB, MMB, and MFB).
The underscore in the bundle accronym indicates
a stop.
Every basic bundle type has two versions: one
with a stop at the end of the bundle and one
without.
1/17/01
CMPUT429/CMPE382
Amara, 45
Control Dependency Preventing Code
Motion
In the code below the ld4 is control dependent on the
branch, and thus cannot be safely moved up in
conventional processor architectures.
add
r7=r6,1
add
r13=r25, r27
cmp.eq p1, p2=r12, r23
(p1) br. cond
some_label ;;
ld4
sub
1/17/01
r2=[r3] ;;
r4=r2, r11
// cycle 0
block A
br
block B
ld
// cycle 1
// cycle 3
CMPUT429/CMPE382
Amara, 46
Control Speculation
In the following code, suppose a load latency of two cycles
(p1) br.cond.dptk L1
ld8 r3=[r5] ;;
shr r7=r3,r87
// cycle 0
// cycle 1
// cycle 3
However, if we execute the load before we know if
we actually have to do it (control speculation), we get:
ld8.s r3=[r5]
// earlier cycle
// other, unrelated instructions
(p1) br.cond.dptk L1
;;
// cycle 0
chk.s r3, recovery
// cycle 1
shr r7=r3,r87
// cycle 1
1/17/01
CMPUT429/CMPE382
Amara, 47
Control Speculation
The ld8.s instruction is a speculative load, and the
chk.s instruction is a check instruction that verifies
if the value loaded is still good.
ld8.s r3=[r5]
// earlier cycle
// other, unrelated instructions
(p1) br.cond.dptk L1
;;
// cycle 0
chk.s r3, recovery
// cycle 1
shr r7=r3,r87
// cycle 1
1/17/01
CMPUT429/CMPE382
Amara, 48
Ambiguous Memory Dependencies
An ambiguous memory dependency is a dependence
between a load and a store, or between two stores
where it cannot be determined if the instructions
involved access overlapping memory locations.
Two or more memory references are independent
if it is known that they access non-overlapping
memory locations.
1/17/01
CMPUT429/CMPE382
Amara, 49
Data Speculation
An advanced load allows a load to be moved
above a store even if it is not known wether
the load and the store may reference overlapping
memory locations.
st8
ld8
shr
[r55]=r45
r3=[r5] ;;
r7=r3,r87
// cycle 0
// cycle 0
// cycle 2
ld8.a r3=[r5] ;;
// Advanced Load
// other, unrelated instructions
st8
[r55]=r45
// cycle 0
ld8.c r3=[r5] ;;
// cycle 0 - check
shr
r7=r3,r87
// cycle 0
1/17/01
CMPUT429/CMPE382
Amara, 50
Moving Up Loads + Uses: Recovery
Code
Original Code
Speculative
Code
st8
ld8
add
st8
[r4] = r12
r6 = [r8] ;;
r5 = r6,r7
[r18] = r5
ld8.a
r6 = [r8] ;; // cycle -3
// other, unrelated instructions
add
r5 = r6,r7
// cycle -1; add that uses r6
// other, unrelated instructions
st8
[r4]=r12
// cycle 0
chk.a
r6, recover // cycle 0: check
back: // Return point from jump to recover
st8
[r18] = r5
// cycle 0
recover:
ld8
r6 = [r8] ;;
add
r5 = r6,r7
br
back
1/17/01
// cycle 0: ambiguous store
// cycle 0: load to advance
// cycle 2
// cycle 3
// Reload r6 from [r8]
// Re-execute the add
// Jump back to main code
CMPUT429/CMPE382
Amara, 51
ld.c, chk.a and the ALAT
The execution of an advanced load, ld.a, creates an
entry in a hardware structure, the Advanced Load
Address Table (ALAT). This table is indexed by the
register number. Each entry records the load
address, the load type, and the size of the load.
When a check is executed, the entry for the register
is checked to verify that a valid enter with the type
specified is there.
1/17/01
CMPUT429/CMPE382
Amara, 52
ld.c, chk.a and the ALAT
Entries are removed from the ALAT when:
(1) A store overlaps with the memory locations
specified in the ALAT entry;
(2) Another advanced load to the same register
is executed;
(3) There is a context switch caused by the
operating system (or hardware);
(4) Capacity limitation of the ALAT implementation
requires reuse of the entry.
1/17/01
CMPUT429/CMPE382
Amara, 53
Not a Thing (NaT)
The IA-64 has 128 general purpose registers, each
with 64+1 bits, and 128 floating point registers, each
with 82 bits.
The extra bit in the GPRs is the NaT bit that is used to
indicate that the content of the register is not valid.
NaT=1 indicates that an instruction that generated an
exception wrote to the register. It is a way to defer
exceptions caused by speculative loads.
Any operation that uses NaT as an operand
results in NaT.
1/17/01
CMPUT429/CMPE382
Amara, 54
If-conversion
If-conversion uses predicates to transform a
conditional code into a single control stream code.
if(r4) {
add r1= r2, r3
ld8 r6=[r5]
}
if(r1)
r2 = r3 + r3
else
r7 = r6 - r5
1/17/01
cmp.ne
(p1) add
(p1) ld8
p1, p0=r4, 0 ;; Set predicate reg
r1=r2, r3
r6=[r5]
cmp.ne
(p1) add
(p2) sub
p1, p2 = r1, 0 ;; Set predicate reg
r2 = r3, r4
r7 = r6,r5
CMPUT429/CMPE382
Amara, 55
Trace Scheduling
• Two steps:
– Trace Selection
» Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code
– Trace Compaction
» Squeeze trace into few VLIW instructions
» Need bookkeeping code in case prediction is wrong
• This is a form of compiler-generated speculation
– Compiler must generate recovery code to handle cases in which
execution does not go according to speculation.
– Needs extra registers: undo bad guesses by discarding
• Subtle compiler bugs may result in wrong answer:
no hardware interlocks
1/17/01
CMPUT429/CMPE382
Amara, 56
Superscalar v. VLIW
• Smaller code size
• Binary compatibility
across generations
of hardware
1/17/01
• Simplified Hardware
for decoding, issuing
instructions
• No Interlock
Hardware (compiler
checks?)
• More registers, but
simplified Hardware
for Register Ports
(multiple independent
register files?)
CMPUT429/CMPE382
Amara, 57
Problems with First Generation VLIW
• Increase in code size
– generating enough operations in a straight-line code
fragment requires ambitiously unrolling loops
– whenever VLIW instructions are not full, unused functional
units translate to wasted bits in instruction encoding
• Operated in lock-step; no hazard detection HW
– a stall in any functional unit pipeline caused entire processor
to stall, since all functional units must be kept synchronized
– Compiler might prediction function units, but caches hard to
predict
• Binary code compatibility
– Pure VLIW => different numbers of functional units and unit
latencies require different versions of the code
1/17/01
CMPUT429/CMPE382
Amara, 58
Intel/HP IA-64 “Explicitly Parallel
Instruction Computer (EPIC)”
• IA-64: instruction set architecture; EPIC is type
– EPIC = 2nd generation VLIW?
• Itanium™ is name of first implementation (2001)
– Highly parallel and deeply pipelined hardware at 800Mhz
– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process
• 128 64-bit integer registers + 128 82-bit floating
point registers
– Not separate register files per functional unit as in old VLIW
• Hardware checks dependencies
(interlocks => binary compatibility over time)
• Predicated execution (select 1 out of 64 1-bit flags)
=> 40% fewer mispredictions?
1/17/01
CMPUT429/CMPE382
Amara, 59
IA-64 Registers
• The integer registers are configured to help
accelerate procedure calls using a register stack
– mechanism similar to that developed in the Berkeley RISC-I
processor and used in the SPARC architecture.
– Registers 0-31 are always accessible and addressed as 0-31
– Registers 32-128 are used as a register stack and each
procedure is allocated a set of registers (from 0 to 96)
– The new register stack frame is created for a called
procedure by renaming the registers in hardware;
– a special register called the current frame pointer (CFM)
points to the set of registers to be used by a given procedure
• 8 64-bit Branch registers used to hold branch
destination addresses for indirect branches
• 64 1-bit predict registers
1/17/01
CMPUT429/CMPE382
Amara, 60
IA-64 Registers
• Both the integer and floating point registers
support register rotation for registers 32-128.
• Register rotation is designed to ease the task of
allocating of registers in software pipelined loops
• When combined with predication, possible to avoid
the need for unrolling and for separate prologue
and epilogue code for a software pipelined loop
– makes the SW-pipelining usable for loops with smaller numbers
of iterations, where the overheads would traditionally negate
many of the advantages
1/17/01
CMPUT429/CMPE382
Amara, 61
Intel/HP IA-64 “Explicitly Parallel
Instruction Computer (EPIC)”
• Instruction group: a sequence of consecutive
instructions with no register data dependences
– All the instructions in a group could be executed in parallel, if
sufficient hardware resources existed and if any dependences
through memory were preserved
– An instruction group can be arbitrarily long, but the compiler must
explicitly indicate the boundary between one instruction group and
another by placing a stop between 2 instructions that belong to
different groups
• IA-64 instructions are encoded in bundles, which are
128 bits wide.
– Each bundle consists of a 5-bit template field and 3 instructions,
each 41 bits in length
• 3 Instructions in 128 bit “groups”; field determines if
instructions dependent or independent
– Smaller code size than old VLIW, larger than x86/RISC
– Groups can be linked to show independence > 3 instr
1/17/01
CMPUT429/CMPE382
Amara, 62
5 Types of Execution in Bundle
Execution Instruction Instruction
Unit Slot type
Description
I-unit
A
Integer ALU
I
Non-ALU Int
M-unit
A
Integer ALU
M
Memory access
F-unit
F
Floating point
B-unit
B
Branches
L+X
L+X
Extended
Example
Instructions
add, subtract, and, or, cmp
shifts, bit tests, moves
add, subtract, and, or, cmp
Loads, stores for int/FP regs
Floating point instructions
Conditional branches, calls
Extended immediates, stops
• 5-bit template field within each bundle describes
both the presence of any stops associated with the
bundle and the execution unit type required by each
instruction within the bundle (see Fig 4.12 page 271)
CMPUT429/CMPE382
1/17/01
Amara, 63
Itanium™ Processor Silicon
(Copyright: Intel at Hotchips ’00)
IA-32
Control
FPU
IA-64 Control
Integer Units
Instr.
Fetch &
Decode
Cache
TLB
Cache
Bus
Core Processor Die
1/17/01
4 x 1MB L3 cache
CMPUT429/CMPE382
Amara, 64
Itanium™ Machine Characteristics
(Copyright: Intel at Hotchips ’00)
Frequency
800 MHz
Transistor Count
25.4M CPU; 295M L3
Process
0.18u CMOS, 6 metal layer
Package
Organic Land Grid Array
Machine Width
6 insts/clock (4 ALU/MM, 2 Ld/St, 2 FP, 3 Br)
Registers
14 ported 128 GR & 128 FR; 64 Predicates
Speculation
32 entry ALAT, Exception Deferral
Branch Prediction
Multilevel 4-stage Prediction Hierarchy
FP Compute Bandwidth
3.2 GFlops (DP/EP); 6.4 GFlops (SP)
Memory -> FP Bandwidth
4 DP (8 SP) operands/clock
Virtual Memory Support
64 entry ITLB, 32/96 2-level DTLB, VHPT
L2/L1 Cache
Dual ported 96K Unified & 16KD; 16KI
L2/L1 Latency
6 / 2 clocks
L3 Cache
4MB, 4-way s.a., BW of 12.8 GB/sec;
System Bus
2.1 GB/sec; 4-way Glueless MP
Scalable to large (512+ proc) systems
1/17/01
CMPUT429/CMPE382
Amara, 65
Itanium™ EPIC Design Maximizes SW-HW Synergy
(Copyright: Intel at Hotchips ’00)
Architecture Features programmed by compiler:
Branch
Hints
Explicit
Parallelism
Register
Data & Control
Stack
Predication
Speculation
& Rotation
Memory
Hints
Micro-architecture Features in hardware:
1/17/01
Fast, Simple 6-Issue
Instruction
Cache
& Branch
Predictors
Issue
Register
Handling
128 GR &
128 FR,
Register
Remap
&
Stack
Engine
Control
Parallel Resources
Bypasses & Dependencies
Fetch
4 Integer +
4 MMX Units
Memory
Subsystem
2 FMACs
(4 for SSE)
Three
levels of
cache:
2 L.D/ST units
L1, L2, L3
32 entry ALAT
Speculation Deferral Management
CMPUT429/CMPE382
Amara, 66
10 Stage In-Order Core Pipeline
(Copyright: Intel at Hotchips ’00)
Front End
Execution
• Pre-fetch/Fetch of up
to 6 instructions/cycle
• Hierarchy of branch
predictors
• Decoupling buffer
•
•
•
•
EXPAND
IPG
INST POINTER
GENERATION
FET ROT EXP
FETCH
ROTATE
Instruction Delivery
• Dispersal of up to 6
instructions on 9 ports
• Reg. remapping
• Reg. stack engine
1/17/01
RENAME
REN
4 single cycle ALUs, 2 ld/str
Advanced load control
Predicate delivery & branch
Nat/Exception//Retirement
WORD-LINE
REGISTER READ
DECODE
WL.D
REG
EXE
EXECUTE
DET
WRB
EXCEPTION WRITE-BACK
DETECT
Operand Delivery
• Reg read + Bypasses
• Register scoreboard
• Predicated
dependencies
CMPUT429/CMPE382
Amara, 67
Itanium processor 10-stage pipeline
• Front-end (stages IPG, Fetch, and Rotate):
prefetches up to 32 bytes per clock (2
bundles) into a prefetch buffer, which can
hold up to 8 bundles (24 instructions)
– Branch prediction is done using a multilevel adaptive
predictor like P6 microarchitecture
• Instruction delivery (stages EXP and REN):
distributes up to 6 instructions to the 9
functional units
– Implements registers renaming for both rotation and
register stacking.
1/17/01
CMPUT429/CMPE382
Amara, 68
Itanium processor 10-stage pipeline
• Operand delivery (WLD and REG): accesses
register file, performs register bypassing,
accesses and updates a register scoreboard,
and checks predicate dependences.
– Scoreboard used to detect when individual instructions
can proceed, so that a stall of 1 instruction in a bundle
need not cause the entire bundle to stall
• Execution (EXE, DET, and WRB): executes
instructions through ALUs and load/store
units, detects exceptions and posts NaTs,
retires instructions and performs write-back
– Deferred exception handling for speculative instructions is
supported by providing the equivalent of poison bits,
called NaTs for Not a Thing, for the GPRs (which makes
the GPRs effectively 65 bits wide), and NaT Val (Not a
Thing Value) for FPRs (already 82 bits wides)
1/17/01
CMPUT429/CMPE382
Amara, 69
Comments on Itanium
• Remarkably, the Itanium has many of the
features more commonly associated with the
dynamically-scheduled pipelines
– strong emphasis on branch prediction, register renaming,
scoreboarding, a deep pipeline with many stages before
execution (to handle instruction alignment, renaming, etc.),
and several stages following execution to handle exception
detection
• Surprising that an approach whose goal is to
rely on compiler technology and simpler HW
seems to be at least as complex as dynamically
scheduled processors!
1/17/01
CMPUT429/CMPE382
Amara, 70
Peformance of IA-64 Itanium
(Source: Microprocessor Report Jan 2002)
•
•
•
•
•
•
•
•
•
1/17/01
ITANIUM (800 MHz):
SPECint2000(base): 358
SPECfp2000(base): 703
POWER4 (1.3 GHz):
SPECint2000(base): 790
SPECfp2000(base): 1,098
SUN UltraSPARC III (1.05 GHz)
SPECint2000(base): 537
SPECfp2000(base): 701
CMPUT429/CMPE382
Amara, 71
Summary#1: Hardware versus
Software Speculation Mechanisms
• To speculate extensively, must be able to
disambiguate memory references
– Much easier in HW than in SW for code with pointers
• HW-based speculation works better when control
flow is unpredictable, and when HW-based
branch prediction is superior to SW-based
branch prediction done at compile time
– Mispredictions mean wasted speculation
• HW-based speculation maintains precise
exception model even for speculated instructions
• HW-based speculation does not require
compensation or bookkeeping code
1/17/01
CMPUT429/CMPE382
Amara, 72
Summary#2: Hardware versus Software
Speculation Mechanisms cont’d
• Compiler-based approaches may benefit from the
ability to see further in the code sequence,
resulting in better code scheduling
• HW-based speculation with dynamic scheduling
does not require different code sequences to
achieve good performance for different
implementations of an architecture
– may be the most important in the long run?
1/17/01
CMPUT429/CMPE382
Amara, 73
Summary #3: Software Scheduling
• Instruction Level Parallelism (ILP) found either by
compiler or hardware.
• Loop level parallelism is easiest to see
– SW dependencies/compiler sophistication determine if compiler can
unroll loops
– Memory dependencies hardest to determine => Memory disambiguation
– Very sophisticated transformations available
• Trace Scheduling to Parallelize If statements
• Superscalar and VLIW: CPI < 1 (IPC > 1)
– Dynamic issue vs. Static issue
– More instructions issue at same time => larger hazard penalty
– Limitation is often number of instructions that you can successfully
fetch and decode per cycle
1/17/01
CMPUT429/CMPE382
Amara, 74
Download