In class notes

advertisement
Software Exploits for ILP
• We have already looked at compiler scheduling to
support ILP
–
–
–
–
Altering code to reduce stalls
Loop unrolling and scheduling
Compiler-based scheduling for superscalars
VLIW
• Here, we examine appendix H to see additional
compiler-based ideas and hardware to help support
some of these ideas
– Few architectures have focused on static-based
approaches aside from minimal support with compiler
scheduling
– However, the EPIC architecture heavily relied on it, here
we will view EPIC and consider to what extent it succeeds
(or fails) over dynamic approaches
Loop Dependencies
• To support loop unrolling, the compiler must detect
any dependencies that exist both within and between
loop iterations
– Within loop iterations are the typical RAW, WAW and
WAR hazards
– Between loop iterations, RAW, WAW and WAR hazards
may be hard to identify because an array index may not
match exactly
– Consider the following two loop bodies, both iterate over i
from 1 to 100
• x[i] = x[i] + s;
• x[i] = x[i-1] + s;
– In the first, the RAW hazard will not cause any stalling
behavior, nor will it be complicated by loop unrolling
because reads happen before writes
– But in the second, the RAW hazard exists across loop
iterations so that an unrolled loop could lead to problems
Example Examined
• Consider this code:
– x[0] = …;
– for(i=1;i<=100;i++) x[i] = x[i-1] + s;
• Assume x is an FP array so that the additions take 4 cycles
– Let’s unroll this loop to contain 4 iterations per loop, this would
give us the following four assignments in the first unrolled loop
iteration:
– x[1] = x[0] + s;
– x[2] = x[1] + s;
– x[3] = x[2] + s;
– x[4] = x[3] + s;
• If schedule the above code, we would first attempt to L.D
x[0], x[1], x[2], x[3], then do the ADD.Ds and finally the
S.Ds
– But each S.D is needed before the next ADD.D
• If the compiler doesn’t detect this dependence, the code will
be incorrect!
Forms of Dependencies
• There are 3 forms of dependencies
– True or data dependencies – these are the same as RAW
hazards as found in pipelining
• we have to make sure that the value is written before we can
subsequently read/use it
– Name dependencies – these arise because the same named
entity is referenced, but the data differs
• for instance, we put a result in R1 and use it in a later instruction,
but yet another instruction places a completely unrelated datum in
R1
– There are two forms of name dependencies
• output dependencies – these arise when two instructions write two
independent results to a named location without an intervening
read, that is, these are WAW hazards
• antidependencies – these arise when a read and write must occur
in the proper order so that the read takes place before the write,
these are WAR hazards
Example
• Find the dependencies
– both within a single loop iteration and across loop iterations
– identifying each type
– is the loop parallelizable (unrollable)?
for (i=1;i<=100;i=i+1) {
y[i]=x[i]/c;
/* S1 */
x[i]=x[i]+c;
/* S2 */
z[i]=y[i]+c;
/* S3 */
y[i]=c-y[i];
/* S4 */
}
True: from S1 to S3 (y), from S1 to S4 (y)
Anti: from S1 to S2 (x), from S3 to S4 (y)
Output: from S1 to S4 (y)
for(i=1;i<=100;i=i+1)
t[i]=x[i]/c;
x1[i]=x[i]+c;
z[i]=t[i]+c;
y[i]=c-t[i];
}
{
As is the loop is not parallelizable, but if we use renaming on x (S2) and y
(S1, S3 and S4), we can unroll and schedule the code – notice that we have
renamed x to x1 for this to work, later code would have to reference x1
Example
• The previous example had no loop carried
dependencies
– These can be tricky to find
• just because an array is specified as something other than index i
does not mean that there is a loop carried dependence
• we will examine how to prove a loop carried dependence exists
in a couple of slides, consider this loop:
– for(i=1;i<=100;i++) {
• a[i+1] = a[i] + c[i];
// S1
• b[i+1] = b[i] + a[i+1]; // S2
• }
– This code has the following true dependencies
• a from S1 to S2
• a from S1 to S1 (loop carried)
• b from S2 to S2 (loop carried)
– The loop carried dependencies, at least in this case,
prevent the loop from being parallelizable
Example
• Not all loop carried dependencies prevent a loop from
being parallelizable, consider this example
– for(i=1;i<=100;i++) {
•
•
•
•
a[i] = a[i] + b[i];
// S1
b[i+1] = c[i] + d[i]; // S2
}
here, we have a loop carried dependence on b from S2 to S1 (the
dependence from a to a in S1 is not loop carried)
– To parallelize this loop, we must eliminate the dependence
• this change requires adding an initial S1 before the loop and a final
S2 after the loop
– a[1] = a[1] + b[1];
// initial S1
– for(i=1;i<=100;i++) {
• b[i+1] = c[i] + d[i]; // S2
• a[i+1] = a[i+1] + b[i+1];
• }
– b[101] = c[101] + d[101];
// S1
// final S2
Recurrences
• The key to identifying if a loop carries a dependence
across iterations is to find if a recurrence of loop indices
arises
– A recurrence is when a loop index for a given variable is
reused in another iteration
– With a[i], a[i+1], the recurrence is easy to detect, but consider
these two statements:
• a[i] = b[2*i] + c;
• b[2*i+1] = d[i] * e;
// S1
// S2
– There is no recurrence of b between S1 and S2 because the
index in S1 is always even and the index in S2 is always odd
• Identifying a recurrence can be computationally
challenging, there are a number of tests we can apply
– The tests can confirm a loop carried dependence but if the test
fails, it does not tell us anything because there are other tests
that might be applicable
GCD Test
• One easy test is applied when array indices are affine
– Basically, an affine index is one that can be written in the
from a*i + b where i is the loop index and a and b are
integer constants
– The GCD test says that if two indices of the same array
are affine then a dependence exists if these three
conditions hold:
• there are two iteration indices, j and k, within the bounds of the
loop
• the loop stores into an array element by a*j+b and later fetches
from the same array at c*k+d
• the value of d – b is evenly divisible by the greatest common
divisor of c and a
– Examples:
• x[2*i+3] = x[2*i]: 2 does not divide -3, test fails (cannot
conclude anything)
• y[5*i – 4] = y[15*i + 6]: 5 does divide 10 (6 - -4) so there is a
loop carried dependence (this arises, for instance when i = 8 and i
=2
Dependence Challenges
• There are a number of challenges that complicate loop
dependence analysis
– When storage is referenced by pointer rather than array index
– When array indexing is indirect through another array
– When dependencies exist for a subset of inputs but do not arise
under other sets of inputs
• Although pointers pose a very difficult problem to static
analysis (since pointers take on their values at run time),
there are some forms of analysis available
– If two pointers cannot point to the same type, there can be no
dependence
– If an object being referenced by a pointer is only allocated
under conditions that differ from those of another pointer
– If one pointer can only point to a local referent while another
points only to a global
Eliminating Computation
• Aside from loop unrolling/scheduling,
another useful pursuit for the compiler is
to replace some common computations by
DADD
storing the first result in a register
– This can take on multiple forms
•
•
•
•
DADDI
DADDI
becomes
DADDI
R1, R2, #4
R1, R1, #4
R1, R1, #8
R1, R2, R3
DADD R4, R1, R6
DADD R8, R4, R7
Here, we have two RAW
Hazards that might cause
Stalls, we replace them
With the code below
DADD R1, R2, R3
DADD R4, R6, R7
DADD R8, R1, R4
– We can take advantage of associativity as
shown to the right
– Additionally, if a particular computation, say
c + d, is used in several locations, place c + d
into a register and replace all further uses of
the computation with the register
Software Pipelining
• As we saw earlier, a compiler can rearrange code in a
loop to remove loop carried dependencies
– The compiler can also rearrange the code to hide true
dependencies found within an iteration through a technique
called symbolic loop unrolling
– The idea is to identify in each iteration of a loop, the
instruction that can be paired with a previous and
successive loop iteration
– For instance, if one iteration performs a FP add which
takes multiple cycles, the store for that add can be moved
to the next iteration
• to prevent having to use multiple groups of registers, we would
place the store before the add of that iteration since we would be
storing the sum from the previous iteration
• this may require adding “startup” and “cleanup” code
Continued
• Pictorially, the concept works as follows:
– In iteration 0 we select the last instruction in the loop that has the
dependence
– In iteration 1 we select the second to last instruction, etc
– In the last iteration we select the first instruction in the loop that has
the dependence
• We add startup code so that the instructions preceding the last
instruction from iteration 0 are still available
• We add cleanup code so that the instructions from the last
iteration that follow the first instruction are performed
Example
Loop:
L.D
ADD.D
S.D
DSUBI
BNE
Iteration i:
L.D
ADD.D
S.D
Iteration i+1:
L.D
ADD.D
S.D
Iteration i+2:
L.D
ADD.D
S.D
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,#8
R1,R2,Loop
F0,0(R1)
F4,F0,F2
F4,0(R1)
F0,0(R1)
F4,F0,F2
F4,0(R1)
F0,0(R1)
F4,F0,F2
F4,0(R1)
Bold-faced instructions are unrolled
L.D
F0, 16(R1)
L.D
F6, 8(R1)
ADD.D F4, F6, F2
Startup
Loop: S.D
F4,16(R1)
ADD.D F4,F0,F2
L.D
F0,0(R1)
DSUBI R1,R1,#8
BNE R1,R2,Loop
ADD.D F8, F0, F2
S.D
F4, 8(R1)
S.D
F8, 0(R1)
Cleanup
Code Scheduling
• So far, our code scheduling has been limited to
– Moving code within a basic block to fill stalls
– Loop unrolling and scheduling
• What about moving code across conditional branches?
– With branch history, we can make predictions on whether a
branch will be taken or not
– Are the benefits of moving code to avoid branch delays worth
the risk of guessing wrong?
• Branch speculation requires several supporting
mechanisms
– A buffer to consult that provides the branch prediction and
branch target location
– A mechanism for “killing off” the speculated operation(s) if the
prediction is wrong
– A mechanism to ensure that speculated code does not raise an
exception unless/until the speculation is proved correct
Example
• Consider the skeleton of an
if-else statement to the right,
we have some options for
code scheduling
– Move B(i) before the
condition
• only useful if condition is
usually true and executing it will
not impact X or C(i)
– Move X before the condition
• only useful if condition is
usually false and executing it
will not impact B(i) or C(i)
– Move C(i) before the
condition, or into one of the
clauses
• only useful if executing it does
not impact condition, B(i) or X
Variants
• Move B(i) up before the condition
– In X, reset B(i)
• that is, if the condition turns out to be false, wipe out the B(i)
assignment statement (reset it)
• Move C(i) up before the condition
– Doable if we can ensure that neither the condition, B(i) or
X would be impacted
• let’s assume that X would be impacted, if we predict the else
clause is rarely taken, we could reset C(i) before doing X in the
else clause
• The question comes down to
– Is the benefit from a correct prediction more than the cost
when incorrect?
– Again, the compiler has to ensure that the movement does
not cause an incorrect condition result or incorrect values
from the if clause and else clause
• Let’s use the code
– if(x > y) x++; else y--;
Example
• Further, let’s assume that x > y is
true 90% of the time
• Our original code is shown to the
right
– if true, 4 instructions are executed
and if false, 3 instructions are
executed
– let’s assume no stalls, each
instruction has a CPI of 1
– average CPI = 4 * .9 + 3 * .1 = 3.9
• Given the prediction (x > y), the
compiler generates the code to the
right
– if true, 3 instructions are executed
and if false, 5
– average CPI = 3 * .9 + 5 * .1 = 3.2,
– a speedup of 3.9 / 3.2 = 1.22 (22%)
else:
next:
SGT R3, R1, R2
BEQZ R3, else
DADDI R1, R1, #1
J
next
DSUBI R2, R2, #1
…
next:
SGT R3, R1, R2
DADDIR1, R1, #1
BNEZ R3, next
DSUBI R1, R1, #1
DSUBI R2, R2, #1
…
Trace Scheduling
• In the previous example, we
selected the “critical path” –
the most common path
through the selection
statement
– Typically, such a conditional is
found inside of a loop
• In trace scheduling, we
combine selecting the critical
path with loop unrolling so
that we move the critical path
out of the selection in
multiple iterations
– In order to handle the missprediction, we have exits out
of the unrolled code and
entrances to re-enter after
handling the miss-prediction
Superblocks
• The numerous entries and exits in our previous
figure indicates a major drawback of trace
scheduling
– First, it requires that the compiler build mechanisms for
recovering from miss-predictions into the unrolled code
• for instance, imagine that we unroll a loop 4 times, the compiler
then has to build into the code what to do if the miss-prediction
occurs in the first iteration and how to re-enter, if the missprediction occurs in the second iteration and how to re-enter, etc
– Second, it increases the amount of code required and
complicates the code
• The superblock uses the same idea except that, upon
exiting, you enter a different block which foregoes
speculation
– When the loop terminates, the superblock is re-entered
– This is done using a technique called tail duplication
Example
Superblock
Example
• Assume our code is:
• for(i=0;i<n;i++) if(a[i]>0) x++; else x--;
– In most cases, a[i] is positive
• We choose to move x++ out of the selection
statement and replace the selection with if(a[i]<=0)
x=x-2;
– That is, we add 1 to x automatically and if we missspeculate, we subtract 2 from x
• We then unroll the loop giving us the following (in
C rather than assembly)
– for(i=0;i<n/4;i+=4) {
•
•
•
•
–}
x+=4; if(a[i]<=0) {…}
if(a[i+1]<=0) {…}
if(a[i+2]<=0) {…}
if(a[i+3]<=0 {…}
// code in the { } requires
// subtracting from x, and then
// completing the remaining
// loop iterations using the
// original code
Predicated Instructions
• We have seen that with every loop and every selection
statement, there are branches
– Which could result in branch delays, or speculation that when
miss-speculated can lead to stalls
• If the condition and action are simple enough, can we do
them without a branch?
– The answer is yes, if we use predicated (or conditional)
instructions
• The idea is that the condition and the action are both
performed but that if the condition is determined false, the
register write is canceled
• In most cases, predicated instructions can
– only use a simple condition: value = 0 or value != 0
– only have a single, simple action such as x = y
• Here, we consider two:
– CMOVZ – conditional move
– LWC – conditional load
Examples
• The code if(A==0) {S=T;} can be implemented in MIPS as
–
–
– L:
BNEZ R1, L
ADD R2, R3, R0
…
• Or with the MOVZ instruction as
–
MOVZ R2, R3, R1
– Move R3 to R2 if R1 = 0 (if R1 != 0) cancel the move before it
is finalized
• assuming the MIPS pipeline, we reduce from 2 instructions to 1 and
remove the branch penalty, so a potential savings of 2 clock cycles
• In MIPS, the MOVZ is the only predicated instruction, but
other architectures offer others such as LWC
– Load if condition is true
– LWC R2, 0(R3), R1 – load 0(R3) into R2 if R1 = 0
– The instruction performs 0+R3 and R1 = 0 test in EX stage and
either loads 0+R3 into R2 in MEM and WB respectively or does
nothing in MEM/WB depending on the result of the condition
Handling Exceptions
• Whether we use predicated instructions or through
compiler scheduling (e.g., trace scheduling)
– an instruction that raises an exception that should not have
executed because of miss-speculation should not cause the
exception
• recall exceptions may invoke an exception handler, which is very
time consuming, or may cause program termination
• we need a way to recover from a miss-speculated exception
situation
– in the former case, we can invoke the exception handler
and cancel it later if we determine the instruction was
miss-speculated – this wastes some time but preserves
proper behavior
– for the latter case, we need a mechanism to ensure that the
exception is either not raised or ignored until we know
whether the speculation was correct or not
Four Approaches
• Hardware and OS work cooperatively to ignore exceptions of
speculative instructions
– this only works for correct programs
• Speculative instructions are not permitted to raise exceptions
– speculative instructions are annotated as such
• for instance a speculative load might be sLW
– we disallow the instruction from raising an exception
• Poison bits are attached to registers to indicate if their value was
the result of speculation
– we add a bit to every register, a speculated instruction that writes to
the register sets the bit, a register written to as a result of a register
with the set poison bit is also set
– exceptions are disallowed for any instruction with a set poison bit
until the instruction’s speculation is known
• Buffers used to store results of speculated instructions
– like a reorder buffer, we only permit results to move beyond the buffer
once the speculation is know, until then, exceptions are buffered
Example
• Imagine we have the following instruction:
– If (A==0) A = B; else A = A + 4;
• The original code is shown on the left but the compiler
uses speculation to generate better code on the right
– If the speculation is true 90% of the time the code goes from
.90 * 5 + .10 * 4 = 4.9 to .90 * 4 +.10 * 5 = 4.1 cycles (speedup
of about 19%)
LW R1, 0(R3)
BNEZ R1, L1
LW R1, 0(R2)
J
L2
L1: DADDI R1, R1, #4
L2: SW
R1, 0(R3)
LW
LW
BEQZ
ADDI
L3: SW
R1, 0(R3)
R14, 0(R2)
R1, L3
R14, R1, #4
R14, 0(R3)
The speculated code adds register R14 so that the value R1 is not
destroyed if we have a miss-speculation. Additionally, we do not
want the SW to take place until the speculation is known
• While the previous example
ensured the proper value was
stored to A, it did nothing to
prevent an exception from a
miss-speculation
– Specifically, we do not want to
load B (0(R2)) if A is not 0
• imagine that A is not 0 and 0(R2)
causes a memory violation, this would
cause an exception that should never
arise
– We indicate that the load for B is L1:
speculative and now we add a new L2:
instruction called SPECCK –
speculative check – this ensures
that an exception only arises if the
speculated instruction should have
executed AND it caused an
exception
Continued
LD
R1,0(R3)
sLD
R14,0(R2)
BNEZ R1,L1
SPECCK 0(R2)
J
L2
DADDI R14,R1,#4
SD
R14,0(R3)
Limitations on Speculation
• Instructions that are annulled (turned into no-ops) still
take execution time
• Conditional instructions are most useful when the
condition can be evaluated early
– such as during the ID stage of our pipeline
• Speculated instructions may cause a slow down
compared to unconditional instructions requiring either a
slower clock rate or greater number of cycles
• The use of conditional instructions can be limited when
the control flow involves more than a simple alternative
sequence
– for example, moving an instruction across multiple branches
requires making it conditional on both branches, which
requires two conditions to be specified or requires additional
instructions to compute the controlling predicate
– if such capabilities are not present, the overhead of if
conversion will be larger, reducing its advantage
Intel IA-64/EPIC
• This chapter introduced a number of compiler-based
strategies to promote ILP to support a superscalar
processor
– To date, very few processors have attempted to
aggressively schedule parallel instructions through the
compiler, instead relying on hardware scheduling
– The IA-64 is one of the few, here we look at a few
highlights of the instruction set and see how instructions
are bundled together to issue in a VLIW-like way
•
•
•
•
•
128 65-bit registers (1 poison bit included)
128 82-bit FP registers
64 1-bit predicate registers
8 64-bit branch registers (for indirect branching)
register stack for parameter passing (rather than memory)
Instruction Format
• Compiler uses a number of strategies to provide ILP
– Loop unrolling
– Speculation
– Scheduling, etc
• Compiler selects up to 3 consecutive instructions to
place into a “bundle”
– A bundle is 128 bits wide consisting of
• a 5-bit template field
• up to three instructions which are 41 bits each (or no-ops as
necessary)
• the 5-bit template describes what each type of instruction is, and
each type has its own formatting so some of the instruction
information is encoded in the template
• the template includes whether a stop should exist – stops denote
the need for stalls
Bundle Components
• All instructions
break into one of 5
types:
– I: integer ALU,
non-ALU integer
– M: memory (int
& FP), integer
ALU
– F: floating point
– B: branches and
conditional
instructions
– L+X: instructions
with extended
immediate data,
stops, and no-ops
Template #
Slot 0
Slot 1
Slot 2
0
M
I
I
1
M
I
I
2
M
I
I
3
M
I
I
8
M
M
I
9
M
M
I
12
M
F
I
13
M
F
I
14
M
M
F
15
M
M
F
M
F
B
…
…
…
See figure H.7 for full table
29
Example
• Unroll the x[i]=x[i]+s; loop seven times and schedule the
instructions in IA-64 bundles
– First to minimize bundles
– Second to minimize cycles
Loop:
L.D
L.D
L.D
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
ADD.D
ADD.D
ADD.D
F0, 0(R1)
F6, -8(R1)
F10, -16(R1)
F14, -24(R1)
F18, -32(R1)
F22, -40(R1)
F26, -48(R1)
F4, F0, F2
F8, F6, F2
F12, F10, F2
F16, F14, F2
F20, F18, F2
F24, F22, F2
F28, F26, F2
S.D
S.D
S.D
S.D
S.D
S.D
S.D
DADDI
BNE
F4, 0(R1)
F8, -8(R1)
F12, -16(R1)
F16, -24(R1)
F20, -32(R1)
F24, -40(R1)
F28, -48(R1)
R1, R1, #-56
R1, R2, Loop
Bundle
Template
9: MMI
14: MMF
15: MMF
15: MMF
15: MMF
15: MMF
15: MMF
15: MMF
16: MIB
Bundle
Template
8: MMI
9: MMI
14: MMF
14: MMF
15: MMF
14: MMF
14: MMF
15: MMF
14: MMF
9: MMI
16: MIB
Slot 0
Slot 1
L.D F0, 0(R1)
L.D F10, -16(R1)
L.D F18, -32(R1)
L.D F26, -48(R1)
S.D F8, -8(R1)
S.D F16, -24(R1)
S.D F20, -(R1)
S.D F24, -40(R1)
S.D F28, -48(R1)
L.D F6, -8(R1)
L.D F14, -24(R1)
L.D F22, -40(R1)
S.D F4, 0(R1)
S.D F12, -16(R1)
Slot 0
Slot 1
L.D F0, 0(R1)
L.D F10, -16(R1)
L.D F18, -32(R1)
L.D F26, -48(R1)
L.D F6, -8(R1)
L.D F14, -24(R1)
L.D F22, -40(R1)
S.D F20, -32(R1)
S.D F28, -48(R1)
Slot 2
ADD.D F4, F0, F2
ADD.D F8, F6, F2
ADD.D F12, F10, F2
ADD.D F16, F14, F2
ADD.D F20, F18, F2
ADD.D F24, F22, F2
ADD.D F28, F26, F2
DADDUI R1, R1, #-56 BNE R1, R2, Loop
S.D F4, 0(R1)
S.D F8, -8(R1)
S.D F12, -16(R1)
S.D F16, -24(R1)
S.D F24, -40(R1)
DADDUI R1, R1, #-56
Slot 2
ADD.D F4, F0, F2
ADD.D F8, F6, F2
ADD.D F12, F10, F2
ADD.D F16, F14, F2
ADD.D F20, F18, F2
ADD.D F24, F22, F2
ADD.D F28, F26, F2
BNE R1, R2, Loop
Execute
Cycle
1
3
4
6
9
12
15
18
21
Execute
Cycle
1
2
3
4
5
6
7
8
9
11
12
Speculation Support
• Nearly every instruction can be predicated
– This is done by specifying one of the predicate registers
• a conditional branch can become an unconditional branch with a
predicate register
– Predicate registers are set using a compare or test
instruction
• hardware supports predication by controlling when exceptions
are handled – for a predicated instruction, an exception can only
be handled once the predicate’s result is known and by using
registers with poison bits
• the compiler is tasked with generating recovery code for
exceptions that arise because of miss-speculation
– Speculated loads use a special table so that if the load is
miss-speculated, it does not wipe out a current register
value
Itanium 2 Performance
• The IA-64/EPIC instruction set was implemented in
the Itanium 2 processor with a 1.5 GHz clock
– The graph below compares its performance on int and FP
benchmarks to those of Pentium IV (3.8 GHz), AMD
Athlon and Power5
• the Itanium 2 compares favorably to Pentium IV & Athlon for FP
benchmarks but is slower on int benchmarks
Download