1. How does Global Code Scheduling work? (Trace Scheduling... Superblocks) These are useful if there ...

advertisement
1. How does Global Code Scheduling work? (Trace Scheduling and
Superblocks)
Loop unrolling and software pipelining:
- These are useful if there are no conditionals
within the loop as it will be relatively easy to
find independent code.
- Once there are conditionals within the loop
(especially a tight loop) it will be difficult to
find independent instructions to schedule
Very likely that whatever code is in the loop
body will depend on the outcome of the
conditional. E.g.:
for(i=0; i<n; i++)
{
a[i] = a[i] + b[i];
….
if(a[i] == 0)
{
b[i] = b[i] – 5;
….
}
else
{
X += b[i] + 7;
….
}
c[i] = …
}
Difficult to unroll this because future
iterations depend on the outcome of the
condition if(a[i] == 0)
- Global code scheduling attempts to schedule
code across conditional statements to allow
for more ILP.
Complications of Global Scheduling
i)
Which part of the conditional is most
frequently executed? To get maximum
savings
we
will
need
to
move
instructions out of the part of the branch
that is executed most frequently. For
this discussion we will assume that the
THEN portion is executed most of the
time.
ii)
What is the cost of moving the statement
to above the conditional branch?
Obviously we will require empty ILP
slots above the branch in order to have
any benefit (e.g. if the CPU can issue 3
instructions per cycle but there is only 1
instruction above the branch that can be
issued, then there are two empty ILP
slots).
iii)
What kind of compensation code must be
inserted into the ELSE portion of the
branch? How much compensation code
must be inserted? For e.g. if we moved
the b[i] = b[i] – 5 above the THEN
portion, then we must insert b[i] = b[i] +
5 into the ELSE portion to compensate.
iv)
How easy is it to schedule the
compensation code? If the cost of the
compensation code is high, then we must
ensure that the ELSE portion is executed
very infrequently compared to the THEN
portion.
v)
Is it easier to move the c[i] portion up?
What if the THEN or ELSE portions also
depend on c[i]? Moving it up means that
the correctness of the THEN and ELSE
portions are affected.
Many other complications involved.
How do we do global code scheduling?
Trace Scheduling
Idea:
To find as many instructions as possible in the frequently
visited portion of the branch (the THEN part in our case), to
move these portions up and to compact them into as few
wide instructions (for VLIW) or issue packets (for
superscalar processors) as possible.
Step 1: Trace Selection
a. Loop
is
unrolled
(since
loop
conditions are usually taken) to form a
sequence of instructions that are likely
to be executed (this is called a “trace”)
A trace may have multiple exit points
(corresponding to the ELSE portion of
our code) and multiple entry points
(corresponding to returns to the trace,
e.g. for the next iteration).
If the trace is only allowed to have one
entry point but multiple exit points, it
is
called
a
“superblock”.
This
restriction is placed to ease moving
code.
E.g. if we unrolled the above code four
times and assume that the THEN
portion
is
always
taken,
our
superblock will be:
a[i] = a[i] + b[i];
b[i] = b[i] – 5;
….
c[i] = ….
if(a[i] != 0)
goto EXIT;
a[i+1] = a[i+1] + b[i+1];
b[i+1] = b[i+1] – 5;
….
c[i+1] = ….
if(a[i+1] != 0)
goto EXIT;
a[i+2] = a[i+2] + b[i+2];
b[i+2] = b[i+2] – 5;
….
c[i+2] = ….
if(a[i+2] != 0)
goto EXIT;
a[i+3] = a[i+3] + b[i+3];
b[i+3] = b[i+3] – 5;
….
c[i+3] = ….
if(a[i+3] != 0)
goto EXIT;
….
goto SKIP;
EXIT:
….
X += b[i]+7;
….
SKIP:
- More code is actually inserted at the end of the
superblock to finish up the remaining iterations of the
loop. To do this other code might be inserted to track
which portion of the trace did we exit from.
- This is not necessary for non superblock traces (with
multiple entry points) since additional iterations can
be achieved by jumping to the next part of the trace.
But other problems are introduced when we attempt to
move code across entry points. Simpler to just finish
the remaining iterations at the exit point.
Step Two: Trace Compaction
This is basically code scheduling. Code from the unrolled
loop are moved to empty ILP slots above their current
position, and compensation code is inserted into other
portions to compensate for these moves.
2. What is a write miss?
A write-miss, like a read-miss, is a write to a memory
location that had never been loaded into the cache (or
was loaded and later invalidated).
Two ways to handle:
i)
No-write-allocate
Useful for write-through caches. We simply
update the memory location without creating an
entry in the cache.
Of course if we write to the same location again
we will have another write-miss.
ii)
Write-allocate
More complicated. A new cache entry is created in
the cache, and the new value is written to this
entry (and to memory if we are doing writethrough).
Note that multiple-word blocks will require a
read from memory before the write, otherwise
the other values in the block will be incorrect. E.g.
Write value “1” to location x+2 missed. Allocate
cache block (assume 4-word blocks), read
locations x, x+1, x+2, x+3, then modify location
x+2 in cache (and memory if write-through).
3. How to distinguish between RAW and WAR?
RAW – Read-after-write
The results of a previous write are read. E.g.:
I0: a[i] = b[i] + 3;
I1: c[i] = a[i] + 1;
There is a RAW between I0 and I1 based on a[i]. This is
the only TRUE dependency and cannot be removed by
renaming.
WAR – Write-after-read
A variable is written to, and this variable was read in a
previous instruction. E.g.:
I0: a[i] = b[i]+3;
I1: b[i] = 7+c[i];
Here there is a WAR between I0 and I1 based on b[i].
This is hazardous because if I1 is executed and retired
before I0, I0 will read the wrong value in b[i].
Can be resolved by renaming each new copy of b[i]:
I0: a[i] = b[i]+3;
I1: b1[i] = 7+c[i];
Now if I1 is executed and retired before I0, no
problem.
WAW (not asked but added for completeness)
Two instructions write to the same variable. E.g.:
I0: a[i] = 3+b[i];
I1: a[i] = 7 – c[i];
This is hazardous because if I1 is executed and retired
before I0, the final a[i] answer is wrong. Again resolve
by renaming.
4. What is the justification behind IA-64?
Main justification:
a. Superscalar architectures attempt to discover and
benefit from ILP using hardware:
i)
ILP search is confined to a narrow
window of instructions in the “pre-fetch
queue” or “issue buffer”. CPU cannot
“see” ILP beyond this narrow window.
Large windows are expensive and can be
hard to fill.
ii)
Difficult to find ILP in the instructions.
Many conditions need to be met (e.g.
dependencies) to be able to issue the
instructions in parallel.
iii)
Branches introduce a new complication:
Not sure which portion to fetch
instructions from. Further complicates
instruction fetch and issue.
iv)
Cost of wrong branch assumption very
high. Sophisticated methods of branch
prediction with prediction buffers,
statistical predictors etc. required to
reduce poor predicitons. Very expensive
and hard to build.
v)
Loads are expensive and cause pipelines
to idle.
b. Why not move all the complication to software?
i)
Software is cheap, simple to write.
ii)
Extensive support to debug and fix faulty
software. Faulty hardware on the other
hand are a disaster (e.g. Pentium Bug).
c. Three ideas in Itanium (IA-64):
i)
Pass the responsibility of finding
independent instructions to the software
(compiler).
i. IA-64 is a VLIW architecture.
Three instructions can fit into
a single instruction word.
ii. All three instructions in the
instruction word are simply
sent to the execution units.
iii. Dependency
information
must be inserted by the
compiler using a set of
“template bits” at the end of
the instruction word, that not
only indicate independence
within a single instruction
word,
but
also
across
instruction words. Thus the
CPU will only look at the
template to determine which
instructions can be executed
in parallel. Will not try to
discover ILP on its own.
ii)
Forget about branch prediction.
Compiler inserts predicates to indicate
the THEN and the ELSE portions of
code. E.g. for the following code:
cmp r1, r2
bne ELSE
addi r1, r1, 2 // THEN
sub r3, r7, r1 // portion
j JOIN
// Skip over ELSE
ELSE: subi r1, r1, 7
mul r3, r7, r1
JOIN: …
Using Stalling’s notation, This is rewritten as:
<P1, P2> = cmp(r1, r2);
<P1> addi r1, r1, 2
<P1> sub r3, r7, r1
<P2> subi r1, r1, 7
<P2> mul r3, r7, r1
…
The code for both predicates are
executed together, but only the
results of the correct predicated
are committed.
This allows us to fully utilize all
the execution units, yet do not
require us to roll-back any
mispredicted branches.
iii)
Speculative Loading
Loads from memory can be very slow. Thus
we want to begin loads as early as possible.
The Itanium also supports speculative
loading. This allows us to move all loads to the
start of the program (and as far away from the
instructions that use the result of the load as
possible).
Results are not committed until a “check”
instruction is executed. Exceptions are also
delayed until the “check” is executed.
This separation between “load” (when
memory is accessed) allows us to do all the
memory loads early, but the effect of the load
(i.e. modifying the target register) takes place
only at the point when the result of the load is
needed.
Surprisingly benefits of IA-64 are still not clear in practice.
5. From page 203 of CA-AQA 3rd Ed, Fig 3.13, why is it that for the
first row of d=2, why did the b1 branch prediction start with
NT/NT with the LHS bolded, whereas for b2 it starts with
NT/NT with the RHS bolded?
Fig 3.13 is about using correlation bits to do branch
prediction.
The bolding in Table 3.13 will depend on whether the
previous branch was taken (bold RHS) or not taken
(bold LHS). Illustrated below.
Main Idea:
Base our prediction of the current branch on the
behavior of other branches
We use two bits. The first bit is to indicate what
prediction to make if the previous branch was not
taken, and the second bit to indicate what prediction to
make if the previous branch was taken. This is shown
below:
Prediction Bits
NT/NT
NT/T
T/NT
T/T
Prediction
if
previous branch not
taken
NT
NT
T
T
Prediction if last
branch was taken
NT
T
NT
T
Individually each bit is treated like a 1-bit predictor
(i.e. its direction is set according to whether the
current branch was actually taken or not).
Given the following C code:
if(d==0)
d=1;
if(d==1)
d=2;
Assuming d is in register r1, this gives us:
bnez r1, L1
addi r1, r0, 1
; Branch b1
; Set r1 to 1 if d==0. Note that r0 is
; a special register that always
holds
; the value 0
L1: addi r3, r1, -1 ; Subtract –1 from r1. If r1=1,
result in
; r3 should be zero.
bnez r3, L2
; Branch b2. Taken if d != 0
L2: ….
If this is in a loop, predictors for b1 will depend on b2,
and predictors for b2 will depend on b1. Initially we
assume both not taken. Shaded portions indicate
which prediction (previous branch not taken/previous
branch is taken) is used.
Table below shows various iterations through the loop
when d takes a value in the range [0,2].
Iter
d=?
0
1
2
3
4
2
0
1
2
0
b1 pre
b1 act
NT/NT
T/NT
T/NT
T/NT
T/NT
T
NT
T
T
NT
new
b1 pre
T/NT
T/NT
T/NT
T/NT
T/NT
b2 pre
b2 act
NT/NT
NT/T
NT/T
NT/NT
NT/T
T
NT
NT
T
NT
new
b2 pre
NT/T
NT/T
NT/NT
NT/T
NT/T
Download