Chapter 4

advertisement
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-1
Chapter 4
Exploiting Instruction-Level
Parallelism with Software
Approaches
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-2
Basic Compiler Techniques for Exposing
– Basic pipeline scheduling and loop unrolling
• To keep a pipeline full, parallelism among instructions must be
exploited by finding sequences of unrelated instructions that can be
overlapped in the pipeline.
• A compiler’s ability to perform such kind of scheduling depends on
both the amount of ILP available in the program and on the
latencies of the functional units in the pipeline.
• To avoid a pipeline stall, a dependent instruction must be separated
from the source instruction by a distance in clock cycles equal to the
pipeline latency of that source instruction..
Rung-Bin Lin
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Scheduling and Loop Unrolling
– Basic assumptions:
• The latencies of the FP unit
Inst. producing result
FP ALU op
FP ALU op
Load double
Load double
Inst. Using result
Another FP ALU op
Store double
FP ALU op
Store double
Latency
3
2
1
0
• The branch delay of the pipeline implementation is 1 delay slot.
• The functional units are fully pipelined or replicated such that no
structural hazards can occur
4-3
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
Loop Unrolling by Compilers
– Example:
for (j=1, j<= 1000, j++)
x[j]=x[j]+s;
• Assume R1 initially holds the highest address of the first element
and 8(R2) holds the last element.
Loop: L.D
F0, 0(R1)
ADD.D
F4, F0, F2
S.D
F4, 0(R1)
DADDUI
R1, R1, #-8
BNE
R1, R2,Loop
– Performance of scheduled code with loop unrolling.
4-4
Rung-Bin Lin
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
4-5
Performance of Unscheduled Code without
Loop Unrolling
Loop:
L.D
stall
ADD.D
stall
stall
S.D
DADDUI
stall
BNE
stall
– Need 10 cycles per result
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
R1, R1, #-8
R1, R2,Loop
Clock cycle issued
1
2
3
4
5
6
7
8
9
10
Rung-Bin Lin
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Performance of Scheduled Code without Loop
Unrolling
Loop:
L.D
DADDUI
ADD.D
stall
BNE
S.D
– Need 6 cycles per result
F0, 0(R1)
R1, R1, #-8
F4, F0, F2
R1, R2,Loop ; delay branch
F4, 8(R1)
4-6
Rung-Bin Lin
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Performance of Unscheduled Code with Loop
Unrolling
•
Unroll the loop 4 iterations
Loop:
L.D
F0, 0(R1)
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
DADDUI
BNE
F4, F0, F2
F4, 0(R1)
F6, -8(R1)
F8, F6, F2
F8, -8(R1)
F10, -16(R1)
F12, F10, F2
F12, -16(R1)
F14, -24(R1)
F16, F14, F2
F16, -24(R1)
R1, R1, #--32
R1, R1, Loop
– Needs 7 cycles per result
4-7
Rung-Bin Lin
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Performance of Scheduled Code with Loop
Unrolling
Loop:
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DADDUI
S.D
BNE
S.D
• Need 3.5 cycles per result
F0, 0(R1)
F6, -8(R1)
F10, -16(R1)
F14, -24(R1)
F4, F0, F2
F8, F6, F2
F12, F10, F2
F16, F14, F2
F4, 0(R1)
F8, -8(R1)
R1, R1, #--32
F12, 16(R1)
R1, R1, Loop
F16, 8(R1)
4-8
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-9
Using Loop Unrolling and Pipeline Scheduling
with Static Multiple Issue
• Fig. 4.2 on page 313
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-10
Static Branch Prediction
– For a compiler to effectively schedule the code such as for
scheduling branch delay slot, we need to statically predict
the behavior of branches.
– Static branch prediction used in a compiler
LD
R1, 0(R2)
DSUBU R1, R1, R3
BEQZ
R1, L
OR
R4, R5, R6
DADDU R10, R4, R3
L: DADDU R7, R8, R9
– If the BEQZ was almost always taken and the value of R7 was not
needed on the fall through path, DADDU can be moved to the
position after LD.
– If it is rarely taken and the value of R4 was not needed on the taken
path, OR can be moved to the position after LD.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-11
Branch Behavior in Programs
– Program behavior
• Average frequency of taken branches : 67%
– 60% of the forward branches are taken.
– 85% of the backward branches are taken
– Methods for statically branch prediction
• By examination of the program behavior
– Predict-taken (mis-prediction rate: 9%~59%).
– Predict-forward-untaken and backward taken.
– The above two approaches combined mis-prediction rate is
30%~40%.
• By the use of profile information collected from earlier runs of the
program.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
Mis-prediction Rate for a Profile-Based
Predictor
4-12
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-13
Comparison between Profile-Based and PredictTaken
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-14
The Basic VLIW Approach
• VLIW uses multiple, independent functional units.
• Multiple, independent instructions are issued by processing a large
instruction package that consists of multiple operations.
• A VLIW instruction might include one integer/branch instruction,
two memory references, and two floating-point operations.
– If each operation requires a 16 to 24 bits field, the length of each
VLIW instruction is of 112 to 168 bits.
• Performance of VLIW
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
Scheduling of VLIW Instructions
• Fig. 4.5 on page 318
4-15
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-16
Limitations to VLIW Implementation
• Limitations
– Technical problem
• To generate enough straight-line code fragment requires ambitiously
unrolling loops, which increases code size.
– Poor code density
• Whenever the instructions are not full, the unused functional units translate
into wasted bits in the instruction encoding (only 60% full).
– Logistical problem
• Binary code compatibility; it depends on
– Instruction set definition,
– The detailed pipeline structure, including both functional units and their
latencies.
• Advantages of a superscalar processor over a VLIW processor
– Little impact on code density.
– Even unscheduled programs, or those compiled for older
implementations, can be run.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-17
Advanced Compiler Support for Exposing and
Exploiting ILP
– Exploiting Loop-Level Parallelism
• Converting the loop-level parallelism into ILP
– Software pipelining (Symbolic loop unrolling)
– Global code scheduling
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-18
Loop-Level Parallelism
– Concepts and techniques
• Loop-level parallelism is normally analyzed at the source level while
most ILP analysis is done once the instructions have been
generated by the compiler.
• The analysis of loop-level parallelism focuses on determining
whether data accesses in later iterations are data dependent on data
values produced in earlier iterations.
• Example:
for (i=1; i<=1000; i++)
x[i]=x[i]+s;
• Loop-carried data dependence: Dependence exists between different
iterations of the loop.
• A loop is parallel unless there is a cycle in the dependences.
Therefore, a non-cycled loop-carried data dependence can be
eliminated by code transformation.
Rung-Bin Lin
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Loop-Carried Data Dependence (1)
• Example
for (I=1; I<=100; I=I+1){
A[I+1] = A[I]+C[I]; /* S1 */
B[I+1] = B[I]+A[I+1]; /* s2 */
}
– Dependence graph
S1
S2
4-19
Rung-Bin Lin
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Loop-Carried Data Dependence (2)
• Example
for (I=1; I<=100; I=I+1){
A[I] = A[I]+B[I]; /* S1 */
B[I+1] = C[I]+D[I]; /* s2 */
}
S1
S2
– Code transformation
A[1] = A[1] +B[1];
for (I=1; I<99; I=I+1){
S1
S2
B[I+1] = C[I]+D[I]; /* s2 */
A[I+1] = A[I+1]+B[I+1];
/* S1 */
}
– Convert loop-carried data dependence into data dependence.
4-20
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-21
Loop-Carried Data Dependence (3)
• True loop-carried data dependence are usually in the
form of a recurrence.
For (I=2; I<=100; I++){
Y[I] = Y[I-1] + Y[I];
}
• Even true loop-carried data dependence has
parallelism.
For (I=6; I<=100; I++){
Y[I] = Y[I-5] + Y[I];
}
– The first, second, …, five iterations are parallel.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
Detecting and Eliminating Dependencies
• Finding the dependences in a program is an
important part of three tasks:
– Good scheduling of code
– Determining which loops might contain parallelism, and
– Eliminating name dependence
• Example
– for (i=1; i<= 100; i++) {
– A[i] = B[i] + C[i];
– D[i] = A[i] + E[i];
–}
• Absence of loop-carried dependence, which implies existence of a
large amount of parallelism.
4-22
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-23
Dependence Detection Problem
• NP complete.
• GCD test heuristic
– Suppose we have stored to an array element with index value a*j+b and
loaded from the same array with index value c*k+d, where j and k are
the for-loop index variable that runs from m to n. A dependence exists
if two conditions hold:
– There are tow iteration indices, j and k, both within the limits of the for
loop.
– The loop stores into an array element indexed by a*j+b and later fetches
from that same array element when it is indexed by c*k+d. That is,
a*j+b=c*k+d.
» Note, a,b,c, and d are generally unknown at compile time, making it
impossible to tell if a dependence exists.
– A simple and sufficient test for the absence of a dependence. If a loopcarried dependence exists, then GCD(c,a) must divide (d-b). That is if
GCD(c,a) does not divide (d-b), no dependence is possible (Example on
page 324).
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-24
Situations where Dependence Analysis Fails
– When objects are referenced via pointers rather than array
indices;
– When array indexing is indirect through another array.
– When a dependence may exist for some value of the inputs,
but does not exist in actuality.
– Others.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
Eliminating Dependent Computations
• Copy propagation
DADDUI R1, R2, #4
DADDUI R1, R2, #4
to
DADDUI R1, R2, #8
• Tree height reduction
ADD
ADD
ADD
R1, R2, R3
R4, R1, R6
R8, R4, R7
ADD
ADD
ADD
R1, R2, R3
R4, R6, R7
R8, R1, R4
to
4-25
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-26
Software Pipelining: Symbolic Loop Unrolling
– Software pipelining is a technique for reorganizing loops such that each
iteration in the software-pipelined code is made from instructions
chosen from different iterations of the original loop.
– A software-pipelined loop interleaves instructions from different loop
iterations without unrolling the loop.
– A software pipeline loop consists of a loop body, start-up code and
clean-up code
Rung-Bin Lin
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
4-27
Example
Original loop
Loop:
Reorganized loop
L.D
ADD.D
S.D
DADDUI
BNE
Iteration i:
Iteration i+1:
Iteration i+2:
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
R1, R1, #-8
R1, R2, Loop
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
Loop:
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
S.D
ADD.D
L.D
DADDUI
BNE
F4, 16(R1)
F4, F0, F2
F0, 0(R1)
R1, R1, #-8
R1, R2, Loop
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
Comparison between Software-Pipelining and
Loop Unrolling
– Software pipelining consumes less code space.
– Loop unrolling reduces the overhead of the loop -- the branch and
counter-updated code.
– Software pipelining reduces the time when the loop is not running at
peak speed to once per loop at the beginning and end.
4-28
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Global Code Scheduling
Rung-Bin Lin
4-29
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
Trace Scheduling: Focusing on Critical Path
• Trace selection
• Trace compaction
• Bookkeeping code
4-30
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-31
Hardware Support for Exposing More
Parallelism at Compile Time
– The difficulty of uncovering more ILP at compile time ( due
to unknown branch behavior) can be overcome by
employing the following techniques:
• Conditional or predicated instructions
• Speculation
– Static speculation performed by the compiler with hardware
support.
– Dynamic speculation performed by hardware using branch
prediction to guide speculation process.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-32
Conditional or Predicated instructions
– Basic concept
• An instruction refers to a condition, which is evaluated as part of
the instruction execution. If the condition is true, the instruction is
executed normally, otherwise, the execution continues as if it is a
no-op.
• The conditional instruction allows us to convert the control
dependence present in the branch-based code sequence to a data
dependence.
– A conditional instruction can be used to speculatively move
an instruction that is time critical
– To use a conditional instruction successfully like the one in
examples, we must ensure that the speculated instruction
does not introduce an exception.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Conditional Move
• Example on page 341
Rung-Bin Lin
4-33
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
On Time Critical Path
• Example on page 342 and 343
Rung-Bin Lin
4-34
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Example (Cont.)
Rung-Bin Lin
4-35
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-36
Limiting Factors
• The usefulness of conditional instructions is limited by several
factors:
– Conditional instructions that are annulled still take execution time.
– Conditional instructions are most useful when the condition can be
evaluated early.
– The use of conditional instructions is limited when the control flow
involves more than a simple alternative sequence.
– Conditional instructions may have some speed penalty compared
with unconditional instructions.
• Machines that use conditional instruction
– Alpha: Conditional move;
– HP PA: Any register-register instruction;
– SPARC: Conditional move;
– ARM: All instructions.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-37
Compiler Speculation with Hardware Support
• In moving instructions across a branch the compiler must ensure that
exception behavior is not changed and the dynamic data dependence
remains the same.
– The simplest case is that the compiler is conservative about what
instructions it speculatively moves, and the exception behavior is
unaffected.
• Four methods
– The hardware and OS cooperatively ignore exceptions for speculative
instructions.
– Speculative instructions that never raise exceptions are used, and
checks are introduced to determine when an exception should occur.
– Poison bits are attached to the result registers written by speculated
instructions when the instruction cause exceptions.
– The instruction results are buffered until it is certain that the
instruction is no longer speculative.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-38
Types of Exceptions
• Two types of exceptions needs to be distinguished:
– Exceptions cause program error, which indicates the program must
be terminated. Ex., memory protection error.
– Exceptions can be normally resumed, Ex., page faults.
• Basic principles employed by the above mechanism:
– Exceptions that can be resumed can be accepted and processed for
speculative instructions just as if they are normal instruction.
– Exceptions that indicate a program error should not occur in correct
programs.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-39
Hardware-Software Cooperation for Speculation
• The hardware and OS simply
– Handle all resumable exceptions when exception occurs, and
– Return an undefined value for any exception that would cause
termination.
• If a normal instruction generate
– terminating exception --> return an undefined value and program
proceeds normally --> generate incorrect result, or
– resumable exception --> accepted and handled accordingly -->
program terminated normally.
• If a speculative instruction generate
– terminating exception --> return an undefined value --> a correct
program will not use it --> the result is still correct.
– resumable exception --> accepted and handled accordingly -->
program terminated normally.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Example
• On page 346 and 347
Rung-Bin Lin
4-40
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
Speculative Instructions Never … (Method 2)
• Example on page 347
4-41
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Answer
Rung-Bin Lin
4-42
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-43
Speculation with Poison Bits
– A poison bit is added to every register and another bit is
added to every instruction to indicate whether the
instruction is speculative.
– Three steps:
• The poison bit is set whenever a speculative instruction results in a
terminating exception; all other exceptions are handled immediately.
• If a speculative instruction uses a register with a poison bit turned
on, the destination register of the instruction simply has its poison
bit turned on.
• If a normal instruction attempts to use a register source with its
poison bit turned on, the instruction causes a fault.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Example
• On page 348
Rung-Bin Lin
4-44
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-45
Hardware Support for Memory Reference
Speculation
• Moving load across stores is usually done when the
compiler is certain the address do not conflict.
• To support speculative load
– A special check instruction to check for address conflict is
placed at the original location of the load instruction.
– When a speculated load is executed, the hardware saves the
address of the accessed memory location.
– If the value stored in the location is changed before check
instruction, speculation fails. If not, it succeeds.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-46
Hardware- versus Software-Based Speculation
• Dynamic runtime disambiguation of memory addresses is
conducive to speculate extensively. This allows us to move loads past
stores at runtime.
• Hardware-based speculation is better because hardware-based
branch predictions is better than software-based branch prediction
done at compile time.
• Hardware-based speculation maintains a completely precise
exception model.
• Hardware-based speculation does not require bookkeeping codes.
• Hardware-based speculation with dynamic scheduling does not
require different code sequence for different implementation of an
architecture to achieve good performance.
• Compiler-based approaches can see further in the code sequence.
Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches
Rung-Bin Lin
4-47
Concluding Remarks
• Hardware and software approaches to increasing ILP
tend to fuse together.
Download