Instruction Level Parallelism ILP

advertisement
Instruction Level Parallelism
Advanced Computer Architecture
CSE 8383
Spring 2004
2/19/2004
Presented By:
Sa’ad Al-Harbi
Saeed Abu Nimeh
ILP
Outline





What’s ILP
ILP vs Parallel Processing
Sequential execution vs ILP execution
Limitations of ILP
ILP Architectures






Sequential Architecture
Dependence Architecture
Independence Architecture
ILP Scheduling
Open Problems
References
What’s ILP




Architectural technique that allows the
overlap of individual machine operations (
add, mul, load, store …)
Multiple operations will execute in parallel
(simultaneously)
Goal: Speed Up the execution
Example:
load R1  R2
add R3  R3, “1”
add R4  R4, R2
add
R3  R3, “1”
add
R4  R3, R2
store [R4]  R0
Example: Sequential vs ILP

Sequential execution (Without ILP)
Add r1, r2  r8
Add r3, r4  r7

4 cycles
4 cycles
8 cycles
ILP execution (overlap execution)
Add r1, r2  r8
Add r3, r4  r7
Total of 5 cycles
ILP vs Parallel Processing
ILP

Overlap individual machine
operations (add, mul, load…)
so that they execute in
parallel
Parallel Processing


Transparent to the user


Goal: speed up execution

Having separate processors
getting separate chunks of
the program ( processors
programmed to do so)
Nontransparent to the user
Goal: speed up and quality
up
ILP Challenges

In order to achieve parallelism we
should not have dependences among
instructions which are executing in
parallel:


H/W terminology Data Hazards ( RAW,
WAR, WAW)
S/W terminology Data Dependencies
Dependences and Hazards




Dependences are a property of
programs
If two instructions are data dependent
they can not execute simultaneously
A dependence results in a hazard and
the hazard causes a stall
Data dependences may occur through
registers or memory
Types of Dependencies

Name dependencies





Output dependence
Anti-dependence
Data True dependence
Control Dependence
Resource Dependence
Name dependences

Output dependence
When instruction I and J write the same register or
memory location. The ordering must be preserved to
leave the correct value in the register
add r7,r4,r3
div r7,r2,r8

Anti-dependence
When instruction j writes a register or memory
location that instruction I reads
i: add r6,r5,r4
j: sub r5,r8,r11
Data Dependences

An instruction j is data
dependent on instruction i
if either of the following
hold:


LOOP
LD
F0, 0(R1)
ADD
F4, F0, F2
SD
F4, 0(R1)
SUB
R1, R1, -8
BNE
R1, R2, LOOP
instruction i produces a
result that may be used by
instruction j , or
instruction j is data
dependent on instruction k,
and instruction k is data
dependent on instruction i
Control Dependences

A control dependence determines the ordering of an instruction
i, with respect to a branch instruction so that the instruction i is
executed in correct program order.

Example:
If p1 {
S1;
};
If p2 {
S2;
};
Two constraints imposed by control
dependences:
1. An instruction that is control dependent on a
branch cannot be moved before the branch
2. An instruction that is not control dependent
on a branch cannot be moved after the branch
Resource dependences

An instruction is resource-dependent on
a previously issued instruction if it
requires a hardware resource which is
still being used by a previously issued
instruction.
 e.g.
div r1, r2, r3
 div r4, r2, r5

ILP Architectures


Computer Architecture: is a contract
(instruction format and the interpretation of
the bits that constitute an instruction)
between the class of programs that are
written for the architecture and the set of
processor implementations of that
architecture.
In ILP Architectures: + information
embedded in the program pertaining to
available parallelism between instructions and
operations in the program
ILP Architectures
Classifications



Sequential Architectures: the program is not
expected to convey any explicit information regarding
parallelism. (Superscalar processors)
Dependence Architectures: the program explicitly
indicates the dependences that exist between
operations (Dataflow processors)
Independence Architectures: the program provides
information as to which operations are independent
of one another. (VLIW processors)
Sequential architecture and
superscalar processors


Program contains no explicit information
regarding dependencies that exist between
instructions
Dependencies between instructions must be
determined by the hardware


It is only necessary to determine dependencies
with sequentially preceding instructions that have
been issued but not yet completed
Compiler may re-order instructions to
facilitate the hardware’s task of extracting
parallelism
Superscalar Processors

Superscalar processors attempt to issue
multiple instructions per cycle


However, essential dependencies are
specified by sequential ordering so
operations must be processed in sequential
order
This proves to be a performance bottleneck
that is very expensive to overcome
Dependence architecture and
data flow processors




The compiler (programmer) identifies the parallelism
in the program and communicates it to the hardware
(specify the dependences between operations)
The hardware determines at run-time when each
operation is independent from others and perform
scheduling
Here, no scanning of the sequential program to
determine dependences
Objective: execute the instruction at the earliest
possible time (available input operands and functional
units).
Dependence architectures
Dataflow processors






Dataflow processors are representative of
Dependence architectures
Execute instruction at earliest possible time subject
to availability of input operands and functional units
Dependencies communicated by providing with each
instruction a list of all successor instructions
As soon as all input operands of an instruction are
available, the hardware fetches the instruction
The instruction is executed as soon as a functional
unit is available
Few Dataflow processors currently exist
Dataflow strengths and
limitations



Dataflow processors use control parallelism
alone to fully utilize the FU.
Dataflow processor is more successful than
others at looking far down the execution path
to find control parallelism
When successful its better than speculative
execution:


Every instruction is executed is useful
Processor does not have to deal with error
conditions, because of speculative operations
Independence architecture
and VLIW processors


By knowing which operations are independent, the
hardware needs no further checking to determine
which instructions can be issued in the same cycle
The set of independent operations >> the set of
dependent operations


Only a subset of independent operations are specified
The compiler may additionally specify on which
functional unit and in which cycle an operation is
executed

The hardware needs to make no run-time decisions
VLIW processors

Operation vs instruction




Operation: is an unit of computation (add, load,
branch = instruction in sequential ar.)
Instruction: set of operations that are intended to
be issued simultaneously
Compiler decides which operation to go to
each instruction (scheduling)
All operations that are supposed to begin at
the same time are packaged into a single
VLIW instruction
VLIW strengths

In hardware it is very simple:




consisting of a collection of function units (adders,
multipliers, branch units, etc.) connected by a bus,
plus some registers and caches
More silicon goes to the actual processing
(rather than being spent on branch
prediction, for example),
It should run fast, as the only limit is the
latency of the function units themselves.
Programming a VLIW chip is very much like
writing microcode
VLIW limitations




The need for a powerful compiler,
Increased code size arising from aggressive
scheduling policies,
Larger memory bandwidth and register-file
bandwidth,
Limitations due to the lock-step operation,
binary compatibility across implementations
with varying number of functional units and
latencies
Summary: ILP Architectures
Sequential
Architecture
Dependence
Architecture
Independence
Architectures
Additional info
required in the
program
None
Specification of
dependences between
operations
Minimally, a partial list
of independences. A
complete specification
of when and where
each operation to be
executed
Typical kind of ILP
processor
Superscalar
Dataflow
VLIW
Dependences
analysis
Performed by HW
Performed by
compiler
Performed by
compiler
Independences
analysis
Performed by HW
Performed by HW
Performed by
compiler
Scheduling
Performed by HW
Performed by HW
Performed by
compiler
Role of compiler
Rearranges the code
to make the analysis
and scheduling HW
more successful
Replaces some
analysis HW
Replaces virtually all
the analysis and
scheduling HW
ILP Scheduling
Static Scheduling boosted
by parallel code
optimization



done by the compiler
The processor receives
dependency-free and
optimized code for
parallel execution
Typical for VLIWs and a
few pipelined
processors (e.g. MIPS)
Dynamic Scheduling
without static parallel code
optimization



done by the processor
The code is not
optimized for parallel
execution. The
processor detects and
resolves dependencies
on its own
Early ILP processors
(e.g. CDC 6600, IBM
360/91 etc.)
Dynamic Scheduling
boosted by static parallel
code optimization



done by processor in
conjunction with parallel
optimizing compiler
The processor receives
optimized code for
parallel execution, but it
detects and resolves
dependencies on its own
Usual practice for
pipelined and
superscalar processors
(e.g. RS6000)
ILP Scheduling: Trace
scheduling



An optimization technique that has been
widely used for VLIW, superscalar, and
pipelined processors.
It selects a sequence of basic blocks as a
trace and schedules the operations from the
trace together.
Example:
Instr1
Instr2
Branch x
Instr3
Trace Scheduling



Extract more ILP
Increase machine fetch bandwidth by
storing logically consecutive blocks in
physically contiguous cache location
(possible to fetch multiple basic blocks
in one cycle)
Trace scheduling can be implemented
by hardware or software
Trace Scheduling in HW



Hardware technique makes use of a large
amount of information in dynamic execution
to format traces dynamically and schedule the
instructions in trace more efficiently.
Since the dependency and memory access
addresses have been solved in dynamic
execution, instructions in trace can be
reordered more easily and efficiently.
Example: trace cache approach
Trace scheduling in SW



Supplement to machines without
hardware trace scheduling support.
Formats traces based on static profiled
data, and schedules instructions using
traditional compiler scheduling and
optimization technique.
It faces some difficulties like code
explosion and exception handling.
ILP open problems




Pipelined scheduling : Optimized scheduling of pipelined
behavioral descriptions. Two simple type of pipelining (structural
and functional).
Controller cost : Most scheduling algorithms do not consider the
controller costs which is directly dependent on the controller
style used during scheduling.
Area constraints : The resource constrained algorithms could
have better interaction between scheduling and floorplanning.
Realism :



Scheduling realistic design descriptions that contain several special
language constructs.
Using more realistic libraries and cost functions.
Scheduling algorithms must also be expanded to incorporate
different target architectures.
References

Instruction-Level Parallel Processing: History, Overview and Perspective. B. Ramakrishna
Rau, Joseph A. Fisher. Journal of Supercomputing, Vol. 7, No. 1, Jan. 1993, pages 9-50.

Limits of Control Flow on Parallelism. Monica S. Lam, Robert P. Wilson. 19th ISCA, May
1992, pages 19-21.

Global Code Generation for Instruction-Level Parallelism: Trace Scheduling-2. Joseph A.
Fisher. Technical Report, HPLabs HPL-93-43, Jun. 1993.





VLIW at IBM Research
http://www.research.ibm.com/vliw
Intel and HP hope to speed CPUs with VLIW technology that's riskier than RISC, Dick
Pountain
http://www.byte.com/art/9604/sec8/art3.htm
Hardware and Software Trace Scheduling
http://charlotte.ucsd.edu/users/yhu/paperlist/summary.html
ILP open problems
http://www.ececs.uc.edu/~ddel/projects/dss/hls_paper/node9.html
Computer Architecture A Quantitative Approach, Hennessy & Patterson, 3rd edition, M
Kaufmann
Download