Lecture 6: VLIW Processors What about Superscalar? Introduction and motivation

advertisement
2015-11-17
Lecture 6: VLIW Processors




Introduction and
motivation
Loop unrolling
IA-64 Architecture
Features of Itanium
Zebo Peng, IDA, LiTH
1
TDTS 08 – Lecture 6
What about Superscalar?
It is a good architecture from two perspectives:
 The hardware solves everything:
 It detects potential parallelism between instructions.
 It tries to issue as many instructions as possible in
parallel.
 It uses register renaming to increase parallelism.

Binary compatibility:
 If functional units are added or other improvements are
made (without changing the instruction set), programs
don’t need to be changed.
 Old programs can benefit from the additional machine
parallelism, since the new hardware will simply issue
instructions in a more efficient way.
Zebo Peng, IDA, LiTH
2
TDTS 08 – Lecture 6
1
2015-11-17
Problems with Superscalar

The architecture is very complex.
 A lot of hardware is needed for run-time detection of
parallelism.
 It consumes very much power (advanced cooling
technique may be needed).
 There is, therefore, a limit in how far we can go with this
technique.

The instruction window for execution is limited in size.
 This limits the capacity to detect large number of
parallel instructions.
Zebo Peng, IDA, LiTH
3
TDTS 08 – Lecture 6
Very Long Instruction Word Processors

In a VLIW (also called Very Large Instruction Word) processor,
several operations that can be executed in parallel are placed an
a single instruction word.
Instruction 1
Instruction 2
Instruction 3

op1
op1
Ø
op2
Ø
op2
op3
op3
op3
op4
op4
Ø
VLIW architectures rely on compile-time detection of parallelism.
 The compiler analyzes the program and detects operations to be
executed in parallel.

After one instruction has been fetched all the corresponding
operations are issued in parallel.

The instruction window limitation disappears: the compiler can
potentially analyze the whole program to detect parallel operations.
 No hardware is needed for run-time detection of parallelism.
Zebo Peng, IDA, LiTH
4
TDTS 08 – Lecture 6
2
2015-11-17
Typical Organization
Memory
System
Functional
Unit 1
Instruction
Fetch
Unit
Instruction
Decode
Unit
Functional
Unit 3
Functional
Unit 4
Register Files
Functional
Unit 2
…
FI
DI
EI
WO
Functional
Unit n
Execution Unit
Zebo Peng, IDA, LiTH
5
TDTS 08 – Lecture 6
Explicit Parallelism

Instruction parallelism scheduled at compile time.
 Included within the machine instructions explicitly.

An EPIC (Explicitly Parallel Instruction Computing) processor
uses this information to perform parallel execution.

The hardware is very much simplified.
 The controller is similar to a simple scalar computer.
 The number of FUs can be increased without needing additional
sophisticated hardware to detect parallelism, as in SSA.

Compiler has much more time to determine parallel operations.
 This analysis is only done once off-line, while run-time detection is
carried out by SSA hardware for each execution of the code.
 Good compilers can detect parallelism based on global analysis of
the whole program.
Zebo Peng, IDA, LiTH
6
TDTS 08 – Lecture 6
3
2015-11-17
Main Issues

A large number of registers is needed in order to keep all FUs
active (to store operands and results).

Large data transport capacity is needed between FUs and the
register files and between register files and memory.

High bandwidth between instruction cache and fetch unit is also
needed due to long instructions.
 Ex: each instruction with 8 operations, each 24 bits 
192 bits/instruction (six 32-bit words).
Large code size, partially because unused operations  wasted
bits in instruction words.

op1
op1
Ø
op1
Zebo Peng, IDA, LiTH
op
Ø2
Ø
op2
Ø
op3
op
Ø3
op3
op3
7
op4
op4
Ø
op4
TDTS 08 – Lecture 6
Software Issues

Incompatibility of binary code:
 If a new version of the processor introduces additional
FUs, the number of operations to execute in parallel is
increased.
 Therefore, the instruction word changes, and old binary
code cannot be run on the new processor.

We might not have sufficient parallelism in the program
to utilize the large degree of machine parallelism.
 Hardware resources will be wasted in this case.
 A lot of memory space will also be wasted.
 A technique to address this problem is loop unrolling,
which can be performed by a compiler.
Zebo Peng, IDA, LiTH
8
TDTS 08 – Lecture 6
4
2015-11-17
Lecture 6: VLIW Processors




Introduction and
motivation
Loop unrolling
IA-64 Architecture
Features of Itanium
Zebo Peng, IDA, LiTH
9
TDTS 08 – Lecture 6
An Example
for (i=959; i>=0; i--)
x[i] = x[i] + s;
Assumptions:
x is an array of floating
point values;
s is a floating point
constant.
Memory allocation:

R1 initially contains the address of the
last element in x; the other elements
are at lower addresses; x[0] is at
address 0.

Floating point register F2 contains the
value s.

A floating point value is 8 bytes long.
For an ordinary processor, this C code will be compiled to:
Loop: LDD
ADF
STD
SBI
BGEZ
Zebo Peng, IDA, LiTH
F0,(R1)
F4,F0,F2
(R1),F4
R1,R1,#8
R1,Loop
F0:= x[i];(load double)
F4:= F0+F2;(add floating point)
x[i]:= F4;(store double)
R1:= R1-8;
branch if R1 ≥ 0.
10
TDTS 08 – Lecture 6
5
2015-11-17
A VLIW Processor Example

Two memory references, two FP operations, and one
integer operation or branch can be packed in an
instruction, in the following way:
Instruction

Mem R Mem R
FP 1
FP 2
I/BRA
Assuming also that:
 Integer operations take one cycle to complete.
 The delay for a double word load is one additional
clock cycles.
 The delay for a floating point operation is two
additional clock cycles.
Zebo Peng, IDA, LiTH
11
TDTS 08 – Lecture 6
Loop Unrolling
Let us rewrite the example:
Loop unrolling: a technique used
in compilers in order to increase the
potential of parallelism in a program.
 It supports more efficient code
generation for processors with
instruction level parallelism.
for (i=959; i>=0; i-=2){
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
}
For an ordinary processor, this new code will be compiled to:
Loop: LDD
ADF
STD
LDD
ADF
STD
SBI
BGEZ
Zebo Peng, IDA, LiTH
F0,(R1)
F4,F0,F2
(R1),F4
F6,(R1-8)
F8,F6,F2
(R1-8),F8
R1,R1,#16
R1,Loop
F0:=x[i];(load double)
F4:=F0+F2;(add floating pnt)
x[i]:=F4;(store double)
F6:=x[i-1];(load double)
F8:=F6+F2;(add floating pnt)
x[i-1]:=F8;(store double)
R1:=R1-16;
branch if R1 ≥ 0.
12
TDTS 08 – Lecture 6
6
2015-11-17
Loop Unrolling (2 iterations)
Cycle
1 LDD F0,(R1) LDD F6,(R1-8)
2
3
4
5
6 STD(R1+16),F4 STD(R1+8),F8
ADF F4,F0,F2 ADF F8,F6,F2
SBI R1,R1,#16
BGEZ R1,Loop

There is an increased degree of parallelism in this case.

Still we have two completely empty cycles and many empty
operation slots.

Nevertheless, we have basically double the performance:
 Two iterations take 6 cycles
 The whole loop takes 480*6 = 2880 cycles
Zebo Peng, IDA, LiTH
13
TDTS 08 – Lecture 6
Loop Unrolling (4 iterations)
Cycle
1 LDD F0,(R1) LDD F6,(R1-8)
2 LDD F10,(R1-16) LDD F15,(R1-24)
ADF F4,F0,F2 ADF F8,F6,F2
3
ADF
F12,F10,F2 ADF F16,F15,F2
4
5
STD(R1),F4
STD(R1-8),F8
6
7 STD(R1+16),F12 STD(R1+24),F8
SBI R1,R1,#32
BGEZ R1,Loop

The degree of parallelism is further improved.

There is still an empty cycle and many empty operation slots.

Four iterations take 7 cycles, and the whole loop takes now
about 240*7 = 1680 cycles.

With 8 unrolled iteration, we need 120*9 = 1080 cycles (5.3
times better performance than the unrolled version).
Zebo Peng, IDA, LiTH
14
TDTS 08 – Lecture 6
7
2015-11-17
Loop Unrolling in General

Given a certain set of resources (processor architecture) and a
given loop, there is a limit on how many iterations should be
unrolled.
 Beyond that limit there is no gain in performance any more.

Loop unrolling increases the memory space needed to store the
machine code.
Original code
Loop: LDD
ADF
STD
SBI
BGEZ
Code with 2 iterations:
Loop: LDD
ADF
STD
LDD
ADF
STD
SBI
BGEZ
F0,(R1)
F4,F0,F2
(R1),F4
R1,R1,#8
R1,Loop
Zebo Peng, IDA, LiTH
F0,(R1)
F4,F0,F2
(R1),F4
F6,(R1-8)
F8,F6,F2
(R1-8),F8
R1,R1,#16
R1,Loop
15
TDTS 08 – Lecture 6
Loop Unrolling in General (Cont’d)

A compiler has to find the optimal level of unrolling for each loop.

We need also hardware support to keep a VLIW processor busy:
 Large number of registers (in order to store data for operations
which are active in parallel);
 Large traffic has to be supported in parallel:
• register files  memory
• register files  functional units
Original code
Loop: LDD
ADF
STD
SBI
BGEZ
Code with 2 iterations:
Loop: LDD
ADF
STD
LDD
ADF
STD
SBI
BGEZ
F0,(R1)
F4,F0,F2
(R1),F4
R1,R1,#8
R1,Loop
Zebo Peng, IDA, LiTH
16
F0,(R1)
F4,F0,F2
(R1),F4
F6,(R1-8)
F8,F6,F2
(R1-8),F8
R1,R1,#16
R1,Loop
TDTS 08 – Lecture 6
8
2015-11-17
Lecture 6: VLIW Processors




Introduction and
motivation
Loop unrolling
IA-64 Architecture
Features of Itanium
Zebo Peng, IDA, LiTH
17
TDTS 08 – Lecture 6
Commercial VLIW Processors

Example of successful VLIW processors:
 TriMedia of Philips.
 TMS320C6x of Texas Instruments.
Both are targeting the multimedia market (where there
was not much compiled legacy code to support).

The IA-64 architecture from Intel and Hewlett-Packard.
 This family uses many of the VLIW ideas.
 It is not "just" a multimedia processor, but intended to
become "the new generation" processor for servers
and workstations.
 The first product in 2001: Itanium
Zebo Peng, IDA, LiTH
18
TDTS 08 – Lecture 6
9
2015-11-17
The IA-64 Architecture

IA-64 is not a pure VLIW architecture, but many of its features
are typical for VLIW processors.

These are typical VLIW features of IA-64:
 Instruction-level parallelism fixed at compile-time.
 (Very) long instruction word (128 bits).

These are specific for the IA-64 Architecture:
 Flexibility of the number of concurrent operations.
 Branch predication (NOT prediction!).
 Speculative loading.
Zebo Peng, IDA, LiTH
19
TDTS 08 – Lecture 6
IA-64 Basic architecture
Memory
System
Functional
Unit
…
Instruction
Fetch
Unit


Functional
Unit
Instruction
Decode &
Control Unit
64 predicate
registers
Functional
Unit
Registers (both integer and
floating point) are 64-bit.
Predicate registers are 1-bit.
8 or more functional units.
Zebo Peng, IDA, LiTH
…

128 registers
for integers
128 registers
for FPs
Functional
Unit
20
TDTS 08 – Lecture 6
10
2015-11-17
Instruction Format
128 bits
Operation 1
Operation 2
Operation 3
Template
5 bits
Op
code
Reg
Reg
Reg
Pred
reg
41 bits

Three operations are specified in an instruction word.


However, the three operations are not necessarily to be executed in parallel!
Instead, the template indicates what can be executed in parallel.

The encoding in the template shows which of the operations in the
instruction can be executed in parallel.

The template connects also neighboring instructions. Therefore,
operations from different instructions can also be executed in parallel.
Zebo Peng, IDA, LiTH
21
TDTS 08 – Lecture 6
Instruction Format (cont’d)

The template provides high flexibility and avoids several problems
with classical VLIW processors:
 Operations in one instruction don’t have to be parallel.
• No places will be left empty when no operation is there to be
executed in parallel.
 The number of operations to be executed in parallel is not
restricted by the instruction size.
• New processor generations can have different number of
functional units without changing the instruction format.
• Improved binary compatibility.
 If, according to the template, more operations can be executed in
parallel than the functional units available, the processor will simply
take them sequentially.
Zebo Peng, IDA, LiTH
22
TDTS 08 – Lecture 6
11
2015-11-17
Predicated Execution

Any operation can refer to a predicate register.
<Pi> operation
where i is the number of a predicate register
(between 0 and 63)

This means that an operation is to be committed (the results made
permanent) only when the respective predicate is true (i.e., the
predicate register gets value 1).

If the predicate value is known when the operation is issued, the
operation is executed only if this value is true.

If the predicate is not known at that moment, the operation will
also be started. If the predicate turns out to be false, the operation
is discarded.

If no predicate register is mentioned, the operation is executed
and committed unconditionally.
Zebo Peng, IDA, LiTH
23
TDTS 08 – Lecture 6
Branch Predication (!)

Branch predication is an aggressive compilation technique to
generate code with higher degree of instruction level parallelism.

It lets operations from both branches of a conditional branch to be
executed in parallel, to increase the amount of parallel operations.

In this way, branches are eliminated and replaced by conditional
execution.
 Hardware support is needed, as implemented in the IA-64
architecture.
The idea is: let instructions from both branches go on in parallel,
before the branch condition is known. The hardware takes care
that only instructions of the right branch will be finally committed.
Zebo Peng, IDA, LiTH
24
TDTS 08 – Lecture 6
12
2015-11-17
Branch Predication Example
Instruction 1
For a branch instruction, the
compiler assigns a predicate
to each of the two following
instruction paths.
Instruction 2
Instruction 3
(branch)
<P1> Instruction 4
<P2> Instruction 7
<P1> Instruction 5
<P2> Instruction 8
<P1> Instruction 6
<P2> Instruction 9
CPU can execute
instructions from different
paths concurrently, but
only the correct path will
finally be committed.
For a VLIW machine, the instructions may be arranged as follows:
Instruction 1
Instruction 2
Instruction 3
<P1> Instruction 4
<P2> Instruction 7
<P1> Instruction 5
<P2> Instruction 8
<P1> Instruction 6
<P2> Instruction 9
Zebo Peng, IDA, LiTH
25
TDTS 08 – Lecture 6
Branch Predication (cont’d)
Branch predication is not branch prediction!

Branch prediction:
Guess for one branch and then go along that one. If the guess is
not correct, undo all the work, and go to the right one.

Branch predication:
Both branches are started and when the condition is known, the
right instructions are committed, the others are discarded.
-
There is no lost time with failed predictions.
-
You pay by introducing duplicated hardware.
-
Computation on some hardware will be wasted, instead of
doing useful things!
Zebo Peng, IDA, LiTH
26
TDTS 08 – Lecture 6
13
2015-11-17
Placement of Loading
A load-from-memory operation should be placed so that memory
latency is avoided, i.e., the value is there when it’s needed:
ADD R4, R3
LOAD R1, X
ADD R2, R1
R4 := R4 + R3
Loads from address X into R1
R2 := R2 + R1
LOAD R1, X
ADD R4, R3
ADD R2, R1
Loads from address X into R1
R4 := R4 + R3
R2 := R2 + R1
Zebo Peng, IDA, LiTH
27
TDTS 08 – Lecture 6
Speculative Loading

What if a load is moved across a branch?
 Ex. from within an ELSE-branch to before the conditional branch.
<P2> Instruction 8
(load data)
Instruction 1
Instruction 2
Instruction 3
(branch)
<P1> Instruction 4
<P2> Instruction 7
<P1> Instruction 5
<P2> Instruction 8
(load data)
<P1> Instruction 6
Zebo Peng, IDA, LiTH
<P2> Instruction 9
28
TDTS 08 – Lecture 6
14
2015-11-17
Speculative Loading

What if a load is moved across a branch?
 Ex. from within an ELSE-branch to before the conditional branch.

The load will be executed for both branches.
 This shouldn’t be a problem if hardware resources are available.
 However, if, for example, a page fault is generated, handling such
an exception takes extremely long time and resources.
• A whole page has to be brought in from the secondary
memory to the main memory.
 If the load will turn out to be not needed, it wastes a huge amount
of time, and may replace a useful page.

We would therefore like to handle the page faults (exception)
only when we really need the loaded value.

This is solved by speculative loading.
Zebo Peng, IDA, LiTH
29
TDTS 08 – Lecture 6
Speculative Loading (cont’d)

With speculative loading, a load instruction in the original
program is replaced by two instructions:
 A speculative load (LD.s) which performs the memory fetch and
detects if an exception (like page fault) is generated. The
exception, however is not signaled to the operating system.
• The speculative load is the one which is moved for latency
reduction.
 A checking instruction (CHK.s) which signals the exception if one
has been detected by the speculative load.
• If no exception has been detected, nothing happens.
• The checking instruction remains in the original place.
Zebo Peng, IDA, LiTH
30
TDTS 08 – Lecture 6
15
2015-11-17
Speculative Loading Example
Speculative
Load
Instruction 1
The load will only be
done, if there is no
exception.
Instruction 2
Instruction 3
(branch)
<P1> Instruction 4
<P2> Instruction 7
<P1> Instruction 5
<P2> Speculative
Check
<P1> Instruction 6
Exception handling
will occur here, if
needed.
<P2> Instruction 9
Zebo Peng, IDA, LiTH
31
TDTS 08 – Lecture 6
Superscalar vs IA-64
Superscalar
IA-64 (improved VLIW)
Multiple parallel execution units
Multiple parallel execution units
RISC-line instructions, one per
word
RISC-line instructions, bundled
into group of three
Reorder and optimize instruction
stream at run time
Reorder and optimize instruction
stream at compiler time
Branch prediction with speculative Speculative execution along both
execution of one path
paths of a branch
Load data from memory (caches)
only when needed.
Zebo Peng, IDA, LiTH
32
Speculatively load data before it is
needed (caches or memory).
TDTS 08 – Lecture 6
16
2015-11-17
VLIW Summary

In order to keep up with increasing demands on performance, superscalar
architectures become excessively complex.

VLIW architectures avoid hardware complexity by relying on the compiler for
parallelism detection.

Operation level parallelism is explicit in the instructions of a VLIW
processor.

Not having to deal with parallelism detection, VLIW processors can have a
higher number of functional units.

This generates the need for a large number of registers and high
communication bandwidth.

In order to keep the large number of functional units busy, compilers for
VLIW processors have to be aggressive in parallelism detection.

Loop unrolling is used in order to increase the degree of parallelism in
loops: several iterations of a loop are unrolled and handled in parallel.
Zebo Peng, IDA, LiTH
33
TDTS 08 – Lecture 6
IA/64 Architecture Summary

The IA-64 family uses VLIW techniques for implementing high-end
processors. Itanium is the first member of this family.

The instruction format avoids empty operations and allows more
flexibility in parallelism specification, by using the template field.

Beside typical VLIW features, the Itanium architecture uses branch
predication and speculative loading.

Branch predication is based on predicated execution:
 The execution of an instruction is connected to a predicate register,
which will be committed only if the predicate becomes true.
 It allows instructions from both branches to be started in parallel.

Speculative loading tries to reduce latencies generated by load
instructions.
 It allows load instructions to be moved across branch boundaries.
 Page exceptions will be handled only if really needed.
Zebo Peng, IDA, LiTH
34
TDTS 08 – Lecture 6
17
Download