Topic 2 - Electronic Systems group

advertisement
Embedded Computer Architecture
Exploiting ILP
VLIW architectures
TU/e 5KK73
Henk Corporaal
What are we talking about?
ILP = Instruction Level Parallelism =
ability to perform multiple operations (or instructions),
from a single instruction stream,
in parallel
VLIW = Very Long Instruction Word architecture
Instruction format example of 5 issue VLIW:
operation 1 operation 2 operation 3 operation 4 operation 5
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
2
Single Issue RISC vs VLIW
instr
instr
instr
instr
instr
instr
instr
instr
instr
instr
instr
instr
op
op
op
op
op
op
op
op
op
op
op
op
Compiler
instr
instr
instr
instr
instr
op
nop
op
op
op
op
op
op
nop
op
op
op
nop
op
op
execute
1 instr/cycle
3 ops/cycle
execute
1 instr/cycle
3-issue VLIW
RISC CPU
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
3
Topics Overview
• How to speed up your processor?
– What options do you have?
• Operation/Instruction Level Parallelism
– Limits on ILP
• VLIW
– Examples
– Clustering
• Code generation (2nd slide-set)
• Hands-on
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
4
Speed-up
Pipelined Execution of Instructions
IF: Instruction Fetch
INSTRUCTION
CYCLE
1
1
2
3
2
IF
3
DC
IF
4
RF
DC
IF
4
5
EX
RF
DC
IF
6
WB
EX
RF
DC
7
DC: Instruction Decode
8
RF: Register Fetch
WB
EX
RF
EX: Execute instruction
WB
EX
WB
WB: Write Result Register
Simple 5-stage pipeline
Purpose of pipelining:
• Reduce #gate_levels in critical path
• Reduce CPI close to one (instead of a large number for the
multicycle machine)
• More efficient Hardware
Problems
• Hazards: pipeline stalls
• Structural hazards: add more hardware
• Control hazards, branch penalties: use branch prediction
• Data hazards: by passing required
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
5
Speed-up
Pipelined Execution of Instructions
Superpipelining:
• Split one or more of the critical pipeline stages
• Superpipelining degree S:
S(architecture) =  f(Op) * lt (Op)
Op I_set
*
where:
f(op) is frequency of operation op
lt(op) is latency of operation op
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
6
Speed-up
Powerful Instructions (1)
MD-technique
• Multiple data operands per operation
• SIMD: Single Instruction Multiple Data
Vector instruction:
Assembly:
for (i=0, i++, i<64)
c[i] = a[i] + 5*b[i];
set
ldv
mulvi
ldv
addv
stv
or
c = a + 5*b
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
vl,64
v1,0(r2)
v2,v1,5
v1,0(r1)
v3,v1,v2
v3,0(r3)
7
Speed-up
Powerful Instructions (1)
SIMD computing
SIMD Execution Method
time
• Nodes used for independent
operations
• Mesh or hypercube connectivity
• Exploit data locality of e.g.
image processing applications
• Dense encoding (few instruction
bits needed)
node1
node2
node-K
Instruction 1
Instruction 2
Instruction 3
Instruction n
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
8
Speed-up
Powerful Instructions (1)
• Sub-word parallelism
– SIMD on restricted scale:
– Used for Multi-media instructions
• Examples
– MMX, SSX, SUN-VIS, HP MAX-2,
AMD 3Dnow, Trimedia II
– Example: i=1..4|ai-bi|
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
*
*
*
*
9
Speed-up
Powerful Instructions (2)
MO-technique: multiple operations per instruction
Two options:
• CISC (Complex Instruction Set Computer)
• VLIW (Very Long Instruction Word)
field
FU 1
instruction
sub r8, r5, 3
FU 2
and r1, r5, 12
FU 3
mul r6, r5, r2
FU 4
ld r3, 0(r5)
FU 5
bnez r5, 13
VLIW instruction example
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
10
VLIW architecture: central Register File
Shared, Multi-ported Register file
Exec Exec Exec
unit 1 unit 2 unit 3
Issue slot 1
Exec Exec Exec
unit 4 unit 5 unit 6
Issue slot 2
Exec Exec Exec
unit 7 unit 8 unit 9
Issue slot 3
Q: How many ports does the registerfile need for n-issue?
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
11
Philips oldie: TriMedia TM32A processor
0.18 micron
area : 16.9mm2
200 MHz (typ)
1.4 W
7 mW/MHz
I/O
INTERFACE
I-Cache
TAG
TAG
D-cache
TAG
H. Corporaal and B. Mesman
DSPMUL2
DSPMUL1
IFMUL1
(FLOAT) IFMUL2
(FLOAT)
FALU3
FALU0
FCOMP2
ALU3
ALU0
SHIFTER0
DSPALU0
Embedded Computer Architecture
ALU1
ALU4
SHIFTER1
ALU2
FTOUGH1
DSPALU2
SEQUENCER
/ DECODE
TAG
4/8/2015
(MIPS processor:
0.9 mW/MHz)
12
Speedup: Powerful Instructions (2)
VLIW Characteristics
• Only RISC like operation support
 Short cycle times
• Flexible: Can implement any FU mixture
• Extensible
• Tight inter FU connectivity required
• Large instructions (up to 1024 bits)
• Not binary compatible !!!
• But good compilers exist
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
13
Speed-up
Multiple instruction issue (per cycle)
Who guarantees semantic correctness?
–
•
User: he specifies multiple instruction streams
–
•
Multi-processor: MIMD (Multiple Instruction Multiple Data)
HW: Run-time detection of ready instructions
–
•
Superscalar
Compiler: Compile into dataflow representation
–
4/8/2015
which can instructions be executed in parallel?
Dataflow processors
Embedded Computer Architecture
H. Corporaal and B. Mesman
14
Multiple instruction issue
Three Approaches
Example code
a := b + 15;
Translation to DDG
(Data Dependence Graph)
c := 3.14 * d;
e := c / f;
&d
3.14
&f
&b
ld
15
+
&a
ld
&e
ld
&c
*
/
st
st
st
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
15
Generated Code
Instr. Sequential Code
Dataflow Code
I1
I2
I3
I4
I5
I6
I7
I8
I9
I1
I2
I3
I4
I5
I6
I7
I8
I9
ld
addi
st
ld
muli
st
ld
div
st
r1,M(&b)
r1,r1,15
r1,M(&a)
r1,M(&d)
r1,r1,3.14
r1,M(&c)
r2,M(&f)
r1,r1,r2
r1,M(&e)
ld(M(&b)
addi 15 -> I3
st M(&a)
ld M(&d)
muli 3.14
st M(&c)
ld M(&f)
div
st M(&e)
-> I2
-> I5
-> I6, I8
-> I8
-> I9
3 approaches:
• An MIMD may execute two streams: (1) I1-I3 (2) I4-I9
– No dependencies between streams; in practice communication and
synchronization required between streams
• A superscalar issues multiple instructions from sequential stream
– Obey dependencies (True and name dependencies)
– Reverse engineering of DDG needed at run-time
• Dataflow code is direct representation of DDG
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
16
Result Tokens
Multiple Instruction Issue: Data
flow processor
Token
Matching
Token
Store
Instruction
Generate
Instruction
Store
Reservation
Stations
FU-1
4/8/2015
Embedded Computer Architecture
FU-2
H. Corporaal and B. Mesman
FU-K
17
Instruction Pipeline Overview
(no pipelining)
CISC
IF
DC
RF
EX
RISC
IF
DC/RF
EX
WB
IF1
IF2
IF3
DC1
DC2
DC3
ISSUE
ISSUE
ISSUE
RF1
RF2
RF3
EX1
EX2
EX3
ROB
ROB
ROB
WB1
WB2
WB3
IFk
DCk
ISSUE
RFk
EXk
ROB
WBk
Superpipelined
VLIW
4/8/2015
IF
IF1
IF2
---
IFs
RF1
RF2
EX1
EX2
WB1
WB2
RFk
EXk
WBk
DC
Embedded Computer Architecture
H. Corporaal and B. Mesman
DC
RF
EX1
DATAFLOW
Superscalar
WB
EX2
---
EX5
WB
RF1
RF2
EX1
EX2
WB1
WB2
RFk
EXk
WBk
18
Four dimensional representation of the
architecture design space <I, O, D, S>
SIMD
100
Data/operation ‘D’
10
Vector
CISC
Superscalar
0.1
MIMD
10
RISC
Dataflow
100
Instructions/cycle ‘I’
Superpipelined
10
VLIW
10
Operations/instruction ‘O’
Superpipelining
Degree ‘S’
Note: MIMD should better be a separate, 5th dimension !
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
19
Architecture design space
Typical values of K (# of functional units or processor nodes), and
<I, O, D, S> for different architectures
Architecture K
I
O
D
S
CISC
RISC
VLIW
Superscalar
Superpipelined
Vector
SIMD
MIMD
Dataflow
0.2
1
1
3
1
0.1
1
32
10
1.2
1
10
1
1
1
1
1
1
1.1
1
1
1
1
64
1024
1
1
1
1.2
1.2
1.2
3
5
1.2
1.2
1.2
1
1
10
3
1
7
1024
32
10
S(architecture) =

Mpar
0.26
1.2
12
3.6
3
32
1229
38
12
f(Op) * lt (Op)
Op I_set
Mpar = I*O*D*S
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
20
Overview
• Enhance performance: architecture methods
• Instruction Level Parallelism (ILP)
– limits on ILP
• VLIW
– Examples
• Clustering
• Code generation
• Hands-on
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
21
General organization of an
ILP architecture
FU-3
FU-4
Data memory
FU-2
Register file
FU-1
Bypassing network
Instruction
decode unit
Instruction
fetch unit
Instruction memory
CPU
FU-5
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
22
Motivation for ILP
• Increasing VLSI densities; decreasing feature size
• Increasing performance requirements
• New application areas, like
– multi-media (image, audio, video, 3-D, holographic)
– intelligent search and filtering engines
– neural, fuzzy, genetic computing
• More functionality
• Use of existing Code (Compatibility)
• Low Power: P = fCVdd2
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
23
Low power through parallelism
• Sequential Processor
–
–
–
–
Switching capacitance C
Frequency f
Voltage V
P = fCV2
• Parallel Processor (two times the number of units)
–
–
–
–
4/8/2015
Switching capacitance 2C
Frequency f/2
Voltage V’ < V
P = f/2 2C V’2 = fCV’2
Embedded Computer Architecture
H. Corporaal and B. Mesman
24
Measuring and exploiting available ILP
• How much ILP is there in applications?
• How to measure parallelism within applications?
– Using existing compiler
– Using trace analysis
• Track all the real data dependencies (RaWs) of instructions from issue
window
– register dependence
– memory dependence
• Check for correct branch prediction
– if prediction correct continue
– if wrong, flush schedule and start in next cycle
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
25
Trace analysis
Program
Compiled code
Trace
set
r1,0
set
r2,3
set
r3,&A
st
r1,0(r3)
add
r1,r1,1
r3,r3,4
For i := 0..2
set
r1,0
add
A[i] := i;
set
r2,3
brne r1,r2,Loop
set
r3,&A
st
r1,0(r3)
st
r1,0(r3)
add
r1,r1,1
add
r1,r1,1
add
r3,r3,4
add
r3,r3,4
brne r1,r2,Loop
S := X+3;
Loop:
brne r1,r2,Loop
st
r1,0(r3)
add
add
r1,r1,1
add
r3,r3,4
r1,r5,3
brne r1,r2,Loop
How parallel can you execute this code?
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
add
r1,r5,3
26
Trace analysis
Parallel Trace
set
r1,0
set
r2,3
set
r3,&A
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
brne r1,r2,Loop
st
r1,0(r3)
add
r1,r1,1
add
r3,r3,4
brne r1,r2,Loop
brne r1,r2,Loop
add
r1,r5,3
Max ILP = Speedup = Lserial / Lparallel = 16 / 6 = 2.7
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
27
Ideal Processor
Assumptions for ideal/perfect processor:
1. Register renaming – infinite number of virtual registers => all
register WAW & WAR hazards avoided
2. Branch and Jump prediction – Perfect => all program
instructions available for execution
3. Memory-address alias analysis – addresses are known. A store
can be moved before a load provided addresses not equal
Also:
–
–
–
–
unlimited number of instructions issued/cycle (unlimited resources), and
unlimited instruction window
perfect caches
1 cycle latency for all instructions (FP *,/)
Programs were compiled using MIPS compiler with maximum
optimization level
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
28
Upper Limit to ILP: Ideal Processor
Integer: 18 - 60
FP: 75 - 150
160
150.1
140
Instruction Issues per cycle
IPC
118.7
120
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
espresso
li
fpppp
doducd
tomcatv
Programs
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
29
Window Size and Branch Impact
• Change from infinite window to examine 2000
FP: 15 - 45
and issue at most 64 instructions per cycle
61
60
60
58
IPC
Instruction issues per cycle
50
Integer: 6 – 12
48
46
46
45
45 45
41
40
35
30
29
19
20
16
15
13
12
14
10
10
9
6
7
6
6
6
7
4
2
2
2
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
4/8/2015
Perfect Tournament
BHT(512)
Profile
No prediction
Perfectand B.
Selective
predictor Standard 2-bit
Static
None
H. Corporaal
Mesman
Embedded Computer Architecture
30
Limiting nr. of Renaming Registers
• Changes: 2000 instr. window, 64 instr. issue, 8K 2-level
predictor (slightly better than tournament predictor)
70
FP: 11 - 45
59
Integer: 5 - 15
60
54
49
IPC
Instruction issues per cycle
50
45
44
40
35
29
30
28
20
20
16
15 15
13
10
11 10 10
12 12 12 11
10
9
5
5
4
15
11
6
4
5
5
5
4
7
5
5
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
4/8/2015
Embedded Computer Architecture
64 32 None
Infinite 256128 128
64 32
Infinite 256
H. Corporaal and B. Mesman
31
Memory Address Alias Impact
• Changes: 2000 instr. window, 64 instr. issue, 8K
2-level predictor, 256 renaming registers
49
50
49
45
45
45
FP: 4 - 45
(Fortran,
no heap)
40
IPC
Instruction issues per cycle
35
30
25
Integer: 4 - 9
20
16
16
15
15
12
10
10
5
9
7
7
4
5
5
4
3
3
4
6
5
4
3
4
0
gcc
espresso
li
f pppp
doducd
tomcat v
Program
Global/ stack Perf ect
InspectionInspection
None None
Perfect Global/stack
perfect
Perf ect
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
32
Reducing Window Size
• Assumptions: Perfect disambiguation, 1K Selective predictor, 16
entry
return stack, 64 renaming registers, issue as many as window
60
56
52
IPC
Instruction issues per cycle
50
47
FP: 8 - 45
45
40
35
34
30
22
Integer: 6 - 12
20
15 15
10 10 10
10
9
13
12 12 11 11
10
8
8
6
4
6
3
17 16
14
9
6
4
22
2
14
12
9
8
4
15
9
7
5
4
3
3
6
3
3
0
gcc
expresso
li
f pppp
doducd
tomcat v
Program
Inf inite
4/8/2015
256
128
64
32
16
8
Infinite
256
128 64 32 16 8 4
H. Corporaal and
B. Mesman
Embedded Computer Architecture
4
33
How to Exceed ILP Limits of
This Study?
• WAR and WAW hazards through memory:
eliminated WAW and WAR hazards through
register renaming, but not in memory
• Unnecessary dependences
– compiler did not unroll loops so iteration variable
dependence
• Overcoming the data flow limit: value prediction,
predicting values and speculating on prediction
– Address value prediction and speculation predicts
addresses and speculates by reordering loads and stores.
Could provide better aliasing analysis
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
34
Conclusions
• Amount of parallelism is limited
– higher in Multi-Media and Signal Processing appl.
– higher in kernels
• Trace analysis detects all types of parallelism
– task, data and operation types
• Detected parallelism depends on
– quality of compiler
– hardware
– source-code transformations
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
35
Overview
• Enhance performance: architecture methods
• Instruction Level Parallelism
• VLIW
– Examples
•
•
•
•
C6
TM
IA-64: Itanium, ....
TTA
• Clustering
• Code generation
• Hands-on
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
36
VLIW: general concept
A VLIW architecture
with 7 FUs
Instruction Memory
Instruction register
Function
Int FU
units
Int FU
Int FU
LD/ST
LD/ST
FP FU
FP FU
Floating Point
Register File
Int Register File
Data Memory
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
37
VLIW characteristics
•
•
•
•
Multiple operations per instruction
One instruction per cycle issued (at most)
Compiler is in control
Only RISC like operation support
– Short cycle times
– Easier to compile for
• Flexible: Can implement any FU mixture
• Extensible / Scalable
However:
• tight inter FU connectivity required
• not binary compatible !!
– (new long instruction format)
• low code density
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
38
VelociTI
C6x
datapath
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
39
VLIW example: TMS320C62
TMS320C62 VelociTI Processor
• 8 operations (of 32-bit) per instruction (256 bit)
• Two clusters
– 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)
– 2 x 16 registers
– One bus available to write in register file of other cluster
•
•
•
•
•
4/8/2015
Flexible addressing modes (like circular addressing)
Flexible instruction packing
All instruction conditional
Originally: 5 ns, 200 MHz, 0.25 um, 5-layer CMOS
128 KB on-chip RAM
Embedded Computer Architecture
H. Corporaal and B. Mesman
40
VLIW example: Philips TriMedia TM1000
5 constant
5 ALU
2 memory
2 shift
2 DSP-ALU
2 DSP-mul
3 branch
2 FP ALU
2 Int/FP ALU
1 FP compare
1 FP div/sqrt
4/8/2015
Register file (128 regs, 32 bit, 15 ports)
Exec
unit
Embedded Computer Architecture
Exec
unit
Exec
unit
Exec
unit
Exec
unit
Data
cache
(16 kB)
Instruction register (5 issue slots)
PC
Instruction
cache (32kB)
H. Corporaal and B. Mesman
41
Intel EPIC Architecture IA-64
Explicit Parallel Instruction Computer (EPIC)
• IA-64 architecture -> Itanium, first realization 2001
Register model:
• 128 64-bit int x bits, stack, rotating
• 128 82-bit floating point, rotating
• 64 1-bit boolean
• 8 64-bit branch target address
• system control registers
See http://en.wikipedia.org/wiki/Itanium
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
42
EPIC Architecture: IA-64
• Instructions grouped in 128-bit bundles
– 3 * 41-bit instruction
– 5 template bits, indicate type and stop location
• Each 41-bit instruction
– starts with 4-bit opcode, and
– ends with 6-bit guard (boolean) register-id
• Supports speculative loads
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
43
Itanium organization
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
44
Itanium 2:
McKinley
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
45
EPIC Architecture: IA-64
• EPIC allows for more binary compatibility then a
plain VLIW:
– Function unit assignment performed at run-time
– Lock when FU results not available
• See other website (course 5MD00) for more info
on IA-64:
– www.ics.ele.tue.nl/~heco/courses/ACA
– (look at related material)
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
46
What did we talk about?
ILP = Instruction Level Parallelism =
ability to perform multiple operations (or instructions),
from a single instruction stream,
in parallel
VLIW = Very Long Instruction Word architecture
Example Instruction format (5-issue):
operation 1 operation 2 operation 3 operation 4 operation 5
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
47
VLIW evaluation
FU-3
FU-4
Data memory
FU-2
Register file
FU-1
Bypassing network
Instruction
decode unit
Instruction
fetch unit
Instruction memory
CPU
FU-5
Control problem
O(N2)
O(N)-O(N2)
With N function units
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
48
VLIW evaluation
Strong points of VLIW:
– Scalable (add more FUs)
– Flexible (an FU can be almost anything; e.g. multimedia support)
Weak points:
• With N FUs:
– Bypassing complexity: O(N2)
– Register file complexity: O(N)
– Register file size: O(N2)
• Register file design restricts FU flexibility
Solution: .................................................. ?
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
49
Solution
TTA: Transport Triggered Architecture
+
>
+
*
>
st
4/8/2015
Embedded Computer Architecture
*
st
H. Corporaal and B. Mesman
50
Transport Triggered Architecture
General organization of a TTA
FU-1
CPU
FU-4
FU-5
Data memory
FU-3
Register
file
Bypassing network
Instruction
decode unit
Instruction
fetch unit
Instruction memory
FU-2
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
51
TTA structure; datapath details
Data Memory
load/store load/store
unit
unit
integer
ALU
integer
ALU
boolean
RF
instruct.
unit
float
ALU
Socket
integer
RF
float
RF
immediate
unit
Instruction Memory
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
52
TTA hardware characteristics
• Modular: building blocks easy to reuse
• Very flexible and scalable
– easy inclusion of Special Function Units (SFUs)
• Very low complexity
–
–
–
–
–
–
4/8/2015
> 50% reduction on # register ports
reduced bypass complexity (no associative matching)
up to 80 % reduction in bypass connectivity
trivial decoding
reduced register pressure
easy register file partitioning (a single port is enough!)
Embedded Computer Architecture
H. Corporaal and B. Mesman
53
TTA software characteristics
add r3, r1, r2
That does not
look like an
improvement !?!
o1
o2
+r
r1  add.o1;
r2 add.o2;
add.r  r3
• More difficult to schedule !
• But: extra scheduling optimizations
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
54
Program TTAs
How to do data operations ?
1. Transport of operands to FU
• Operand move (s)
Trigger
Operand
• Trigger move
2. Transport of results from FU
• Result move (s)
Internal stage
Example
Add r3,r1,r2
becomes
r1  Oint
r2  Tadd
………….
Rint  r3
// operand move to integer unit
// trigger move to integer unit
// addition operation in progress
// result move from integer unit
Result
FU Pipeline
How to do Control flow ?
1. Jumps:
2. Branch:
3. Call:
4/8/2015
#jump-address  pc
#displacement  pcd
pc  r; #call-address  pcd
Embedded Computer Architecture
H. Corporaal and B. Mesman
55
Scheduling example
VLIW
load/store
unit
add r1,r1,r2
integer
ALU
integer
ALU
sub r4,r1,95
TTA
r1 -> add.o1,
r2 -> add.o2
add.r -> sub.o1, 95 -> sub.o2
sub.r -> r4
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
integer
RF
immediate
unit
56
TTA Instruction format
General MOVE field:
g
i
src
dst
: guard specifier
: immediate specifier
: source
: destination
g
i
src
dst
General MOVE instructions: multiple fields
move 1
move 2
move 3
move 4
How to use immediates?
Small, 6 bits
g
1
imm
dst
Long, 32 bits
g
0
Ir-1
dst
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
imm
57
Programming TTAs
How to do conditional execution
Each move is guarded
Example
r1  cmp.o1
r2  cmp.o2
cmp.r g
g:r3 r4
4/8/2015
// operand move to compare unit
// trigger move to compare unit
// put result in boolean register g
// guarded move takes place when r1=r2
Embedded Computer Architecture
H. Corporaal and B. Mesman
58
Register file port pressure for TTAs
Read and write ports required
ILP degree
3.50
3.00
2.50
2.00
1.50
1.00
5
Read ports
4/8/2015
Embedded Computer Architecture
4
3
2
1
H. Corporaal and B. Mesman
1
2
3
4
5
Write ports
59
Summary of TTA Advantages
• Better usage of transport capacity
– Instead of 3 transports per dyadic operation, about 2 are
needed
– # register ports reduced with at least 50%
– Inter FU connectivity reduces with 50-70%
• No full connectivity required
• Both the transport capacity and # register ports become
independent design parameters; this removes one of the
major bottlenecks of VLIWs
• Flexible: Fus can incorporate arbitrary functionality
• Scalable: #FUS, #reg.files, etc. can be changed
• FU splitting results into extra exploitable concurrency
• TTAs are easy to design and can have short cycle times
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
60
TTA automatic DSE
User
intercation
Optimizer
x
x
x
feedback
x
x
Architecture
parameters
Parametric compiler
Pareto curve
(solution space)
x
feedback
x
x
x
x
Hardware generator
x
x
x
x x
x
x
x
x x
cost
Move framework
Parallel
object
code
4/8/2015
Embedded Computer Architecture
chip
H. Corporaal and B. Mesman
61
Overview
•
•
•
•
Enhance performance: architecture methods
Instruction Level Parallelism
VLIW
Examples
– C6
– TM
– TTA
• Clustering and Reconfigurable components
• Code generation
• Hands-on
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
62
Clustered VLIW
• Clustering = Splitting up the VLIW data path
- same can be done for the instruction path –
loop buffer
loop buffer
loop buffer
FU FU FU
FU FU FU
FU FU FU
register file
register file
register file
Level 2 (shared) Cache
Level 1 Instruction Cache
Level 1 Data Cache
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
63
Clustered VLIW
Why clustering?
• Timing: faster clock
• Lower Cost
– silicon area
– T2M (Time-to-Market)
• Lower Energy
What’s the disadvantage?
Want to know more: see PhD thesis Andrei Terechko
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
64
Fine-Grained reconfigurable:
Xilinx XC4000 FPGA
CLB
Slew
Rate
Control
CLB
Switch
Matrix
D
CLB
Q
Passive
Pull-Up,
Pull-Down
Vcc
Output
Buffer
Pad
Input
Buffer
CLB
Q
Programmable
Interconnect
D
Delay
I/O Blocks (IOBs)
C1 C2 C3 C4
H1 DIN S/R EC
S/R
Control
G4
G3
G2
G1
F4
F3
F2
F1
DIN
G
Func.
Gen.
F'
G'
H
Func.
Gen.
F
Func.
Gen.
D
SD
Q
H'
EC
RD
1
Y
G'
H'
S/R
Control
DIN
F'
G'
D
SD
Q
H'
1
EC
RD
H'
K
F'
X
Configurable
Logic Blocks (CLBs)
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
65
Recent Coarse Grain Reconfigurable
Architectures
• SmartCell 2009
– read http://www.hindawi.com/journals/es/2009/518659.html
•
•
•
•
•
•
•
Montium (reconfigurable VLIW)
RAPID
NIOS II
RAW
PicoChip
PACT XPP64
ADRES (IMEC)
• many more ….
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
66
Xilinx Zynq with 2 ARM processors
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
67
ADRES
• Combines VLIW
and reconfig.
Array
• PEs have local
registers
Top-row PEs
share registers
4/8/2015
Embedded Computer Architecture
H. Corporaal and B. Mesman
68
PACT XPP: Architecture
• XPP (Extreme Processing Platform)
– A hierarchical structure consisting of PAEs
• PAEs
 Course grain PEs
 Adaptive
 Clustered in PACs
 PA = PAC + CM
 A hierarchical
configuration tree
 Memory elements
(aside PAs)
 I/O elements (on
each side of the
chip)
69
PA
PA
PA
PA
RAW with Mesh network
Compute
Pipeline
8 32-bit channels
Registered at input 
longest wire = length of tile
Granularity Makes Differences
4/8/2015
Fine-Grained
Architecture
Coarse-Grained
Architecture
Clock Speed
Low
High
Configuration
Time
Long
Short
# of Blocks
Large
Small
Flexibility
High
Low
Power
High
Low
Area
Large
Small
Embedded Computer Architecture
H. Corporaal and B. Mesman
71
Reconfiguration time
HW or SW reconfigurable?
reset
FPGA
Spatial mapping
loopbuffer
context
Temporal mapping
Subword parallelism
1 cycle
fine
4/8/2015
Embedded Computer Architecture
Data path granularity
H. Corporaal and B. Mesman
VLIW
coarse
72
Download