Compiling for EDGE Architectures: The TRIPS Prototype Compiler

advertisement
Compiling for EDGE Architectures:
The TRIPS Prototype Compiler
Kathryn McKinley
Doug Burger, Steve Keckler,
Jim Burrill1, Xia Chen, Katie Coons, Sundeep Kushwaha,
Bert Maher, Nick Nethercote, Aaron Smith, Bill Yoder
et al.
The University of Texas at Austin
of Massachusetts, Amherst
1University
July 13, 2016
ASPLOS XII
Technology Scaling Hitting the Wall
Analytically …
Qualitatively …
35 nm
70 nm
100 nm
130 nm
20 mm chip edge
Either way … Partitioning for on-chip communication is key
July 13, 2016
ASPLOS XII
OO SuperScalars Out of Steam
Clock ride is over
 Wire
and pipeline limits
 Quadratic out-of-order issue logic
 Power, a first order constraint
Problems for any architectural solution
 ILP
- instruction level parallelism
 Memory and on-chip latency
Major vendors ending processor lines
July 13, 2016
ASPLOS XII
OO SuperScalars Out of Steam
Clock ride is over
 Wire
and pipeline limits
 Quadratic out-of-order issue logic
 Power, a first order constraint
Problems for any architectural solution
 ILP
- instruction level parallelism
 Memory and on-chip latency
Major vendors ending processor lines
What’s next?
July 13, 2016
ASPLOS XII
Post-RISC Solutions

CMP - An evolutionary path




Replicate what we already have 2 to N times on a chip
Coarse grain parallelism
Exposes the resources to the programmer and compiler
Explicit Data Graph Execution (EDGE)
1. Program graph is broken into sequence of blocks

Blocks commit atomically or not - a block never partially commits
2. Dataflow within a block, ISA support for direct producer-consumer
communication



No shared named registers (point-to-point dataflow edges only)
Memory is still a shared namespace
The block’s dataflow graph (DFG) is explicit in the architecture
July 13, 2016
ASPLOS XII
Outline
 TRIPS
Execution Model & ISA
 TRIPS
Architectural Constraints
 Compiler
 Spatial
July 13, 2016
Structure
Path Scheduling
ASPLOS XII
Block Atomic Execution Model
TRIPS block
Flow Graph
Dataflow
Graph
read
add
add
ld
cmp
write
read
shl
ld
cmp
br
ld
shl
sw
br
write
addi
addi
mov
write
bro_t
lw_f
Gtile
D[0]
read
sw
sw
add
br
write
•
•
July 13, 2016
Register File
read
Gtile
Data Caches
read
Execution
Substrate
D[0]
write
bro_t
addi
lw_f
mov
read
write
addi
addi
addi
write
TRIPS block - single entry constrained hyperblock
Dataflow execution w/ target position encoding
ASPLOS XII
TRIPS Block Constraints
Registers: 32 reads and 32 writes, 8 to each of
4 banks (in addition to 128)
Memory
Load/Store Identifiers: 32 load or store queue
identifiers
 More than 32 static loads and stores is
possible
PC
32 loads
1 - 128
PC read
32 reads
32 writes
32
stores
instruction
DFG
Memory
Register banks
Fixed Size: 128 instructions
 Padded with no-ops if needed
terminating
branch
PC
Constant Output: all stores and writes execute, one branch
 Simplifies hardware logic for detecting block completion
 Every path of execution through a block must produce the same stores and
register writes
Simplifies the hardware, more work for the compiler
July 13, 2016
ASPLOS XII
Compiler Phases (Classic)
Scale Compiler (UTexas/UMass)
C
FORTRAN
Frontend
Inlining
Unrolling/Flattening
Scalar Optimizations
Code Generation
Alpha
July 13, 2016
SPARC PPC
PRE
Global Value Numbering
Scalar Replacement
Global Variable Replacement
SCC
Copy Propagation
Array Access Strength
Reduction
LICM
Tree Height Reduction
Useless Copy Removal
Dead Variable Elimination
TIL: TRIPS Intermediate Language - RISC-like threeaddress form
TRIPS TIL
TASL: TRIPS Assembly Language - dataflow target form
w/ locations encoded
ASPLOS XII
Backend Compiler Flow
Hyperblock
Formation
TIL
Resource
Allocation
Scheduling
July 13, 2016
If-conversion
Loop peeling
While loop unrolling
Instruction merging
Predicate optimizations
Register allocation
Reverse if-conversion & split
Load/Store ID assignment
SSA for constant outputs
Fanout insertion
Instruction placement
Target form generation
TASL
ASPLOS XII
Correctness:
Progressively Satisfy Constraints
Hyperblock
Formation
TIL
Resource
Allocation
Scheduling
July 13, 2016
If-conversion
Loop peeling
While loop unrolling
Instruction merging
Predicate optimizations
Register allocation
Reverse if-conversion & split
Load/Store ID assignment
SSA for constant outputs
Fanout insertion
Instruction placement
Target form generation
Constraint
128 instructions
32 load/store IDs
32 reg. read/write
(8 per 4 banks)
constant output
TASL
ASPLOS XII
Predication & Hyperblock Formation
Predication






Convert control dependence to data dependence
Improves instruction fetch bandwidth
Eliminates branch mispredictions
Adds overhead
Any instruction can have a predicate, but...
Predicate head (low power) or bottom (speculative)
Hyperblock




Scheduling region (set of basic blocks)
Single entry, multiple exit, predicated instructions
Expose parallelism w/o over saturating resources
Must satisfy block constraints
head
P
bottom
P
P
July 13, 2016
ASPLOS XII
Accuracy?
Hyperblock
Formation
TIL
Resource
Allocation
Scheduling
July 13, 2016
If-conversion
Loop peeling
While loop unrolling
Instruction merging
Predicate optimizations
Register allocation
Reverse if-conversion & split
Load/Store ID assignment
SSA for constant outputs
Fanout insertion
Instruction placement
Target form generation
Constraint
128 instructions
32 load/store IDs
32 reg. read/write
(8 per 4 banks)
constant output
TASL
ASPLOS XII
Block Atomic Execution Model
TRIPS block
Flow Graph
Dataflow
Graph
read
add
add
ld
cmp
write
read
shl
ld
cmp
br
ld
shl
sw
br
write
addi
addi
mov
write
bro_t
lw_f
Gtile
D[0]
read
sw
sw
add
br
Register File
read
Gtile
Data Caches
read
Execution
Substrate
D[0]
write
bro_t
addi
lw_f
mov
read
write
addi
addi
addi
write
write
TRIPS block - single entry constrained hyperblock
Dataflow execution w/ target position encoding
July 13, 2016
ASPLOS XII
Spatial Scheduling Problem
Partitioned microarchitecture
add
mul
mul
ld
ld
ld
mul
ld
mul
add
st
July 13, 2016
ASPLOS XII
Spatial Scheduling Problem
Partitioned microarchitecture
add
mul
ld
ld
ld
mul
ld
mul
ld
ld
mul
st
ld
ld
add
st
Anchor points
July 13, 2016
ASPLOS XII
Spatial Scheduling Problem
Balance latency and concurrency
Partitioned microarchitecture
add
mul
mul
ld
ld
ld
mul
ld
mul
st
ld
mul
add
mul
ld
mul
add
ld
ld
add
mul
st
Anchor points
July 13, 2016
ASPLOS XII
Outline
 Background
 Spatial
Path Scheduling
 Simulated
Annealing
 Extending
SPS
 Conclusions
July 13, 2016
and Future Work
ASPLOS XII
Dissecting the Problem

Scheduling can have two components


Placement:
Issue:
Where an instruction executes
When an instruction executes
July 13, 2016
Static
Dynamic
Static
VLIW
(SPSI)
Bad idea
(DPSI)
Dynamic
Issue
Placement
TRIPS
(SPDI)
Superscalars
(DPDI)
EDGE
ASPLOS XII
Explicit Data Graph Execution

Block-atomic execution


Instruction groups fetch, execute, and commit atomically
Direct instruction communication

Explicitly encode dataflow graph by specifying targets
RISC
EDGE
R4
add r1, r4, r5
add r2, r5, r6
add r3, r1, r2
July 13, 2016
Centralized
Register
File
i1: add i3
i2: add i3
i3: add i4
R5
add
i1
i2
add
i3
i2
R6
add
i2
i2
ASPLOS XII
Scheduling for TRIPS
TRIPS ISA



Ctrl
R0
R1
R2
R3
Up to 8 blocks in flight
1 cycle latency between
adjacent ALUs
D0
E0
E1
E2
E3
D1
E4
E5
E6
E7
D2
E8
E9
E10
E11
D3
E12
E13
E14
E15

Known



Register File
TRIPS microarchitecture


Up to 128 instructions/block
Any instruction can be in any slot
Execution latencies
Lower bound for
communication latency
Unknown (estimated)


Data Cache

Memory access latencies
Resource conflicts
July 13, 2016
ASPLOS XII
Scheduling for TRIPS
TRIPS ISA



Ctrl
Up to 8 blocks in flight
1 cycle latency between
adjacent ALUs
D0

Known



Register File
TRIPS microarchitecture


Up to 128 instructions/block
Any instruction can be in any slot
Execution latencies
Lower bound for
communication latency
Unknown


Data Cache

D1
R0
R1
R2
R3
E2
E4
D2
D3
Memory access latencies
Resource conflicts
July 13, 2016
ASPLOS XII
Greedy Scheduling for TRIPS

GRST [PACT ‘04]: Based on VLIW list-scheduling

Augmented with five heuristics
Prioritizes critical path (C)
Reprioritizes after each placement (R)
Accounts for data cache locality (L)
Accounts for register output locality (O)
Load balancing for local issue contention (B)
1.
2.
3.
4.
5.

Drawbacks


Unnecessary restrictions on scheduling order
Inelegant and overly specific
Replace heuristics with elegant approach designed for
spatial scheduling
July 13, 2016
ASPLOS XII
Greedy Scheduling for TRIPS

GRST [PACT ‘04]: Based on VLIW list-scheduling

Augmented with five heuristics
Prioritizes critical path (C)
Reprioritizes after each placement (R)
Accounts for data cache locality (L)
Accounts for register output locality (O)
Load balancing for local issue contention (B)
1.
2.
3.
4.
5.

Drawbacks


Unnecessary restrictions on scheduling order
Inelegant and overly specific
Replace heuristics with elegant approach designed for
spatial scheduling
July 13, 2016
ASPLOS XII
Outline
 Background
 Spatial
Path Scheduling
 Simulated
Annealing
 Extending
SPS
 Conclusions
July 13, 2016
and Future Work
ASPLOS XII
Spatial Path Scheduling Overview
Legend
read
add
Register
Data cache
mul
br
ld
ld
ctrl
D0
D1
Execution
Control
ctrl
Dataflow
Graph
R1
R2
read
mul
D0
ld
add
mul
D1
ld
mul
add
add
write
Scheduler
br
Placement
Topology
July 13, 2016
ASPLOS XII
Spatial Path Scheduling Overview
Legend
read
add
Register
Data cache
mul
br
ld
ld
ctrl
D0
D1
Control
read
mul
Dataflow
Graph
Execution
R1
add
mul
ctrl
D0
D1
R2
mul
ld
add
write
Scheduler
ld
Placement
Topology
July 13, 2016
ASPLOS XII
Spatial Path Scheduling Overview
Legend
read
add
Register
Data cache
mul
br
ld
ld
ctrl
D0
D1
Control
read
mul
Dataflow
Graph
Execution
R1
add
R2
mul
add
ld
mul
ld
br
D0
D1
add
write
Scheduler
Placement
Topology
July 13, 2016
ASPLOS XII
Spatial Path Scheduling Overview
Initialize all known anchor points
Until all instructions are scheduled:
1. Populate the open list
2. Find placement costs
3. Choose the minimum cost
location
4. Schedule the instruction
whose minimum placement
cost is largest
(Choose the max of the mins)
read
R2
add
br
mul
ld
ld
read
R1
mul
add
write
R1
July 13, 2016
ASPLOS XII
Spatial Path Scheduling Example

Initialize all known anchor points
read
R2
add
Register File
ctrl
R1
mul
br
ld
ld
ctrl
D0
D1
R2
Data Cache
D0
read
R1
mul
D1
Legend
Register
add
Data cache
Execution
Control
write
R1
Unplaced
July 13, 2016
ASPLOS XII
Spatial Path Scheduling Example

Populate the open list
(marked in yellow)
read
R2
add
Open list: Instructions that are candidates
for scheduling
We include: Instructions with no parents,
or with at least one placed parent
July 13, 2016
mul
br
ld
ld
ctrl
D0
D1
read
R1
mul
add
write
R1
ASPLOS XII
Spatial Path Scheduling Example

Calculate placement cost for
each instruction in the open
list at each slot
1
read
R2
add
Placement cost(i,slot): Longest path length
through i if placed at slot
cost = inputCost + execCost + outputCost
(includes communication and execution latencies)
July 13, 2016
mul
3
br
ld
ld
1
ctrl
D0
D1
3
3
read
R1
mul
1
add
1
write
R1
ASPLOS XII
Spatial Path Scheduling Example

Calculate placement cost for
each instruction in the open
list at each slot
read
R2
1
mul
3
5
Register File
ctrl
R1
Data Cache
3 cycles
mul
E1
1
D1
3
R2
1 cycle
D0
ld
3
3
mul
D1
5 cycles
1
1
add
1
write
R1
Total placement cost = 16 + 3 + 3 = 22
July 13, 2016
ASPLOS XII
Spatial Path Scheduling Example

Calculate placement cost for
each instruction in the open
list at each slot
Register File
ctrl
Data Cache
D0
D1
R1
22
22
24
22
24
read
R2
add
mul
br
ld
ld
ctrl
D0
D1
R2
24
24
26
26
26
28
mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
read
R1
mul
add
write
R1
26
July 13, 2016
22
add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
26
28
30
add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
ASPLOS XII
Spatial Path Scheduling Example

Choose the minimum cost
location for each instruction
Register File
ctrl
Data Cache
D0
D1
R1
22
22
24
22
24
read
R2
add
mul
br
ld
ld
ctrl
D0
D1
R2
24
24
26
26
26
28
mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
read
R1
mul
add
write
R1
26
July 13, 2016
22
add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
26
28
30
add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
ASPLOS XII
Spatial Path Scheduling Example


mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
Break ties
add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Example heuristics:


Links consumed
ALU utilization
Register File
ctrl
Data Cache
D0
D1
R1
22
22
24
22
24
add
mul
br
ld
ld
ctrl
D0
D1
R2
24
24
26
26
26
30
mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
read
R1
mul
add
write
R1
26
July 13, 2016
22
read
R2
26
28
30
add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
ASPLOS XII
Spatial Path Scheduling Example

Place the instruction with the
highest minimum cost
(Choose the max of the mins)
Register File
ctrl
Data Cache
D0
D1
July 13, 2016
R1
add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
read
R2
add
mul
br
ld
ld
ctrl
D0
D1
R2
mul
mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
read
R1
mul
add
write
R1
add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
ASPLOS XII
Spatial Path Scheduling Algorithm
Schedule (block, topology)
initialize known anchor points
while (not all instructions scheduled)
for each instruction in open list, i
for each available location, n
calculate placement cost for (i, n)
keep track of n with min placement cost
keep track of i with highest min placement cost
schedule i with highest min placement cost
Per-block complexity:
SPS:
O(i2 * n)
GRST:
O(i2
Exhaustive search:
i!
i = # of instructions
n = # of ALUs
July 13, 2016
+ i * n)
ASPLOS XII
SPS Benefits and Limitations

Benefits





Automatically exploits known communication latencies
Designed for spatial scheduling
Minimizes critical path length at each step
Naturally encompasses four of five GRST heuristics
Limitations of basic algorithm



Does not account for resource contention
Uses no global information
Minimum communication latencies may be optimistic
July 13, 2016
ASPLOS XII
Experimental Methodology

26 hand-optimized microbenchmarks



Cycle-accurate simulator



Extracted from SPEC2000, EEMBC, Livermore Loops,
MediaBench, and C libraries
Average dynamic instructions fetched/block: 67.3 (Ranges from
14.5 to 117.5)
Within 4% of RTL on average
Models communication and contention delays
Comparison points


Greedy Scheduling for TRIPS (GRST)
Simulated annealing
July 13, 2016
ASPLOS XII
July 13, 2016
nv
se
a
cm
p
str
sh
qr
rbt
ree
pm
svd
_G
MT
I
va
dd
Ge
o.
Me
an
spo
Hand-coded microbenchmark
tra
n
co
cfa
r
ct
ke_
1
ge
na
lg
gz
ip_
1
gz
ip_
2
ma
trix
_1
me
mc
hr
me
mc
py
me
ms
et
pa
rse
r_1
eq
ua
am
1
mp
_2
art
_1
art
_2
art
_3
bz
ip2
_1
e0
a2
t im
Speedup
SPS Performance
Geometric mean of speedup over GRST: 1.19
Basic SPS
2
1.8
1.6
1.4
1.2
1
0.8
ASPLOS XII
July 13, 2016
nv
se
a
cm
p
str
sh
qr
rbt
ree
pm
svd
_G
MT
I
va
dd
Ge
o.
Me
an
spo
Hand-coded microbenchmark
tra
n
co
cfa
r
ct
ke_
1
ge
na
lg
gz
ip_
1
gz
ip_
2
ma
trix
_1
me
mc
hr
me
mc
py
me
ms
et
pa
rse
r_1
eq
ua
am
1
mp
_2
art
_1
art
_2
art
_3
bz
ip2
_1
e0
a2
t im
Speedup
SPS Performance
Geometric mean of speedup over GRST: 1.19
Basic SPS
2
1.8
1.6
1.4
1.2
1
0.8
ASPLOS XII
July 13, 2016
nv
se
a
cm
p
str
sh
qr
rbt
ree
pm
svd
_G
MT
I
va
dd
Ge
o.
Me
an
spo
Hand-coded microbenchmark
tra
n
co
cfa
r
ct
ke_
1
ge
na
lg
gz
ip_
1
gz
ip_
2
ma
trix
_1
me
mc
hr
me
mc
py
me
ms
et
pa
rse
r_1
eq
ua
am
1
mp
_2
art
_1
art
_2
art
_3
bz
ip2
_1
e0
a2
t im
Speedup
SPS Performance
Geometric mean of speedup over GRST: 1.19
Basic SPS
2
1.8
1.6
1.4
1.2
1
0.8
ASPLOS XII
Outline
 Background
 Spatial
Path Scheduling
 Simulated
Annealing
 Extending
SPS
 Conclusions
July 13, 2016
and Future Work
ASPLOS XII
How well can we do?

Simulated annealing




Cost function: simulated cycles



Artificial intelligence search technique
Uses random perturbations to avoid local optima
Approximates a global optimum
Uncertainty makes static cost functions insufficient
Best cost function
Purpose



Optimization
Discover performance upper bound
Tool to improve scheduler
July 13, 2016
ASPLOS XII
Speedup with Simulated Annealing
Geometric mean of speedup over GRST
Basic SPS: 1.19
Annealed: 1.40
Basic SPS
Annealed
2.2
2
Speedup
1.8
1.6
1.4
1.2
1
July 13, 2016
svd
spo
se
_G
MT
I
va
dd
Ge
o.
me
an
a
str
cm
p
tra
n
Hand-coded microbenchmark
sh
qr
rbt
ree
pm
mc
hr
mc
py
me
ms
et
pa
rse
r_1
me
_1
me
1
2
trix
ma
ip_
gz
lg
ip_
gz
1
ge
na
ke_
ct
eq
ua
r
nv
co
cfa
art
_2
art
_1
art
_3
ip2
_1
bz
am
a2
t im
e0
1
mp
_2
0.8
ASPLOS XII
Speedup with Simulated Annealing
Geometric mean of speedup over GRST
Basic SPS: 1.19
Annealed: 1.40
Basic SPS
Annealed
2.2
2
Speedup
1.8
1.6
1.4
1.2
1
July 13, 2016
svd
spo
se
_G
MT
I
va
dd
Ge
o.
me
an
a
str
cm
p
tra
n
Hand-coded microbenchmark
sh
qr
rbt
ree
pm
mc
hr
mc
py
me
ms
et
pa
rse
r_1
me
_1
me
1
2
trix
ma
ip_
gz
lg
ip_
gz
1
ge
na
ke_
ct
eq
ua
r
nv
co
cfa
art
_2
art
_1
art
_3
ip2
_1
bz
am
a2
t im
e0
1
mp
_2
0.8
ASPLOS XII
Speedup with Simulated Annealing
Geometric mean of speedup over GRST
Basic SPS: 1.19
Annealed: 1.40
Basic SPS
Annealed
2.2
2
Speedup
1.8
1.6
1.4
1.2
1
July 13, 2016
svd
spo
se
_G
MT
I
va
dd
Ge
o.
me
an
a
str
cm
p
tra
n
Hand-coded microbenchmark
sh
qr
rbt
ree
pm
mc
hr
mc
py
me
ms
et
pa
rse
r_1
me
_1
me
1
2
trix
ma
ip_
gz
lg
ip_
gz
1
ge
na
ke_
ct
eq
ua
r
nv
co
cfa
art
_2
art
_1
art
_3
ip2
_1
bz
am
a2
t im
e0
1
mp
_2
0.8
ASPLOS XII
Outline
 Background
 Spatial
Path Scheduling
 Simulated
Annealing
 Extending
SPS
 Conclusions
July 13, 2016
and Future Work
ASPLOS XII
Extending SPS
 Contention
Network link contention
 Local and Global ALU contention

 Global
 Path
July 13, 2016
register prioritization
volume scheduling
ASPLOS XII
ALU Contention

What if two instructions are ready to execute on the
same ALU at the same time?
read
R2
add
Register File
ctrl
Data Cache
D0
R1
br
mul
br
ld
ld
ctrl
D0
D2
R2
add
add
ld
mul
read
R1
mul
mul
ld
add
D2
July 13, 2016
write
R1
ASPLOS XII
Local vs. Global ALU Contention
 Local
ALU contention
Keep track of expected issue time
 Increase placement cost if conflict occurs

 Global
ALU contention
Resource utilization in previous/next block
 Weighting function

 Modify
July 13, 2016
placement cost
ASPLOS XII
Speedup over GRST
Geometric mean of speedup over GRST
Basic SPS: 1.19
SPS extended: 1.31
Basic SPS
2.2
Annealed: 1.40
SPS extended
Annealed
2
Speedup
1.8
1.6
1.4
1.2
1
July 13, 2016
svd
spo
se
_G
MT
I
va
dd
Ge
o.
me
an
sh
a
str
cm
p
tra
n
Hand-coded microbenchmark
qr
rbt
ree
pm
ct
ke_
1
ge
na
lg
gz
ip_
1
gz
ip_
2
ma
trix
_1
me
mc
hr
me
mc
py
me
ms
et
pa
rse
r_1
eq
ua
r
nv
co
cfa
art
_2
art
_1
art
_3
ip2
_1
bz
am
a2
t im
e0
1
mp
_2
0.8
ASPLOS XII
Speedup over GRST
Geometric mean of speedup over GRST
Basic SPS: 1.19
SPS extended: 1.31
Basic SPS
2.2
Annealed: 1.40
SPS extended
Annealed
2
Speedup
1.8
1.6
1.4
1.2
1
July 13, 2016
svd
spo
se
_G
MT
I
va
dd
Ge
o.
me
an
sh
a
str
cm
p
tra
n
Hand-coded microbenchmark
qr
rbt
ree
pm
ct
ke_
1
ge
na
lg
gz
ip_
1
gz
ip_
2
ma
trix
_1
me
mc
hr
me
mc
py
me
ms
et
pa
rse
r_1
eq
ua
r
nv
co
cfa
art
_2
art
_1
art
_3
ip2
_1
bz
am
a2
t im
e0
1
mp
_2
0.8
ASPLOS XII
Speedup over GRST
Geometric mean of speedup over GRST
Basic SPS: 1.19
SPS extended: 1.31
Basic SPS
2.2
Annealed: 1.40
SPS extended
Annealed
2
Speedup
1.8
1.6
1.4
1.2
1
July 13, 2016
svd
spo
se
_G
MT
I
va
dd
Ge
o.
me
an
sh
a
str
cm
p
tra
n
Hand-coded microbenchmark
qr
rbt
ree
pm
ct
ke_
1
ge
na
lg
gz
ip_
1
gz
ip_
2
ma
trix
_1
me
mc
hr
me
mc
py
me
ms
et
pa
rse
r_1
eq
ua
r
nv
co
cfa
art
_2
art
_1
art
_3
ip2
_1
bz
am
a2
t im
e0
1
mp
_2
0.8
ASPLOS XII
Related Work

Scheduling for VLIW [Ellis, Fisher]

Scheduling for other partitioned architectures




Partitioned VLIW [Gilbert, Kailas, Kessler, Özer, Qian, Zalamea]
RAW [Lee]
Wavescalar [Mercaldi]
ASIC and FPGA place and route [Paulin]



Resource conflicts known statically
Substrate may not be fixed
Simulated annealing [Betz]
July 13, 2016
ASPLOS XII
Conclusions and Future Work

Future work




Register allocation
Memory placement
Reliability-aware scheduling
Conclusions



General spatial instruction scheduling algorithm
Reasons explicitly about anchor points
Performance within 4% of annealed results
July 13, 2016
ASPLOS XII
Questions?
July 13, 2016
ASPLOS XII
Mapping instructions to Physical Locations

Scheduler converts operand format to target format, and assigns IDs

ID assigned to each instruction indicates physical location

The microarchitecture can interpret this ID in many different ways

To schedule well, the scheduler must understand how the
microarchitecture translates ID -> Physical location
TIL (operand format):
read
read
muli
ld
ld
mul
add
addi
br
write
July 13, 2016
t0, g1
t1, g2
t2, t1, 4
t3, 0(t2)
t4, 4(t2)
t5, t3, t4
t6, t5, t0
t7, t1, 8
t7
g1, t6
Scheduler
TASL(target format)
R[1]
read, G[1], N[5]
R[2]
read, N[2], N[6]
N[2]
muli, N[34], N[1]
N[34]
ld, N[32]
N[1]
ld, N[32]
N[32]
mul, N[5]
N[5]
add, W[1]
N[6]
addi, N[0]
N[0]
br
W[1]
write, G[1]
ASPLOS XII
Mapping instructions to Physical Locations

Scheduler converts operand format to target format, and assigns IDs

ID assigned to each instruction indicates physical location

The microarchitecture can interpret this ID in many different ways

To schedule well, the scheduler must understand how the
microarchitecture translates ID -> Physical location
ctrl
R0
R0
R0
D0
D1
D2
D3
July 13, 2016
R1
R1
R1
R2
R2
R2
R3
R3
R3
0
1
2
3
32
33
34
35
64
65
66
67
96
97
98
99
TASL(target format)
R[1]
read, G[1], N[5]
R[2]
read, N[2], N[6]
N[2]
muli, N[34], N[1]
N[34]
ld, N[32]
N[1]
ld, N[32]
N[32]
mul, N[5]
N[5]
add, W[1]
N[6]
addi, N[0]
N[0]
br
W[1]
write, G[1]
ASPLOS XII
Mapping instructions to Physical Locations

Scheduler converts operand format to target format, and assigns IDs

ID assigned to each instruction indicates physical location

The microarchitecture can interpret this ID in many different ways

To schedule well, the scheduler must understand how the
microarchitecture translates ID -> Physical location
ctrl
R0
R4
R1
R5
R2
R6
R3
R7
D0
4
5
6
7
D1
36
37
38
39
D2
68
69
70
71
D3
100
101
102
103
July 13, 2016
TASL(target format)
R[1]
read, G[1], N[5]
R[2]
read, N[2], N[6]
N[2]
muli, N[34], N[1]
N[34]
ld, N[32]
N[1]
ld, N[32]
N[32]
mul, N[5]
N[5]
add, W[1]
N[6]
addi, N[0]
N[0]
br
W[1]
write, G[1]
ASPLOS XII
Mapping instructions to Physical Locations

Scheduler converts operand format to target format, and assigns IDs

ID assigned to each instruction indicates physical location

The microarchitecture can interpret this ID in many different ways

To schedule well, the scheduler must understand how the
microarchitecture translates ID -> Physical location
ctrl
R0,R4,
… R28
R1,R5,
… R29
R2,R6,
… R30
R3,R7,
… R31
D0
0,4,8,
… 28
1,5,9,
… 29
2,6,10,
… 30
3,7,11,
… 31
D1
32,36,
… 60
33,37,
… 61
34,38,
… 62
35,39,
… 63
D2
64,68,
… 92
65,69,
… 93
66,70,
… 94
67,69,
… 95
D3
96,100,
… 124
97,101,
… 125
98,101,
… 126
99,102,
… 127
July 13, 2016
TASL(target format)
R[1]
read, G[1], N[5]
R[2]
read, N[2], N[6]
N[2]
muli, N[34], N[1]
N[34]
ld, N[32]
N[1]
ld, N[32]
N[32]
mul, N[5]
N[5]
add, W[1]
N[6]
addi, N[0]
N[0]
br
W[1]
write, G[1]
ASPLOS XII
Simulated Annealing Over Time
100000
random accepted
95000
random best
guided accepted
guided best
Simulation Cycles
90000
85000
80000
75000
70000
65000
60000
1
74
147
220
293
366
439
512
585
658
731
804
877
950 1023 1096 1169 1242 1315 1388 1461 1534 1607 1680
Annealing Iterations
July 13, 2016
ASPLOS XII
Simulated Annealing

Cost function: Simulated cycles

Prune space further with critical path tool
Guided vs. unguided Annealing for memset_hand
83000
Random Move
Guided Move
Simulation Cycles
82000
81000
80000
79000
78000
77000
76000
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96 101
Annealing Times
July 13, 2016
ASPLOS XII
Contention

ALU contention



Network link contention



Local (within a block) - Estimate temporal schedule
Global (between blocks) - Probabilistic - use weighting function
Precise measurements too inaccurate
Estimate with threshold, weighting function
Weight network link and global ALU contention based
on annealed results
criticality
weight = (1 - fullness) * (1 )
concurrency
July 13, 2016
ASPLOS XII
Global Register Prioritization

Problem: Any register dependence may be important
with speculative execution

Solution: Extend path lengths through registers
Register prioritization:
1)
Schedule smaller loops before larger loops
2)
Schedule loop-carried dependences first
3)
Extend placement cost through registers to
previous/next block
July 13, 2016
ASPLOS XII
Path Volume Scheduling

Problem: The basic SPS algorithm does not account for
the number of instructions in the path

Solution: Perform a depth-first search with iterative
deepening to find the shortest path that holds all
instructions
July 13, 2016
ASPLOS XII
Download