Uploaded by Reza Mohajer

fank-edcep04 (1)

advertisement
Compiler-directed Synthesis of
Programmable Loop Accelerators
Kevin Fan, Hyunchul Park, Scott Mahlke
September 25, 2004
EDCEP Workshop
University of Michigan
Electrical Engineering and Computer Science
Loop Accelerators
• Hardware implementation of a critical loop nest
– Hardwired state machine
– Digital camera appln – 1000x vs Pentium III
– Multiple accelerators hooked up in a pipeline
• Loop accelerator vs. customized processor
– 1 block of code vs. multiple blocks
– Trivial control flow vs. handling generic branches
– Traditionally state machine vs. instruction driven
University of Michigan
Electrical Engineering and Computer Science
Programmable Loop Accelerators
• Goals
– Multifunction accelerators – Accelerator hardware can
handle multiple loops (re-use)
– Post-programmable – To a degree, allow changes to the
application
– Use compiler as architecture synthesis tool
• But …
– Don’t build a customized processor
– Maintain ASIC-level efficiency
University of Michigan
Electrical Engineering and Computer Science
NPA (Nonprogrammable
Accelerator) Synthesis in PICO
University of Michigan
Electrical Engineering and Computer Science
PICO Frontend
• Goals
– Exploit loop-level parallelism
– Map loop to abstract hardware
– Manage global memory BW
• Steps
–
–
–
–
–
Tiling
Load/store elimination
Iteration mapping
Iteration scheduling
Virtual processor clustering
for i = 1 to ni
for j = 1 to nj
y[i] += w[j] * x[i+j]
for jt = 1 to 100 step 10
for t = 0 to 502
for p = 0 to 1
(i,j) = function of (t,p)
if (i>1) W[t][p] = W[t-5][p]
else w[jt+j]
if (i>1 && j<bj) X[t][p] = X[t-4][p+1]
else x[i+jt+j]
Y[t][p] += W[t][p] * X[t][p]
University of Michigan
Electrical Engineering and Computer Science
PICO Backend
• Resource allocation (II, operation graph)
• Synthesize machine description for “fake” fully connected processor with allocated resources
University of Michigan
Electrical Engineering and Computer Science
Reduced VLIW Processor after
Modulo Scheduling
University of Michigan
Electrical Engineering and Computer Science
Data/control-path Synthesis  NPA
Load
yii
Xr-1
Load
wjj
Load
xii-jj
1
0
t2 t3
1
0
1
0

1
0
Yr-1
1
0
t1
+
yii
Store
University of Michigan
Electrical Engineering and Computer Science
PICO Methodology – Why it Works?
• Systematic design methodology
– 1. Parameterized meta-architecture – all NPAs have
same general organization
– 2. Performance/throughput is input
– 3. Abstract architecture – We know how to build
compilers for this
– 4. Mapping mechanism – Determine architecture
specifics from schedule for abstract architecture
University of Michigan
Electrical Engineering and Computer Science
Direct Generalization of PICO?
• Programmability would require full interconnect between elements
• Back to the meta architecture!
• Generalize connectivity to enable post-programmability
• But stylize it
University of Michigan
Electrical Engineering and Computer Science
Programmable Loop Accelerator –
Design Strategy
• Compile for partially defined architecture
– Build long distance communication into schedule
– Limit global communication bandwidth
• Proposed meta-architecture
– Multi-cluster VLIW
• Explicit inter-cluster transfers (varying latency/BW)
• Intra-cluster communication is complete
– Hardware partially defined – expensive units
University of Michigan
Electrical Engineering and Computer Science
Programmable Loop Accelerator
Schema
DRAM
Stream Unit
II
Shift Register
Control
Unit
FU
MEM
SRAM
Accelerator
…
…
…
…
…
…
…
…
Intra-cluster Communication
Stream Buffer
Stream Unit
FU
Accelerator
FU
Inter-cluster
Register File
…
Pipeline of Tiled or
Clustered Accelerators
Accelerator Datapath
University of Michigan
Electrical Engineering and Computer Science
Flow Diagram
Assembly code,
II
# cheap FUs
FUs assigned to clusters
FU Alloc
Modulo
Schedule
# clusters
# expensive FUs
Shift register depth, width, porting
Intercluster bandwidth
Partition
Loop
Accelerator
University of Michigan
Electrical Engineering and Computer Science
Sobel Kernel
for (i = 0; i < N1; i++) {
for (j = 0; j < N2; j++) {
int t00, t01, t02, t10, t12, t20, t21, t22;
int e, tmp;
t00
t01
t02
t10
t12
t20
t21
t22
=
=
=
=
=
=
=
=
x[i ][j ];
x[i ][j+1];
x[i ][j+2];
x[i+1][j ];
x[i+1][j+2];
x[i+2][j ];
x[i+2][j+1];
x[i+2][j+2];
e1 = ((t00 + t01) + (t01 + t02)) –
((t20 + t21) + (t21 + t22));
e2 = ((t00 + t10) + (t10 + t20)) –
((t02 + t12) + (t12 + t22));
e12 = e1*e1;
e22 = e2*e2;
e = e12 + e22;
if (e > threshold) tmp = 1;
else tmp = 0;
edge[i][j] = tmp;
}
}
University of Michigan
Electrical Engineering and Computer Science
FU Allocation
• Determine number of
clusters:  # ops 


4
II


• Determine number of
expensive FUs
– MPY, DIV, memory
 # ops _ of _ type 


II


• Sobel with II=4
41 ops
 3 clusters
2 MPY ops
 1 multiplier
9 memory ops
 3 memory units
University of Michigan
Electrical Engineering and Computer Science
Partitioning
• Multi-level approach consists of two phases
– Coarsening
– Refinement
• Minimize inter-cluster communication
• Load balance
– Max of 4  II operations per cluster
• Take FU allocation into account
– Restricted # of expensive units
– # of cheap units (ADD, logic) determined from partition
University of Michigan
Electrical Engineering and Computer Science
Coarsening
• Group highly related operations together
– Pair operations together at each step
– Forces partitioner to consider several operations as a
single unit
• Coarsening Sobel subgraph into 2 groups:
L
L
+
L
+
+
L
+
L
+
+
L
L
+
L
+
+
L
+
L
+
+
L
L
+
L
+
+
L
+
L
+
+
L
L
+
L
+
+
L
+
L
+
+
University of Michigan
Electrical Engineering and Computer Science
Refinement
• Move operations between clusters
• Good moves:
– Reduce inter-cluster communication
– Improve load balance
– Reduce hardware cost
• Reduce number of expensive units to
meet limit
• Collect similar bitwidth operations
together
?
L
L
+
L
+
+
L
+
L
+
+
University of Michigan
Electrical Engineering and Computer Science
Partitioning Example
• From sobel, II=4
• Place MPYs together
• Place each tree of ADDLOAD-ADDs together
• Cuts 6 edges
University of Michigan
Electrical Engineering and Computer Science
Modulo Scheduling
• Determines shift register width, depth, and number
of read ports
FU Cycle Max
Req’d Req’d
• Sobel II=4
cycle
FU0
FU1
FU2
3
0
2
4
4
1
1
1
2
4
2
3
4
2
4
1
1
1
3
0
-
1
1
3
1
LD
1
2
FU3
ADD
0
result depth ports
lifetime
ADD
LD
ADD
ADD
University of Michigan
Electrical Engineering and Computer Science
Test Cases
• Sobel and fsed kernels, II=4 designs
• Each machine has 4 clusters with 4 FUs per cluster
sobel
fsed
M
+-
M
+-
M
+-
B
<<
+-
+-
+-
+-
*
&
+-
<<
M
+-
M
+&
M
+-
B
+-
+-
<<
+&
+&
+-
<<
*
University of Michigan
Electrical Engineering and Computer Science
Cross Compile Results
• Computation is localized
– sobel: 1.5 moves/cycle
– fsed: 1 move/cycle
• Cross compile
–
–
–
–
–
Can still achieve II=4
More inter-cluster communication
May require more units
sobel on fsed machine: ~2 moves/cycle
fsed on sobel machine: ~3 moves/cycle
University of Michigan
Electrical Engineering and Computer Science
Concluding Remarks
• Programmable loop accelerator design strategy
– Meta-architecture with stylized interconnect
– Systematic compiler-directed design flow
• Costs of programmability:
– Interconnect, inter-cluster communication
– Control – “micro-instructions” are necessary
• Just scratching the surface of this work
• For more, see the CCCP group webpage
– http://cccp.eecs.umich.edu
University of Michigan
Electrical Engineering and Computer Science
Download