Fast Compilation for Reconfigurable Hardware Mihai Budiu and Seth Copen Goldstein

advertisement
Fast Compilation for
Reconfigurable Hardware
Mihai Budiu and Seth Copen Goldstein
Carnegie Mellon University
Computer Science Department
Joint work with
Srihari Cadambi, Herman Schmit, Matt Moe,
Robert Taylor, Ronald Laufer
Goal
To program reconfigurable devices using the
standard software development processes:
Java
– Compile C or Java
– Do it quickly
Partitioner
Data-flow Intermediate
Language
DIL
This talk
Configuration
Reconfigurable HW
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
CPU
2
Compiler Performance on 1D DCT
(8 inputs 8 bit each)
DIL
2.4s
Total Compile time
Place and route
Target clock speed
Circuit size
Application speed-up
Target
Classical tools
1s
75Mhz
7816 bit-ops
20
PipeRench
~75min
Synopsis+Design Manager
Design Manager 14m22s
33Mhz
899 CLBs
~20
Xilinx 4085XL
Compilation: ~700x faster
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
3
The Place and Route Problem
~
&
<<
>>
Interconnection
operators
~
&
<<
.
[1,2]
>>
Interconnection
network <<
.
[1,2]
<<
+
+
Processing elements
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
4
Our Target:
• Medium grain processing
elements (4 bits)
• Pipelined architecture
• Virtualized hardware
• Local interconnection network
• Wide pipelined bus
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
5
The Place and Route Problem
~
&
<<
>>
Interconnection
operators
~
&
<<
.
[1,2]
>>
Interconnection
network <<
.
Stripe
[1,2]
<<
+
+
Processing elements
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
6
Why Place and Route Is Hard
• Hard constraints:
– Stripe width
– Pipelined bus width
• Word-based circuit
– interconnection network switches words
– fixed PE size
• Scarce input ports for the interconnection
network
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
7
How We Simplify Place and Route
• Computation-oriented programs (restricted
language, with unidirectional data flow)
• Hardware resources virtualized
• Relatively rich interconnection network
• High granularity placement (I.e. one 32-bit adder
instead of 100 gates)
• There is a wide pipelined bus available
• Timing is very predictable
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
8
The Key Idea
• Global analysis and transformations guarantee
placeability using lazy noops (conservatively)
• Deterministic, greedy place & route
(no backtracking)
• All passes linear time in the size of the circuit
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
9
Guaranteeing Placement
&
~
>>
~
&
<<
<<
Complex
permutation
Simple
permutation
noop
>>
.
[1,2]
Simple
permutation
.
<<
[1,2]
noop
<<
+
Simple
permutation
+
The inserted noops are sufficient but not necessary
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
10
Placement of a Non-lazy Noop
~
&
~
noop
&
noop
noop
+
+
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
11
Lazy Noops Are Not Placed
~
&
~
noop
&
+
noop
+
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
12
Place and Route Overview
• Analysis:
– Noops have been inserted to guarantee that the
graph is routable.
• Place & Route:
– will determine which lazy noops are instantiated
Next: actual Place and Route
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
13
Step1: Analyze Routability
~
&
Already placed
&
~
noop
+ + + + + + +
noop
+
FPGA, Feb 23 1999
Q: can we place the + given the
placement of its ancestors?
(c) 1998 by Mihai Budiu
14
Step 2: If a Node Is Unroutable
~
&
noop
~
&
noop
noop
noop
+
+
Solution: promote a lazy noop
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
15
Step 3: Choosing a Noop
~
&
noop
~
&
noop
Closest noop
which is routable.
noop
noop
+
FPGA, Feb 23 1999
+
(c) 1998 by Mihai Budiu
16
Other Details
• Operators are decomposed in pieces for:
– timing constraints
– size constraints
• When placing optimize for
– register pressure when accessing the bus
– constraints placed on future nodes
• Long critical paths are sliced with pipeline
registers
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
17
Compilation Times (Seconds on PII/400)
9
8.07
8
7
6
5
4
3
2
2.43
2.27
1.36
1.25
0.95
0.84
1
0.13
0.07
0.47
0.86
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
va
rp
ol
y
e
sq
ua
r
ov
er
s
nq
ue
en
id
ea
en
co
de
r
dc
t
ul
t
cs
dm
co
rd
ic
at
r2
Lf
ir
0
18
Compilation Speed (PII/400)
20000
Bit Operations/ Kernel
18000
bitops
bitops/sec
16000
14000
10000
8000
12000
10000
6000
8000
4000
6000
4000
Bit Operations Compiled/Sec
12000
2000
2000
0
FPGA, Feb 23 1999
CS
D
D
En CT
co
de
r
FI
R
ID
N EA
qu
ee
ns
O
ve
Sq r
u
V are
ar
po
G ly
M
ea
n
c
di
Co
r
A
TR
0
(c) 1998 by Mihai Budiu
19
Compilation Times Breakdown
100%
other
place
analysis
library
simplification
evaluation
80%
60%
40%
Place and route
20%
FPGA, Feb 23 1999
t
sq
ua
re
va
rp
ol
y
r
po
pc
n
ov
e
id
ea
nq
ue
en
s
en
co
de
r
dc
t
co
rd
ic
cs
dm
ul
t
Lf
ir
0%
(c) 1998 by Mihai Budiu
20
Placed Circuit Utilization
100%
90%
utilization
80%
effective utilization
70%
60%
50%
40%
30%
20%
10%
FPGA, Feb 23 1999
FI
R
ID
EA
N
qu
ee
ns
O
ve
r
Sq
ua
r
V e
ar
po
l
G y
M
ea
n
CT
En
co
de
r
D
CS
D
Co
rd
ic
A
TR
0%
(c) 1998 by Mihai Budiu
21
Simulated Speed-up vs. UltraSparc @ 300Mhz
1000.0
328.8
90.9
100.0
76.1
61.8
29.0
26.0
20.6
10.0
1.0
ATR
FPGA, Feb 23 1999
Cordic
DCT
FIR
(c) 1998 by Mihai Budiu
IDEA
Nqueens
Over
22
Conclusions
• Fast compilation from HLL achievable
(seconds not tens of minutes.)
• High-quality output achievable
(60% density)
• Linear-time Place and Route feasible using
the technique of lazy noops
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
23
Future Work
• Time-multiplexing the bus
• Porting to commercial FPGAs
• Front-end from C/Java to DIL
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
24
How We Simplify Place and Route
• Computation-oriented programs (restricted
language, with unidirectional data flow)
Hardware resources virtualized
• Relatively rich interconnection network
• High granularity placement (I.e. one 32-bit adder
instead of 100 gates)
There is a wide pipelined bus available
• Timing is very predictable
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
25
Timing and Size Guarantees
24
24
8
24
8
24
+
8
+
8
+
8
24
8
8
8
+
8
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
24
28
Optimize for Register Pressure
~
&
&
~
noop
++ + + + ++
noop
Cost:
1 2 1 -- -- 0
Best
position
+
FPGA, Feb 23 1999
(c) 1998 by Mihai Budiu
29
Kernels
Benchmark
ATR
Cordic
CSD
DCT
Encoder
FIR
IDEA
Nqueens
Over
Square
Varpoly
FPGA, Feb 23 1999
Description
Automatic Target Recognition (image pattern scan)
Honeywell timing benchmark for vector rotation.
Canonical signed multiplier with the constant 123.
One-dimensional 8-point discrete cosine transform.
Huffman encoder for fixed frequencies.
Finite Impulse Response filter with 20 taps.
PGP encryption algorithm.
8x8 queens solution tester.
Porter-Duff “over” operator.
Squaring a 16-bit number.
Evaluating a degree-3 polynomial with variable coefficients
in a given point.
(c) 1998 by Mihai Budiu
30
Download