A Library of Parameterized Hardware Modules for Floating Point Arithmetic

advertisement
A Library of Parameterized Hardware
Modules for Floating Point Arithmetic
and Its Use
Prof. Miriam Leeser
Pavle Belanovic
Department of Electrical and Computer Engineering
Northeastern University
Boston MA
Outline
• Introduction and motivation
• Library of hardware modules for variable precision
floating point
• Application: K-means algorithm using variable
precision floating point library
• Conclusions
Accelerating Algorithms
• Reconfigurable hardware used to accelerate
image and signal processing algorithms
• Exploit parallelism for speedup
• Customize design to fit the particular task
– signals in fixed or floating-point format
– area, power vs. range, precision trade-offs
Format Design Trade-offs
Fewer
hardware
resources
Less
power
dissipation
narrower
More
parallelism
Signal
Bitwidth
wider
Better
range,
precision
Goal: use optimal format
For every value
Area Gains Using Reduced
Precision
IEEE single precision
12-bit floating point
format
31 adders
OR
13 multipliers
113 adders
Xilinx
XCV1000
FPGA
OR
85 multipliers
General Floating-Point Format
Field
sign
exponent
fraction/mantissa
Symbol
s
e
f
Bitwidth
1
exp_bits
man_bits
(-1)s * 1.f * 2e-BIAS
sign exponent fraction / mantissa
MSB
LSB
IEEE Floating Point Format
1
s
8
exp (e)
23
fraction (f)
32
(-1)s * 1.f * 2e-BIAS
• BIAS depends on number of exponent
bits
» 127 in IEEE single precision format
• Implied 1 in mantissa not stored
Library of Parameterized Modules
• Total of seven parameterized hardware modules
for arbitrary precision floating-point arithmetic
Module Latency
denorm
0
format control
rnd_norm
2
fp_add
4
operators fp_sub
4
fp_mul
3
fix2float
4/5
conversion
float2fix
4/5
Highlights
• Completely general floating-point format
• All IEEE formats are a subset
• All previously published non-IEEE formats are a
subset
• Abstract normalization from other operations
• Rounding to zero or nearest
• Pipelining signals
• Some error handling
Assembly of Modules
Normalized IEEE single
precision values
2  denorm
DENORM
DENORM
exp_bits=8
man_bits=23
exp_bits=8
man_bits=23
+ 1  rnd_norm
FP_ADD
exp_bits=8
man_bits=24
= 1  IEEE single
precision adder
RND_NORM
exp_bits=8
man_bits_in=25
man_bits_out=23
+ 1  fp_add
IEEE single
precision adder
Normalized IEEE single
precision sum
Denormalization
• “Unpack” input number: insert implied digit
• If input is value zero, insert ‘0’
Otherwise, insert ‘1’
• Output 1 bit wider than input
• Latency = 0
INPUTS
IN1
READY
EXCEPTION_IN
OUTPUTS
Denormalize
(denorm)
OUT1
DONE
EXCEPTION_OUT
Rounding and Normalizing
• Returns input to normalized format
• Designed to follow arithmetic operation(s)
IN1
READY
EXCEPTION_IN
s
e
f
NORMALIZER
INPUTS
OUTPUTS
IN1
READY
CLK
ROUND
EXCEPTION_IN
Round and Normalize
(rnd_norm)
OUT1
DONE
EXCEPTION_OUT
ROUND_ADD
e
s
f
OUT1
DONE
EXCEPTION_OUT
ROUND
Addition and Subtraction
OP1 OP2 READY EXCEPTION
_IN
PARAMETERIZED
_COMPARATOR
SWAP
eq
large
s1
INPUTS
OUTPUTS
OP1
OP2
READY
CLK
EXCEPTION_IN
Addition
(fp_add)
s1
s2
f1
small
e1
e2
f2
SHIFT_ADJUST
RESULT
DONE
EXCEPTION_OUT
ADD_SUB
s
e
f
enable
clear
CORRECTION
exception
EXCEPTION RESULT DONE
_OUT
Multiplication
READY
EXCEPTION
_IN
OP1
OP2
e1 e2
s1
s2
f1
+
cout
(exp_bits+1)
INPUTS
BIAS-1
OUTPUTS
OP1
OP2
READY
CLK
EXCEPTION_IN
Multiplication
(fp_mul)
-
RESULT
DONE
EXCEPTION_OUT
MSB
(exp_bits+1)
(exp_bits)
0
DONE
EXCEPTION
_OUT
RESULT
f2
X
e1,f1
e2,f2
Fixed to Floating-Point
INPUTS
OUTPUTS
Fix to Float
(fix2float)
FIXED
READY
CLK
ROUND
EXCEPTION_IN
ROUND
ROUND
FIXED
FLOAT
DONE
EXCEPTION_OUT
FIXED
EXCEPTION_IN
absolute
value
priority
encoder
EXCEPTION_IN
const
priority
encoder
-
variable
shifter
const
0
-
variable
shifter
+
0
+
0
0
FLOAT
FLOAT
EXCEPTION_OUT
EXCEPTION_OUT
Floating to Fixed-Point
ROUND
FLOAT
EXCEPTION_IN
s
INPUTS
f
OUTPUTS
Float to Fix
(float2fix)
FLOAT
READY
CLK
ROUND
EXCEPTION_IN
FIXED
DONE
EXCEPTION_OUT
bias
e
bias
e
-
>
f
ROUND
FLOAT
s
bias
e
bias
e
-
EXCEPTION_IN
variable
shifter
f
>
e
fix_bits+bias-2
0
>
+
f
0
variable
shifter
1
+
0
+
0
0
FIXED
FIXED
EXCEPTION_OUT
EXCEPTION_OUT
e
fix_bits+bias-2
>
Implementation Experiments
• Designs specified in VHDL
• Mapped to Xilinx Virtex FPGA
• Wildstar reconfigurable computing engine by Annapolis
Micro Systems Inc.
–
–
–
–
–
–
–
PCI Interface to host processor
3 Xilinx XCV1000 FPGAs
total of 3 million system gates
40 Mbytes of SRAM
1.6 Gbytes/sec I/O bandwidth
6.4 Gbytes/sec memory bandwidth
clock rates to 100MHz
Synthesis Results
700
fp_mul
Area [slices]
600
500
400
300
fp_add
200
100
0
6
8
10
12
14
16
18
20
22
Total bitwidth
24
26
28
30
32
34
Median Number of Cores per XCV1000
Synthesis Results
250
fp_add
200
150
fp_mul
100
50
0
6
8
10
12
14
16
18
20
22
Total Bitwidth
24
26
28
30
32
34
K-means Algorithm
Image spectral data
Pixel Xij
Clustered image
j = 0 to J
mag
(12 bits)
0
1
2
class 0
class 1
class 2
class 3
3 4
class 4
spectral component ‘k’
i = 0 to I
Every pixel Xij is
assigned a class cj
•Each cluster has a center:
– mean value of pixels in that cluster
•Each pixel is in the cluster whose center it is closest to
– requires a distance metric
•Algorithm is iterative
Hardware Implementation of k-means Clustering
8x
Pi,0
Cj,0
Pi,1
Cj,1
Pi,2
Cj,2
Pi,3
Cj,3
10 input
channels
dist
+
dist
+
min
dist
+
dist
min
+
.
.
.
.
.
.
.
.
Cluster
Number
min
+
.
.
.
Cluster
Distance
Measure
min
min
min
+
min
290 16 bit
operations
K-means Clustering Algorithm
Purely fixed-point
Hybrid fixed and floating-point
Fixed-point
Floating-point
Fixed-point
Fixed-point
Structure of the K-means Circuit
Memory
Acknowledge
Pixel
Data
Cluster
Centers
Pixel
Data
DATAPATH
Validity
Subtraction
Pixel
Shift
Datapath
Abs. Value
Addition
Comparison
Data
Valid
Accumulators
Cluster
Assignment
Hybrid Datapath
fix2float
fix2float
CLUSTER(9)
PIXEL(9)
fix2float
fix2float
CLUSTER(8)
PIXEL(8)
fix2float
fix2float
PIXEL(7)
fix2float
CLUSTER(7)
CLUSTER(6)
float2fix
DISTANCE
fix2float
+
PIXEL(6)
Addition
fix2float
fix2float
+
CLUSTER(5)
PIXEL(5)
fix2float
fix2float
CLUSTER(4)
PIXEL(4)
fix2float
fix2float
CLUSTER(3)
PIXEL(3)
fix2float
fix2float
CLUSTER(2)
PIXEL(2)
fix2float
fix2float
CLUSTER(1)
PIXEL(1)
fix2float
fix2float
Comparison
CLUSTER(0)
PIXEL(0)
Absolute Value
Fixed-point format
16 bits, unsigned
+
+
+
+
+
+
Datapath
+
Floating-point format
exp_bits=5
man_bits=6
(12 bits total)
Subtraction
Fixed-point format
12 bits, unsigned
INPUTS
Pixel and Cluster Center Data
Results of Processing
Purely fixed-point
Hybrid fixed and
floating-point
Synthesis Results
Property
Fixed-point
Hybrid
Area
9420 slices
10883 slices
Percent of FPGA
76%
88%
Minimum period
16ns
20ns
Maximum frequency 64MHz
50MHz
Throughput
8 cycles
1 cycle
Conclusions
• Library of fully parameterized hardware modules for
floating-point arithmetic available
• Ability to form arithmetic pipelines in custom floating-point
formats demonstrated
• Future work
– More floating point modules (ACC, MAC, DIV ...)
– More applications
– Automation of design process using the library
• Automatically choose best format for each
variable and operation
Download