A Library of Parameterized Hardware Modules for Floating Point Arithmetic and Its Use Prof. Miriam Leeser Pavle Belanovic Department of Electrical and Computer Engineering Northeastern University Boston MA Outline • Introduction and motivation • Library of hardware modules for variable precision floating point • Application: K-means algorithm using variable precision floating point library • Conclusions Accelerating Algorithms • Reconfigurable hardware used to accelerate image and signal processing algorithms • Exploit parallelism for speedup • Customize design to fit the particular task – signals in fixed or floating-point format – area, power vs. range, precision trade-offs Format Design Trade-offs Fewer hardware resources Less power dissipation narrower More parallelism Signal Bitwidth wider Better range, precision Goal: use optimal format For every value Area Gains Using Reduced Precision IEEE single precision 12-bit floating point format 31 adders OR 13 multipliers 113 adders Xilinx XCV1000 FPGA OR 85 multipliers General Floating-Point Format Field sign exponent fraction/mantissa Symbol s e f Bitwidth 1 exp_bits man_bits (-1)s * 1.f * 2e-BIAS sign exponent fraction / mantissa MSB LSB IEEE Floating Point Format 1 s 8 exp (e) 23 fraction (f) 32 (-1)s * 1.f * 2e-BIAS • BIAS depends on number of exponent bits » 127 in IEEE single precision format • Implied 1 in mantissa not stored Library of Parameterized Modules • Total of seven parameterized hardware modules for arbitrary precision floating-point arithmetic Module Latency denorm 0 format control rnd_norm 2 fp_add 4 operators fp_sub 4 fp_mul 3 fix2float 4/5 conversion float2fix 4/5 Highlights • Completely general floating-point format • All IEEE formats are a subset • All previously published non-IEEE formats are a subset • Abstract normalization from other operations • Rounding to zero or nearest • Pipelining signals • Some error handling Assembly of Modules Normalized IEEE single precision values 2 denorm DENORM DENORM exp_bits=8 man_bits=23 exp_bits=8 man_bits=23 + 1 rnd_norm FP_ADD exp_bits=8 man_bits=24 = 1 IEEE single precision adder RND_NORM exp_bits=8 man_bits_in=25 man_bits_out=23 + 1 fp_add IEEE single precision adder Normalized IEEE single precision sum Denormalization • “Unpack” input number: insert implied digit • If input is value zero, insert ‘0’ Otherwise, insert ‘1’ • Output 1 bit wider than input • Latency = 0 INPUTS IN1 READY EXCEPTION_IN OUTPUTS Denormalize (denorm) OUT1 DONE EXCEPTION_OUT Rounding and Normalizing • Returns input to normalized format • Designed to follow arithmetic operation(s) IN1 READY EXCEPTION_IN s e f NORMALIZER INPUTS OUTPUTS IN1 READY CLK ROUND EXCEPTION_IN Round and Normalize (rnd_norm) OUT1 DONE EXCEPTION_OUT ROUND_ADD e s f OUT1 DONE EXCEPTION_OUT ROUND Addition and Subtraction OP1 OP2 READY EXCEPTION _IN PARAMETERIZED _COMPARATOR SWAP eq large s1 INPUTS OUTPUTS OP1 OP2 READY CLK EXCEPTION_IN Addition (fp_add) s1 s2 f1 small e1 e2 f2 SHIFT_ADJUST RESULT DONE EXCEPTION_OUT ADD_SUB s e f enable clear CORRECTION exception EXCEPTION RESULT DONE _OUT Multiplication READY EXCEPTION _IN OP1 OP2 e1 e2 s1 s2 f1 + cout (exp_bits+1) INPUTS BIAS-1 OUTPUTS OP1 OP2 READY CLK EXCEPTION_IN Multiplication (fp_mul) - RESULT DONE EXCEPTION_OUT MSB (exp_bits+1) (exp_bits) 0 DONE EXCEPTION _OUT RESULT f2 X e1,f1 e2,f2 Fixed to Floating-Point INPUTS OUTPUTS Fix to Float (fix2float) FIXED READY CLK ROUND EXCEPTION_IN ROUND ROUND FIXED FLOAT DONE EXCEPTION_OUT FIXED EXCEPTION_IN absolute value priority encoder EXCEPTION_IN const priority encoder - variable shifter const 0 - variable shifter + 0 + 0 0 FLOAT FLOAT EXCEPTION_OUT EXCEPTION_OUT Floating to Fixed-Point ROUND FLOAT EXCEPTION_IN s INPUTS f OUTPUTS Float to Fix (float2fix) FLOAT READY CLK ROUND EXCEPTION_IN FIXED DONE EXCEPTION_OUT bias e bias e - > f ROUND FLOAT s bias e bias e - EXCEPTION_IN variable shifter f > e fix_bits+bias-2 0 > + f 0 variable shifter 1 + 0 + 0 0 FIXED FIXED EXCEPTION_OUT EXCEPTION_OUT e fix_bits+bias-2 > Implementation Experiments • Designs specified in VHDL • Mapped to Xilinx Virtex FPGA • Wildstar reconfigurable computing engine by Annapolis Micro Systems Inc. – – – – – – – PCI Interface to host processor 3 Xilinx XCV1000 FPGAs total of 3 million system gates 40 Mbytes of SRAM 1.6 Gbytes/sec I/O bandwidth 6.4 Gbytes/sec memory bandwidth clock rates to 100MHz Synthesis Results 700 fp_mul Area [slices] 600 500 400 300 fp_add 200 100 0 6 8 10 12 14 16 18 20 22 Total bitwidth 24 26 28 30 32 34 Median Number of Cores per XCV1000 Synthesis Results 250 fp_add 200 150 fp_mul 100 50 0 6 8 10 12 14 16 18 20 22 Total Bitwidth 24 26 28 30 32 34 K-means Algorithm Image spectral data Pixel Xij Clustered image j = 0 to J mag (12 bits) 0 1 2 class 0 class 1 class 2 class 3 3 4 class 4 spectral component ‘k’ i = 0 to I Every pixel Xij is assigned a class cj •Each cluster has a center: – mean value of pixels in that cluster •Each pixel is in the cluster whose center it is closest to – requires a distance metric •Algorithm is iterative Hardware Implementation of k-means Clustering 8x Pi,0 Cj,0 Pi,1 Cj,1 Pi,2 Cj,2 Pi,3 Cj,3 10 input channels dist + dist + min dist + dist min + . . . . . . . . Cluster Number min + . . . Cluster Distance Measure min min min + min 290 16 bit operations K-means Clustering Algorithm Purely fixed-point Hybrid fixed and floating-point Fixed-point Floating-point Fixed-point Fixed-point Structure of the K-means Circuit Memory Acknowledge Pixel Data Cluster Centers Pixel Data DATAPATH Validity Subtraction Pixel Shift Datapath Abs. Value Addition Comparison Data Valid Accumulators Cluster Assignment Hybrid Datapath fix2float fix2float CLUSTER(9) PIXEL(9) fix2float fix2float CLUSTER(8) PIXEL(8) fix2float fix2float PIXEL(7) fix2float CLUSTER(7) CLUSTER(6) float2fix DISTANCE fix2float + PIXEL(6) Addition fix2float fix2float + CLUSTER(5) PIXEL(5) fix2float fix2float CLUSTER(4) PIXEL(4) fix2float fix2float CLUSTER(3) PIXEL(3) fix2float fix2float CLUSTER(2) PIXEL(2) fix2float fix2float CLUSTER(1) PIXEL(1) fix2float fix2float Comparison CLUSTER(0) PIXEL(0) Absolute Value Fixed-point format 16 bits, unsigned + + + + + + Datapath + Floating-point format exp_bits=5 man_bits=6 (12 bits total) Subtraction Fixed-point format 12 bits, unsigned INPUTS Pixel and Cluster Center Data Results of Processing Purely fixed-point Hybrid fixed and floating-point Synthesis Results Property Fixed-point Hybrid Area 9420 slices 10883 slices Percent of FPGA 76% 88% Minimum period 16ns 20ns Maximum frequency 64MHz 50MHz Throughput 8 cycles 1 cycle Conclusions • Library of fully parameterized hardware modules for floating-point arithmetic available • Ability to form arithmetic pipelines in custom floating-point formats demonstrated • Future work – More floating point modules (ACC, MAC, DIV ...) – More applications – Automation of design process using the library • Automatically choose best format for each variable and operation