A case for 16-bit floating point data: FPGA image and media processing Daniel Etiemble and Lionel Lacassagne University Paris Sud, Orsay (France) de@lri.fr U of T, 09/20/2005 Daniel Etiemble 1 Summary • Graphics and media applications – integer versus FP computations • Accuracy • Execution speed • Compilation issues – A niche for 16-bit floating point format (F16 or “half”) • Methodology and benchmarks • Hardware support – Customization of SIMD16-bit FP operators on a FPGA soft core (Altera NIOS II CPU) – The SIMD 16-bit FP instructions • Results • Conclusion U of T, 09/20/2005 Daniel Etiemble 2 Integer or FP computations? • Both formats are used in graphics and media processing – Example: Apple vImage library has four image types with Four pixel types: • Unsigned char (0-255) or Float (0.0-1.0) for color or alpha values • Set of 4 unsigned chars or floats for Alpha, Red, Green, Blue • Trade-offs – Precision and dynamic range – Memory occupation and cache footprint – Hardware cost (embedded applications) • Chip area • Power dissipation U of T, 09/20/2005 Daniel Etiemble 3 Integer or FP computations? (2) • General trend to replace FP computations by fixed-point computations – Intel GPP library: “using Fixed-Point instead of Floating-Point for better 3D Performance” (G. Kolly) – Intel Optimizing Center, http://www.devx.com/Intel/article/16478 – Techniques for automatic floating-point to fixed-point conversions for DSP code generation (Menard et al) U of T, 09/20/2005 Daniel Etiemble 4 Menard et al approach Precision LASTI, Lannion, France Methodology FP algorithm Fixed-point hardware Correct algorithm HW design (ASIC-FPGA) Optimize data path width Minimize chip area SW design (DSP) Optimize the « mapping » of the algorithm on a fixed architecture Maximize precision U of T, 09/20/2005 Daniel Etiemble Minimize Execution time & code size 5 Integer or FP computations? (3) • Opposite option: Customized FP formats – “Lightweight” FP arithmetic (Fang et al) to avoid conversions • With IDCT: FP numbers with 5-bit exponent and 8-bit mantissa are sufficient to get a PSNR similar to 32-bit FP numbers • To compare with “half” format U of T, 09/20/2005 Daniel Etiemble 6 Integer or FP computations? (4) • How to help a compiler to “vectorize”? – Integers: different input and output formats • N bits + N bits => N+1 bits • N bits * N bits => 2N bits – FP numbers: same input and output formats • Example: a Deriche filter on a size*size points image U of T, 09/20/2005 #define byte unsigned char byte **X, **Y; int32 b0, a1, a2; for(i=0; i<size; i++) { for(j=0; j<size; j++) { Y[i][j] = (byte) (b0 * X[i][j] + a1 * Y[i-1][j] + a2 * Y[i-2][j]) >> 8);}} for (i=size-1;i>=0;i--) { for(j=0; j<size; j++) { Y[i][j] = (byte) (b0 * X[i][j] + a1 * Y[i+1][j] + a2 * Y[i+2][j]) >> 8);}} Compiler vectorization is impossible. With 8-bit coefficients, this benchmark can be manually vectorized. The vectorization is possible only if the programmer has a detailed knowledge of the used parameters. Float version is easily vectorized by the compiler Daniel Etiemble 7 Cases for 16-bit FP formats • Computation when data range exceeds “16-bit integer” range without needing “32-bit FP float” range • Graphics and media applications – Not for GPU (F16 already used in NVidia GPUs) – For embedded applications • Advantages of 16-bit FP format – Reduce memory occupation (cache footprint) versus 32-bit integer or FP formats • CPU without SIMD extensions (low-end embedded CPUs) – 2 x wider SIMD instructions compared to float SIMD • CPU with SIMD extensions (high-end embedded CPUs) • Huge advantage of SIMD float operations versus SIMD integer operations both for compiler and manual vectorization. U of T, 09/20/2005 Daniel Etiemble 8 Example: Points of Interest U of T, 09/20/2005 Daniel Etiemble 9 Points of interests (PoI) in images Ix*Ix Image 3 x 3 Gradient (Sobel) byte Ix • Ix*Iy Sxy Iy*Iy Syy int int (Sxx*Syy-Sxy2 ) - 0.05 (Sxx+Syy)2 FI Iy short Harris algorithm • Sxx 3 x 3 Gauss filters Threshold byte Integer computation mixes char, short and int and prevents an efficient use of SIMD parallelism F16 computations would profit from SIMD parallelism with an uniform 16-bit format U of T, 09/20/2005 Daniel Etiemble 10 16-bit Floating-Point formats • Some have been defined in DSPs but rarely used – Example: TMS 320C32 • Internal FP type (immediate operand) – 1 sign bit, 4-bit exponent field and 11-bit fraction • External FP type (storage purposes) – 1 sign bit, 8-bit exponent field and 7-bit fraction • “Half” format 1 S U of T, 09/20/2005 5 Exponent 10 Fraction Daniel Etiemble 11 “Half” format • 16-bit version of IEEE 754 simple and double precision versions. • Introduced by ILM for OpenEXR format • Defined in Cg (NVidia) • Motivation: – “16-bit integer based formats typically represent color component values from 0 (black) to 1 (white), but don’t account for over-range value (e.g. a chrome highlight) that can be captured by film negative or other HDR displays… Conversely, 32-bit floating-point TIFF is often overkill for visual effects work. 32bit FP TIFF provides more than sufficient precision and dynamic range for VFX images, but it comes at the cost of storage, both on disk and memory” U of T, 09/20/2005 Daniel Etiemble 12 Validation of the F16 approach • Accuracy – Results presented in ODES-3 (2005) and CAMP’05 (2005) – Next slides. • Performances with General Purpose CPUs (Pentium 4 and Power PC G4-G5) – Results presented in ODES-3 (2005) and CAMP’05 (2005) • Performance with FPGAs (this presentation) – Execution time – Hardware cost (and power dissipation) • Other embedded hardware (to be done) – SoC – Customizable CPU (ex: Tensilica approach) U of T, 09/20/2005 Daniel Etiemble Another time 13 Accuracy • Comparison of F16 computation results with F32 computation results • Specificities of FP formats – Rounding? – Denormals? – NaN? U of T, 09/20/2005 Daniel Etiemble 14 Impact of F16 accuracy and dynamic range • Simulation of “half” format with “float” format with actual benchmarks or applications – Impact of reduced accuracy and range on results – F32-computed and F16-computed images are compared with PSNR measures. • Four different functions: ftd, frd, ftn, frd to simulate the F16 – Fraction : truncation or rounding – With or without denormals • For any benchmark, manual insertions of one function (ftd / frd /ftn / frd) – Function call before any use of a “float” value – Function call after any operation producing a “float” value 1 8 PE =1023 à 1039 5 bits : 1-31 U of T, 09/20/2005 23 xxxxxxxxxx 0000000000000 10 bits Daniel Etiemble 15 Impact of F16 accuracy and dynamic range • Benchmark 1 : zooming (A. Montanvert, Grenoble) – “Spline” technique for x1, x2 and x4 zooms • Benchmark 2 : JPEG (Mediabench) – 4 different DCT/IDCT functions • Integer/Fast integer/F32/F16 • Benchmark 3 : Wavelet transform (L. Lacassagne, Orsay) – SPIHT (Set Partioning in Hierarchical Trees) U of T, 09/20/2005 Daniel Etiemble 16 Difference (PSNR) between F32 and F16 images Accuracy (1): Zooming benchmark 90 80 70 60 50 40 30 20 10 0 Baboon Lena Lighthouse 1-T 1-N 2-T 2-N 3-T 3-N Zoom factor • Denormals are useless • No significant difference between truncation and rounding for mantissa – Minimum hardware (no denormals, truncation) is OK U of T, 09/20/2005 Daniel Etiemble 17 Accuracy (2) :JPEG (Mediabench) DCT FAST DCT INT DCT FLOAT DCT F16 DCT FA ST 45 45 40 40 35 35 30 30 25 25 20 DCT I NT DCT FLOA T DCT F16 20 15 15 10 10 5 5 0 Baboon Lena Light house Cor r i dor 512 x 512 images Difference (db) U of T, 09/20/2005 0 E i nst ei n Of f i ce 256 x 256 images final image compressed - uncompressed original image Daniel Etiemble 18 Lena Lighthouse Man (1024x1024) Lena Lighthouse Man (1024x1024) 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 50 45 40 35 30 25 20 15 10 5 0 1 2 5 10 20 50 100 200 PSNR(F32) PSNR loss Accuracy (3): Wavelet transform 500 Compression rate 512 x 512 or 1024 x 1024 images U of T, 09/20/2005 Daniel Etiemble 19 Office Corridor Einstein Grenoble Lena Titanic Office Corridor Einstein Grenoble Lena Titanic 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 50 40 30 20 PSNR (F32) PSNR loss Accuracy (4) : Wavelet transforms 10 1 2 5 10 20 50 100 200 500 0 Compression rate Images 256 x 256 U of T, 09/20/2005 Daniel Etiemble 20 Benchmarks • Convolution operators – Horizontal-vertical version of Deriche filter – Deriche gradient • Image stabilization – Points of Interest • Achard • Harris for(i=0; i<size-1; i++) { for(j=0; j<size; j++) { Y[i][j] = (byte) ((b0 * X[i][j] + a1 * Y[i-1][j] + a2 * Y[i-2][j]) >> 8);}} for (i=size-1;i>=0;i--) { for(j=0; j<size; j++) { Y[i][j] = (byte) ((b0 * X[i][j] + a1 * Y[i+1][j] + a2 * Y[i+2][j]) >> 8);}} Deriche: horizontal vertical version – Optical flow • FDCT (JPEG 6-a) U of T, 09/20/2005 Daniel Etiemble 21 HW and SW support • Altera NIOS development kit (Cyclone edition) • NIOS II/f – Fixed features – EP1C20F400C7 FPGA device – NIOS II/f CPU (50-MHz) • • • • • • Altera IDE – GCC tool chain (-O3 option) – High_res_timer (Nb of clock cycles for execution time) 32-bit RISC CPU Branch prediction Dynamic branch predictor Barrel shifter Customized instructions – Parameterized features • VHDL description of all the F16 operators – Arithmetic operators – Data handling operators • HW integer multiplication and division • 4 KB instruction cache • 2 KB data cache • Quartus II design software U of T, 09/20/2005 Daniel Etiemble 22 Customization of SIMD F16 instructions Data manipulation ADD/SUB, MUL, DIV With a 32-bit CPU, it makes sense to implement F16 instructions as SIMD 2 x 16-bits instructions U of T, 09/20/2005 Daniel Etiemble 23 SIMD F16 instructions • Data conversions: 1 cycle i+3 – Bytes to/from F16 – Shorts to/from F16 i+2 i+1 B2F16H i B2F16L • Conversions and shifts: 1 cycle – Accesses to (i, i-1) or (i+2, i+1) and conversions i+1 i i+3 i+2 i+1 i i+3 i+2 B2FSRL B2FSRH i-1 • Arithmetic instructions – – – – ADD/SUB : 2 cycles (4 for F32) MULF : 2 cycles (3 for F32) DIVF : 5 cycles DP2 : 1 cycle U of T, 09/20/2005 Daniel Etiemble 24 Execution time: basic vector operations Execution time per iteration (N) N=10 N=100 F32 N=256 40 35 I32 30 25 F16 Add Mul I32 1 ? F16 2 2 F32 4 3 20 15 10 5 U of T, 09/20/2005 B[ i] ) i] (A [ LF 32 X[ i] = AD = M U DF 32 (A [ i] i] ,B [ i] ) ,B [ i] ) ,k ) (A [ LF 16 X[ i] = X[ i] X[ i] = M U M U LF 16 (A [ i] (A [ i] ,B [ i] ) ,k ) i] DF 16 (A [ = X[ i] X[ i] = AD AD DF 16 A[ i] *B [ i] *k = = X[ i] = X[ i] X[ i] A[ i] + A[ i] B[ + A[ i] = = X[ i] i] k A[ i] 0 X[ i] Cycles per Iteration Copy Vector Add and Mul Vector Scalar Add and Mul Daniel Etiemble Instruction latencies 25 Execution time: basic vector operations • Speedup – SIMD F16 versus scalar I32 or F32 – Smaller cache footprint for F16 compared to I32/F32 – F16 latencies are smaller than F32 latencies N=10 N=100 N=256 4 3,5 3 2,5 2 1,5 1 0,5 0 F16/I32 addition speedup U of T, 09/20/2005 F16/I32 multiplication speedup F16/F32 addition speedup Daniel Etiemble F16/F32 multiplication speedup 26 Benchmark speedups • • Speedup greater than 2.5 versus F32 Speedup from 1.3 to 3 versus I32 – Depends on the add/mul ratio and amount of data manipulation • Even scalar F16 can be faster than I32 (1.3 speedup for JPEG DCT) Speedup for 128 x 128 images F16/I32 F16/F32 3,5 3 Speedup 2,5 NO MUL 2 1,5 1 0,5 0 Deriche filter U of T, 09/20/2005 Deriche gradient Achard Daniel Etiemble Harris Optical flow 27 Sy st CP em U In ov st er ru he ct io ad n o AD ve rh DF ea 16 d /S U BF 16 M UL F1 Di 6 vi sio DI n VF by 16 Po w er Da of ta 2 ha nd F3 lin 2 g AD D /S UB In F3 te ge 2 M rH UL W M ul +D iv Cu st om Number of used Logic Units 3000 14,0% 2500 12,0% 2000 U of T, 09/20/2005 F16 Daniel Etiemble F32 1000 10,0% 8,0% 1500 6,0% 4,0% 500 2,0% 0 0,0% % of used Logic Units Hardware cost 28 Concluding remarks • Intermediate level graphics benchmarks generally need more than I16 (short) or I32 (int) dynamic ranges without needing F32 (float) dynamic range • On our benchmarks, graphical results are not significantly different when using F16 instead of F32 • A limited set of SIMD F16 instructions have been customized for NIOS II CPU – The hardware cost is limited and compatible with to-day FPGA technologies – The speedups range from 1.3 to 3 (generally1.5) versus I32 and are greater than 2.5 versus F32 • Similar results have been found for general-purpose CPUs (Pentium4, PowerPC) • Tests should be extended to other embedded approaches – SoCs – Customizable CPUs (Tensilica approach) U of T, 09/20/2005 Daniel Etiemble 29 References • • • • • • • • • • • • OpenEXR, http://www.openexr.org/details.html W.R. Mark, R.S.Glanville, K. Akeley and M.J. Kilgard, “Cg: A system for programming graphics hardware in a C-like language. NVIDIA, Cg User’s manual, http://developer.nvidia.com/view.asp?IO=cg_toolkit Apple, “Introduction to vImage”, http://developer.apple.com/documentation/Performance/Conceptual/vImage/ G. Kolli, “Using Fixed-Point Instead of Floating Point for Better 3D Performance”, Intel Optimizing Center, http://www.devx.com/Intel/article/16478 D. Menard, D. Chillet, F. Charot and O. Sentieys, “Automatic Floating-point to Fixed-point Conversion for DSP Code Generation”, in International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES 2002) F. Fang, Tsuhan Chen, Rob A. Rutenbar, “Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform” in EURASIP Journal on Signal Processing, Special Issue on Applied Implementation of DSP and Communication Systems R. Deriche. “Using Canny's criteria to derive a recursively implemented optimal edge detector”. The International Journal of Computer Vision, 1(2):167-187, May 1987. A. Kumar, “SSE2 Optimization – OpenGL Data Stream Case Study”, Intel application notes, http://www.intel.com/cd/ids/developer/asmo-na/eng/segments/games/resources/graphics/19224.htm Sample code for the benchmarks available: http://www.lri.fr/~de/F16/codetsi Multi-Chip Projects, “Design Kits”, http://cmp.imag.fr/ManChap4.html J. Detrey and F. De Dinechin, “A VHDL Library of Parametrisable Floating Point and LSN Operators for FPGA”, http//www.ens-lyon.fr/~jdetrey/FPLibrary U of T, 09/20/2005 Daniel Etiemble 30 Back slides • F16 SIMD instructions on General Purpose CPUs U of T, 09/20/2005 Daniel Etiemble 31 Microarchitectural assumptions for Pentium 4 and Power PC G5 • The F16 new instructions are compatible with the present implementation of the SIMD ISA extensions – 128-bit SIMD registers – Same number of SIMD registers • Most SIMD 16-bit integer instructions can be used for F16 data – Transfers – Logical instructions – Pack/unpack, Shuffle, Permutation instructions • New instructions – F16 arithmetic ones : add, sub, mul, div, sqrt – Conversion instructions • 16-bit integer to/from 16-bit FP • 8-bit integer to/from 16-bit FP U of T, 09/20/2005 Daniel Etiemble 32 Some P4 instruction examples Latencies and throughput values are similar to the corresponding ones of P4 FP instructions Instruction ADDF16 MULF16 CBL2F16 CBH2F16 CF162BL CF162BH Latency 4 6 4 4 4 4 XMM 8 bytes CBH2F16 8 bytes CBL2F16 XMM Smaller latencies ADDF16 MULF16 CONV U of T, 09/20/2005 2 4 2 Byte to Half conversion instructions Daniel Etiemble 33 Measures • Hardware “simulator” – IA-32 • 2.4 GHz Pentium 4 with 768-MB running Windows 2000 • Intel C++ 8 compiler with QxW option, “maximize speed” • Execution time measured with RDTSC instruction – PowerPC • 1.6 GHz PowerPC G5 with 768-MB DDR400 running Mac OS X.3 • Xcode programming environment including gcc 3.3 • Measures – Average values of at least 10 executions (excluding abnormal ones) U of T, 09/20/2005 Daniel Etiemble 34 SIMD Execution time (1) Deriche benchmarks Scalar integer Scalar float SIMD integer SIMD float F16 73 40 Cycles per pixel (CPP) 35 30 25 20 15 10 5 0 * P4-Deriche H • • • * * * P4-Deriche HV P4-Gradient G5-Deriche H * * G5-Deriche HV G5-Gradient SIMD integer results are incorrect (insufficient dynamic range) F16 results are close to “incorrect” SIMD integer results F16 results are significantly better than 32-bit FP results U of T, 09/20/2005 Daniel Etiemble 35 SIMD Execution time (2) : Scan benchmarks Byte-short Byte - int Byte - float Float - float Byte – F16 25 20 Cycles per pixel Cumulative sum and sum of square of precedent pixel values execution time according to inputoutput values 15 10 5 * 0 P4 SIMD Copy • • • P4 SIMD +Scan ** P4 SIMD +*Scan * G5 SIMD Copy * G5 SIMD +Scan ** G5 SIMD +*Scan Copy corresponds to the lower bound in execution time (memory-bounded) Byte-short for +scan and Byte-short and Byte-integer for +*scan give incorrect results (insufficient dynamic range) Same results as for Deriche benchmarks – – F16 results are close to incorrect SIMD integer results F16 results have significant speed-up compared to Float-Float for both Scans, and compared to Byte-Float and Float-Float for +*scan U of T, 09/20/2005 Daniel Etiemble 36 SIMD Execution time (2) : OpenGL data stream case – Altivec is far better, but the relative F16/F32 speed-up is similar U of T, 09/20/2005 F32 F16 1000 195 107.5 Cycles per triangle • Compute for each triangle the min and max values of vertice coordinates. • Most computation time is spent in AoS to SoA conversion • Results 100 21.5 10.5 10 1 P4 Daniel Etiemble G5 37 Overall comparison (1/2/3) Deriche H Deriche HV Gradient Scan+ Scan+* Bounding boxes 4.5 4 F16 version versus float version Speed-up left 3.5 Speed-up 3 2.5 2 1.5 1 0.5 0 P4 F16/F32 U of T, 09/20/2005 G5 F16/F32 P4 F16/I16 Daniel Etiemble G5 F16/I16 F16 versus “incorrect” 16-bit integer version right. 38 SIMD Execution time (4): Wavelet transform • Transformée en ondelettes F32/F16 Speed-up Pentium 4 Horizontal Overall Vertical Image size U of T, 09/20/2005 Daniel Etiemble 39 F32/F16 Execution Time SIMD Execution time (4): Wavelet transform PowerPC Horizontal Overall Vertical Image size U of T, 09/20/2005 Daniel Etiemble 40 Chip area “rough” evaluation • Same approach as used by Tulla et al for the Mediabreeze architecture – VHDL models of FP operators • • • • • J. Detrey and F. De Dinechin (ENS Lyon) Non pipelined and pipelined versions Adder: Close path and large path for exponent values Divider: Radix-4 SRT algorithm SQRT: Radix-2 SRT algorithm – Cell based library • ST 0.18µm HCMOS8D technology • Cadence 4.4.3 synthesis tool (before placement and routing) • Limitations – Full-custom VLSI ≠ VHDL + Cell-based library – Actual implementation in the P4 (G5) data path is not considered U of T, 09/20/2005 Daniel Etiemble 41 16-bit and 64-bit operators 16-bit/64-bit ratio 25.00% 16-bit/64-bit ratio 20.00% 15.00% 10.00% 5.00% 0.00% Adder Multiplier Divider SQRT Overall Two-path approach is too “costly” for 16-bit FP adder. A straightforward approach would be sufficient U of T, 09/20/2005 Daniel Etiemble 42 Chip area evaluation Adder Multiplier Divider SQRT 16-bit 64-bit 10 Chip area (mm2) 0.679 1 1.008 0.276 0.1 0.027 0.047 0.097 0.016 0.019 0.01 16-bit FP FU chip area is about 5.5% of the 64-bit FP FU Eight such units would be 11% of the four corresponding 64-bit ones U of T, 09/20/2005 Daniel Etiemble 43