Aucun titre de diapositive

advertisement
A case for 16-bit floating point data: FPGA
image and media processing
Daniel Etiemble and Lionel Lacassagne
University Paris Sud, Orsay (France)
de@lri.fr
U of T, 09/20/2005
Daniel Etiemble
1
Summary
• Graphics and media applications
– integer versus FP computations
• Accuracy
• Execution speed
• Compilation issues
– A niche for 16-bit floating point format (F16 or “half”)
• Methodology and benchmarks
• Hardware support
– Customization of SIMD16-bit FP operators on a FPGA soft core
(Altera NIOS II CPU)
– The SIMD 16-bit FP instructions
• Results
• Conclusion
U of T, 09/20/2005
Daniel Etiemble
2
Integer or FP computations?
• Both formats are used in graphics and media processing
– Example: Apple vImage library has four image types with Four
pixel types:
• Unsigned char (0-255) or Float (0.0-1.0) for color or alpha values
• Set of 4 unsigned chars or floats for Alpha, Red, Green, Blue
• Trade-offs
– Precision and dynamic range
– Memory occupation and cache footprint
– Hardware cost (embedded applications)
• Chip area
• Power dissipation
U of T, 09/20/2005
Daniel Etiemble
3
Integer or FP computations? (2)
• General trend to replace FP computations by
fixed-point computations
– Intel GPP library: “using Fixed-Point instead of
Floating-Point for better 3D Performance” (G. Kolly)
– Intel Optimizing Center,
http://www.devx.com/Intel/article/16478
– Techniques for automatic floating-point to fixed-point
conversions for DSP code generation (Menard et al)
U of T, 09/20/2005
Daniel Etiemble
4
Menard et al approach
Precision
LASTI, Lannion, France
Methodology
FP algorithm
Fixed-point hardware
Correct algorithm
HW design (ASIC-FPGA)

Optimize data path width

Minimize chip area


SW design (DSP)

Optimize the « mapping »
of the algorithm
on a fixed architecture

Maximize
precision
U of T, 09/20/2005
Daniel Etiemble
Minimize
Execution time
& code size
5
Integer or FP computations? (3)
• Opposite option: Customized FP formats
– “Lightweight” FP arithmetic (Fang et al) to avoid
conversions
• With IDCT: FP numbers with 5-bit exponent and 8-bit
mantissa are sufficient to get a PSNR similar to 32-bit FP
numbers
• To compare with “half” format
U of T, 09/20/2005
Daniel Etiemble
6
Integer or FP computations? (4)
• How to help a compiler to
“vectorize”?
– Integers: different input and
output formats
• N bits + N bits => N+1 bits
• N bits * N bits => 2N bits
– FP numbers: same input and
output formats
• Example: a Deriche filter
on a size*size points image
U of T, 09/20/2005
#define byte unsigned char
byte **X, **Y;
int32 b0, a1, a2;
for(i=0; i<size; i++) {
for(j=0; j<size; j++) {
Y[i][j] = (byte) (b0 * X[i][j] + a1 * Y[i-1][j]
+ a2 * Y[i-2][j]) >> 8);}}
for (i=size-1;i>=0;i--) {
for(j=0; j<size; j++) {
Y[i][j] = (byte) (b0 * X[i][j] + a1 * Y[i+1][j]
+ a2 * Y[i+2][j]) >> 8);}}
Compiler vectorization is impossible.
With 8-bit coefficients, this benchmark can be
manually vectorized. The vectorization is possible
only if the programmer has a detailed knowledge of
the used parameters.
Float version is easily vectorized by the compiler
Daniel Etiemble
7
Cases for 16-bit FP formats
• Computation when data range exceeds “16-bit integer”
range without needing “32-bit FP float” range
• Graphics and media applications
– Not for GPU (F16 already used in NVidia GPUs)
– For embedded applications
• Advantages of 16-bit FP format
– Reduce memory occupation (cache footprint) versus 32-bit integer
or FP formats
• CPU without SIMD extensions (low-end embedded CPUs)
– 2 x wider SIMD instructions compared to float SIMD
• CPU with SIMD extensions (high-end embedded CPUs)
• Huge advantage of SIMD float operations versus SIMD integer
operations both for compiler and manual vectorization.
U of T, 09/20/2005
Daniel Etiemble
8
Example: Points of Interest
U of T, 09/20/2005
Daniel Etiemble
9
Points of interests (PoI) in images
Ix*Ix
Image
3 x 3 Gradient
(Sobel)
byte
Ix
•
Ix*Iy
Sxy
Iy*Iy
Syy
int
int
(Sxx*Syy-Sxy2 )
- 0.05 (Sxx+Syy)2
FI
Iy
short
Harris algorithm
•
Sxx
3 x 3 Gauss
filters
Threshold
byte
Integer computation mixes char, short and int and prevents an efficient use of
SIMD parallelism
F16 computations would profit from SIMD parallelism with an uniform 16-bit
format
U of T, 09/20/2005
Daniel Etiemble
10
16-bit Floating-Point formats
• Some have been defined in DSPs but rarely used
– Example: TMS 320C32
• Internal FP type (immediate operand)
– 1 sign bit, 4-bit exponent field and 11-bit fraction
• External FP type (storage purposes)
– 1 sign bit, 8-bit exponent field and 7-bit fraction
• “Half” format
1
S
U of T, 09/20/2005
5
Exponent
10
Fraction
Daniel Etiemble
11
“Half” format
• 16-bit version of IEEE 754 simple and double precision
versions.
• Introduced by ILM for OpenEXR format
• Defined in Cg (NVidia)
• Motivation:
– “16-bit integer based formats typically represent color component values from 0
(black) to 1 (white), but don’t account for over-range value (e.g. a chrome
highlight) that can be captured by film negative or other HDR displays…
Conversely, 32-bit floating-point TIFF is often overkill for visual effects work. 32bit FP TIFF provides more than sufficient precision and dynamic range for VFX
images, but it comes at the cost of storage, both on disk and memory”
U of T, 09/20/2005
Daniel Etiemble
12
Validation of the F16 approach
• Accuracy
– Results presented in ODES-3 (2005) and CAMP’05 (2005)
– Next slides.
• Performances with General Purpose CPUs (Pentium 4 and
Power PC G4-G5)
– Results presented in ODES-3 (2005) and CAMP’05 (2005)
• Performance with FPGAs (this presentation)
– Execution time
– Hardware cost (and power dissipation)
• Other embedded hardware (to be done)
– SoC
– Customizable CPU (ex: Tensilica approach)
U of T, 09/20/2005
Daniel Etiemble
Another time

13
Accuracy
• Comparison of F16 computation results with F32
computation results
• Specificities of FP formats
– Rounding?
– Denormals?
– NaN?
U of T, 09/20/2005
Daniel Etiemble
14
Impact of F16 accuracy and dynamic range
• Simulation of “half” format with “float” format with actual benchmarks or
applications
– Impact of reduced accuracy and range on results
– F32-computed and F16-computed images are compared with PSNR measures.
• Four different functions: ftd, frd, ftn, frd to simulate the F16
– Fraction : truncation or rounding
– With or without denormals
• For any benchmark, manual insertions of one function (ftd / frd /ftn / frd)
– Function call before any use of a “float” value
– Function call after any operation producing a “float” value
1
8
PE =1023 à 1039
5 bits : 1-31
U of T, 09/20/2005
23
xxxxxxxxxx
0000000000000
10 bits
Daniel Etiemble
15
Impact of F16 accuracy and dynamic range
• Benchmark 1 : zooming
(A. Montanvert, Grenoble)
– “Spline” technique for x1, x2 and x4 zooms
• Benchmark 2 : JPEG (Mediabench)
– 4 different DCT/IDCT functions
• Integer/Fast integer/F32/F16
• Benchmark 3 : Wavelet transform (L. Lacassagne, Orsay)
– SPIHT (Set Partioning in Hierarchical Trees)
U of T, 09/20/2005
Daniel Etiemble
16
Difference (PSNR)
between F32 and F16
images
Accuracy (1): Zooming benchmark
90
80
70
60
50
40
30
20
10
0
Baboon
Lena
Lighthouse
1-T
1-N
2-T
2-N
3-T
3-N
Zoom factor
• Denormals are useless
• No significant difference between truncation and rounding for
mantissa
– Minimum hardware (no denormals, truncation) is OK
U of T, 09/20/2005
Daniel Etiemble

17
Accuracy (2) :JPEG (Mediabench)
DCT FAST
DCT INT
DCT FLOAT
DCT F16
DCT FA ST
45
45
40
40
35
35
30
30
25
25
20
DCT I NT
DCT FLOA T
DCT F16
20
15
15
10
10
5
5
0
Baboon
Lena
Light house
Cor r i dor
512 x 512 images
Difference (db)
U of T, 09/20/2005
0
E i nst ei n
Of f i ce
256 x 256 images
final image compressed - uncompressed
original image
Daniel Etiemble
18
Lena
Lighthouse
Man (1024x1024)
Lena
Lighthouse
Man (1024x1024)
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
50
45
40
35
30
25
20
15
10
5
0
1
2
5
10
20
50
100
200
PSNR(F32)
PSNR loss
Accuracy (3): Wavelet transform
500
Compression rate
512 x 512 or 1024 x 1024 images
U of T, 09/20/2005
Daniel Etiemble
19
Office
Corridor
Einstein
Grenoble
Lena
Titanic
Office
Corridor
Einstein
Grenoble
Lena
Titanic
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
50
40
30
20
PSNR (F32)
PSNR loss
Accuracy (4) : Wavelet transforms
10
1
2
5
10
20
50
100
200
500
0
Compression rate
Images 256 x 256
U of T, 09/20/2005
Daniel Etiemble
20
Benchmarks
• Convolution operators
– Horizontal-vertical version
of Deriche filter
– Deriche gradient
• Image stabilization
– Points of Interest
• Achard
• Harris
for(i=0; i<size-1; i++) {
for(j=0; j<size; j++) {
Y[i][j] = (byte) ((b0 * X[i][j] + a1 * Y[i-1][j]
+ a2 * Y[i-2][j]) >> 8);}}
for (i=size-1;i>=0;i--) {
for(j=0; j<size; j++) {
Y[i][j] = (byte) ((b0 * X[i][j] + a1 * Y[i+1][j]
+ a2 * Y[i+2][j]) >> 8);}}
Deriche: horizontal vertical version
– Optical flow
• FDCT (JPEG 6-a)
U of T, 09/20/2005
Daniel Etiemble
21
HW and SW support
• Altera NIOS development kit
(Cyclone edition)
• NIOS II/f
– Fixed features
– EP1C20F400C7 FPGA device
– NIOS II/f CPU (50-MHz)
•
•
•
•
•
• Altera IDE
– GCC tool chain (-O3 option)
– High_res_timer (Nb of clock cycles
for execution time)
32-bit RISC CPU
Branch prediction
Dynamic branch predictor
Barrel shifter
Customized instructions
– Parameterized features
• VHDL description of all the F16
operators
– Arithmetic operators
– Data handling operators
• HW integer multiplication and
division
• 4 KB instruction cache
• 2 KB data cache
• Quartus II design software
U of T, 09/20/2005
Daniel Etiemble
22
Customization of SIMD F16 instructions
Data manipulation
ADD/SUB, MUL, DIV
With a 32-bit CPU, it makes sense to implement F16 instructions
as SIMD 2 x 16-bits instructions
U of T, 09/20/2005
Daniel Etiemble
23
SIMD F16 instructions
• Data conversions: 1 cycle
i+3
– Bytes to/from F16
– Shorts to/from F16
i+2
i+1
B2F16H
i
B2F16L
• Conversions and shifts: 1 cycle
– Accesses to (i, i-1) or (i+2, i+1) and
conversions
i+1
i
i+3
i+2
i+1
i
i+3
i+2
B2FSRL
B2FSRH
i-1
• Arithmetic instructions
–
–
–
–
ADD/SUB : 2 cycles (4 for F32)
MULF : 2 cycles (3 for F32)
DIVF : 5 cycles
DP2 : 1 cycle
U of T, 09/20/2005
Daniel Etiemble
24
Execution time: basic vector operations
Execution time per iteration (N)
N=10
N=100
F32
N=256
40
35
I32
30
25
F16
Add
Mul
I32
1
?
F16
2
2
F32
4
3
20
15
10
5
U of T, 09/20/2005
B[
i] )
i]
(A
[
LF
32
X[
i]
=
AD
=
M
U
DF
32
(A
[
i]
i]
,B
[ i]
)
,B
[ i]
)
,k
)
(A
[
LF
16
X[
i]
=
X[
i]
X[
i]
=
M
U
M
U
LF
16
(A
[
i]
(A
[
i]
,B
[ i]
)
,k
)
i]
DF
16
(A
[
=
X[
i]
X[
i]
=
AD
AD
DF
16
A[
i]
*B
[ i]
*k
=
=
X[
i]
=
X[
i]
X[
i]
A[
i]
+
A[
i]
B[
+
A[
i]
=
=
X[
i]
i]
k
A[
i]
0
X[
i]
Cycles per Iteration
Copy
Vector Add and Mul
Vector Scalar Add and Mul
Daniel Etiemble
Instruction latencies
25
Execution time: basic vector operations
• Speedup
– SIMD F16 versus scalar I32 or F32
– Smaller cache footprint for F16 compared to I32/F32
– F16 latencies are smaller than F32 latencies
N=10
N=100
N=256
4
3,5
3
2,5
2
1,5
1
0,5
0
F16/I32 addition speedup
U of T, 09/20/2005
F16/I32 multiplication
speedup
F16/F32 addition speedup
Daniel Etiemble
F16/F32 multiplication
speedup
26
Benchmark speedups
•
•
Speedup greater than 2.5 versus F32
Speedup from 1.3 to 3 versus I32
– Depends on the add/mul ratio and amount of data manipulation
•
Even scalar F16 can be faster than I32 (1.3 speedup for JPEG DCT)
Speedup for 128 x 128 images
F16/I32
F16/F32
3,5
3
Speedup
2,5
NO MUL
2
1,5
1
0,5
0
Deriche filter
U of T, 09/20/2005
Deriche gradient
Achard
Daniel Etiemble
Harris
Optical flow
27
Sy
st
CP
em
U
In
ov
st
er
ru
he
ct
io
ad
n
o
AD
ve
rh
DF
ea
16
d
/S
U
BF
16
M
UL
F1
Di
6
vi
sio
DI
n
VF
by
16
Po
w
er
Da
of
ta
2
ha
nd
F3
lin
2
g
AD
D
/S
UB
In
F3
te
ge
2
M
rH
UL
W
M
ul
+D
iv
Cu
st
om
Number of used Logic Units
3000
14,0%
2500
12,0%
2000
U of T, 09/20/2005
F16
Daniel Etiemble
F32
1000
10,0%
8,0%
1500
6,0%
4,0%
500
2,0%
0
0,0%
% of used Logic Units
Hardware cost
28
Concluding remarks
• Intermediate level graphics benchmarks generally need more than I16
(short) or I32 (int) dynamic ranges without needing F32 (float)
dynamic range
• On our benchmarks, graphical results are not significantly different
when using F16 instead of F32
• A limited set of SIMD F16 instructions have been customized for
NIOS II CPU
– The hardware cost is limited and compatible with to-day FPGA
technologies
– The speedups range from 1.3 to 3 (generally1.5) versus I32 and are greater
than 2.5 versus F32
• Similar results have been found for general-purpose CPUs (Pentium4,
PowerPC)
• Tests should be extended to other embedded approaches
– SoCs
– Customizable CPUs (Tensilica approach)
U of T, 09/20/2005
Daniel Etiemble
29
References
•
•
•
•
•
•
•
•
•
•
•
•
OpenEXR, http://www.openexr.org/details.html
W.R. Mark, R.S.Glanville, K. Akeley and M.J. Kilgard, “Cg: A system for programming graphics
hardware in a C-like language.
NVIDIA, Cg User’s manual, http://developer.nvidia.com/view.asp?IO=cg_toolkit
Apple, “Introduction to vImage”,
http://developer.apple.com/documentation/Performance/Conceptual/vImage/
G. Kolli, “Using Fixed-Point Instead of Floating Point for Better 3D Performance”, Intel Optimizing
Center, http://www.devx.com/Intel/article/16478
D. Menard, D. Chillet, F. Charot and O. Sentieys, “Automatic Floating-point to Fixed-point
Conversion for DSP Code Generation”, in International Conference on Compilers, Architectures and
Synthesis for Embedded Systems (CASES 2002)
F. Fang, Tsuhan Chen, Rob A. Rutenbar, “Lightweight Floating-Point Arithmetic: Case Study of
Inverse Discrete Cosine Transform” in EURASIP Journal on Signal Processing, Special Issue on
Applied Implementation of DSP and Communication Systems
R. Deriche. “Using Canny's criteria to derive a recursively implemented optimal edge detector”. The
International Journal of Computer Vision, 1(2):167-187, May 1987.
A. Kumar, “SSE2 Optimization – OpenGL Data Stream Case Study”, Intel application notes,
http://www.intel.com/cd/ids/developer/asmo-na/eng/segments/games/resources/graphics/19224.htm
Sample code for the benchmarks available: http://www.lri.fr/~de/F16/codetsi
Multi-Chip Projects, “Design Kits”, http://cmp.imag.fr/ManChap4.html
J. Detrey and F. De Dinechin, “A VHDL Library of Parametrisable Floating Point and LSN
Operators for FPGA”, http//www.ens-lyon.fr/~jdetrey/FPLibrary
U of T, 09/20/2005
Daniel Etiemble
30
Back slides
• F16 SIMD instructions on General Purpose CPUs
U of T, 09/20/2005
Daniel Etiemble
31
Microarchitectural assumptions for
Pentium 4 and Power PC G5
• The F16 new instructions are compatible with the present
implementation of the SIMD ISA extensions
– 128-bit SIMD registers
– Same number of SIMD registers
• Most SIMD 16-bit integer instructions can be used for F16
data
– Transfers
– Logical instructions
– Pack/unpack, Shuffle, Permutation instructions
• New instructions
– F16 arithmetic ones : add, sub, mul, div, sqrt
– Conversion instructions
• 16-bit integer to/from 16-bit FP
• 8-bit integer to/from 16-bit FP
U of T, 09/20/2005
Daniel Etiemble
32
Some P4 instruction examples
Latencies and throughput values are similar to the corresponding
ones of P4 FP instructions
Instruction
ADDF16
MULF16
CBL2F16
CBH2F16
CF162BL
CF162BH
Latency
4
6
4
4
4
4
XMM
8 bytes
CBH2F16
8 bytes
CBL2F16
XMM
Smaller latencies
ADDF16
MULF16
CONV
U of T, 09/20/2005
2
4
2
Byte to Half conversion instructions
Daniel Etiemble
33
Measures
• Hardware “simulator”
– IA-32
• 2.4 GHz Pentium 4 with 768-MB running Windows 2000
• Intel C++ 8 compiler with QxW option, “maximize speed”
• Execution time measured with RDTSC instruction
– PowerPC
• 1.6 GHz PowerPC G5 with 768-MB DDR400 running Mac OS
X.3
• Xcode programming environment including gcc 3.3
• Measures
– Average values of at least 10 executions (excluding
abnormal ones)
U of T, 09/20/2005
Daniel Etiemble
34
SIMD Execution time (1) Deriche benchmarks
Scalar integer
Scalar float
SIMD integer
SIMD float
F16
73
40
Cycles per pixel (CPP)
35
30
25
20
15
10
5
0
*
P4-Deriche H
•
•
•
*
*
*
P4-Deriche
HV
P4-Gradient
G5-Deriche H
*
*
G5-Deriche
HV
G5-Gradient
SIMD integer results are incorrect (insufficient dynamic range)
F16 results are close to “incorrect” SIMD integer results
F16 results are significantly better than 32-bit FP results
U of T, 09/20/2005
Daniel Etiemble
35
SIMD Execution time (2) : Scan benchmarks
Byte-short
Byte - int
Byte - float
Float - float
Byte – F16
25
20
Cycles per pixel
Cumulative sum
and sum of square of
precedent pixel values
execution time
according to inputoutput values
15
10
5
*
0
P4 SIMD
Copy
•
•
•
P4 SIMD
+Scan
**
P4 SIMD
+*Scan
*
G5 SIMD
Copy
*
G5 SIMD
+Scan
**
G5 SIMD
+*Scan
Copy corresponds to the lower bound in execution time (memory-bounded)
Byte-short for +scan and Byte-short and Byte-integer for +*scan give incorrect results
(insufficient dynamic range)
Same results as for Deriche benchmarks
–
–
F16 results are close to incorrect SIMD integer results
F16 results have significant speed-up compared to Float-Float for both Scans, and compared to
Byte-Float and Float-Float for +*scan
U of T, 09/20/2005
Daniel Etiemble
36
SIMD Execution time (2) : OpenGL data stream case
– Altivec is far better,
but the relative
F16/F32 speed-up is
similar
U of T, 09/20/2005
F32
F16
1000
195
107.5
Cycles per triangle
• Compute for each
triangle the min and
max values of
vertice coordinates.
• Most computation
time is spent in AoS
to SoA conversion
• Results
100
21.5
10.5
10
1
P4
Daniel Etiemble
G5
37
Overall comparison (1/2/3)
Deriche H
Deriche HV
Gradient
Scan+
Scan+*
Bounding boxes
4.5
4
F16 version versus
float version
Speed-up left
3.5
Speed-up
3
2.5
2
1.5
1
0.5
0
P4 F16/F32
U of T, 09/20/2005
G5 F16/F32
P4 F16/I16
Daniel Etiemble
G5 F16/I16
F16 versus
“incorrect” 16-bit
integer version
right.
38
SIMD Execution time (4): Wavelet transform
• Transformée en ondelettes
F32/F16 Speed-up
Pentium 4
Horizontal
Overall
Vertical
Image size
U of T, 09/20/2005
Daniel Etiemble
39
F32/F16 Execution Time
SIMD Execution time (4): Wavelet transform
PowerPC
Horizontal
Overall
Vertical
Image size
U of T, 09/20/2005
Daniel Etiemble
40
Chip area “rough” evaluation
• Same approach as used by Tulla et al for the Mediabreeze
architecture
– VHDL models of FP operators
•
•
•
•
•
J. Detrey and F. De Dinechin (ENS Lyon)
Non pipelined and pipelined versions
Adder: Close path and large path for exponent values
Divider: Radix-4 SRT algorithm
SQRT: Radix-2 SRT algorithm
– Cell based library
• ST 0.18µm HCMOS8D technology
• Cadence 4.4.3 synthesis tool (before placement and routing)
• Limitations
– Full-custom VLSI ≠ VHDL + Cell-based library
– Actual implementation in the P4 (G5) data path is not considered
U of T, 09/20/2005
Daniel Etiemble
41
16-bit and 64-bit operators
16-bit/64-bit ratio
25.00%
16-bit/64-bit ratio
20.00%
15.00%
10.00%
5.00%
0.00%
Adder
Multiplier
Divider
SQRT
Overall
Two-path approach is too “costly” for 16-bit FP adder. A straightforward approach
would be sufficient
U of T, 09/20/2005
Daniel Etiemble
42
Chip area evaluation
Adder
Multiplier
Divider
SQRT
16-bit
64-bit
10
Chip area (mm2)
0.679
1
1.008
0.276
0.1
0.027
0.047
0.097
0.016
0.019
0.01
16-bit FP FU chip area is about 5.5% of the 64-bit FP FU
Eight such units would be 11% of the four corresponding 64-bit ones
U of T, 09/20/2005
Daniel Etiemble
43
Download