Implementation of binary floating-point arithmetic on embedded

advertisement
Implementation of binary floating-point arithmetic
on embedded integer processors
Polynomial evaluation-based algorithms
and
certified code generation
Guillaume Revy
Advisors: Claude-Pierre Jeannerod and Gilles Villard
Arénaire INRIA project-team (LIP, Ens Lyon)
Université de Lyon
CNRS
Ph.D. Defense – December 1st , 2009
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
1/45
Motivation
Embedded systems are ubiquitous
I
microprocessors dedicated to one or a few specific tasks
I
satisfy constraints: area, energy consumption, conception cost
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
2/45
Motivation
Embedded systems are ubiquitous
I
microprocessors dedicated to one or a few specific tasks
I
satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
Embedded systems
No FPU
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
2/45
Motivation
Embedded systems are ubiquitous
I
microprocessors dedicated to one or a few specific tasks
I
satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
Applications
FP computations
Embedded systems
No FPU
Highly used in audio and video applications
I
demanding on floating-point computations
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
2/45
Motivation
Embedded systems are ubiquitous
I
microprocessors dedicated to one or a few specific tasks
I
satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
Applications
FP computations
Embedded systems
Software implementing
No FPU
floating-point arithmetic
Highly used in audio and video applications
I
demanding on floating-point computations
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
2/45
Motivation
Embedded systems are ubiquitous
I
microprocessors dedicated to one or a few specific tasks
I
satisfy constraints: area, energy consumption, conception cost
Some embedded systems do not have any FPU (floating-point unit)
SDI ports
ST231 core
Mul
ITLB
ICache
Mul
DTLB
Register
Instruction file (64
registers
buffer
8 read
4 write)
Load
Store
Unit
(LSU)
Applications
UTLB 4 x SDI
Control
registers
SCU
Write
buffer
DCache
STBus
CMC 64-bit
I-side
memory
subsystem
PC and
branch
unit
Branch
register
file
IU
IU
IU
IU
Prefetch
buffer
D-side
memory
subsystem
FP computations
Trap
controller
Peripherals
STBus
3x
Interrupt Debug
Timers controllersupport unit 32-bit
61 interrupts Debuglink
Embedded systems
Software implementing
No FPU
floating-point arithmetic
Highly used in audio and video applications
I
demanding on floating-point computations
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
2/45
Overview of the ST231 architecture
SDI ports
ST231 core
Mul
ITLB
Mul
DTLB
UTLB
ICache
Instruction
buffer
Load
Store
Unit
(LSU)
4-issue VLIW 32-bit integer processor
→ no FPU
4 x SDI
Control
registers
Register
file (64
registers
8 read
4 write)
SCU
Write
buffer
DCache
STBus
Parallel execution unit
CMC 64-bit
I-side
memory
subsystem
PC and
branch
unit
Branch
register
file
IU
IU
IU
IU
Trap
controller
Peripherals
3x
Timers
Prefetch
buffer
STBus
Interrupt
controller
61 interrupts
Debug
support unit
32-bit
Debuglink
Guillaume Revy – December 1st , 2009.
D-side
memory
subsystem
I
I
4 integer ALU
2 pipelined multipliers 32 × 32 → 32
Latencies: ALU → 1 cycle, Mul → 3
cycles
Implementation of binary floating-point arithmetic on embedded integer processors
3/45
Overview of the ST231 architecture
SDI ports
ST231 core
Mul
ITLB
Mul
DTLB
UTLB
Instruction
buffer
ICache
Load
Store
Unit
(LSU)
4-issue VLIW 32-bit integer processor
→ no FPU
4 x SDI
Control
registers
Register
file (64
registers
8 read
4 write)
SCU
Write
buffer
DCache
STBus
Parallel execution unit
CMC 64-bit
PC and
branch
unit
I-side
memory
subsystem
Branch
register
file
IU
IU
IU
IU
Trap
controller
Peripherals
3x
Timers
Prefetch
buffer
STBus
Interrupt
controller
61 interrupts
Debug
support unit
32-bit
Debuglink
D-side
memory
subsystem
I
I
4 integer ALU
2 pipelined multipliers 32 × 32 → 32
Latencies: ALU → 1 cycle, Mul → 3
cycles
VLIW (Very Long Instruction Word)
→ instructions grouped into bundles
→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
3/45
Overview of the ST231 architecture
SDI ports
ST231 core
Mul
ITLB
Mul
DTLB
UTLB
Instruction
buffer
ICache
Load
Store
Unit
(LSU)
4-issue VLIW 32-bit integer processor
→ no FPU
4 x SDI
Control
registers
Register
file (64
registers
8 read
4 write)
SCU
Write
buffer
DCache
STBus
Parallel execution unit
CMC 64-bit
PC and
branch
unit
I-side
memory
subsystem
Branch
register
file
IU
IU
IU
D-side
memory
subsystem
IU
Trap
controller
Peripherals
3x
Timers
Prefetch
buffer
Debug
support unit
61 interrupts
4 integer ALU
2 pipelined multipliers 32 × 32 → 32
Latencies: ALU → 1 cycle, Mul → 3
cycles
STBus
Interrupt
controller
I
I
32-bit
Debuglink
VLIW (Very Long Instruction Word)
→ instructions grouped into bundles
→ Instruction-Level Parallelism (ILP) explicitly exposed by the compiler
uint32_t
uint32_t
uint32_t
uint32_t
R1
R2
R3
R4
=
=
=
=
A0 + C;
A3 * X;
A1 * X;
X * X;
Guillaume Revy – December 1st , 2009.
Issue 1
Issue 2
0
R1
R2
1
R4
Issue 3
Issue 4
R3
Implementation of binary floating-point arithmetic on embedded integer processors
3/45
How to emulate floating-point arithmetic in software?
Design and implementation of efficient software support for
IEEE 754 floating-point arithmetic on integer processors
Existing software for IEEE 754 floating-point arithmetic:
I
Software floating-point support of GCC, Glibc and µClibc, GoFast
Floating-Point Library
I
SoftFloat (→ STlib)
I
FLIP (Floating-point Library for Integer Processors)
• software support for binary32 floating-point arithmetic on integer processors
• correctly-rounded addition, subtraction, multiplication, division, square root,
reciprocal, ...
• handling subnormals, and handling special inputs
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
4/45
Towards the generation of fast and certified codes
Underlying problem: development “by hand”
I
long and tedious, error prone
I
new target ? new floating-point format ?
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
5/45
Towards the generation of fast and certified codes
Underlying problem: development “by hand”
I
long and tedious, error prone
I
new target ? new floating-point format ?
⇒ need for automation and certification
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
5/45
Towards the generation of fast and certified codes
Underlying problem: development “by hand”
I
long and tedious, error prone
I
new target ? new floating-point format ?
⇒ need for automation and certification
Current challenge: tools and methodologies for the automatic generation
of efficient and certified programs
I
optimized for a given format, for the target architecture
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
5/45
Towards the generation of fast and certified codes
Arénaire’s developments: hardware (FloPoCo) and software (Sollya,
Metalibm)
Spiral project: hardware and software code generation for DSP algorithms
Can we teach computers to write fast libraries?
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
6/45
Towards the generation of fast and certified codes
Arénaire’s developments: hardware (FloPoCo) and software (Sollya,
Metalibm)
Spiral project: hardware and software code generation for DSP algorithms
Can we teach computers to write fast libraries?
Our tool: CGPE (Code Generation for Polynomial Evaluation)
In the particular case of polynomial evaluation, can we teach computers
to write fast and certified codes, for a given target and optimized for a
given format?
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
6/45
Basic blocks for implementing correctly-rounded operators
(X, Y )
function independent
Special input detection
no
yes
function dependent
Floating-point number unpacking
Normalization
Objectives
Range reduction
Result sign/exponent
computation
Result significand approximation
→ Low latency, correctly-rounded
implementations
Rounding condition decision
→ ILP exposure
Correct rounding computation
Result reconstruction
Special output selection
R
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
7/45
Basic blocks for implementing correctly-rounded operators
(X, Y )
function independent
Special input detection
no
yes
function dependent
Floating-point number unpacking
Normalization
Objectives
Range reduction
Result sign/exponent
computation
Result significand approximation
→ Low latency, correctly-rounded
implementations
Rounding condition decision
→ ILP exposure
Correct rounding computation
Result reconstruction
Special output selection
R
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
7/45
Basic blocks for implementing correctly-rounded operators
(X, Y )
Problem: function to be evaluated
ST231 features
Special input detection
no
yes
Computation of polynomial approximant
Floating-point number unpacking
Efficient and certified C code
generation
Normalization
Fully
automated
C code
Certificate
Range reduction
Result sign/exponent
computation
Result significand approximation
Rounding condition decision
Correct rounding computation
Uniform approach for nth roots
and their reciprocals
→ polynomial evaluation
Extension to division
Result reconstruction
Special output selection
R
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
7/45
Flowchart for generating efficient and certified C codes
Problem: function to be evaluated
ST231 features
Computation of polynomial approximant
Efficient and certified C code
generation
C code
Certificate
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
8/45
Flowchart for generating efficient and certified C codes
Problem: function to be evaluated
ST231 features
Constraints
Computation of polynomial approximant
Accuracy of approximant and
C code
Efficient and certified C code
generation
C code
Certificate
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
8/45
Flowchart for generating efficient and certified C codes
Problem: function to be evaluated
ST231 features
Constraints
Computation of polynomial approximant
Accuracy of approximant and
C code
Efficient and certified C code
generation
Low evaluation latency on
ST231, ILP exposure
C code
Certificate
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
8/45
Flowchart for generating efficient and certified C codes
Problem: function to be evaluated
ST231 features
Constraints
Computation of polynomial approximant
Accuracy of approximant and
C code
I
Sollya
Efficient and certified C code
generation
Low evaluation latency on
ST231, ILP exposure
C code
Certificate
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
8/45
Flowchart for generating efficient and certified C codes
Problem: function to be evaluated
ST231 features
Constraints
Computation of polynomial approximant
Efficient and certified C code
generation
Accuracy of approximant and
C code
I
Sollya
I
interval arithmetic (MPFI),
Gappa
Low evaluation latency on
ST231, ILP exposure
C code
Certificate
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
8/45
Flowchart for generating efficient and certified C codes
Problem: function to be evaluated
ST231 features
Constraints
Computation of polynomial approximant
Accuracy of approximant and
C code
Efficient and certified C code
I
Sollya
I
interval arithmetic (MPFI),
Gappa
CGPE
generation
Low evaluation latency on
ST231, ILP exposure
C code
Certificate
Guillaume Revy – December 1st , 2009.
I
?
Implementation of binary floating-point arithmetic on embedded integer processors
8/45
Flowchart for generating efficient and certified C codes
Problem: function to be evaluated
ST231 features
Constraints
Computation of polynomial approximant
Accuracy of approximant and
C code
Efficient and certified C code
I
Sollya
I
interval arithmetic (MPFI),
Gappa
CGPE
generation
Low evaluation latency on
ST231, ILP exposure
C code
Certificate
I
?
Efficiency of the generation
process
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
8/45
Outline of the talk
1. Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Implementation of correct rounding
2. Low latency parenthesization computation
Classical evaluation methods
Computation of all parenthesizations
Towards low evaluation latency
3. Selection of effective evaluation parenthesizations
General framework
Automatic certification of generated C codes
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
9/45
Design and implementation of floating-point operators
Outline of the talk
1. Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Implementation of correct rounding
2. Low latency parenthesization computation
3. Selection of effective evaluation parenthesizations
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
10/45
Design and implementation of floating-point operators
Notation and assumptions
(x, y)
Division C code
RN(x/y)
Input (x , y ) and output RN(x /y ): normal numbers
→ no underflow nor overflow
→ precision p, extremal exponents emin , emax
x = ±1.mx ,1 . . . mx ,p−1 · 2ex
Guillaume Revy – December 1st , 2009.
with
ex ∈ {emin , . . . , emax }
Implementation of binary floating-point arithmetic on embedded integer processors
11/45
Design and implementation of floating-point operators
Notation and assumptions
(x, y)
Division C code
RN(x/y)
Input (x , y ) and output RN(x /y ): normal numbers
→ no underflow nor overflow
→ precision p, extremal exponents emin , emax
x = ±1.mx ,1 . . . mx ,p−1 · 2ex
with
ex ∈ {emin , . . . , emax }
→ RoundTiesToEven
x/y
Guillaume Revy – December 1st , 2009.
RN(x/y)
Implementation of binary floating-point arithmetic on embedded integer processors
11/45
Design and implementation of floating-point operators
Notation and assumptions
(X, Y )
Division C code
R
Standard binary encoding: k -bit unsigned integer X encodes input x
sx
1 bit
Ex = ex − emin − 1
w = k − p bits
Tx = mx,1 . . . mx,p−1
p − 1 bits
Computation: k -bit unsigned integers
→ integer and fixed-point arithmetic
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
11/45
Design and implementation of floating-point operators
Notation and assumptions
(X, Y )
?
Division C code
R
Standard binary encoding: k -bit unsigned integer X encodes input x
sx
1 bit
Ex = ex − emin − 1
w = k − p bits
Tx = mx,1 . . . mx,p−1
p − 1 bits
Computation: k -bit unsigned integers
→ integer and fixed-point arithmetic
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
11/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Range reduction of division
Express the exact result r = x /y as:
r = ` · 2d
⇒
RN(x /y ) = RN(`) · 2d
with
` ∈ [1, 2) and d ∈ {emin , . . . , emax }
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
12/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Range reduction of division
Express the exact result r = x /y as:
r = ` · 2d
⇒
RN(x /y ) = RN(`) · 2d
with
` ∈ [1, 2) and d ∈ {emin , . . . , emax }
Definition
c = 1 if mx ≥ my ,
Guillaume Revy – December 1st , 2009.
and
c=0
otherwise
Implementation of binary floating-point arithmetic on embedded integer processors
12/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Range reduction of division
Express the exact result r = x /y as:
r = ` · 2d
⇒
RN(x /y ) = RN(`) · 2d
with
` ∈ [1, 2) and d ∈ {emin , . . . , emax }
Definition
c = 1 if mx ≥ my ,
and
c=0
otherwise
Range reduction
x /y = 21−c · mx /my
|
Guillaume Revy – December 1st , 2009.
{z
:= ` ∈ [1,2)
· 2d
with
d = ex − ey − 1 + c
}
Implementation of binary floating-point arithmetic on embedded integer processors
12/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Range reduction of division
Express the exact result r = x /y as:
r = ` · 2d
⇒
RN(x /y ) = RN(`) · 2d
with
` ∈ [1, 2) and d ∈ {emin , . . . , emax }
Definition
c = 1 if mx ≥ my ,
and
c=0
otherwise
Range reduction
x /y = 21−c · mx /my
|
{z
:= ` ∈ [1,2)
· 2d
with
d = ex − ey − 1 + c
}
How to compute the correctly-rounded significand RN(`) ?
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
12/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Methods for computing the correctly-rounded significand
Iterative methods: restoring, non-restoring, SRT, ...
I
Oberman and Flynn (1997)
I
minimal ILP exposure, sequential algorithm
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
13/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Methods for computing the correctly-rounded significand
Iterative methods: restoring, non-restoring, SRT, ...
I
Oberman and Flynn (1997)
I
minimal ILP exposure, sequential algorithm
Multiplicative methods: Newton-Raphson, Goldschmidt
I
Piñeiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006)
I
exploit available multipliers, more ILP exposure
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
13/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Methods for computing the correctly-rounded significand
Iterative methods: restoring, non-restoring, SRT, ...
I
Oberman and Flynn (1997)
I
minimal ILP exposure, sequential algorithm
Multiplicative methods: Newton-Raphson, Goldschmidt
I
Piñeiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006)
I
exploit available multipliers, more ILP exposure
Polynomial-based methods
I
Agarwal, Gustavson and Schmookler (1999)
→ univariate polynomial evaluation
I
Our approach
→ bivariate polynomial evaluation: maximal ILP exposure
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
13/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Correct rounding via truncated one-sided approximation
How to compute RN(`), with ` = 21−c · mx /my ?
Three steps for correct rounding computation
1. compute v = 1.v1 . . . vk −2 such that −2−p ≤ ` − v < 0
→ implied by |(` + 2−p−1 ) − v | < 2−p−1
→ bivariate polynomial evaluation
2. compute u as the truncation of v after p fraction bits
3. determine RN(`) after possibly adding 2−p
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
14/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Correct rounding via truncated one-sided approximation
How to compute RN(`), with ` = 21−c · mx /my ?
Three steps for correct rounding computation
1. compute v = 1.v1 . . . vk −2 such that −2−p ≤ ` − v < 0
→ implied by |(` + 2−p−1 ) − v | < 2−p−1
→ bivariate polynomial evaluation
2. compute u as the truncation of v after p fraction bits
3. determine RN(`) after possibly adding 2−p
How to compute the one-sided approximation v and then deduce RN(`)?
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
14/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
One-sided approximation via bivariate polynomials
1. Consider ` + 2−p−1 as the exact result of the function
F (s, t ) = s/(1 + t ) + 2−p−1
at the points s∗ = 21−c · mx and t ∗ = my − 1
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
15/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
One-sided approximation via bivariate polynomials
1. Consider ` + 2−p−1 as the exact result of the function
F (s, t ) = s/(1 + t ) + 2−p−1
at the points s∗ = 21−c · mx and t ∗ = my − 1
2. Approximate F (s, t ) by a bivariate polynomial P (s, t )
P (s, t ) = s · a(t ) + 2−p−1
→ a(t ): univariate polynomial approximant of 1/(1 + t )
→ approximation error Eapprox
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
15/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
One-sided approximation via bivariate polynomials
1. Consider ` + 2−p−1 as the exact result of the function
F (s, t ) = s/(1 + t ) + 2−p−1
at the points s∗ = 21−c · mx and t ∗ = my − 1
2. Approximate F (s, t ) by a bivariate polynomial P (s, t )
P (s, t ) = s · a(t ) + 2−p−1
→ a(t ): univariate polynomial approximant of 1/(1 + t )
→ approximation error Eapprox
3. Evaluate P (s, t ) by a well-chosen efficient evaluation program P
v = P (s∗ , t ∗ )
→ evaluation error Eeval
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
15/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
One-sided approximation via bivariate polynomials
1. Consider ` + 2−p−1 as the exact result of the function
F (s, t ) = s/(1 + t ) + 2−p−1
at the points s∗ = 21−c · mx and t ∗ = my − 1
2. Approximate F (s, t ) by a bivariate polynomial P (s, t )
P (s, t ) = s · a(t ) + 2−p−1
→ a(t ): univariate polynomial approximant of 1/(1 + t )
→ approximation error Eapprox
3. Evaluate P (s, t ) by a well-chosen efficient evaluation program P
v = P (s∗ , t ∗ )
→ evaluation error Eeval
How to ensure that |(` + 2−p−1 ) − v | < 2−p−1 ?
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
15/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Sufficient error bounds
To ensure
|(` + 2−p−1 ) − v | < 2−p−1
it suffices to ensure that
µ · Eapprox + Eeval < 2−p−1 ,
since
|(` + 2−p−1 ) − v | ≤ µ · Eapprox + Eeval
Guillaume Revy – December 1st , 2009.
with
µ = 4 − 23−p
Implementation of binary floating-point arithmetic on embedded integer processors
16/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Sufficient error bounds
To ensure
|(` + 2−p−1 ) − v | < 2−p−1
µ · Eapprox + Eeval < 2−p−1 ,
it suffices to ensure that
since
|(` + 2−p−1 ) − v | ≤ µ · Eapprox + Eeval
with
µ = 4 − 23−p
This gives the following sufficient conditions
Eapprox < 2−p−1 /µ
Guillaume Revy – December 1st , 2009.
⇒
Eeval < 2−p−1 − µ · Eapprox
Implementation of binary floating-point arithmetic on embedded integer processors
16/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Sufficient error bounds
To ensure
|(` + 2−p−1 ) − v | < 2−p−1
it suffices to ensure that
µ · Eapprox + Eeval < 2−p−1 ,
since
|(` + 2−p−1 ) − v | ≤ µ · Eapprox + Eeval
with
µ = 4 − 23−p
This gives the following sufficient conditions
Eapprox ≤ θ
with θ < 2−p−1 /µ
Guillaume Revy – December 1st , 2009.
⇒
Eeval < η = 2−p−1 − µ · θ
Implementation of binary floating-point arithmetic on embedded integer processors
16/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Example for the binary32 division
Sufficient conditions with µ = 4 − 2−21
Eapprox ≤ θ
with
Guillaume Revy – December 1st , 2009.
θ < 2−25 /µ
and
Eeval < η = 2−25 − µ · θ
Implementation of binary floating-point arithmetic on embedded integer processors
17/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Example for the binary32 division
Sufficient conditions with µ = 4 − 2−21
Eapprox ≤ θ
with
θ < 2−25 /µ
and
Eeval < η = 2−25 − µ · θ
Approximation of 1/(1 + t ) by a Remez-like polynomial of degree 10
Absolute approximation error
8e-09
6e-09
I
4e-09
Eapprox ≤ θ,
2e-09
with θ = 3 · 2−29 ≈ 6 · 10−9
0
-2e-09
-4e-09
I
-6e-09
Eeval < η,
-8e-09
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
t
Approximation error
Guillaume Revy – December 1st , 2009.
0.9
with η ≈ 7.4 · 10−9
Required bound
Implementation of binary floating-point arithmetic on embedded integer processors
17/45
Design and implementation of floating-point operators
Bivariate polynomial evaluation-based approach
Flowchart for generating efficient and certified C codes
F(s,t)
E
approx
≤θ
E
eval
<η
ST231 features
Computation of polynomial approximant
?
Computation of low latency parenthesizations
?
Selection of effective parenthesizations
v
C code
Certificate
|(` + 2−p−1) − v| < 2−p−1
truncation
u
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
18/45
Design and implementation of floating-point operators
Implementation of correct rounding
Rounding condition: definition
Approximation u of ` with
` = 21−c · mx /my
The exact value ` may have an infinite number of bits
→ the sticky bit cannot always be computed
u
u
`
floating-point
Guillaume Revy – December 1st , 2009.
`
midpoint
Implementation of binary floating-point arithmetic on embedded integer processors
19/45
Design and implementation of floating-point operators
Implementation of correct rounding
Rounding condition: definition
Approximation u of ` with
` = 21−c · mx /my
The exact value ` may have an infinite number of bits
→ the sticky bit cannot always be computed
u
u
`
`
floating-point
midpoint
Compute RN(`) requires to be able to decide whether u ≥ `
→ ` cannot be a midpoint
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
19/45
Design and implementation of floating-point operators
Implementation of correct rounding
Rounding condition: definition
Approximation u of ` with
` = 21−c · mx /my
The exact value ` may have an infinite number of bits
→ the sticky bit cannot always be computed
u
u
`
`
floating-point
midpoint
Compute RN(`) requires to be able to decide whether u ≥ `
→ ` cannot be a midpoint
Rounding condition: u ≥ `
u≥`
Guillaume Revy – December 1st , 2009.
⇐⇒
u · my ≥ 21−c · mx
Implementation of binary floating-point arithmetic on embedded integer processors
19/45
Design and implementation of floating-point operators
Implementation of correct rounding
Rounding condition: implementation in integer arithmetic
Rounding condition: u · my ≥ 21−c · mx
Approximation u and my : representable with 32 bits
u
×
my
u · my
I
u · my is exactly representable with 64 bits
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
20/45
Design and implementation of floating-point operators
Implementation of correct rounding
Rounding condition: implementation in integer arithmetic
Rounding condition: u · my ≥ 21−c · mx
Approximation u and my : representable with 32 bits
u
×
my
u · my
21−c · mx
I
u · my is exactly representable with 64 bits
I
21−c · mx is representable with 32 bits since c ∈ {0, 1}
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
20/45
Design and implementation of floating-point operators
Implementation of correct rounding
Rounding condition: implementation in integer arithmetic
Rounding condition: u · my ≥ 21−c · mx
Approximation u and my : representable with 32 bits
u
×
my
u · my
≥
21−c · mx
I
u · my is exactly representable with 64 bits
I
21−c · mx is representable with 32 bits since c ∈ {0, 1}
⇒ one 32 × 32 → 32-bit multiplication and one comparison
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
20/45
Design and implementation of floating-point operators
Implementation of correct rounding
Flowchart for generating efficient and certified C codes
F(s,t)
E
approx
≤θ
E
eval
<η
ST231 features
Computation of polynomial approximant
?
Computation of low latency parenthesizations
?
Selection of effective parenthesizations
C code
Guillaume Revy – December 1st , 2009.
Certificate
Implementation of binary floating-point arithmetic on embedded integer processors
21/45
Design and implementation of floating-point operators
Implementation of correct rounding
Flowchart for generating efficient and certified C codes
F(s,t)
E
approx
≤θ
E
eval
<η
ST231 features
Computation of polynomial approximant
a(t)
Computation of low latency parenthesizations
?
Selection of effective parenthesizations
C code
Guillaume Revy – December 1st , 2009.
Certificate
Implementation of binary floating-point arithmetic on embedded integer processors
21/45
Low latency parenthesization computation
Outline of the talk
1. Design and implementation of floating-point operators
2. Low latency parenthesization computation
Classical evaluation methods
Computation of all parenthesizations
Towards low evaluation latency
3. Selection of effective evaluation parenthesizations
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
22/45
Low latency parenthesization computation
Classical evaluation methods
Objectives
Compute an efficient parenthesization for evaluating P (s, t )
→ reduces the evaluation latency on unbounded parallelism
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
23/45
Low latency parenthesization computation
Classical evaluation methods
Objectives
Compute an efficient parenthesization for evaluating P (s, t )
→ reduces the evaluation latency on unbounded parallelism
Evaluation program P = main part of the full software implementation
→ dominates the cost
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
23/45
Low latency parenthesization computation
Classical evaluation methods
Objectives
Compute an efficient parenthesization for evaluating P (s, t )
→ reduces the evaluation latency on unbounded parallelism
Evaluation program P = main part of the full software implementation
→ dominates the cost
Two families of algorithms
I
algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and
Stockmeyer (1964), ...
→ ill-suited in the context of fixed-point arithmetic
I
algorithms without coefficient adaptation
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
23/45
Low latency parenthesization computation
Classical evaluation methods
Objectives
Compute an efficient parenthesization for evaluating P (s, t )
→ reduces the evaluation latency on unbounded parallelism
Evaluation program P = main part of the full software implementation
→ dominates the cost
Two families of algorithms
I
algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and
Stockmeyer (1964), ...
→ ill-suited in the context of fixed-point arithmetic
I
algorithms without coefficient adaptation
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
23/45
Low latency parenthesization computation
Classical evaluation methods
Classical parenthesizations for binary32 division
P (s, t ) = 2−25 + s ·
∑
ai · t i
0≤i ≤10
Horner’s rule: (3 + 1) × 11 = 44 cycles
→ no ILP exposure
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
24/45
Low latency parenthesization computation
Classical evaluation methods
Classical parenthesizations for binary32 division
P (s, t ) = 2−25 + s ·
∑
ai · t i
0≤i ≤10
Horner’s rule: (3 + 1) × 11 = 44 cycles
→ no ILP exposure
Second-order Horner’s rule: 27 cycles
→ evaluation of odd and even parts independently with Horner, more ILP
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
24/45
Low latency parenthesization computation
Classical evaluation methods
Classical parenthesizations for binary32 division
P (s, t ) = 2−25 + s ·
∑
ai · t i
0≤i ≤10
Horner’s rule: (3 + 1) × 11 = 44 cycles
→ no ILP exposure
Second-order Horner’s rule: 27 cycles
→ evaluation of odd and even parts independently with Horner, more ILP
Estrin’s method: 19 cycles
→ evaluation of high and low parts in parallel, even more ILP
→ distributing the multiplication by s in the evaluation of a(t ) → 16 cycles
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
24/45
Low latency parenthesization computation
Classical evaluation methods
Classical parenthesizations for binary32 division
P (s, t ) = 2−25 + s ·
∑
ai · t i
0≤i ≤10
Horner’s rule: (3 + 1) × 11 = 44 cycles
→ no ILP exposure
Second-order Horner’s rule: 27 cycles
→ evaluation of odd and even parts independently with Horner, more ILP
Estrin’s method: 19 cycles
→ evaluation of high and low parts in parallel, even more ILP
→ distributing the multiplication by s in the evaluation of a(t ) → 16 cycles
...
We can do better.
How to explore the solution space of parenthesizations?
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
24/45
Low latency parenthesization computation
Computation of all parenthesizations
Algorithm for computing all parenthesizations
a (x , y ) =
∑
∑
ai,j · x i · y j
with n = nx + ny ,
and anx ,ny 6= 0
0≤i ≤nx 0≤j ≤ny
Example
Let a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . Then
a1,0 + a1,1 · y
is a valid expression,
Guillaume Revy – December 1st , 2009.
while
a1,0 · x + a1,1 · x
is not.
Implementation of binary floating-point arithmetic on embedded integer processors
25/45
Low latency parenthesization computation
Computation of all parenthesizations
Algorithm for computing all parenthesizations
a (x , y ) =
∑
∑
ai,j · x i · y j
with n = nx + ny ,
and anx ,ny 6= 0
0≤i ≤nx 0≤j ≤ny
Example
Let a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . Then
a1,0 + a1,1 · y
is a valid expression,
while
a1,0 · x + a1,1 · x
is not.
Exhaustive algorithm: iterative process
→ step k = computation of all the valid expressions of total degree k
3 building rules for computing all parenthesizations
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
25/45
Low latency parenthesization computation
Computation of all parenthesizations
Rules for building valid expressions
Consider step k of the algorithm
E(k ) : valid expressions of total degree k
P(k ) : powers x i y j of total degree k = i + j
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
26/45
Low latency parenthesization computation
Computation of all parenthesizations
Rules for building valid expressions
Consider step k of the algorithm
E(k ) : valid expressions of total degree k
P(k ) : powers x i y j of total degree k = i + j
Rule R1 for building the powers
deg(p) = deg (p1) + deg(p2)
p
p1
deg(p1) ≤ bk/2c
Guillaume Revy – December 1st , 2009.
p2
dk/2e ≤ deg(p2) < k
Implementation of binary floating-point arithmetic on embedded integer processors
26/45
Low latency parenthesization computation
Computation of all parenthesizations
Rules for building valid expressions
Consider step k of the algorithm
E(k ) : valid expressions of total degree k
P(k ) : powers x i y j of total degree k = i + j
Rule R2 for expressions by multiplications
deg(e) = deg(e0) + deg(p)
e
e0
p
0
deg(p) ≤ k
deg(e ) < k
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
26/45
Low latency parenthesization computation
Computation of all parenthesizations
Rules for building valid expressions
Consider step k of the algorithm
E(k ) : valid expressions of total degree k
P(k ) : powers x i y j of total degree k = i + j
Rule R3 for expressions by additions
deg(e) = max ( deg(e1), deg(e2))
e
Guillaume Revy – December 1st , 2009.
e1
e2
deg(e1) = k
deg(e2) ≤ k
Implementation of binary floating-point arithmetic on embedded integer processors
26/45
Low latency parenthesization computation
Computation of all parenthesizations
Number of parenthesizations
ny = 0
nx = 1
nx = 2
nx = 3
nx = 4
nx = 5
nx = 6
1
7
163
11602
2334244
1304066578
ny = 1
51
67467
1133220387
207905478247998
···
···
ny = 2
67467
106191222651
10139277122276921118
···
···
···
Number of generated parenthesizations for evaluating a bivariate polynomial
Timings for parenthesization computation
→ for univariate polynomial of degree 5 ≈ 1h on a 2.4 GHz core
→ for bivariate polynomial of degree (2,1) ≈ 30s
→ for P (s, t ) of degree (3,1) ≈ 7s (88384 schemes)
Optimization for univariate polynomial and P (s, t )
→ univariate polynomial of degree 5 ≈ 4min
→ for P (s, t ) of degree (3,1) ≈ 2s (88384 schemes)
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
27/45
Low latency parenthesization computation
Computation of all parenthesizations
Number of parenthesizations
Number of degree-5 parenthesizations
1e+06
100000
10000
1000
100
10
10
12
14
16
18
20
Latency on unbounded parallelism (# cycles)
→ minimal latency for univariate polynomial of degree 5: 10 cycles
(36 schemes)
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
27/45
Low latency parenthesization computation
Computation of all parenthesizations
Number of parenthesizations
Number of degree-5 parenthesizations
1e+06
100000
10000
1000
100
10
10
12
14
16
18
20
Latency on unbounded parallelism (# cycles)
→ minimal latency for univariate polynomial of degree 5: 10 cycles
(36 schemes)
How to compute only parenthesizations of low latency?
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
27/45
Low latency parenthesization computation
Towards low evaluation latency
Determination of a target latency
Target latency = minimal cost for evaluating
a0,0 + anx ,ny · x nx y ny
I
if no scheme satisfies τ then increase τ and restart
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
28/45
Low latency parenthesization computation
Towards low evaluation latency
Determination of a target latency
Target latency = minimal cost for evaluating
a0,0 + anx ,ny · x nx y ny
I
if no scheme satisfies τ then increase τ and restart
Static target latency τstatic
I
as general as evaluating a0,0 + x nx +ny +1
τstatic = A + M × dlog2 (nx + ny + 1)e
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
28/45
Low latency parenthesization computation
Towards low evaluation latency
Determination of a target latency
Target latency = minimal cost for evaluating
a0,0 + anx ,ny · x nx y ny
I
if no scheme satisfies τ then increase τ and restart
Static target latency τstatic
I
as general as evaluating a0,0 + x nx +ny +1
τstatic = A + M × dlog2 (nx + ny + 1)e
Dynamic target latency τdynamic
I
I
cost of operator on anx ,ny and delay on intederminates
dynamic programming
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
28/45
Low latency parenthesization computation
Towards low evaluation latency
Determination of a target latency
Target latency = minimal cost for evaluating
a0,0 + anx ,ny · x nx y ny
I
if no scheme satisfies τ then increase τ and restart
Example
Degree-9 bivariate polynomial: nx = 8 and ny = 1
Latencies: A = 1 and M = 3
Delay: y available 9 cycles later than x
τstatic
τdynamic
1 + 3 × dlog2 (10)e = 13 cycles
16 cycles
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
28/45
Low latency parenthesization computation
Towards low evaluation latency
Optimized search of best parenthesizations
Example
Let a(x , y ) be a degree-2 bivariate polynomial
a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y .
⇒ find a best splitting of the polynomial → low latency
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
29/45
Low latency parenthesization computation
Towards low evaluation latency
Optimized search of best parenthesizations
Example
Let a(x , y ) be a degree-2 bivariate polynomial
a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y .
⇒ find a best splitting of the polynomial → low latency
a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
29/45
Low latency parenthesization computation
Towards low evaluation latency
Optimized search of best parenthesizations
Example
Let a(x , y ) be a degree-2 bivariate polynomial
a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y .
⇒ find a best splitting of the polynomial → low latency
a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
29/45
Low latency parenthesization computation
Towards low evaluation latency
Optimized search of best parenthesizations
Example
Let a(x , y ) be a degree-2 bivariate polynomial
a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y .
⇒ find a best splitting of the polynomial → low latency
a0,0 + a1,0 · x + a0,1 · y
Guillaume Revy – December 1st , 2009.
+ a1,1 · x · y
Implementation of binary floating-point arithmetic on embedded integer processors
29/45
Low latency parenthesization computation
Towards low evaluation latency
Optimized search of best parenthesizations
Example
Let a(x , y ) be a degree-2 bivariate polynomial
a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y .
⇒ find a best splitting of the polynomial → low latency
a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
29/45
Low latency parenthesization computation
Towards low evaluation latency
Optimized search of best parenthesizations
Example
Let a(x , y ) be a degree-2 bivariate polynomial
a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y .
⇒ find a best splitting of the polynomial → low latency
a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
29/45
Low latency parenthesization computation
Towards low evaluation latency
Optimized search of best parenthesizations
Example
Let a(x , y ) be a degree-2 bivariate polynomial
a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y .
⇒ find a best splitting of the polynomial → low latency
Level 1
a(x, y )
a(x, y )
anx,ny xnx y ny
a0(x, y )
a0(x, y )
if support > max
a(x, y )
a0,0
a00(x, y )
if support ≤ max
keep
a00(x, y )
exhaustive search
Level 2
a0(x, y )
a0(x, y )
a02(x, y )
a01(x, y )
Guillaume Revy – December 1st , 2009.
a01(x, y )
a02(x, y )
Implementation of binary floating-point arithmetic on embedded integer processors
29/45
Low latency parenthesization computation
Towards low evaluation latency
Efficient evaluation parenthesization generation
P (s, t ) = 2−25 + s ·
∑
ai · t i
0≤i ≤10
First target latency τ = 13
→ no parenthesization found
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
30/45
Low latency parenthesization computation
Towards low evaluation latency
Efficient evaluation parenthesization generation
P (s, t ) = 2−25 + s ·
ai · t i
∑
0≤i ≤10
First target latency τ = 13
→ no parenthesization found
Second target latency τ = 14
→ obtained in about 10 sec.
Classical methods
I
I
I
Horner: 44 cycles,
Estrin: 19 cycles,
Estrin by distributing s: 16 cycles
v
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
2−25
s
a0
t
Guillaume Revy – December 1st , 2009.
a1
a2
t
a4
a3
t
s
a5
a6
t
a7
a10
t
t
a8
t
a9
Implementation of binary floating-point arithmetic on embedded integer processors
30/45
Low latency parenthesization computation
Towards low evaluation latency
Flowchart for generating efficient and certified C codes
F(s,t)
E
approx
≤θ
E
eval
<η
ST231 features
Computation of polynomial approximant
a(t)
Computation of low latency parenthesizations
?
Selection of effective parenthesizations
C code
Guillaume Revy – December 1st , 2009.
Certificate
Implementation of binary floating-point arithmetic on embedded integer processors
31/45
Low latency parenthesization computation
Towards low evaluation latency
Flowchart for generating efficient and certified C codes
F(s,t)
E
approx
≤θ
E
eval
<η
ST231 features
Computation of polynomial approximant
a(t)
Computation of low latency parenthesizations
Selection of effective parenthesizations
C code
Guillaume Revy – December 1st , 2009.
Certificate
Implementation of binary floating-point arithmetic on embedded integer processors
31/45
Selection of effective evaluation parenthesizations
Outline of the talk
1. Design and implementation of floating-point operators
2. Low latency parenthesization computation
3. Selection of effective evaluation parenthesizations
General framework
Automatic certification of generated C codes
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
32/45
Selection of effective evaluation parenthesizations
General framework
Selection of effective parenthesizations
1. Arithmetic Operator Choice
I
all intermediate variables are of constant sign
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
33/45
Selection of effective evaluation parenthesizations
General framework
Selection of effective parenthesizations
1. Arithmetic Operator Choice
I
all intermediate variables are of constant sign
2. Scheduling on a simplified model of the ST231
I
constraints of architecture: cost of operators, instructions bundling, ...
I
delays on indeterminates
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
33/45
Selection of effective evaluation parenthesizations
General framework
Selection of effective parenthesizations
1. Arithmetic Operator Choice
I
all intermediate variables are of constant sign
2. Scheduling on a simplified model of the ST231
I
constraints of architecture: cost of operators, instructions bundling, ...
I
delays on indeterminates
3. Certification of generated C code
I
straightline polynomial evaluation program
I
“certified C code”: we can bound the evaluation error in integer arithmetic
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
33/45
Selection of effective evaluation parenthesizations
Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Sufficient conditions with µ = 4 − 2−21
Eapprox ≤ θ
with
θ < 2−25 /µ
I
8e-09
Absolute approximation error
and
Eeval < η = 2−25 − µ · θ
Eapprox ≤ θ,
6e-09
with θ = 3 · 2−29 ≈ 6 · 10−9
4e-09
2e-09
0
-2e-09
-4e-09
I
-6e-09
-8e-09
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
t
Approximation error
Guillaume Revy – December 1st , 2009.
0.9
Eeval < η,
with η ≈ 7.4 · 10−9
Required bound
Implementation of binary floating-point arithmetic on embedded integer processors
34/45
Selection of effective evaluation parenthesizations
Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Case 1: mx ≥ my → condition satisfied
Case 2: mx < my → condition not satisfied: Eeval ≥ η
s∗ = 3.935581684112548828125 and t ∗ = 0.97490441799163818359375
Absolute approximation error
8e-09
6e-09
4e-09
2e-09
0
-2e-09
-4e-09
-6e-09
-8e-09
0.965 0.97 0.975 0.98 0.985 0.99 0.995
t
Approximation error
Required bound 2−25 /(4 − 2−21 ) ≈ 8 · 10−9
Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
35/45
Selection of effective evaluation parenthesizations
Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Case 1: mx ≥ my → condition satisfied
Case 2: mx < my → condition not satisfied: Eeval ≥ η
s∗ = 3.935581684112548828125 and t ∗ = 0.97490441799163818359375
Absolute approximation error
8e-09
6e-09
1. determine an interval I around this point
4e-09
2e-09
0
-2e-09
-4e-09
-6e-09
-8e-09
0.965 0.97 0.975 0.98 0.985 0.99 0.995
t
Approximation error
Required bound 2−25 /(4 − 2−21 ) ≈ 8 · 10−9
Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
35/45
Selection of effective evaluation parenthesizations
Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Case 1: mx ≥ my → condition satisfied
Case 2: mx < my → condition not satisfied: Eeval ≥ η
s∗ = 3.935581684112548828125 and t ∗ = 0.97490441799163818359375
Absolute approximation error
8e-09
6e-09
1. determine an interval I around this point
4e-09
2. compute Eapprox over I
2e-09
0
3. determine an evaluation error bound η
-2e-09
4. check if Eeval < η ?
-4e-09
-6e-09
-8e-09
0.965 0.97 0.975 0.98 0.985 0.99 0.995
t
Approximation error
Required bound 2−25 /(4 − 2−21 ) ≈ 8 · 10−9
Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
35/45
Selection of effective evaluation parenthesizations
Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Sufficient conditions for each subinterval, with µ = 4 − 2−21
(i )
Eapprox
≤ θ(i )
with
θ(i ) < 2−25 /µ
and
(i )
Eeval < η(i ) = 2−25 − µ · θ(i )
Absolute approximation error
8e-09
6e-09
4e-09
2e-09
(i )
Eapprox ≤ θ(i )
0
-2e-09
-4e-09
(i )
Eeval < η(i )
-6e-09
-8e-09
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
t
Approximation error
Guillaume Revy – December 1st , 2009.
Required bound
Implementation of binary floating-point arithmetic on embedded integer processors
36/45
Selection of effective evaluation parenthesizations
Automatic certification of generated C codes
Certification of evaluation error for binary32 division
Sufficient conditions for each subinterval, with µ = 4 − 2−21
(i )
Eapprox
≤ θ(i )
with
θ(i ) < 2−25 /µ
and
(i )
Eeval < η(i ) = 2−25 − µ · θ(i )
Absolute approximation error
8e-09
6e-09
4e-09
2e-09
0
(i )
I
Eapprox ≤ θ(i )
I
Eeval < η(i )
-2e-09
-4e-09
-6e-09
(i )
-8e-09
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
t
Approximation error
Guillaume Revy – December 1st , 2009.
Required bound
Implementation of binary floating-point arithmetic on embedded integer processors
36/45
Selection of effective evaluation parenthesizations
Automatic certification of generated C codes
Certification using a dichotomy-based strategy
Implementation of the splitting by dichotomy
I
for each T
(i )
1. compute a certified approximation error bound θ(i )
Sollya
2. determine an evaluation error bound η(i )
Sollya
(i )
3. check this bound: Eeval
< η(i )
Gappa
⇒ if this bound is not satisfied, T
Guillaume Revy – December 1st , 2009.
(i )
is split up into 2 subintervals
Implementation of binary floating-point arithmetic on embedded integer processors
37/45
Selection of effective evaluation parenthesizations
Automatic certification of generated C codes
Certification using a dichotomy-based strategy
Implementation of the splitting by dichotomy
I
for each T
(i )
1. compute a certified approximation error bound θ(i )
Sollya
2. determine an evaluation error bound η(i )
Sollya
(i )
3. check this bound: Eeval
< η(i )
Gappa
⇒ if this bound is not satisfied, T
Guillaume Revy – December 1st , 2009.
(i )
is split up into 2 subintervals
Implementation of binary floating-point arithmetic on embedded integer processors
37/45
Selection of effective evaluation parenthesizations
Automatic certification of generated C codes
Certification using a dichotomy-based strategy
Implementation of the splitting by dichotomy
I
for each T
(i )
1. compute a certified approximation error bound θ(i )
Sollya
2. determine an evaluation error bound η(i )
Sollya
(i )
3. check this bound: Eeval
< η(i )
Gappa
⇒ if this bound is not satisfied, T
(i )
is split up into 2 subintervals
Example of binary32 implementation
→ launched on a 64 processor grid
→ 36127 subintervals found in several hours (≈ 5h.)
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
37/45
Numerical results
Outline of the talk
1. Design and implementation of floating-point operators
2. Low latency parenthesization computation
3. Selection of effective evaluation parenthesizations
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
38/45
Numerical results
Performances of FLIP on ST231
100
180
FLIP 1.0
FLIP 0.3
STlib
FLIP 1.0 vs STlib
FLIP 1.0 vs FLIP 0.3
80
140
Speed-up (%)
Latency (# cycles)
160
120
100
80
60
40
60
40
20
20
0
0
t
v
r
sq
di
ul
m
b
su
d
ad
t
v
r
sq
di
ul
m
b
su
d
ad
Floating-point operators
Floating-point operators
Performances on ST231, in RoundTiesToEven
⇒ Speed-up between 20 and 50 %
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
39/45
Numerical results
Performances of FLIP on ST231
100
180
FLIP 1.0
FLIP 0.3
STlib
FLIP 1.0 vs STlib
FLIP 1.0 vs FLIP 0.3
80
140
Speed-up (%)
Latency (# cycles)
160
120
100
80
60
40
60
40
20
20
0
0
t
v
r
sq
di
ul
m
b
su
d
ad
t
v
r
sq
di
ul
m
b
su
d
ad
Floating-point operators
Floating-point operators
Performances on ST231, in RoundTiesToEven
⇒ Speed-up between 20 and 50 %
Implementations of other operators
x −1
x −1/2
x 1/3
x −1/3
x −1/4
25
29
34
40
42
Performances on ST231, in RoundTiesToEven (in number of cycles)
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
39/45
Numerical results
Impact of dynamic target latency
x 1/3
x −1/3
(8,1)
(9,1)
Delay on the operand s (# cycles)
9
9
Static target latency
13
13
Dynamic target latency
16
16
Latency on unbounded parallelism and on ST231
16
16
Degree (nx ,ny )
Latency (# cycles) on unbounded parallelism and on ST231
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
40/45
Numerical results
Impact of dynamic target latency
x 1/3
x −1/3
(8,1)
(9,1)
Delay on the operand s (# cycles)
9
9
Static target latency
13
13
Dynamic target latency
16
16
Latency on unbounded parallelism and on ST231
16
16
Degree (nx ,ny )
Latency (# cycles) on unbounded parallelism and on ST231
=⇒ Conclude on the optimality in terms of polynomial evaluation latency
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
40/45
Numerical results
Timings for code generation
x 1/2
x −1/2
x 1/3
x −1/3
x −1
(8,1)
(9,1)
(8,1)
(9,1)
(10,0)
Static target latency
13
13
13
13
13
Dynamic target latency
13
13
16
16
13
Latency on unbounded parallelism
13
13
16
16
13
Latency on ST231
13
14
16
16
13
Parenthesization generation
172ms
152ms
53s
56s
168ms
Arithmetic Operator Choice
6ms
6ms
7ms
11ms
4ms
Scheduling
29s
4m21s
32ms
132ms
7s
Certification (Gappa)
6s
4s
1m38s
1m07s
11s
Total time (≈)
35s
4m25s
2m31s
2m03s
18s
Degree (nx ,ny )
Timing of each step of the generation flow
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
41/45
Numerical results
Timings for code generation
x 1/2
x −1/2
x 1/3
x −1/3
x −1
(8,1)
(9,1)
(8,1)
(9,1)
(10,0)
Static target latency
13
13
13
13
13
Dynamic target latency
13
13
16
16
13
Latency on unbounded parallelism
13
13
16
16
13
Latency on ST231
13
14
16
16
13
Parenthesization generation
172ms
152ms
53s
56s
168ms
Arithmetic Operator Choice
6ms
6ms
7ms
11ms
4ms
Scheduling
29s
4m21s
32ms
132ms
7s
Certification (Gappa)
6s
4s
1m38s
1m07s
11s
Total time (≈)
35s
4m25s
2m31s
2m03s
18s
Degree (nx ,ny )
Timing of each step of the generation flow
Impact of the target latency on the first step of the generation
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
41/45
Numerical results
Timings for code generation
x 1/2
x −1/2
x 1/3
x −1/3
x −1
(8,1)
(9,1)
(8,1)
(9,1)
(10,0)
Static target latency
13
13
13
13
13
Dynamic target latency
13
13
16
16
13
Latency on unbounded parallelism
13
13
16
16
13
Latency on ST231
13
14
16
16
13
Parenthesization generation
172ms
152ms
53s
56s
168ms
Arithmetic Operator Choice
6ms
6ms
7ms
11ms
4ms
Scheduling
29s
4m21s
32ms
132ms
7s
Certification (Gappa)
6s
4s
1m38s
1m07s
11s
Total time (≈)
35s
4m25s
2m31s
2m03s
18s
Degree (nx ,ny )
Timing of each step of the generation flow
Impact of the target latency on the first step of the generation
What may dominate the cost
→ scheduling algorithm
→ certification using Gappa
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
41/45
Conclusions and perspectives
Outline of the talk
1. Design and implementation of floating-point operators
2. Low latency parenthesization computation
3. Selection of effective evaluation parenthesizations
4. Numerical results
5. Conclusions and perspectives
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
42/45
Conclusions and perspectives
Conclusions
Conclusions
Design and implementation of floating-point operators
I
uniform approach for correctly-rounded roots and their reciprocals
I
extension to correctly-rounded division
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
43/45
Conclusions and perspectives
Conclusions
Conclusions
Design and implementation of floating-point operators
I
uniform approach for correctly-rounded roots and their reciprocals
I
extension to correctly-rounded division
I
polynomial evaluation-based method, very high ILP exposure
⇒ new, much faster version of FLIP
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
43/45
Conclusions and perspectives
Conclusions
Conclusions
Design and implementation of floating-point operators
I
uniform approach for correctly-rounded roots and their reciprocals
I
extension to correctly-rounded division
I
polynomial evaluation-based method, very high ILP exposure
⇒ new, much faster version of FLIP
Code generation for efficient and certified polynomial evaluation
I
methodologies and tools for automating polynomial evaluation
implementation
I
heuristics and techniques for generating quickly efficient and certified C
codes
⇒ CGPE: allows to write and certify automatically ≈ 50 % of the codes of FLIP
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
43/45
Conclusions and perspectives
Perspectives
Perspectives
Faithful implementation of floating-point operators
→ other floating-point operators:
√
• log2 (1 + x ) over [0.5, 1), 1/ 1 + x 2 over [0, 0.5), ...
→ roots and their reciprocals: rounding condition decision not automated yet
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
44/45
Conclusions and perspectives
Perspectives
Perspectives
Faithful implementation of floating-point operators
→ other floating-point operators:
√
• log2 (1 + x ) over [0.5, 1), 1/ 1 + x 2 over [0, 0.5), ...
→ roots and their reciprocals: rounding condition decision not automated yet
Extension to other binary floating-point formats
→ square root in binary64: 171 cycles on ST231, 396 cycles with STlib
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
44/45
Conclusions and perspectives
Perspectives
Perspectives
Faithful implementation of floating-point operators
→ other floating-point operators:
√
• log2 (1 + x ) over [0.5, 1), 1/ 1 + x 2 over [0, 0.5), ...
→ roots and their reciprocals: rounding condition decision not automated yet
Extension to other binary floating-point formats
→ square root in binary64: 171 cycles on ST231, 396 cycles with STlib
Extension to other architectures, typically FPGAs
→ polynomial evaluation-based approach: already seems to be a good
alternative to multiplicative methods on FPGAs
→ the other techniques introduced of this thesis: should be investigated further
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
44/45
Conclusions and perspectives
Perspectives
Implementation of binary floating-point arithmetic
on embedded integer processors
Polynomial evaluation-based algorithms
and
certified code generation
Guillaume Revy
Advisors: Claude-Pierre Jeannerod and Gilles Villard
Arénaire INRIA project-team (LIP, Ens Lyon)
Université de Lyon
CNRS
Ph.D. Defense – December 1st , 2009
Guillaume Revy – December 1st , 2009.
Implementation of binary floating-point arithmetic on embedded integer processors
45/45
Download