Implementation of binary floating-point arithmetic on embedded integer processors Polynomial evaluation-based algorithms and certified code generation Guillaume Revy Advisors: Claude-Pierre Jeannerod and Gilles Villard Arénaire INRIA project-team (LIP, Ens Lyon) Université de Lyon CNRS Ph.D. Defense – December 1st , 2009 Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 1/45 Motivation Embedded systems are ubiquitous I microprocessors dedicated to one or a few specific tasks I satisfy constraints: area, energy consumption, conception cost Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45 Motivation Embedded systems are ubiquitous I microprocessors dedicated to one or a few specific tasks I satisfy constraints: area, energy consumption, conception cost Some embedded systems do not have any FPU (floating-point unit) Embedded systems No FPU Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45 Motivation Embedded systems are ubiquitous I microprocessors dedicated to one or a few specific tasks I satisfy constraints: area, energy consumption, conception cost Some embedded systems do not have any FPU (floating-point unit) Applications FP computations Embedded systems No FPU Highly used in audio and video applications I demanding on floating-point computations Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45 Motivation Embedded systems are ubiquitous I microprocessors dedicated to one or a few specific tasks I satisfy constraints: area, energy consumption, conception cost Some embedded systems do not have any FPU (floating-point unit) Applications FP computations Embedded systems Software implementing No FPU floating-point arithmetic Highly used in audio and video applications I demanding on floating-point computations Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45 Motivation Embedded systems are ubiquitous I microprocessors dedicated to one or a few specific tasks I satisfy constraints: area, energy consumption, conception cost Some embedded systems do not have any FPU (floating-point unit) SDI ports ST231 core Mul ITLB ICache Mul DTLB Register Instruction file (64 registers buffer 8 read 4 write) Load Store Unit (LSU) Applications UTLB 4 x SDI Control registers SCU Write buffer DCache STBus CMC 64-bit I-side memory subsystem PC and branch unit Branch register file IU IU IU IU Prefetch buffer D-side memory subsystem FP computations Trap controller Peripherals STBus 3x Interrupt Debug Timers controllersupport unit 32-bit 61 interrupts Debuglink Embedded systems Software implementing No FPU floating-point arithmetic Highly used in audio and video applications I demanding on floating-point computations Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 2/45 Overview of the ST231 architecture SDI ports ST231 core Mul ITLB Mul DTLB UTLB ICache Instruction buffer Load Store Unit (LSU) 4-issue VLIW 32-bit integer processor → no FPU 4 x SDI Control registers Register file (64 registers 8 read 4 write) SCU Write buffer DCache STBus Parallel execution unit CMC 64-bit I-side memory subsystem PC and branch unit Branch register file IU IU IU IU Trap controller Peripherals 3x Timers Prefetch buffer STBus Interrupt controller 61 interrupts Debug support unit 32-bit Debuglink Guillaume Revy – December 1st , 2009. D-side memory subsystem I I 4 integer ALU 2 pipelined multipliers 32 × 32 → 32 Latencies: ALU → 1 cycle, Mul → 3 cycles Implementation of binary floating-point arithmetic on embedded integer processors 3/45 Overview of the ST231 architecture SDI ports ST231 core Mul ITLB Mul DTLB UTLB Instruction buffer ICache Load Store Unit (LSU) 4-issue VLIW 32-bit integer processor → no FPU 4 x SDI Control registers Register file (64 registers 8 read 4 write) SCU Write buffer DCache STBus Parallel execution unit CMC 64-bit PC and branch unit I-side memory subsystem Branch register file IU IU IU IU Trap controller Peripherals 3x Timers Prefetch buffer STBus Interrupt controller 61 interrupts Debug support unit 32-bit Debuglink D-side memory subsystem I I 4 integer ALU 2 pipelined multipliers 32 × 32 → 32 Latencies: ALU → 1 cycle, Mul → 3 cycles VLIW (Very Long Instruction Word) → instructions grouped into bundles → Instruction-Level Parallelism (ILP) explicitly exposed by the compiler Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 3/45 Overview of the ST231 architecture SDI ports ST231 core Mul ITLB Mul DTLB UTLB Instruction buffer ICache Load Store Unit (LSU) 4-issue VLIW 32-bit integer processor → no FPU 4 x SDI Control registers Register file (64 registers 8 read 4 write) SCU Write buffer DCache STBus Parallel execution unit CMC 64-bit PC and branch unit I-side memory subsystem Branch register file IU IU IU D-side memory subsystem IU Trap controller Peripherals 3x Timers Prefetch buffer Debug support unit 61 interrupts 4 integer ALU 2 pipelined multipliers 32 × 32 → 32 Latencies: ALU → 1 cycle, Mul → 3 cycles STBus Interrupt controller I I 32-bit Debuglink VLIW (Very Long Instruction Word) → instructions grouped into bundles → Instruction-Level Parallelism (ILP) explicitly exposed by the compiler uint32_t uint32_t uint32_t uint32_t R1 R2 R3 R4 = = = = A0 + C; A3 * X; A1 * X; X * X; Guillaume Revy – December 1st , 2009. Issue 1 Issue 2 0 R1 R2 1 R4 Issue 3 Issue 4 R3 Implementation of binary floating-point arithmetic on embedded integer processors 3/45 How to emulate floating-point arithmetic in software? Design and implementation of efficient software support for IEEE 754 floating-point arithmetic on integer processors Existing software for IEEE 754 floating-point arithmetic: I Software floating-point support of GCC, Glibc and µClibc, GoFast Floating-Point Library I SoftFloat (→ STlib) I FLIP (Floating-point Library for Integer Processors) • software support for binary32 floating-point arithmetic on integer processors • correctly-rounded addition, subtraction, multiplication, division, square root, reciprocal, ... • handling subnormals, and handling special inputs Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 4/45 Towards the generation of fast and certified codes Underlying problem: development “by hand” I long and tedious, error prone I new target ? new floating-point format ? Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 5/45 Towards the generation of fast and certified codes Underlying problem: development “by hand” I long and tedious, error prone I new target ? new floating-point format ? ⇒ need for automation and certification Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 5/45 Towards the generation of fast and certified codes Underlying problem: development “by hand” I long and tedious, error prone I new target ? new floating-point format ? ⇒ need for automation and certification Current challenge: tools and methodologies for the automatic generation of efficient and certified programs I optimized for a given format, for the target architecture Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 5/45 Towards the generation of fast and certified codes Arénaire’s developments: hardware (FloPoCo) and software (Sollya, Metalibm) Spiral project: hardware and software code generation for DSP algorithms Can we teach computers to write fast libraries? Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 6/45 Towards the generation of fast and certified codes Arénaire’s developments: hardware (FloPoCo) and software (Sollya, Metalibm) Spiral project: hardware and software code generation for DSP algorithms Can we teach computers to write fast libraries? Our tool: CGPE (Code Generation for Polynomial Evaluation) In the particular case of polynomial evaluation, can we teach computers to write fast and certified codes, for a given target and optimized for a given format? Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 6/45 Basic blocks for implementing correctly-rounded operators (X, Y ) function independent Special input detection no yes function dependent Floating-point number unpacking Normalization Objectives Range reduction Result sign/exponent computation Result significand approximation → Low latency, correctly-rounded implementations Rounding condition decision → ILP exposure Correct rounding computation Result reconstruction Special output selection R Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 7/45 Basic blocks for implementing correctly-rounded operators (X, Y ) function independent Special input detection no yes function dependent Floating-point number unpacking Normalization Objectives Range reduction Result sign/exponent computation Result significand approximation → Low latency, correctly-rounded implementations Rounding condition decision → ILP exposure Correct rounding computation Result reconstruction Special output selection R Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 7/45 Basic blocks for implementing correctly-rounded operators (X, Y ) Problem: function to be evaluated ST231 features Special input detection no yes Computation of polynomial approximant Floating-point number unpacking Efficient and certified C code generation Normalization Fully automated C code Certificate Range reduction Result sign/exponent computation Result significand approximation Rounding condition decision Correct rounding computation Uniform approach for nth roots and their reciprocals → polynomial evaluation Extension to division Result reconstruction Special output selection R Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 7/45 Flowchart for generating efficient and certified C codes Problem: function to be evaluated ST231 features Computation of polynomial approximant Efficient and certified C code generation C code Certificate Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45 Flowchart for generating efficient and certified C codes Problem: function to be evaluated ST231 features Constraints Computation of polynomial approximant Accuracy of approximant and C code Efficient and certified C code generation C code Certificate Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45 Flowchart for generating efficient and certified C codes Problem: function to be evaluated ST231 features Constraints Computation of polynomial approximant Accuracy of approximant and C code Efficient and certified C code generation Low evaluation latency on ST231, ILP exposure C code Certificate Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45 Flowchart for generating efficient and certified C codes Problem: function to be evaluated ST231 features Constraints Computation of polynomial approximant Accuracy of approximant and C code I Sollya Efficient and certified C code generation Low evaluation latency on ST231, ILP exposure C code Certificate Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45 Flowchart for generating efficient and certified C codes Problem: function to be evaluated ST231 features Constraints Computation of polynomial approximant Efficient and certified C code generation Accuracy of approximant and C code I Sollya I interval arithmetic (MPFI), Gappa Low evaluation latency on ST231, ILP exposure C code Certificate Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45 Flowchart for generating efficient and certified C codes Problem: function to be evaluated ST231 features Constraints Computation of polynomial approximant Accuracy of approximant and C code Efficient and certified C code I Sollya I interval arithmetic (MPFI), Gappa CGPE generation Low evaluation latency on ST231, ILP exposure C code Certificate Guillaume Revy – December 1st , 2009. I ? Implementation of binary floating-point arithmetic on embedded integer processors 8/45 Flowchart for generating efficient and certified C codes Problem: function to be evaluated ST231 features Constraints Computation of polynomial approximant Accuracy of approximant and C code Efficient and certified C code I Sollya I interval arithmetic (MPFI), Gappa CGPE generation Low evaluation latency on ST231, ILP exposure C code Certificate I ? Efficiency of the generation process Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 8/45 Outline of the talk 1. Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Implementation of correct rounding 2. Low latency parenthesization computation Classical evaluation methods Computation of all parenthesizations Towards low evaluation latency 3. Selection of effective evaluation parenthesizations General framework Automatic certification of generated C codes 4. Numerical results 5. Conclusions and perspectives Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 9/45 Design and implementation of floating-point operators Outline of the talk 1. Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Implementation of correct rounding 2. Low latency parenthesization computation 3. Selection of effective evaluation parenthesizations 4. Numerical results 5. Conclusions and perspectives Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 10/45 Design and implementation of floating-point operators Notation and assumptions (x, y) Division C code RN(x/y) Input (x , y ) and output RN(x /y ): normal numbers → no underflow nor overflow → precision p, extremal exponents emin , emax x = ±1.mx ,1 . . . mx ,p−1 · 2ex Guillaume Revy – December 1st , 2009. with ex ∈ {emin , . . . , emax } Implementation of binary floating-point arithmetic on embedded integer processors 11/45 Design and implementation of floating-point operators Notation and assumptions (x, y) Division C code RN(x/y) Input (x , y ) and output RN(x /y ): normal numbers → no underflow nor overflow → precision p, extremal exponents emin , emax x = ±1.mx ,1 . . . mx ,p−1 · 2ex with ex ∈ {emin , . . . , emax } → RoundTiesToEven x/y Guillaume Revy – December 1st , 2009. RN(x/y) Implementation of binary floating-point arithmetic on embedded integer processors 11/45 Design and implementation of floating-point operators Notation and assumptions (X, Y ) Division C code R Standard binary encoding: k -bit unsigned integer X encodes input x sx 1 bit Ex = ex − emin − 1 w = k − p bits Tx = mx,1 . . . mx,p−1 p − 1 bits Computation: k -bit unsigned integers → integer and fixed-point arithmetic Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 11/45 Design and implementation of floating-point operators Notation and assumptions (X, Y ) ? Division C code R Standard binary encoding: k -bit unsigned integer X encodes input x sx 1 bit Ex = ex − emin − 1 w = k − p bits Tx = mx,1 . . . mx,p−1 p − 1 bits Computation: k -bit unsigned integers → integer and fixed-point arithmetic Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 11/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Range reduction of division Express the exact result r = x /y as: r = ` · 2d ⇒ RN(x /y ) = RN(`) · 2d with ` ∈ [1, 2) and d ∈ {emin , . . . , emax } Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 12/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Range reduction of division Express the exact result r = x /y as: r = ` · 2d ⇒ RN(x /y ) = RN(`) · 2d with ` ∈ [1, 2) and d ∈ {emin , . . . , emax } Definition c = 1 if mx ≥ my , Guillaume Revy – December 1st , 2009. and c=0 otherwise Implementation of binary floating-point arithmetic on embedded integer processors 12/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Range reduction of division Express the exact result r = x /y as: r = ` · 2d ⇒ RN(x /y ) = RN(`) · 2d with ` ∈ [1, 2) and d ∈ {emin , . . . , emax } Definition c = 1 if mx ≥ my , and c=0 otherwise Range reduction x /y = 21−c · mx /my | Guillaume Revy – December 1st , 2009. {z := ` ∈ [1,2) · 2d with d = ex − ey − 1 + c } Implementation of binary floating-point arithmetic on embedded integer processors 12/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Range reduction of division Express the exact result r = x /y as: r = ` · 2d ⇒ RN(x /y ) = RN(`) · 2d with ` ∈ [1, 2) and d ∈ {emin , . . . , emax } Definition c = 1 if mx ≥ my , and c=0 otherwise Range reduction x /y = 21−c · mx /my | {z := ` ∈ [1,2) · 2d with d = ex − ey − 1 + c } How to compute the correctly-rounded significand RN(`) ? Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 12/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Methods for computing the correctly-rounded significand Iterative methods: restoring, non-restoring, SRT, ... I Oberman and Flynn (1997) I minimal ILP exposure, sequential algorithm Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 13/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Methods for computing the correctly-rounded significand Iterative methods: restoring, non-restoring, SRT, ... I Oberman and Flynn (1997) I minimal ILP exposure, sequential algorithm Multiplicative methods: Newton-Raphson, Goldschmidt I Piñeiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006) I exploit available multipliers, more ILP exposure Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 13/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Methods for computing the correctly-rounded significand Iterative methods: restoring, non-restoring, SRT, ... I Oberman and Flynn (1997) I minimal ILP exposure, sequential algorithm Multiplicative methods: Newton-Raphson, Goldschmidt I Piñeiro and Bruguera (2002) – Raina’s Ph.D., FLIP 0.3 (2006) I exploit available multipliers, more ILP exposure Polynomial-based methods I Agarwal, Gustavson and Schmookler (1999) → univariate polynomial evaluation I Our approach → bivariate polynomial evaluation: maximal ILP exposure Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 13/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Correct rounding via truncated one-sided approximation How to compute RN(`), with ` = 21−c · mx /my ? Three steps for correct rounding computation 1. compute v = 1.v1 . . . vk −2 such that −2−p ≤ ` − v < 0 → implied by |(` + 2−p−1 ) − v | < 2−p−1 → bivariate polynomial evaluation 2. compute u as the truncation of v after p fraction bits 3. determine RN(`) after possibly adding 2−p Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 14/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Correct rounding via truncated one-sided approximation How to compute RN(`), with ` = 21−c · mx /my ? Three steps for correct rounding computation 1. compute v = 1.v1 . . . vk −2 such that −2−p ≤ ` − v < 0 → implied by |(` + 2−p−1 ) − v | < 2−p−1 → bivariate polynomial evaluation 2. compute u as the truncation of v after p fraction bits 3. determine RN(`) after possibly adding 2−p How to compute the one-sided approximation v and then deduce RN(`)? Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 14/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach One-sided approximation via bivariate polynomials 1. Consider ` + 2−p−1 as the exact result of the function F (s, t ) = s/(1 + t ) + 2−p−1 at the points s∗ = 21−c · mx and t ∗ = my − 1 Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach One-sided approximation via bivariate polynomials 1. Consider ` + 2−p−1 as the exact result of the function F (s, t ) = s/(1 + t ) + 2−p−1 at the points s∗ = 21−c · mx and t ∗ = my − 1 2. Approximate F (s, t ) by a bivariate polynomial P (s, t ) P (s, t ) = s · a(t ) + 2−p−1 → a(t ): univariate polynomial approximant of 1/(1 + t ) → approximation error Eapprox Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach One-sided approximation via bivariate polynomials 1. Consider ` + 2−p−1 as the exact result of the function F (s, t ) = s/(1 + t ) + 2−p−1 at the points s∗ = 21−c · mx and t ∗ = my − 1 2. Approximate F (s, t ) by a bivariate polynomial P (s, t ) P (s, t ) = s · a(t ) + 2−p−1 → a(t ): univariate polynomial approximant of 1/(1 + t ) → approximation error Eapprox 3. Evaluate P (s, t ) by a well-chosen efficient evaluation program P v = P (s∗ , t ∗ ) → evaluation error Eeval Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach One-sided approximation via bivariate polynomials 1. Consider ` + 2−p−1 as the exact result of the function F (s, t ) = s/(1 + t ) + 2−p−1 at the points s∗ = 21−c · mx and t ∗ = my − 1 2. Approximate F (s, t ) by a bivariate polynomial P (s, t ) P (s, t ) = s · a(t ) + 2−p−1 → a(t ): univariate polynomial approximant of 1/(1 + t ) → approximation error Eapprox 3. Evaluate P (s, t ) by a well-chosen efficient evaluation program P v = P (s∗ , t ∗ ) → evaluation error Eeval How to ensure that |(` + 2−p−1 ) − v | < 2−p−1 ? Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 15/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Sufficient error bounds To ensure |(` + 2−p−1 ) − v | < 2−p−1 it suffices to ensure that µ · Eapprox + Eeval < 2−p−1 , since |(` + 2−p−1 ) − v | ≤ µ · Eapprox + Eeval Guillaume Revy – December 1st , 2009. with µ = 4 − 23−p Implementation of binary floating-point arithmetic on embedded integer processors 16/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Sufficient error bounds To ensure |(` + 2−p−1 ) − v | < 2−p−1 µ · Eapprox + Eeval < 2−p−1 , it suffices to ensure that since |(` + 2−p−1 ) − v | ≤ µ · Eapprox + Eeval with µ = 4 − 23−p This gives the following sufficient conditions Eapprox < 2−p−1 /µ Guillaume Revy – December 1st , 2009. ⇒ Eeval < 2−p−1 − µ · Eapprox Implementation of binary floating-point arithmetic on embedded integer processors 16/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Sufficient error bounds To ensure |(` + 2−p−1 ) − v | < 2−p−1 it suffices to ensure that µ · Eapprox + Eeval < 2−p−1 , since |(` + 2−p−1 ) − v | ≤ µ · Eapprox + Eeval with µ = 4 − 23−p This gives the following sufficient conditions Eapprox ≤ θ with θ < 2−p−1 /µ Guillaume Revy – December 1st , 2009. ⇒ Eeval < η = 2−p−1 − µ · θ Implementation of binary floating-point arithmetic on embedded integer processors 16/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Example for the binary32 division Sufficient conditions with µ = 4 − 2−21 Eapprox ≤ θ with Guillaume Revy – December 1st , 2009. θ < 2−25 /µ and Eeval < η = 2−25 − µ · θ Implementation of binary floating-point arithmetic on embedded integer processors 17/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Example for the binary32 division Sufficient conditions with µ = 4 − 2−21 Eapprox ≤ θ with θ < 2−25 /µ and Eeval < η = 2−25 − µ · θ Approximation of 1/(1 + t ) by a Remez-like polynomial of degree 10 Absolute approximation error 8e-09 6e-09 I 4e-09 Eapprox ≤ θ, 2e-09 with θ = 3 · 2−29 ≈ 6 · 10−9 0 -2e-09 -4e-09 I -6e-09 Eeval < η, -8e-09 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 t Approximation error Guillaume Revy – December 1st , 2009. 0.9 with η ≈ 7.4 · 10−9 Required bound Implementation of binary floating-point arithmetic on embedded integer processors 17/45 Design and implementation of floating-point operators Bivariate polynomial evaluation-based approach Flowchart for generating efficient and certified C codes F(s,t) E approx ≤θ E eval <η ST231 features Computation of polynomial approximant ? Computation of low latency parenthesizations ? Selection of effective parenthesizations v C code Certificate |(` + 2−p−1) − v| < 2−p−1 truncation u Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 18/45 Design and implementation of floating-point operators Implementation of correct rounding Rounding condition: definition Approximation u of ` with ` = 21−c · mx /my The exact value ` may have an infinite number of bits → the sticky bit cannot always be computed u u ` floating-point Guillaume Revy – December 1st , 2009. ` midpoint Implementation of binary floating-point arithmetic on embedded integer processors 19/45 Design and implementation of floating-point operators Implementation of correct rounding Rounding condition: definition Approximation u of ` with ` = 21−c · mx /my The exact value ` may have an infinite number of bits → the sticky bit cannot always be computed u u ` ` floating-point midpoint Compute RN(`) requires to be able to decide whether u ≥ ` → ` cannot be a midpoint Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 19/45 Design and implementation of floating-point operators Implementation of correct rounding Rounding condition: definition Approximation u of ` with ` = 21−c · mx /my The exact value ` may have an infinite number of bits → the sticky bit cannot always be computed u u ` ` floating-point midpoint Compute RN(`) requires to be able to decide whether u ≥ ` → ` cannot be a midpoint Rounding condition: u ≥ ` u≥` Guillaume Revy – December 1st , 2009. ⇐⇒ u · my ≥ 21−c · mx Implementation of binary floating-point arithmetic on embedded integer processors 19/45 Design and implementation of floating-point operators Implementation of correct rounding Rounding condition: implementation in integer arithmetic Rounding condition: u · my ≥ 21−c · mx Approximation u and my : representable with 32 bits u × my u · my I u · my is exactly representable with 64 bits Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 20/45 Design and implementation of floating-point operators Implementation of correct rounding Rounding condition: implementation in integer arithmetic Rounding condition: u · my ≥ 21−c · mx Approximation u and my : representable with 32 bits u × my u · my 21−c · mx I u · my is exactly representable with 64 bits I 21−c · mx is representable with 32 bits since c ∈ {0, 1} Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 20/45 Design and implementation of floating-point operators Implementation of correct rounding Rounding condition: implementation in integer arithmetic Rounding condition: u · my ≥ 21−c · mx Approximation u and my : representable with 32 bits u × my u · my ≥ 21−c · mx I u · my is exactly representable with 64 bits I 21−c · mx is representable with 32 bits since c ∈ {0, 1} ⇒ one 32 × 32 → 32-bit multiplication and one comparison Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 20/45 Design and implementation of floating-point operators Implementation of correct rounding Flowchart for generating efficient and certified C codes F(s,t) E approx ≤θ E eval <η ST231 features Computation of polynomial approximant ? Computation of low latency parenthesizations ? Selection of effective parenthesizations C code Guillaume Revy – December 1st , 2009. Certificate Implementation of binary floating-point arithmetic on embedded integer processors 21/45 Design and implementation of floating-point operators Implementation of correct rounding Flowchart for generating efficient and certified C codes F(s,t) E approx ≤θ E eval <η ST231 features Computation of polynomial approximant a(t) Computation of low latency parenthesizations ? Selection of effective parenthesizations C code Guillaume Revy – December 1st , 2009. Certificate Implementation of binary floating-point arithmetic on embedded integer processors 21/45 Low latency parenthesization computation Outline of the talk 1. Design and implementation of floating-point operators 2. Low latency parenthesization computation Classical evaluation methods Computation of all parenthesizations Towards low evaluation latency 3. Selection of effective evaluation parenthesizations 4. Numerical results 5. Conclusions and perspectives Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 22/45 Low latency parenthesization computation Classical evaluation methods Objectives Compute an efficient parenthesization for evaluating P (s, t ) → reduces the evaluation latency on unbounded parallelism Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45 Low latency parenthesization computation Classical evaluation methods Objectives Compute an efficient parenthesization for evaluating P (s, t ) → reduces the evaluation latency on unbounded parallelism Evaluation program P = main part of the full software implementation → dominates the cost Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45 Low latency parenthesization computation Classical evaluation methods Objectives Compute an efficient parenthesization for evaluating P (s, t ) → reduces the evaluation latency on unbounded parallelism Evaluation program P = main part of the full software implementation → dominates the cost Two families of algorithms I algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and Stockmeyer (1964), ... → ill-suited in the context of fixed-point arithmetic I algorithms without coefficient adaptation Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45 Low latency parenthesization computation Classical evaluation methods Objectives Compute an efficient parenthesization for evaluating P (s, t ) → reduces the evaluation latency on unbounded parallelism Evaluation program P = main part of the full software implementation → dominates the cost Two families of algorithms I algorithms with coefficient adaptation: Knuth and Eve (60’s), Paterson and Stockmeyer (1964), ... → ill-suited in the context of fixed-point arithmetic I algorithms without coefficient adaptation Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 23/45 Low latency parenthesization computation Classical evaluation methods Classical parenthesizations for binary32 division P (s, t ) = 2−25 + s · ∑ ai · t i 0≤i ≤10 Horner’s rule: (3 + 1) × 11 = 44 cycles → no ILP exposure Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45 Low latency parenthesization computation Classical evaluation methods Classical parenthesizations for binary32 division P (s, t ) = 2−25 + s · ∑ ai · t i 0≤i ≤10 Horner’s rule: (3 + 1) × 11 = 44 cycles → no ILP exposure Second-order Horner’s rule: 27 cycles → evaluation of odd and even parts independently with Horner, more ILP Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45 Low latency parenthesization computation Classical evaluation methods Classical parenthesizations for binary32 division P (s, t ) = 2−25 + s · ∑ ai · t i 0≤i ≤10 Horner’s rule: (3 + 1) × 11 = 44 cycles → no ILP exposure Second-order Horner’s rule: 27 cycles → evaluation of odd and even parts independently with Horner, more ILP Estrin’s method: 19 cycles → evaluation of high and low parts in parallel, even more ILP → distributing the multiplication by s in the evaluation of a(t ) → 16 cycles Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45 Low latency parenthesization computation Classical evaluation methods Classical parenthesizations for binary32 division P (s, t ) = 2−25 + s · ∑ ai · t i 0≤i ≤10 Horner’s rule: (3 + 1) × 11 = 44 cycles → no ILP exposure Second-order Horner’s rule: 27 cycles → evaluation of odd and even parts independently with Horner, more ILP Estrin’s method: 19 cycles → evaluation of high and low parts in parallel, even more ILP → distributing the multiplication by s in the evaluation of a(t ) → 16 cycles ... We can do better. How to explore the solution space of parenthesizations? Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 24/45 Low latency parenthesization computation Computation of all parenthesizations Algorithm for computing all parenthesizations a (x , y ) = ∑ ∑ ai,j · x i · y j with n = nx + ny , and anx ,ny 6= 0 0≤i ≤nx 0≤j ≤ny Example Let a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . Then a1,0 + a1,1 · y is a valid expression, Guillaume Revy – December 1st , 2009. while a1,0 · x + a1,1 · x is not. Implementation of binary floating-point arithmetic on embedded integer processors 25/45 Low latency parenthesization computation Computation of all parenthesizations Algorithm for computing all parenthesizations a (x , y ) = ∑ ∑ ai,j · x i · y j with n = nx + ny , and anx ,ny 6= 0 0≤i ≤nx 0≤j ≤ny Example Let a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . Then a1,0 + a1,1 · y is a valid expression, while a1,0 · x + a1,1 · x is not. Exhaustive algorithm: iterative process → step k = computation of all the valid expressions of total degree k 3 building rules for computing all parenthesizations Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 25/45 Low latency parenthesization computation Computation of all parenthesizations Rules for building valid expressions Consider step k of the algorithm E(k ) : valid expressions of total degree k P(k ) : powers x i y j of total degree k = i + j Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 26/45 Low latency parenthesization computation Computation of all parenthesizations Rules for building valid expressions Consider step k of the algorithm E(k ) : valid expressions of total degree k P(k ) : powers x i y j of total degree k = i + j Rule R1 for building the powers deg(p) = deg (p1) + deg(p2) p p1 deg(p1) ≤ bk/2c Guillaume Revy – December 1st , 2009. p2 dk/2e ≤ deg(p2) < k Implementation of binary floating-point arithmetic on embedded integer processors 26/45 Low latency parenthesization computation Computation of all parenthesizations Rules for building valid expressions Consider step k of the algorithm E(k ) : valid expressions of total degree k P(k ) : powers x i y j of total degree k = i + j Rule R2 for expressions by multiplications deg(e) = deg(e0) + deg(p) e e0 p 0 deg(p) ≤ k deg(e ) < k Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 26/45 Low latency parenthesization computation Computation of all parenthesizations Rules for building valid expressions Consider step k of the algorithm E(k ) : valid expressions of total degree k P(k ) : powers x i y j of total degree k = i + j Rule R3 for expressions by additions deg(e) = max ( deg(e1), deg(e2)) e Guillaume Revy – December 1st , 2009. e1 e2 deg(e1) = k deg(e2) ≤ k Implementation of binary floating-point arithmetic on embedded integer processors 26/45 Low latency parenthesization computation Computation of all parenthesizations Number of parenthesizations ny = 0 nx = 1 nx = 2 nx = 3 nx = 4 nx = 5 nx = 6 1 7 163 11602 2334244 1304066578 ny = 1 51 67467 1133220387 207905478247998 ··· ··· ny = 2 67467 106191222651 10139277122276921118 ··· ··· ··· Number of generated parenthesizations for evaluating a bivariate polynomial Timings for parenthesization computation → for univariate polynomial of degree 5 ≈ 1h on a 2.4 GHz core → for bivariate polynomial of degree (2,1) ≈ 30s → for P (s, t ) of degree (3,1) ≈ 7s (88384 schemes) Optimization for univariate polynomial and P (s, t ) → univariate polynomial of degree 5 ≈ 4min → for P (s, t ) of degree (3,1) ≈ 2s (88384 schemes) Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 27/45 Low latency parenthesization computation Computation of all parenthesizations Number of parenthesizations Number of degree-5 parenthesizations 1e+06 100000 10000 1000 100 10 10 12 14 16 18 20 Latency on unbounded parallelism (# cycles) → minimal latency for univariate polynomial of degree 5: 10 cycles (36 schemes) Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 27/45 Low latency parenthesization computation Computation of all parenthesizations Number of parenthesizations Number of degree-5 parenthesizations 1e+06 100000 10000 1000 100 10 10 12 14 16 18 20 Latency on unbounded parallelism (# cycles) → minimal latency for univariate polynomial of degree 5: 10 cycles (36 schemes) How to compute only parenthesizations of low latency? Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 27/45 Low latency parenthesization computation Towards low evaluation latency Determination of a target latency Target latency = minimal cost for evaluating a0,0 + anx ,ny · x nx y ny I if no scheme satisfies τ then increase τ and restart Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45 Low latency parenthesization computation Towards low evaluation latency Determination of a target latency Target latency = minimal cost for evaluating a0,0 + anx ,ny · x nx y ny I if no scheme satisfies τ then increase τ and restart Static target latency τstatic I as general as evaluating a0,0 + x nx +ny +1 τstatic = A + M × dlog2 (nx + ny + 1)e Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45 Low latency parenthesization computation Towards low evaluation latency Determination of a target latency Target latency = minimal cost for evaluating a0,0 + anx ,ny · x nx y ny I if no scheme satisfies τ then increase τ and restart Static target latency τstatic I as general as evaluating a0,0 + x nx +ny +1 τstatic = A + M × dlog2 (nx + ny + 1)e Dynamic target latency τdynamic I I cost of operator on anx ,ny and delay on intederminates dynamic programming Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45 Low latency parenthesization computation Towards low evaluation latency Determination of a target latency Target latency = minimal cost for evaluating a0,0 + anx ,ny · x nx y ny I if no scheme satisfies τ then increase τ and restart Example Degree-9 bivariate polynomial: nx = 8 and ny = 1 Latencies: A = 1 and M = 3 Delay: y available 9 cycles later than x τstatic τdynamic 1 + 3 × dlog2 (10)e = 13 cycles 16 cycles Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 28/45 Low latency parenthesization computation Towards low evaluation latency Optimized search of best parenthesizations Example Let a(x , y ) be a degree-2 bivariate polynomial a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . ⇒ find a best splitting of the polynomial → low latency Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45 Low latency parenthesization computation Towards low evaluation latency Optimized search of best parenthesizations Example Let a(x , y ) be a degree-2 bivariate polynomial a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . ⇒ find a best splitting of the polynomial → low latency a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45 Low latency parenthesization computation Towards low evaluation latency Optimized search of best parenthesizations Example Let a(x , y ) be a degree-2 bivariate polynomial a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . ⇒ find a best splitting of the polynomial → low latency a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45 Low latency parenthesization computation Towards low evaluation latency Optimized search of best parenthesizations Example Let a(x , y ) be a degree-2 bivariate polynomial a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . ⇒ find a best splitting of the polynomial → low latency a0,0 + a1,0 · x + a0,1 · y Guillaume Revy – December 1st , 2009. + a1,1 · x · y Implementation of binary floating-point arithmetic on embedded integer processors 29/45 Low latency parenthesization computation Towards low evaluation latency Optimized search of best parenthesizations Example Let a(x , y ) be a degree-2 bivariate polynomial a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . ⇒ find a best splitting of the polynomial → low latency a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45 Low latency parenthesization computation Towards low evaluation latency Optimized search of best parenthesizations Example Let a(x , y ) be a degree-2 bivariate polynomial a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . ⇒ find a best splitting of the polynomial → low latency a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 29/45 Low latency parenthesization computation Towards low evaluation latency Optimized search of best parenthesizations Example Let a(x , y ) be a degree-2 bivariate polynomial a(x , y ) = a0,0 + a1,0 · x + a0,1 · y + a1,1 · x · y . ⇒ find a best splitting of the polynomial → low latency Level 1 a(x, y ) a(x, y ) anx,ny xnx y ny a0(x, y ) a0(x, y ) if support > max a(x, y ) a0,0 a00(x, y ) if support ≤ max keep a00(x, y ) exhaustive search Level 2 a0(x, y ) a0(x, y ) a02(x, y ) a01(x, y ) Guillaume Revy – December 1st , 2009. a01(x, y ) a02(x, y ) Implementation of binary floating-point arithmetic on embedded integer processors 29/45 Low latency parenthesization computation Towards low evaluation latency Efficient evaluation parenthesization generation P (s, t ) = 2−25 + s · ∑ ai · t i 0≤i ≤10 First target latency τ = 13 → no parenthesization found Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 30/45 Low latency parenthesization computation Towards low evaluation latency Efficient evaluation parenthesization generation P (s, t ) = 2−25 + s · ai · t i ∑ 0≤i ≤10 First target latency τ = 13 → no parenthesization found Second target latency τ = 14 → obtained in about 10 sec. Classical methods I I I Horner: 44 cycles, Estrin: 19 cycles, Estrin by distributing s: 16 cycles v 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 2−25 s a0 t Guillaume Revy – December 1st , 2009. a1 a2 t a4 a3 t s a5 a6 t a7 a10 t t a8 t a9 Implementation of binary floating-point arithmetic on embedded integer processors 30/45 Low latency parenthesization computation Towards low evaluation latency Flowchart for generating efficient and certified C codes F(s,t) E approx ≤θ E eval <η ST231 features Computation of polynomial approximant a(t) Computation of low latency parenthesizations ? Selection of effective parenthesizations C code Guillaume Revy – December 1st , 2009. Certificate Implementation of binary floating-point arithmetic on embedded integer processors 31/45 Low latency parenthesization computation Towards low evaluation latency Flowchart for generating efficient and certified C codes F(s,t) E approx ≤θ E eval <η ST231 features Computation of polynomial approximant a(t) Computation of low latency parenthesizations Selection of effective parenthesizations C code Guillaume Revy – December 1st , 2009. Certificate Implementation of binary floating-point arithmetic on embedded integer processors 31/45 Selection of effective evaluation parenthesizations Outline of the talk 1. Design and implementation of floating-point operators 2. Low latency parenthesization computation 3. Selection of effective evaluation parenthesizations General framework Automatic certification of generated C codes 4. Numerical results 5. Conclusions and perspectives Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 32/45 Selection of effective evaluation parenthesizations General framework Selection of effective parenthesizations 1. Arithmetic Operator Choice I all intermediate variables are of constant sign Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 33/45 Selection of effective evaluation parenthesizations General framework Selection of effective parenthesizations 1. Arithmetic Operator Choice I all intermediate variables are of constant sign 2. Scheduling on a simplified model of the ST231 I constraints of architecture: cost of operators, instructions bundling, ... I delays on indeterminates Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 33/45 Selection of effective evaluation parenthesizations General framework Selection of effective parenthesizations 1. Arithmetic Operator Choice I all intermediate variables are of constant sign 2. Scheduling on a simplified model of the ST231 I constraints of architecture: cost of operators, instructions bundling, ... I delays on indeterminates 3. Certification of generated C code I straightline polynomial evaluation program I “certified C code”: we can bound the evaluation error in integer arithmetic Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 33/45 Selection of effective evaluation parenthesizations Automatic certification of generated C codes Certification of evaluation error for binary32 division Sufficient conditions with µ = 4 − 2−21 Eapprox ≤ θ with θ < 2−25 /µ I 8e-09 Absolute approximation error and Eeval < η = 2−25 − µ · θ Eapprox ≤ θ, 6e-09 with θ = 3 · 2−29 ≈ 6 · 10−9 4e-09 2e-09 0 -2e-09 -4e-09 I -6e-09 -8e-09 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 t Approximation error Guillaume Revy – December 1st , 2009. 0.9 Eeval < η, with η ≈ 7.4 · 10−9 Required bound Implementation of binary floating-point arithmetic on embedded integer processors 34/45 Selection of effective evaluation parenthesizations Automatic certification of generated C codes Certification of evaluation error for binary32 division Case 1: mx ≥ my → condition satisfied Case 2: mx < my → condition not satisfied: Eeval ≥ η s∗ = 3.935581684112548828125 and t ∗ = 0.97490441799163818359375 Absolute approximation error 8e-09 6e-09 4e-09 2e-09 0 -2e-09 -4e-09 -6e-09 -8e-09 0.965 0.97 0.975 0.98 0.985 0.99 0.995 t Approximation error Required bound 2−25 /(4 − 2−21 ) ≈ 8 · 10−9 Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9 Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 35/45 Selection of effective evaluation parenthesizations Automatic certification of generated C codes Certification of evaluation error for binary32 division Case 1: mx ≥ my → condition satisfied Case 2: mx < my → condition not satisfied: Eeval ≥ η s∗ = 3.935581684112548828125 and t ∗ = 0.97490441799163818359375 Absolute approximation error 8e-09 6e-09 1. determine an interval I around this point 4e-09 2e-09 0 -2e-09 -4e-09 -6e-09 -8e-09 0.965 0.97 0.975 0.98 0.985 0.99 0.995 t Approximation error Required bound 2−25 /(4 − 2−21 ) ≈ 8 · 10−9 Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9 Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 35/45 Selection of effective evaluation parenthesizations Automatic certification of generated C codes Certification of evaluation error for binary32 division Case 1: mx ≥ my → condition satisfied Case 2: mx < my → condition not satisfied: Eeval ≥ η s∗ = 3.935581684112548828125 and t ∗ = 0.97490441799163818359375 Absolute approximation error 8e-09 6e-09 1. determine an interval I around this point 4e-09 2. compute Eapprox over I 2e-09 0 3. determine an evaluation error bound η -2e-09 4. check if Eeval < η ? -4e-09 -6e-09 -8e-09 0.965 0.97 0.975 0.98 0.985 0.99 0.995 t Approximation error Required bound 2−25 /(4 − 2−21 ) ≈ 8 · 10−9 Approximation error bound θ = 3 · 2−29 ≈ 6 · 10−9 Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 35/45 Selection of effective evaluation parenthesizations Automatic certification of generated C codes Certification of evaluation error for binary32 division Sufficient conditions for each subinterval, with µ = 4 − 2−21 (i ) Eapprox ≤ θ(i ) with θ(i ) < 2−25 /µ and (i ) Eeval < η(i ) = 2−25 − µ · θ(i ) Absolute approximation error 8e-09 6e-09 4e-09 2e-09 (i ) Eapprox ≤ θ(i ) 0 -2e-09 -4e-09 (i ) Eeval < η(i ) -6e-09 -8e-09 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 t Approximation error Guillaume Revy – December 1st , 2009. Required bound Implementation of binary floating-point arithmetic on embedded integer processors 36/45 Selection of effective evaluation parenthesizations Automatic certification of generated C codes Certification of evaluation error for binary32 division Sufficient conditions for each subinterval, with µ = 4 − 2−21 (i ) Eapprox ≤ θ(i ) with θ(i ) < 2−25 /µ and (i ) Eeval < η(i ) = 2−25 − µ · θ(i ) Absolute approximation error 8e-09 6e-09 4e-09 2e-09 0 (i ) I Eapprox ≤ θ(i ) I Eeval < η(i ) -2e-09 -4e-09 -6e-09 (i ) -8e-09 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 t Approximation error Guillaume Revy – December 1st , 2009. Required bound Implementation of binary floating-point arithmetic on embedded integer processors 36/45 Selection of effective evaluation parenthesizations Automatic certification of generated C codes Certification using a dichotomy-based strategy Implementation of the splitting by dichotomy I for each T (i ) 1. compute a certified approximation error bound θ(i ) Sollya 2. determine an evaluation error bound η(i ) Sollya (i ) 3. check this bound: Eeval < η(i ) Gappa ⇒ if this bound is not satisfied, T Guillaume Revy – December 1st , 2009. (i ) is split up into 2 subintervals Implementation of binary floating-point arithmetic on embedded integer processors 37/45 Selection of effective evaluation parenthesizations Automatic certification of generated C codes Certification using a dichotomy-based strategy Implementation of the splitting by dichotomy I for each T (i ) 1. compute a certified approximation error bound θ(i ) Sollya 2. determine an evaluation error bound η(i ) Sollya (i ) 3. check this bound: Eeval < η(i ) Gappa ⇒ if this bound is not satisfied, T Guillaume Revy – December 1st , 2009. (i ) is split up into 2 subintervals Implementation of binary floating-point arithmetic on embedded integer processors 37/45 Selection of effective evaluation parenthesizations Automatic certification of generated C codes Certification using a dichotomy-based strategy Implementation of the splitting by dichotomy I for each T (i ) 1. compute a certified approximation error bound θ(i ) Sollya 2. determine an evaluation error bound η(i ) Sollya (i ) 3. check this bound: Eeval < η(i ) Gappa ⇒ if this bound is not satisfied, T (i ) is split up into 2 subintervals Example of binary32 implementation → launched on a 64 processor grid → 36127 subintervals found in several hours (≈ 5h.) Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 37/45 Numerical results Outline of the talk 1. Design and implementation of floating-point operators 2. Low latency parenthesization computation 3. Selection of effective evaluation parenthesizations 4. Numerical results 5. Conclusions and perspectives Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 38/45 Numerical results Performances of FLIP on ST231 100 180 FLIP 1.0 FLIP 0.3 STlib FLIP 1.0 vs STlib FLIP 1.0 vs FLIP 0.3 80 140 Speed-up (%) Latency (# cycles) 160 120 100 80 60 40 60 40 20 20 0 0 t v r sq di ul m b su d ad t v r sq di ul m b su d ad Floating-point operators Floating-point operators Performances on ST231, in RoundTiesToEven ⇒ Speed-up between 20 and 50 % Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 39/45 Numerical results Performances of FLIP on ST231 100 180 FLIP 1.0 FLIP 0.3 STlib FLIP 1.0 vs STlib FLIP 1.0 vs FLIP 0.3 80 140 Speed-up (%) Latency (# cycles) 160 120 100 80 60 40 60 40 20 20 0 0 t v r sq di ul m b su d ad t v r sq di ul m b su d ad Floating-point operators Floating-point operators Performances on ST231, in RoundTiesToEven ⇒ Speed-up between 20 and 50 % Implementations of other operators x −1 x −1/2 x 1/3 x −1/3 x −1/4 25 29 34 40 42 Performances on ST231, in RoundTiesToEven (in number of cycles) Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 39/45 Numerical results Impact of dynamic target latency x 1/3 x −1/3 (8,1) (9,1) Delay on the operand s (# cycles) 9 9 Static target latency 13 13 Dynamic target latency 16 16 Latency on unbounded parallelism and on ST231 16 16 Degree (nx ,ny ) Latency (# cycles) on unbounded parallelism and on ST231 Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 40/45 Numerical results Impact of dynamic target latency x 1/3 x −1/3 (8,1) (9,1) Delay on the operand s (# cycles) 9 9 Static target latency 13 13 Dynamic target latency 16 16 Latency on unbounded parallelism and on ST231 16 16 Degree (nx ,ny ) Latency (# cycles) on unbounded parallelism and on ST231 =⇒ Conclude on the optimality in terms of polynomial evaluation latency Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 40/45 Numerical results Timings for code generation x 1/2 x −1/2 x 1/3 x −1/3 x −1 (8,1) (9,1) (8,1) (9,1) (10,0) Static target latency 13 13 13 13 13 Dynamic target latency 13 13 16 16 13 Latency on unbounded parallelism 13 13 16 16 13 Latency on ST231 13 14 16 16 13 Parenthesization generation 172ms 152ms 53s 56s 168ms Arithmetic Operator Choice 6ms 6ms 7ms 11ms 4ms Scheduling 29s 4m21s 32ms 132ms 7s Certification (Gappa) 6s 4s 1m38s 1m07s 11s Total time (≈) 35s 4m25s 2m31s 2m03s 18s Degree (nx ,ny ) Timing of each step of the generation flow Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 41/45 Numerical results Timings for code generation x 1/2 x −1/2 x 1/3 x −1/3 x −1 (8,1) (9,1) (8,1) (9,1) (10,0) Static target latency 13 13 13 13 13 Dynamic target latency 13 13 16 16 13 Latency on unbounded parallelism 13 13 16 16 13 Latency on ST231 13 14 16 16 13 Parenthesization generation 172ms 152ms 53s 56s 168ms Arithmetic Operator Choice 6ms 6ms 7ms 11ms 4ms Scheduling 29s 4m21s 32ms 132ms 7s Certification (Gappa) 6s 4s 1m38s 1m07s 11s Total time (≈) 35s 4m25s 2m31s 2m03s 18s Degree (nx ,ny ) Timing of each step of the generation flow Impact of the target latency on the first step of the generation Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 41/45 Numerical results Timings for code generation x 1/2 x −1/2 x 1/3 x −1/3 x −1 (8,1) (9,1) (8,1) (9,1) (10,0) Static target latency 13 13 13 13 13 Dynamic target latency 13 13 16 16 13 Latency on unbounded parallelism 13 13 16 16 13 Latency on ST231 13 14 16 16 13 Parenthesization generation 172ms 152ms 53s 56s 168ms Arithmetic Operator Choice 6ms 6ms 7ms 11ms 4ms Scheduling 29s 4m21s 32ms 132ms 7s Certification (Gappa) 6s 4s 1m38s 1m07s 11s Total time (≈) 35s 4m25s 2m31s 2m03s 18s Degree (nx ,ny ) Timing of each step of the generation flow Impact of the target latency on the first step of the generation What may dominate the cost → scheduling algorithm → certification using Gappa Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 41/45 Conclusions and perspectives Outline of the talk 1. Design and implementation of floating-point operators 2. Low latency parenthesization computation 3. Selection of effective evaluation parenthesizations 4. Numerical results 5. Conclusions and perspectives Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 42/45 Conclusions and perspectives Conclusions Conclusions Design and implementation of floating-point operators I uniform approach for correctly-rounded roots and their reciprocals I extension to correctly-rounded division Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 43/45 Conclusions and perspectives Conclusions Conclusions Design and implementation of floating-point operators I uniform approach for correctly-rounded roots and their reciprocals I extension to correctly-rounded division I polynomial evaluation-based method, very high ILP exposure ⇒ new, much faster version of FLIP Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 43/45 Conclusions and perspectives Conclusions Conclusions Design and implementation of floating-point operators I uniform approach for correctly-rounded roots and their reciprocals I extension to correctly-rounded division I polynomial evaluation-based method, very high ILP exposure ⇒ new, much faster version of FLIP Code generation for efficient and certified polynomial evaluation I methodologies and tools for automating polynomial evaluation implementation I heuristics and techniques for generating quickly efficient and certified C codes ⇒ CGPE: allows to write and certify automatically ≈ 50 % of the codes of FLIP Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 43/45 Conclusions and perspectives Perspectives Perspectives Faithful implementation of floating-point operators → other floating-point operators: √ • log2 (1 + x ) over [0.5, 1), 1/ 1 + x 2 over [0, 0.5), ... → roots and their reciprocals: rounding condition decision not automated yet Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 44/45 Conclusions and perspectives Perspectives Perspectives Faithful implementation of floating-point operators → other floating-point operators: √ • log2 (1 + x ) over [0.5, 1), 1/ 1 + x 2 over [0, 0.5), ... → roots and their reciprocals: rounding condition decision not automated yet Extension to other binary floating-point formats → square root in binary64: 171 cycles on ST231, 396 cycles with STlib Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 44/45 Conclusions and perspectives Perspectives Perspectives Faithful implementation of floating-point operators → other floating-point operators: √ • log2 (1 + x ) over [0.5, 1), 1/ 1 + x 2 over [0, 0.5), ... → roots and their reciprocals: rounding condition decision not automated yet Extension to other binary floating-point formats → square root in binary64: 171 cycles on ST231, 396 cycles with STlib Extension to other architectures, typically FPGAs → polynomial evaluation-based approach: already seems to be a good alternative to multiplicative methods on FPGAs → the other techniques introduced of this thesis: should be investigated further Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 44/45 Conclusions and perspectives Perspectives Implementation of binary floating-point arithmetic on embedded integer processors Polynomial evaluation-based algorithms and certified code generation Guillaume Revy Advisors: Claude-Pierre Jeannerod and Gilles Villard Arénaire INRIA project-team (LIP, Ens Lyon) Université de Lyon CNRS Ph.D. Defense – December 1st , 2009 Guillaume Revy – December 1st , 2009. Implementation of binary floating-point arithmetic on embedded integer processors 45/45