Comparison between binary and decimal floating

advertisement
1
Comparison between binary and decimal
floating-point numbers
Nicolas Brisebarre, Christoph Lauter, Marc Mezzarobba, and Jean-Michel Muller
Abstract—We introduce an algorithm to compare a binary floating-point (FP) number and a decimal FP number, assuming the “binary
encoding” of the decimal formats is used, and with a special emphasis on the basic interchange formats specified by the IEEE 754-2008
standard for FP arithmetic. It is a two-step algorithm: a first pass, based on the exponents only, quickly eliminates most cases, then,
when the first pass does not suffice, a more accurate second pass is performed. We provide an implementation of several variants of our
algorithm, and compare them.
F
1
I NTRODUCTION
precision (bits)
𝑒min
𝑒max
binary32
binary64
24
−126
+127
53
−1022
+1023
binary128
113
The IEEE 754-2008 Standard for Floating-Point Arith−16382
metic [5] specifies binary (radix-2) and decimal (radix-10)
+16383
floating-point number formats for a variety of precisions.
The so-called “basic interchange formats” are presented
decimal64
decimal128
in Table 1.
precision (digits)
16
34
The Standard neither requires nor forbids comparisons
𝑒min
−383
−6143
between floating-point numbers of different radices (it
𝑒max
+384
+6144
states that floating-point data represented in different formats
shall be comparable as long as the operands’ formats have the
TABLE 1
same radix). However, such “mixed-radix” comparisons
The basic binary and decimal interchange formats
may offer several advantages. It is not infrequent to read
specified by the IEEE 754-2008 Standard.
decimal data from a database and to have to compare it
to some binary floating-point number. The comparison
may be inaccurate if the decimal number is preliminarily
converted to binary, or, respectively, if the binary number decimal followed by a decimal comparison. The compiler
is first converted to decimal.
emits no warning that the boolean result might not be
Consider for instance the following C code:
the expected one because of rounding.
This kind of strategy may lead to inconsistencies.
double x = ...;
Consider such a “naïve” approach built as follows:
_Decimal64 y = ...;
when comparing a binary floating-point number π‘₯2 of
if (x <= y)
format β„±2 , and a decimal floating-point number π‘₯10 of
...
The standardization of decimal floating-point arithmetic format β„±10 , we first convert π‘₯10 to the binary format β„±2
in C [6] is still at draft stage, and compilers supporting (that is, we replace it by the β„±2 number nearest π‘₯10 ), and
decimal floating-point arithmetic handle code sequences then we perform the comparison in binary. Denote the
<, β—‹
6 , β—‹,
> and β—‹.
>
such as the previous one at their discretion and often comparison operators so defined as β—‹
Consider
the
following
variables
(all
exactly
represented
in an unsatisfactory way. As it occurs, Intel’s icc 12.1.3
translates this sequence into a conversion from binary to in their respective formats):
55
βˆ™ π‘₯ = 3602879701896397/2 , declared as a binary64
number;
βˆ™ N. Brisebarre and J.-M. Muller are with CNRS, Laboratoire LIP, ENS
27
Lyon, INRIA, Université Claude Bernard Lyon 1, Lyon, France.
βˆ™ 𝑦 = 13421773/2 , declared as a binary32 number;
E-mail: nicolas.brisebarre@ens-lyon.fr, jean-michel.muller@ens-lyon.fr
βˆ™ 𝑧 = 1/10, declared as a decimal64 number.
βˆ™ C. Lauter is with Sorbonne Universités, UPMC Univ Paris 06, UMR
7606, LIP6, F-75005, Paris, France. E-mail: christoph.lauter@lip6.fr
Then it holds that π‘₯ β—‹
< 𝑦, but also 𝑦 β—‹
6 𝑧 and 𝑧 β—‹
6 π‘₯.
βˆ™ This work was initiated while M. Mezzarobba was with the Research
Such
an
inconsistent
result
might
for
instance
suffice
to
Institute for Symbolic Computation, Johannes Kepler University, Linz,
prevent
a
sorting
program
from
terminating.
Austria. M. Mezzarobba is now with CNRS, UMR 7606, LIP6, F-75005,
Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR 7606,
Remark that in principle, π‘₯2 and π‘₯10 could both be
LIP6, F-75005, Paris, France. E-mail: marc@mezzarobba.net
converted
to some longer format in such a way that
βˆ™ This work was partly supported by the TaMaDi project of the French
the
naïve
method
yields a correct answer. However, that
Agence Nationale de la Recherche, and by the Austria Science Fund
(FWF) grants P22748-N18 and Y464-N18.
method requires a correctly rounded radix conversion,
which is an intrinsically more expensive operation than
2
where 𝑀2 , 𝑀10 , 𝑒2 and 𝑒10 are integers that satisfy:
the comparison itself.
An experienced programer is likely to explicitly convert
𝑒min
− 𝑝2 + 1 6 𝑒2 6 𝑒max
,
2
2
all variables to a larger format. However we believe that
min
max
𝑒10 6 𝑒10 6 𝑒10 ,
ultimately the task of making sure that the comparisons
(1)
𝑝2 −1
are consistent and correctly performed should be left to
2
6 𝑀2 6 2𝑝2 − 1,
the compiler, and that the only fully consistent way of
1 6 𝑀10 6 10𝑝10 − 1
comparing two numeric variables π‘₯ and 𝑦 of a different
type is to effectively compare π‘₯ and 𝑦, and not to compare with 𝑝2 , 𝑝10 > 1. (The choice of lower bound for 𝑒2 makes
an approximation to π‘₯ to an approximation to 𝑦.
the condition 2𝑝2 −1 6 𝑀2 hold even for subnormal binary
In the following, we describe algorithms to perform numbers.)
such “exact” comparisons between binary numbers in
We assume that the so-called binary encoding (BID) [5],
any of the basic binary interchange formats and decimal [8] of IEEE 754-2008 is used for the decimal format1 , so
numbers in any of the basic decimal interchange formats. that the integer 𝑀10 is easily accessible in binary. Denote
Our algorithms were developed with a software imple- by
mentation in mind. They end up being pretty generic, but
𝑝′10 = ⌈𝑝10 log2 10⌉ ,
we make no claim as to their applicability to hardware
the number of bits that are necessary for representing
implementations.
It is natural to require that mixed-radix comparisons the decimal significands in binary.
When π‘₯2 and π‘₯10 have significantly different orders
not signal any floating-point exception, except in situations that parallel exceptional conditions specified in the of magnitude, examining their exponents will suffice
Standard [5, Section 5.11] for regular comparisons. For to compare them. Hence, we first perform a simple
this reason, we avoid floating-point operations that might exponent-based test (Section 3). When this does not sufsignal exceptions, and mostly use integer arithmetic to fice, a second step compares the significands multiplied
by suitable powers of 2 and 5. We compute “worst cases”
implement our comparisons.
that determine how accurate the second step needs to
Our method is based on two steps:
be (Section 4), and show how to implement it efficiently
βˆ™ Algorithm 1 tries to compare numbers just by
(Section 5). Section 6 is an aside describing a simpler
examining their floating-point exponents;
algorithm that can be used if we only want to decide
βˆ™ when Algorithm 1 is inconclusive, Algorithm 2
whether π‘₯2 and π‘₯10 are equal. Finally, Section 7 discusses
provides the answer.
our implementation of the comparison algorithms and
Algorithm 2 admits many variants and depends on a presents experimental results.
number of parameters. Tables 7 and 8 suggest suitable
parameters for most typical use cases.
Implementing the comparison for a given pair of 3 F IRST STEP : ELIMINATING THE “ SIMPLE
formats just requires the implementation of Algorithms CASES ” BY EXAMINING THE EXPONENTS
1 and 2. The analysis of the algorithms as a function of all
3.1 Normalization
parameters, as presented in Sections 4 to 6, is somewhat
As we have
technical. These details can be skipped at first reading.
1 6 𝑀10 6 10𝑝10 − 1,
This paper is an extended version of the article [2].
there exists a unique 𝜈 ∈ {0, 1, 2, . . . , 𝑝′10 − 1} such that
2
S ETTING
AND
O UTLINE
′
′
2𝑝10 −1 6 2𝜈 𝑀10 6 2𝑝10 − 1.
We consider a binary format of precision 𝑝2 , minimum Our initial problem of comparing π‘₯2 and π‘₯10 reduces to
exponent 𝑒min
and maximum exponent 𝑒max
, and a comparing 𝑀2 · 2𝑒2 −𝑝2 +1+𝜈 and (2𝜈 𝑀10 ) · 10𝑒10 −𝑝10 +1 .
2
2
decimal format of precision 𝑝10 , minimum exponent 𝑒min
The fact that we “normalize” the decimal significand
10
and maximum exponent 𝑒max
𝑀10 by a binary shift between two consecutive powers
10 . We want to compare a
binary floating-point number π‘₯2 and a decimal floating- of two is of course questionable; 𝑀10 could also be
point number π‘₯10 .
normalized into the range 10𝑝10 −1 6 10𝑑 · 𝑀10 6 10𝑝10 − 1.
Without loss of generality we assume π‘₯2 > 0 and However, hardware support for this operation [11] is not
π‘₯10 > 0 (when π‘₯2 and π‘₯10 are negative, the problem widespread. A decimal normalization would thus require
reduces to comparing −π‘₯2 and −π‘₯10 , and when they a loop, provoking pipeline stalls, whereas the proposed
have different signs the comparison is straightforward). binary normalization can exploit an existing hardware
The floating-point representations of π‘₯2 and π‘₯10 are
leading-zero counter with straight-line code.
π‘₯2 = 𝑀2 · 2𝑒2 −𝑝2 +1 ,
π‘₯10 = 𝑀10 · 10𝑒10 −𝑝10 +1 ,
1. This encoding is typically used in software implementations of
decimal floating-point arithmetic, even if BID-based hardware designs
have been proposed [11].
3
b32/d64
b32/d128
b64/d64
b64/d128
b128/d64
b128/d128
24
16
54
29
24
34
113
88
53
16
54
0
53
34
113
59
113
16
54
−60
113
34
113
−1
−570, 526
18
int32
−6371, 6304
23
int64
−1495, 1422
19
int32
−7296, 7200
23
int64
−16915, 16782
27
int64
−22716, 22560
27
int64
𝑝2
𝑝10
𝑝′10
𝑀
(1)
(1)
β„Žmin , β„Žmax
𝑠min
datatype
TABLE 2
The various parameters involved in Step 1 of the comparison algorithm.
3.2
Comparing the Exponents
That range is given in Table 2 for the basic IEEE formats.
Knowing that range, it is easy to implement πœ™ as follows.
Define
⎧
π‘š = 𝑀2 ,
βŽͺ
βŽͺ
βŽͺ
βŽͺ
′
βŽͺ
βŽͺ
⎨ β„Ž = 𝑒2 − 𝑒10 + 𝜈 + 𝑝10 − 𝑝10 + 1,
𝑛 = 𝑀10 · 2𝜈 ,
βŽͺ
βŽͺ
βŽͺ
𝑔 = 𝑒10 − 𝑝10 + 1,
βŽͺ
βŽͺ
βŽͺ
⎩
𝑀 = 𝑝′10 − 𝑝2 − 1
so that
{οΈƒ
𝜈
(2)
2 π‘₯2 = π‘š · 2
,
𝑔
Our comparison problem becomes:
Compare π‘š · 2β„Ž+𝑀 with 𝑛 · 5𝑔 .
We have
π‘šmin = 2𝑝2 −1 6 π‘š 6 π‘šmax = 2𝑝2 − 1,
′
′
𝑛min = 2𝑝10 −1 6 𝑛 6 𝑛max = 2𝑝10 − 1.
(3)
It is clear that π‘šmin · 2β„Ž+𝑀 > 𝑛max · 5𝑔 implies π‘₯2 > π‘₯10 ,
while π‘šmax · 2β„Ž+𝑀 < 𝑛min · 5𝑔 implies π‘₯2 < π‘₯10 . This gives
−𝑝′10
𝑔
β„Ž−2
)·5 <2
⇒
π‘₯10 < π‘₯2 ,
−𝑝2
β„Ž
𝑔
⇒
π‘₯2 < π‘₯10 .
(1 − 2
)·2 <5
(4)
In order to compare π‘₯2 and π‘₯10 based on these
implications, define
πœ™(β„Ž) = ⌊β„Ž · log5 2⌋.
Proposition 1. We have
𝑔 < πœ™(β„Ž) ⇒ π‘₯2 > π‘₯10 ,
𝑔 > πœ™(β„Ž) ⇒ π‘₯2 < π‘₯10 .
Proof: If 𝑔 < πœ™(β„Ž) then 𝑔 6 πœ™(β„Ž) − 1, hence 𝑔 6
β„Ž log5 2 − 1. This implies that 5𝑔 6 (1/5) · 2β„Ž < 2β„Ž /(4 ·
′
(1 − 2−𝑝10 )), therefore, from (4), π‘₯10 < π‘₯2 . If 𝑔 > πœ™(β„Ž)
then 𝑔 > πœ™(β„Ž) + 1, hence 𝑔 > β„Ž log5 2, so that 5𝑔 > 2β„Ž >
(1 − 2−𝑝2 ) · 2β„Ž . This implies, from (4), π‘₯2 < π‘₯10 .
Now consider the range of β„Ž: by (2), β„Ž lies between
(1)
′
β„Žmin = (𝑒min
− 𝑝2 + 1) − 𝑒max
2
10 + 𝑝10 − 𝑝10 + 1
and
(1)
β„Žmax = 𝑒max
− 𝑒min
2
10 + 𝑝10 .
with 𝐿 = ⌊2𝑠 log5 2⌉,
(1)
2 π‘₯10 = 𝑛 · 10 .
(1 − 2
πœ™(β„Ž)
^
= ⌊𝐿 · β„Ž · 2−𝑠 ⌋,
(1)
satisfies πœ™(β„Ž) = πœ™(β„Ž)
^
for all β„Ž in the range [β„Žmin , β„Žmax ].
β„Ž+𝑔+𝑀
𝜈
Proposition 2. Denote by ⌊·⌉ the nearest integer function.
For large enough 𝑠 ∈ N, the function defined by
Proposition 2 is an immediate consequence of the
irrationality of log5 2. For known, moderate values of
(1)
(1)
β„Žmin and β„Žmax , the optimal choice 𝑠min of 𝑠 is easy to find
and small. For instance, if the binary format is binary64
and the decimal format is decimal64, then 𝑠min = 19.
Table 2 gives the value of 𝑠min for the basic IEEE
formats. The product 𝐿 · β„Ž, for β„Ž in the indicated range
and 𝑠 = 𝑠min , can be computed exactly in (signed or
unsigned) integer arithmetic, with the indicated data type.
Computing ⌊πœ‰ · 2−𝛽 ⌋ of course reduces to a right-shift by
𝛽 bits.
Propositions 1 and 2 yield to the following algorithm.
Algorithm 1. First, exponent-based step
1 compute β„Ž = 𝑒2 − 𝑒10 + 𝜈 + 𝑝10 − 𝑝′10 + 1 and 𝑔 =
𝑒10 − 𝑝10 + 1;
2 with the appropriate value of 𝑠, compute πœ™(β„Ž) =
⌊𝐿 · β„Ž · 2−𝑠 ⌋ using integer arithmetic;
3 if 𝑔 < πœ™(β„Ž) then
4 return “π‘₯2 > π‘₯10 ”
5 else if 𝑔 > πœ™(β„Ž) then
6 return “π‘₯2 < π‘₯10 ”
7 else (𝑔 = πœ™(β„Ž))
8 first step is inconclusive (perform the second
step).
Note that, when π‘₯10 admits multiple distinct representations in the precision-𝑝10 decimal format (i.e., when its
cohort is non-trivial [5]), the success of the first step may
depend on the specific representation passed as input. For
instance, assume that the binary and decimal formats are
binary64 and decimal64, respectively. Both 𝐴 = {𝑀10 =
1015 , 𝑒10 = 0} and 𝐡 = {𝑀10 = 1, 𝑒10 = 15} are valid
representations of the integer 1. Assume we are trying
to compare π‘₯10 = 1 to π‘₯2 = 2. Using representation 𝐴,
we have 𝜈 = 4, β„Ž = −32, and πœ™(β„Ž) = −14 > 𝑔 = −15,
hence the test from Algorithm 1 shows that π‘₯10 < π‘₯2 . In
4
contrast, if π‘₯10 is given in the form 𝐡, we get 𝜈 = 53,
πœ™(β„Ž) = πœ™(2) = 0 = 𝑔, and the test is inconclusive.
3.3
We obtain
𝑁norm + 𝑁sub
6 min
#(𝑋2 × π‘‹10 )
How Often is the First Step Enough?
{οΈ‚
1 4
,
π‘Ÿ10 π‘Ÿ2
}οΈ‚
.
We may quantify the quality of this first filter as follows.
We say that Algorithm 1 succeeds if it answers “π‘₯2 > π‘₯10 ”
or “π‘₯2 < π‘₯10 ” without proceeding to the second step,
and fails otherwise. Let 𝑋2 and 𝑋10 denote the sets of
representations of positive, finite numbers in the binary,
resp. decimal formats of interest. (In the case of 𝑋2 , each
number has a single representation.) Assuming zeros,
infinities, and NaNs have been handled before, the input
of Algorithm 1 may be thought of as a pair (πœ‰2 , πœ‰10 ) ∈
𝑋2 × π‘‹10 .
These are rough estimates. One way to get tighter
bounds for specific formats is simply to count, for each
value of 𝜈, the pairs (𝑒2 , 𝑒10 ) such that πœ™(β„Ž) = 𝑔. For
instance, in the case of comparison between binary64
and decimal64 floating-point numbers, one can check
that the failure rate in the sense of Proposition 3 is less
than 0.1%.
As a matter of course, pairs (πœ‰2 , πœ‰10 ) will almost
never be equidistributed in practice. Hence the previous
estimate should not be interpreted as a probability of
Proposition 3. The proportion of input pairs (πœ‰2 , πœ‰10 ) ∈ success of Step 1. It seems more realistic to assume that
𝑋2 × π‘‹10 for which Algorithm 1 fails is bounded by
a well-written numerical algorithm will mostly perform
{οΈ‚
}οΈ‚
comparisons between numbers which are suspected to be
1
4
min
, max
,
max
min
min
close to each other. For instance, in an iterative algorithm
𝑒10 − 𝑒10 + 1 𝑒2 − 𝑒2 + 1
where comparisons are used as part of a convergence test,
assuming 𝑝2 > 3.
it is to be expected that most comparisons need to proceed
Proof: The first step fails if and only if πœ™(β„Ž) = 𝑔. to the second step. Conversely, there are scenarios, e.g.,
Write β„Ž = 𝑒2 + 𝑑, that is, 𝑑 = 𝑝10 − 𝑝′10 + 1 − 𝑒10 + 𝜈. Then checking for out-of-range data, where the first step should
be enough.
πœ™(β„Ž) = 𝑔 rewrites as
⌊(𝑒2 + 𝑑) log5 2⌋ = 𝑔,
4
which implies
−𝑑 + 𝑔 log2 5 6 𝑒2 < −𝑑 + (𝑔 + 1) log2 5
< −𝑑 + 𝑔 log2 5 + 2.4.
The value of 𝜈 is determined by 𝑀10 , so that 𝑑 and
𝑔 = 𝑒10 − 𝑝10 + 1 depend on πœ‰10 only. Thus, for any
given πœ‰10 ∈ 𝑋10 , there can be at most 3 values of 𝑒2 for
which πœ™(β„Ž) = 𝑔.
A similar argument shows that for given 𝑒2 and 𝑀10 ,
there exist at most one value of 𝑒10 such that πœ™(β„Ž) = 𝑔.
Let 𝑋2norm and 𝑋2sub be the subsets of 𝑋2 consisting of
normal and subnormal numbers respectively. Let π‘Ÿπ‘– =
𝑒max
− 𝑒min
+ 1. The number 𝑁norm of pairs (πœ‰2 , πœ‰10 ) ∈
𝑖
𝑖
norm
𝑋2
× π‘‹10 such that πœ™(β„Ž) = 𝑔 satisfies
𝑁norm 6 #{𝑀10 : πœ‰10 ∈ 𝑋10 } · #{𝑀2 : πœ‰2 ∈
𝑋2norm }
· #{(𝑒2 , 𝑒10 ) : πœ‰2 ∈ 𝑋2norm ∧ πœ™(β„Ž) = 𝑔}
6 (10𝑝10 − 1) · 2𝑝2 −1 · min{π‘Ÿ2 , 3π‘Ÿ10 }.
For subnormal π‘₯2 , we have the bound
𝑁sub := #{(πœ‰2 , πœ‰10 ) ∈ 𝑋2sub × π‘‹10 : πœ™(β„Ž) = 𝑔}
6 #{𝑀10 } · #𝑋2sub
= (10𝑝10 − 1) · (2𝑝2 −1 − 1).
The total number of elements of 𝑋2 × π‘‹10 for which
the first step fails is bounded by 𝑁norm + 𝑁sub . This is to
be compared with
#𝑋2 = π‘Ÿ2 · 2𝑝2 −1 + (𝑝2 − 1) · (2𝑝2 −1 − 1)
> (π‘Ÿ2 + 1) · 2𝑝2 −1 ,
#𝑋10 = π‘Ÿ10 · (10𝑝10 − 1).
STEP : A CLOSER LOOK AT THE
SIGNIFICANDS
4.1
S ECOND
Problem Statement
In the following we assume that 𝑔 = πœ™(β„Ž), i.e.,
𝑒10 − 𝑝10 + 1 = ⌊(𝑒2 − 𝑒10 + 𝜈 + 𝑝10 − 𝑝′10 + 1) log5 (2)⌋ .
(Otherwise, the first step already allowed us to compare
π‘₯2 and π‘₯10 .)
Define a function
𝑓 (β„Ž) =
We have
5πœ™(β„Ž)
.
2β„Ž+𝑀
⎧
⎨ 𝑓 (β„Ž) · 𝑛 > π‘š ⇒ π‘₯10 > π‘₯2 ,
𝑓 (β„Ž) · 𝑛 < π‘š ⇒ π‘₯10 < π‘₯2 ,
⎩
𝑓 (β„Ž) · 𝑛 = π‘š ⇒ π‘₯10 = π‘₯2 .
(5)
The second test consists in performing this comparison,
with 𝑓 (β„Ž) · 𝑛 replaced by an accurate enough approximation.
In order to ensure that an approximate test is indeed
equivalent to (5), we need a lower bound πœ‚ on the
minimum nonzero value of
βƒ’ πœ™(β„Ž)
βƒ’
βƒ’5
π‘š βƒ’βƒ’
βƒ’
π‘‘β„Ž (π‘š, 𝑛) = βƒ’ β„Ž+𝑀 − βƒ’
(6)
2
𝑛
that may appear at this stage. We want πœ‚ to be as tight
as possible in order to avoid unduly costly computations when approximating 𝑓 (β„Ž) · 𝑛. The search space is
constrained by the following observations.
5
Proposition 4. Let 𝑔 and β„Ž be defined by Equation (2). The
equality 𝑔 = πœ™(β„Ž) implies
𝑒min
− 𝑝2 − 𝑝′10 + 3
𝑒max + 2
2
6β„Ž< 2
.
1 + log5 2
1 + log5 2
(7)
Proof: From (2), we get
𝑒2 = β„Ž + 𝑔 + 𝑝′10 − 𝜈 − 2.
Since 𝑒min
− 𝑝2 + 1 6 𝑒2 6 𝑒max
and as we assumed
2
2
𝑔 = πœ™(β„Ž), it follows that
𝑒min
− 𝑝2 + 1 6 β„Ž + πœ™(β„Ž) − 𝜈 + 𝑝′10 − 2 6 𝑒max
.
2
2
Therefore we have
𝑒min
− 𝑝2 + 𝜈 − 𝑝′10 + 3 6 (1 + log5 2)β„Ž < 𝑒max
+ 𝜈 − 𝑝′10 + 3,
2
2
and the bounds (7) follow because 0 6 𝜈 6 𝑝′10 − 1.
The binary normalization of 𝑀10 , yielding 𝑛, implies
that 2𝜈 | 𝑛. If 𝑛 > 10𝑝10 , then, by (1) and (2), it follows
that 𝜈 > 1 and 𝑛 is even. As β„Ž + πœ™(β„Ž) − 𝜈 6 𝑒max
− 𝑝′10 + 2,
2
′
we also have 𝜈 > 𝜈 . Since πœ™ is increasing, there exists β„Ž0
such that 𝜈 ′ > 0 for β„Ž > β„Ž0 .
Table 3 gives the range (7) for β„Ž with respect to the
comparisons between basic IEEE formats.
Required Worst-Case Accuracy
Let us now deal with the problem of computing πœ‚,
considering π‘‘β„Ž (π‘š, 𝑛) under the constraints given by
Equation (3) and Proposition 4. A similar problem was
considered by Cornea et al. [3] in the context of correctly
rounded binary to decimal conversions. Their techniques
yield worst and bad cases for the approximation problem
we consider. We will take advantage of them in Section 7
to test our algorithms on many cases when π‘₯2 and π‘₯10 are
very close. Here, we favor a different approach that is less
computationally intensive and mathematically simpler,
making the results easier to check either manually or
with a proof-assistant.
Problem 1. Find the smallest nonzero value of
βƒ’ πœ™(β„Ž)
βƒ’
βƒ’5
π‘šβƒ’
π‘‘β„Ž (π‘š, 𝑛) = βƒ’βƒ’ β„Ž+𝑀 − βƒ’βƒ’
2
𝑛
subject to the constraints
⎧ 𝑝 −1
2
6 π‘š 6 2𝑝2 − 1,
βŽͺ
βŽͺ 2
βŽͺ
βŽͺ
′
′
βŽͺ
βŽͺ 2𝑝10 −1 6 𝑛 6 2𝑝10 − 1,
βŽͺ
⎨
(2)
(2)
β„Žmin 6 β„Ž 6 β„Žmax ,
βŽͺ
βŽͺ
βŽͺ
βŽͺ
𝑛 is even if 𝑛 > 10𝑝10 ,
βŽͺ
βŽͺ
βŽͺ
′
⎩
if β„Ž > β„Ž0 , then 2𝜈 | 𝑛
⌉οΈ‚
𝑒min
− 𝑝2 − 𝑝′10 + 3
2
,
1 + log5 2
⌊οΈ‚ max
⌋οΈ‚
𝑒2 + 2
(2)
β„Žmax =
,
1 + log 2
⌈οΈ‚ max 5 ′
⌉οΈ‚
𝑒2 − 𝑝10 + 3
β„Ž0 =
,
1 + log5 2
𝜈 ′ = β„Ž + πœ™(β„Ž) − 𝑒max
+ 𝑝′10 − 2.
2
(2)
⌈οΈ‚
β„Žmin =
Additionally, 𝑛 satisfies the following properties:
1) if 𝑛 > 10𝑝10 , then 𝑛 is even;
+ 𝑝′10 − 2 is nonnegative (which
2) if 𝜈 ′ = β„Ž + πœ™(β„Ž) − 𝑒max
2
′
holds for large enough β„Ž), then 2𝜈 divides 𝑛.
4.2
where
We recall how such a problem can be solved using the
classical theory of continued fractions [4], [7], [9].
Given 𝛼 ∈ Q, build two finite sequences (π‘Žπ‘– )06𝑖6𝑛 and
(π‘Ÿπ‘– )06𝑖6𝑛 by setting π‘Ÿ0 = 𝛼 and
{οΈƒ
π‘Žπ‘– = ⌊π‘Ÿπ‘– ⌋ ,
π‘Ÿπ‘–+1 = 1/(π‘Ÿπ‘– − π‘Žπ‘– ) if π‘Žπ‘– ΜΈ= π‘Ÿπ‘– .
For all 0 6 𝑖 6 𝑛, the rational number
𝑝𝑖
1
= π‘Ž0 +
1
π‘žπ‘–
π‘Ž1 +
..
1
.+
π‘Žπ‘–
is called the 𝑖th convergent of the continued fraction
expansion of 𝛼. Observe that the process defining the π‘Žπ‘– essentially is the Euclidean algorithm. If we assume 𝑝−1 = 1,
π‘ž−1 = 0, 𝑝0 = π‘Ž0 , π‘ž0 = 1, we have 𝑝𝑖+1 = π‘Žπ‘–+1 𝑝𝑖 + 𝑝𝑖−1 ,
π‘žπ‘–+1 = π‘Žπ‘–+1 π‘žπ‘– + π‘žπ‘–−1 , and 𝛼 = 𝑝𝑛 /π‘žπ‘› .
It is a classical result [7, Theorem 19] that any rational
number π‘š/𝑛 such that
βƒ’
1
π‘š βƒ’βƒ’
βƒ’
⃒𝛼 − βƒ’ < 2
𝑛
2𝑛
is a convergent of the continued fraction expansion of 𝛼.
For basic IEEE formats, it turns out that this classical
result is enough to solve Problem 1, as cases when it is
not precise enough can be handled in an ad hoc way.
First consider the case max(0, −𝑀) 6 β„Ž 6 𝑝2 . We can
then write
|5πœ™(β„Ž) 𝑛 − 2β„Ž+𝑀 π‘š|
2β„Ž+𝑀 𝑛
where the numerator and denominator are both integers.
If π‘‘β„Ž (π‘š, 𝑛) ΜΈ= 0, we have |5πœ™(β„Ž) 𝑛 − π‘š2β„Ž+𝑀 | > 1, hence
′
′
π‘‘β„Ž (π‘š, 𝑛) > 1/(2β„Ž+𝑀 𝑛) > 2𝑝2 −β„Ž−2𝑝10 +1 > 2−2𝑝10 +1 . For
the range of β„Ž where π‘‘β„Ž (π‘š, 𝑛) might be zero, the non-zero
′
minimum is thus no less than 2−2𝑝10 +1 .
(2)
If now 𝑝2 + 1 6 β„Ž 6 β„Žmax , then β„Ž + 𝑀 > 𝑝′10 , so 5πœ™(β„Ž)
′
β„Ž+𝑀
and 2
are integers, and π‘š2β„Ž+𝑀 is divisible by 2𝑝10 . As
′
𝑛 6 2𝑝10 − 1, we have π‘‘β„Ž (π‘š, 𝑛) ΜΈ= 0. To start with, assume
′
that π‘‘β„Ž (π‘š, 𝑛) 6 2−2𝑝10 −1 . Then, the integers π‘š and 𝑛 6
′
2𝑝10 − 1 satisfy
βƒ’ πœ™(β„Ž)
βƒ’
βƒ’5
′
π‘š βƒ’βƒ’
1
βƒ’
−
6 2−2𝑝10 −1 < 2 ,
βƒ’ 2β„Ž+𝑀
𝑛⃒
2𝑛
π‘‘β„Ž (π‘š, 𝑛) =
and hence π‘š/𝑛 is a convergent of the continued fraction
expansion of 𝛼 = 5πœ™(β„Ž) /2β„Ž+𝑀 .
We compute the convergents with denominators less
′
than 2𝑝10 and take the minimum of the |𝛼 − 𝑝𝑖 /π‘žπ‘– | for
6
(2)
(2)
β„Žmin , β„Žmax
(2)
(2)
𝑔min , 𝑔max
#𝑔
β„Ž0
b32/d64
b32/d128
b64/d64
b64/d128
b128/d64
b128/d128
−140, 90
−61, 38
100
54
−181, 90
−78, 38
117
12
−787, 716
−339, 308
648
680
−828, 716
−357, 308
666
639
−11565, 11452
−4981, 4932
9914
11416
−11606, 11452
−4999, 4932
9932
11375
TABLE 3
Range of β„Ž for which we may have 𝑔 = πœ™(β„Ž), and size of the table used in the “direct method” of Section 5.1.
which there exists π‘˜ ∈ N such that π‘š = π‘˜π‘π‘– and 𝑛 = π‘˜π‘žπ‘– be stated as follows. Recall that πœ‚ denotes a (tight) lower
satisfy the constraints of Problem 1. Notice that this strat- bound on (6).
egy only yields the actual minimum nonzero of π‘‘β„Ž (π‘š, 𝑛)
Proposition 6. Assume that πœ‡ approximates 𝑓 (β„Ž) · 𝑛 with
if we eventually find, for some β„Ž, a convergent with
′
′
relative accuracy πœ’ < πœ‚/2−𝑀+2 or better, that is,
|𝛼 − 𝑝𝑖 /π‘žπ‘– | < 2−2𝑝10 −1 . Otherwise, taking πœ‚ = 2−2𝑝10 −1
πœ‚
yields a valid (yet not tight) lower bound.
(8)
|𝑓 (β„Ž) · 𝑛 − πœ‡| 6 πœ’π‘“ (β„Ž) · 𝑛 < −𝑀+2 𝑓 (β„Ž) · 𝑛.
(2)
2
The case β„Žmin 6 β„Ž 6 min(0, −𝑀) is similar. A numerator
of π‘‘β„Ž (π‘š, 𝑛) then is |2−β„Ž−𝑀 𝑛−5−πœ™(β„Ž) π‘š| and a denominator The following implications hold:
′
is 5−πœ™(β„Ž) 𝑛. We notice that 5−πœ™(β„Ž) > 2𝑝10 if and only
⎧
𝑝2 +1
if β„Ž 6 −𝑝′10 . If −𝑝′10 + 1 6 β„Ž 6 min(0, −𝑀) and
=⇒ π‘₯10 > π‘₯2 ,
βŽͺ
⎨ πœ‡>π‘š+πœ’·2
−β„Ž
−πœ™(β„Ž)
π‘‘β„Ž (π‘š, 𝑛) ΜΈ= 0, we have |2 𝑛 − 5
π‘š| > 1, hence
𝑝2 +1
(9)
πœ‡<π‘š−πœ’·2
=⇒ π‘₯10 < π‘₯2 ,
′
(2)
βŽͺ
π‘‘β„Ž (π‘š, 𝑛) > 1/(5−πœ™(β„Ž) 𝑛) > 2−2𝑝10 . For β„Žmin 6 β„Ž 6 −𝑝′10 ,
⎩
𝑝2 +1
|π‘š − πœ‡| 6 πœ’ · 2
=⇒ π‘₯10 = π‘₯2 .
we use the same continued fraction tools as above.
′
There remains the case min(0, −𝑀) 6 β„Ž 6 max(0, −𝑀).
Proof: First notice that 𝑓 (β„Ž) 6 2−𝑀 = 2𝑝2 −𝑝10 +1 for
If 0 6 β„Ž 6 −𝑀, a numerator of π‘‘β„Ž (π‘š, 𝑛) is |5πœ™(β„Ž) 2−β„Ž−𝑀 𝑛−
′
all β„Ž. Since 𝑛 < 2𝑝10 , condition (8) implies |πœ‡ − 𝑓 (β„Ž) · 𝑛| <
π‘š| and a denominator is 𝑛. Hence, if π‘‘β„Ž (π‘š, 𝑛) ΜΈ= 0, we
′
2𝑝2 +1 πœ’, and hence |𝑓 (β„Ž) · 𝑛 − π‘š| > |πœ‡ − π‘š| − 2𝑝2 +1 πœ’.
have π‘‘β„Ž (π‘š, 𝑛) > 1/𝑛 > 2𝑝10 . If −𝑀 6 β„Ž 6 0, a numerator
Now consider each of the possible cases that appear
of π‘‘β„Ž (π‘š, 𝑛) is |𝑛 − 5−πœ™(β„Ž) 2β„Ž+𝑀 π‘š| and a denominator is
in
(9). If |πœ‡ − π‘š| > 2𝑝2 +1 πœ’, then 𝑓 (β„Ž) · 𝑛 − π‘š and πœ‡ − π‘š
−πœ™(β„Ž) β„Ž+𝑀
−β„Ž β„Ž+𝑀
2𝑝′10 −𝑝2 −1
5
2
𝑛 6 5·2 2
𝑛 < 5·2
. It follows
′
that if π‘‘β„Ž (π‘š, 𝑛) ΜΈ= 0, we have π‘‘β„Ž (π‘š, 𝑛) > 1/5 · 2−2𝑝10 +𝑝2 +1 . have the same sign. The first two implications from (5)
then translate into the corresponding cases from (9).
Proposition 5. The minimum nonzero values of π‘‘β„Ž (π‘š, 𝑛)
If finally |πœ‡ − π‘š| 6 2𝑝2 +1 πœ’, then the triangle inequality
under the constraints from Problem 1 for the basic IEEE yields |𝑓 (β„Ž) · 𝑛 − π‘š| 6 2𝑝2 +2 πœ’, which implies |𝑓 (β„Ž) −
formats are attained for the parameters given in Table 4.
π‘š/𝑛| < 2𝑝2 +2 πœ’/𝑛 < 2𝑝2 +𝑀 πœ‚/𝑛 6 πœ‚. By definition of πœ‚,
Proof: This follows from applying the algorithm this cannot happen unless π‘š/𝑛 = 𝑓 (β„Ž). This accounts for
outlined above to the parameters associated with each the last case.
For instance, in the case of binary64–decimal64 combasic IEEE format. Scripts to verify the results are
parisons,
it is more than enough to compute the prodprovided along with our example implementation of the
uct
𝑓
(β„Ž)
·
𝑛
in 128-bit arithmetic.
comparison algorithm (see Section 7).
Proposition 4 gives the number of values of β„Ž to consider in the table of 𝑓 (β„Ž), which grows quite large (23.5 kB
5 I NEQUALITY TESTING
for b64/d64 comparisons, 721 kB in the b128/d128 case,
As already mentioned, the first step of the full comparison assuming a memory alignment on 8-byte boundaries).
algorithm is straightforwardly implementable in 32-bit However, it could possibly be shared with other algoor 64-bit integer arithmetic. The second step reduces to rithms such as binary-decimal conversions. Additionally,
comparing π‘š and 𝑓 (β„Ž) · 𝑛, and we have seen in Section 4 β„Ž is available early in the algorithm flow, so that the
that it is enough to know 𝑓 (β„Ž) with a relative accuracy memory latency for the table access can be hidden behind
given by the last column of Table 4. We now discuss the first step.
The number of elements in the table can be reduced
several ways to perform that comparison. The main
at
the price of a few additional operations. Several
difficulty is to efficiently evaluate 𝑓 (β„Ž)·π‘› with just enough
consecutive
β„Ž map to a same value 𝑔 = πœ™(β„Ž). The
accuracy.
corresponding values of 𝑓 (β„Ž) = 2πœ™(β„Ž)·log2 5−β„Ž are binary
shifts of each other. Hence, we may store the most
5.1 Direct Method
significant bits of 𝑓 (β„Ž) as a function of 𝑔, and shift the
A direct implementation of Equation (5) just replaces 𝑓 (β„Ž) value read off the table by an appropriate amount to
with a sufficiently accurate approximation, read in a table recover 𝑓 (β„Ž). The table size goes down by a factor of
indexed with β„Ž. The actual accuracy requirements can about log2 (5) ≈ 2.3.
7
b32/d64
b32/d128
b64/d64
b64/d128
b128/d64
b128/d128
β„Ž
π‘š
𝑛
50
−159
−275
−818
2546
10378
10888194
11386091
4988915232824583
5148744585188163
7116022508838657793249305056613439
7977485665655127446147737154136553
13802425659501406
8169119658476861812680212016502305
12364820988483254
9254355313724266263661769079234135
13857400902051554
9844227914381600512882010261817769
log2 (πœ‚ −1 )
111.40
229.57
113.68
233.58
126.77
237.14
TABLE 4
Worst cases for π‘‘β„Ž (π‘š, 𝑛) and corresponding lower bounds πœ‚.
More precisely, define 𝐹 (𝑔) = 5𝑔 2−πœ“(𝑔) , where
πœ“(𝑔) = ⌊𝑔 · log2 5⌋.
(10)
We then have 𝑓 (β„Ž) = 𝐹 (πœ™(β„Ž)) · 2−𝜌(β„Ž) with 𝜌(β„Ž) = β„Ž −
πœ“(πœ™(β„Ž)). The computation of πœ“ and 𝜌 is similar to that
of πœ™ in Step 1. In addition, we may check that
1 6 𝐹 (πœ™(β„Ž)) < 2
and 0 6 𝜌(β„Ž) 6 3
for all β„Ž. Since 𝜌(β„Ž) is nonnegative, multiplication by
2𝜌(β„Ž) can be implemented with a bitshift, not requiring
branches.
We did not implement this size reduction: instead, we
concentrated on another size reduction opportunity that
we shall describe now.
(the order of the bounds depends on πœ–). The integer 5π‘Ÿ fits
on πœ“(π‘Ÿ) + 1 bits, where πœ“ is the function defined by (10).
Instead of tabulating 𝑓 (β„Ž) directly, we will use two
tables: one containing the most significant bits of 5π›Ύπ‘ž
roughly to the precision dictated by the worst cases
computed in Section 4, and the other containing the exact
value 5π‘Ÿ for 0 6 π‘Ÿ 6 𝛾 − 1. We denote by πœ†1 and πœ†2 the
respective precisions (entry sizes in bits) of these tables.
Based on these ranges and precisions, we assume
πœ†1 > log2 (πœ‚ −1 ) − 𝑀 + 3,
(12)
πœ†2 > πœ“(𝛾 − 1) + 1.
(13)
In addition, it is natural to suppose πœ†1 larger than or close
to πœ†2 : otherwise, an algorithm using two approximate
tables will likely be more efficient.
5.2 Bipartite Table
In practice, πœ†2 will typically be one of the available
That second table size reduction method uses a bipartite
machine word widths, while, for reasons that should
table. Additionally, it takes advantage of the fact that,
become clear later, πœ†1 will be chosen slightly smaller
for small nonnegative 𝑔, the exact value of 5𝑔 takes less
than a word size. For each pair of basic IEEE formats,
space than an accurate enough approximation of 𝑓 (β„Ž)
Tables 7 and 8 provide values of πœ–, 𝛾, πœ†1 , πœ†2 (and other
𝑔
(for instance, 5 fits on 64 bits for 𝑔 6 27). In the case of a
parameters whose definition follows later) suitable for
typical implementation of comparison between binary64
typical processors. The suggested values were chosen
and decimal64 numbers, the bipartite table uses only
by optimizing either the table size or the number of
800 bytes of storage.
elementary multiplications in the algorithm, with or
Most of the overhead in arithmetic operations for
without the constraint that 𝛾 be a power of two.
the bipartite table method can be hidden on current
It will prove convenient to store the table entries leftprocessors through increased instruction-level parallelism.
aligned, in a fashion similar to floating-point significands.
The reduction in table size also helps decreasing the
Thus, we set
probability of cache misses.
Recall that we suppose 𝑔 = πœ™(β„Ž), so that β„Ž lies
πœƒ1 (π‘ž) = 5π›Ύπ‘ž ·2πœ†1 −1−πœ“(π›Ύπ‘ž) ,
(2)
(2)
between bounds β„Žmin , β„Žmax deduced from Proposition 4.
πœƒ2 (π‘Ÿ) = 5π‘Ÿ ·2πœ†2 −1−πœ“(π‘Ÿ) ,
(2)
(2)
Corresponding bounds 𝑔min , 𝑔max on 𝑔 follow since πœ™ is
where the power-of-two factors provide for the desired
nondecreasing.
For some integer “splitting factor” 𝛾 > 0 and πœ– = ±1, alignment. One can check that
write
⌈οΈ‚ ⌉οΈ‚
2πœ†1 −1 6 πœƒ1 (π‘ž) < 2πœ†1 , 2πœ†2 −1 6 πœƒ2 (π‘Ÿ) < 2πœ†2
(14)
πœ–π‘”
𝑔 = πœ™(β„Ž) = πœ–(π›Ύπ‘ž − π‘Ÿ),
π‘ž=
,
(11)
𝛾
for all β„Ž.
The value 𝑓 (β„Ž) now decomposes as
so that
(οΈ‚ π›Ύπ‘ž )οΈ‚πœ–
(οΈ‚
)οΈ‚πœ–
5
−β„Ž−𝑀
𝑓 (β„Ž) =
·2
.
5𝑔
πœƒ1 (π‘ž)
𝑓 (β„Ž) = β„Ž+𝑀 =
2πœ–·(πœ†2 −πœ†1 )−𝜎(β„Ž)−𝑀
(15)
5π‘Ÿ
2
πœƒ2 (π‘Ÿ)
Typically, 𝛾 will be chosen as a power of 2, but other
(οΈ€
)οΈ€
values can occasionally be useful. With these definitions, where
𝜎(β„Ž) = β„Ž − πœ– · πœ“(π›Ύπ‘ž) − πœ“(π‘Ÿ) .
(16)
we have 0 6 π‘Ÿ 6 𝛾 − 1, and π‘ž lies between
⌈οΈƒ
⌉οΈƒ
⌈οΈƒ
⌉οΈƒ
(2)
(2)
Equations (10) and (16) imply that
𝑔
𝑔
πœ– min
and
πœ– max
𝛾
𝛾
β„Ž − 𝑔 log2 5 − 1 < 𝜎(β„Ž) < β„Ž − 𝑔 log2 5 + 1.
8
As we also have β„Ž log5 2 − 1 6 𝑔 = πœ™(β„Ž) < β„Ž log5 2 by
definition of πœ™, it follows that
Yet additional properties hold that allow for a more
efficient final decision step.
(17)
Proposition 8. Assume 𝑔 = πœ™(β„Ž), and let ΔΜƒ be the signed
integer defined above. For 𝜏 > 0, the following equivalences
hold:
0 6 𝜎(β„Ž) 6 3.
Now let 𝜏 > −1 be an integer serving as an adjustment
parameter to be chosen later. In practice, we will usually2
choose 𝜏 = 0 and sometimes 𝜏 = −1 (see Tables 7 and 8).
For most purposes the reader may simply assume 𝜏 = 0.
Define
⎧
′
βŽͺ
Δ = πœƒ1 (π‘ž) · 𝑛 · 2𝜏 −𝑝10
βŽͺ
βŽͺ
Furthermore,
βŽͺ
⎨
− πœƒ (π‘Ÿ) · π‘š · 2𝜏 +πœ†1 −πœ†2 +𝜎(β„Ž)−𝑝2 −1 , πœ– = +1,
2
βŽͺ
Δ = πœƒ1 (π‘ž) · π‘š · 2𝜏 −𝑝2 −1
βŽͺ
βŽͺ
βŽͺ
′
⎩
− πœƒ2 (π‘Ÿ) · 𝑛 · 2𝜏 +πœ†1 −πœ†2 −𝜎(β„Ž)−𝑝10 ,
πœ– = −1.
The relations |π‘₯2 | > |π‘₯10 |, |π‘₯2 | = |π‘₯10 |, and |π‘₯2 | < |π‘₯10 |
hold respectively when πœ–Δ < 0, Δ = 0, and πœ–Δ > 0.
Proposition 7. Unless π‘₯2 = π‘₯10 , we have |Δ| > 2𝜏 +1 .
Proof: Assume π‘₯2 ΜΈ= π‘₯10 . For πœ– = +1, and using (15),
the definition of Δ rewrites as
(︁
π‘š )︁
Δ = πœƒ2 (π‘Ÿ) 𝑛 2𝜏 +πœ†1 −πœ†2 +𝜎(β„Ž)−𝑝2 −1 𝑓 (β„Ž) −
.
𝑛
Since 𝑓 (β„Ž) = π‘š/𝑛 is equivalent to π‘₯2 = π‘₯10 , we know by
Proposition 5 that |𝑓 (β„Ž) − π‘š/𝑛| > πœ‚. Together with the
bounds (3), (12), (14), and (17), this implies
log2 |Δ| > (πœ†2 − 1) + (𝑝′10 − 1) + 𝜏 + πœ†1 − πœ†2 − 𝑝2 − 1
+ log2 πœ‚
= (πœ†1 + log2 πœ‚ + 𝑀 − 3) + 𝜏 + 1 > 𝜏 + 1.
Similarly, for πœ– = −1, we have
(︁
π‘š )︁
Δ = −πœƒ1 (π‘ž) 𝑛 2𝜏 −𝑝2 −1 𝑓 (β„Ž) −
,
𝑛
so that
Δ < 0 ⇐⇒
ΔΜƒ 6 −2𝜏 ,
Δ = 0 ⇐⇒ 0 6 ΔΜƒ 6
2𝜏 ,
Δ > 0 ⇐⇒
2𝜏 +1 .
ΔΜƒ >
if 𝜏 ∈ {−1, 0} and
{οΈƒ
𝑝2 + 1,
πœ†1 − πœ†2 + 𝜏 >
𝑝′10 + 3,
πœ– = +1,
πœ– = −1,
(18)
then Δ and ΔΜƒ have the same sign (in particular, ΔΜƒ = 0 if and
only if Δ = 0).
Proof: Write the definition of ΔΜƒ as
⌊οΈ€
⌋οΈ€ ⌊οΈ€
⌋οΈ€
ΔΜƒ = ⌈πœƒ1 (π‘ž)⌉𝑋 − πœƒ2 (π‘Ÿ)𝑋 ′
where the expressions of 𝑋 and 𝑋 ′ depend on πœ–, and
′
define error terms 𝛿tbl , 𝛿rnd , 𝛿rnd
by
𝛿tbl = ⌈πœƒ1 (π‘ž)⌉ − πœƒ1 (π‘ž),
𝛿rnd = ⌊⌈πœƒ1 (π‘ž)⌉𝑋⌋ − ⌈πœƒ1 (π‘ž)⌉𝑋,
′
𝛿rnd
= ⌊πœƒ2 (π‘Ÿ)𝑋 ′ ⌋ − πœƒ2 (π‘Ÿ)𝑋 ′ .
The 𝛿’s then satisfy
0 6 𝛿tbl < 1,
′
−1 < 𝛿rnd , 𝛿rnd
6 0.
For both possible values of πœ–, we have3 0 6 𝑋 6 2𝜏 and
hence
′
−1 < ΔΜƒ − Δ = 𝑋𝛿tbl + 𝛿rnd − 𝛿rnd
< 2𝜏 + 1.
If Δ < 0, Proposition 7 implies that
ΔΜƒ < −2𝜏 +1 + 2𝜏 + 1 = −2𝜏 + 1.
log2 |Δ| > (πœ†1 − 1) + (𝑝′10 − 1) + 𝜏 − 𝑝2 − 1 + log2 πœ‚
It follows that ΔΜƒ 6 −⌊2𝜏 ⌋ since ΔΜƒ is an integer. By the
same reasoning, Δ > 0 implies ΔΜƒ > ⌊2𝜏 +1 ⌋, and Δ = 0
implies 0 6 ΔΜƒ 6 ⌈2𝜏 ⌉. This proves the first claim.
when Δ ΜΈ= 0.
Now assume that (18) holds. In this case, we have
As already explained, the values of πœƒ2 (π‘Ÿ) are integers
′
𝛿rnd
= 0, so that −1 < ΔΜƒ − Δ < 2𝜏 . The upper bounds
and can be tabulated exactly. In contrast, only an approxon ΔΜƒ become ΔΜƒ 6 −⌊2𝜏 ⌋ − 1 for Δ < 0 and ΔΜƒ 6 ⌈2𝜏 ⌉ − 1
imation of πœƒ1 (π‘ž) can be stored. We represent these values
for Δ = 0. When −1 6 𝜏 6 0, these estimates respectively
as ⌈πœƒ1 (π‘ž)⌉, hence replacing Δ by the easy-to-compute
imply ΔΜƒ < 0 and ΔΜƒ = 0, while ΔΜƒ > ⌊2𝜏 +1 ⌋ implies Δ > 0.
⎧
⌊οΈ€
⌋οΈ€
𝜏 −𝑝′10
The second claim follows.
βŽͺ
ΔΜƒ = ⌈πœƒ1 (π‘ž)⌉ · 𝑛 · 2
βŽͺ
βŽͺ
⌊οΈ€
⌋οΈ€
βŽͺ
The conclusion ΔΜƒ = 0 = Δ when π‘₯2 = π‘₯10 for certain
𝜏 +πœ†1 −πœ†2 +𝜎(β„Ž)−𝑝2 −1
⎨
− πœƒ2 (π‘Ÿ) · π‘š · 2
, πœ– = +1,
parameter choices means that there is no error in the
⌊οΈ€
⌋οΈ€
βŽͺ
ΔΜƒ = ⌈πœƒ1 (π‘ž)⌉ · π‘š · 2𝜏 −𝑝2 −1
βŽͺ
approximate computation of Δ in these cases. Specifically,
βŽͺ
βŽͺ
⌊οΈ€
⌋οΈ€
⎩
𝜏 +πœ†1 −πœ†2 −𝜎(β„Ž)−𝑝′10
the tabulation error 𝛿tbl and the rounding error 𝛿rnd cancel
− πœƒ2 (π‘Ÿ) · 𝑛 · 2
,
πœ– = −1.
out, thanks to our choice to tabulate the ceiling of πœƒ1 (π‘ž)
Proposition 7 is in principle enough to compare π‘₯2 with (and not, say, the floor).
π‘₯10 , using a criterion similar to that from Proposition 6.
We now describe in some detail the algorithm resulting
from Proposition 8, in the case πœ– = +1. In particular, we
2. See the comments following Equations (19) to (21). Yet allowing
for positive values of 𝜏 enlarges the possible choices of parameters will determine integer datatypes on which all steps can be
>𝜏 +1
to satisfy (21) (resp. (21’)) for highly unusual floating-point formats
and implementation environments. For instance, it makes it possible
to store the table for πœƒ1 in a more compact fashion, using a different
word size for the table than for the computation.
3. Actually 0 6 𝑋 6 2𝜏 −1 for πœ– = −1: this dissymmetry is related to
our choice of always expressing the lower bound on Δ in terms of πœ‚
rather than using a lower bound on 𝑓 (β„Ž)−1 − 𝑛/π‘š when πœ– = −1.
9
performed without overflow, and arrange the powers of in Steps 3 and 4 are left shifts. They can be performed
two in the expression of ΔΜƒ in such a way that computing without overflow on unsigned words of respective widths
the floor functions comes down to dropping the least π‘Š (πœ†1 ), π‘Š (𝑝′10 ) and π‘Š (𝑝2 ) since π‘Š (πœ†1 ) − πœ†1 > 𝜏 + 1 and
π‘Š (𝑝2 ) − 𝑝2 > πœ” + 3, both by (20). The same inequality
significant words of multiple-word integers.
Let π‘Š : N → N denote a function satisfying π‘Š (π‘˜) > π‘˜ implies that the (positive) integers 𝐴 and 𝐡 both fit on
for all π‘˜. In practice, π‘Š (π‘˜) will typically be the width of π‘Š (πœ†1 )−1 bits, so that their difference can be computed by
the smallest “machine word” that holds at least π‘˜ bits. a subtraction on π‘Š (πœ†1 ) bits. The quantity 𝐴−𝐡 computed
We further require that (for πœ– = +1):
in Step 5 agrees with the previous definition of ΔΜƒ, and
Δ′ has the same sign as ΔΜƒ, hence Proposition 8 implies
′
′
π‘Š (𝑝10 ) − 𝑝10 > 1,
(19)
that the relation returned in Step 7 is correct.
π‘Š (πœ†1 ) − πœ†1 > π‘Š (πœ†2 ) − πœ†2 + 𝜏 + 3, (20)
When 𝜏 ∈ {−1, 0} and (18) holds, no low-order bits
π‘Š (𝑝2 ) − 𝑝2 + π‘Š (πœ†2 ) − πœ†2 > π‘Š (πœ†1 ) − πœ†1 − 𝜏 + 1. (21) are dropped in Step 4, and Step 6 can be omitted
(that is, replaced by Δ′ := ΔΜƒ). Also observe that a
These assumptions are all very mild in view of the version of Step 7 returning −1, 0 or 1 according to the
IEEE-754-2008 Standard. Indeed, for all IEEE interchange comparison result can be implemented without branching,
formats (including binary16 and formats wider than using bitwise operations on the individual words which
64 bits), the space taken up by the exponent and sign make up the representation of Δ′ in two’s complement
leaves us with at least 5 bits of headroom if we choose encoding.
for π‘Š (𝑝2 ), resp. π‘Š (𝑝′10 ) the word width of the format.
The algorithm for πœ– = −1 is completely similar, except
Even if we force π‘Š (πœ†2 ) = πœ†2 and 𝜏 = 0, this always that (19)–(21) become
allows for a choice of πœ†1 , πœ†2 and π‘Š (πœ†2 ) satisfying (12),
(13) and (19)–(21).
π‘Š (𝑝2 ) − 𝑝2 > 2,
(19’)
Denote π‘₯+ = max(π‘₯, 0) and π‘₯− = max(−π‘₯, 0), so that
π‘Š (πœ†1 ) − πœ†1 > π‘Š (πœ†2 ) − πœ†2 + 𝜏 + 1, (20’)
π‘₯ = π‘₯+ − π‘₯− .
π‘Š (𝑝′10 ) − 𝑝′10 + π‘Š (πœ†2 ) − πœ†2 > π‘Š (πœ†1 ) − πœ†1 − 𝜏 + 3, (21’)
Algorithm 2. Step 2, second method, πœ– = +1.
1 Compute π‘ž and π‘Ÿ as defined in (11), with πœ– = +1. and 𝐴 and 𝐡 are computed as
⌊οΈ€
⌋οΈ€
+
−
Compute 𝜎(β„Ž) using (16) and Proposition 2.
𝐴 = (⌈πœƒ1 (π‘ž)⌉ · 2𝜏 ) · (π‘š2π‘Š (𝑝2 )−𝑝2 −𝜏 −1 ) · 2−π‘Š (𝑝2 )
2 Read ⌈πœƒ1 (π‘ž)⌉ from the table, storing it on a π‘Š (πœ†1 )-bit
⌊οΈ€
⌋οΈ€
′
𝐡 = πœƒ2 (π‘Ÿ) · (𝑛2πœ”−𝜎(β„Ž) ) · 2π‘Š (πœ†1 )−π‘Š (πœ†2 )−π‘Š (𝑝10 ) ,
(multiple-)word.
3 Compute
with
⌊οΈ€
⌋οΈ€
𝜏+
π‘Š (𝑝′10 )−𝑝′10 −𝜏 −
−π‘Š (𝑝′10 )
𝐴 = (⌈πœƒ1 (π‘ž)⌉ · 2 ) · (𝑛2
)·2
πœ” = π‘Š (𝑝′10 ) − 𝑝′10 + π‘Š (πœ†2 ) − πœ†2 − π‘Š (πœ†1 ) + πœ†1 + 𝜏,
by (constant) left shifts followed by a π‘Š (πœ†1 )-byrespectively using a π‘Š (πœ†1 )-by-π‘Š (𝑝2 )-bit and a π‘Š (πœ†2 )π‘Š (𝑝′10 )-bit multiplication, dropping the least signifiby-π‘Š (𝑝′10 ) multiplication.
cant word(s) of the result that correspond to π‘Š (𝑝′10 )
Tables 7 and 8 suggest parameter choices for comparbits.
isons in basic IEEE formats. (We only claim that these are
4 Compute
reasonable choices that satisfy all conditions, not that they
⌊οΈ€
⌋οΈ€
are optimal for any well-defined metric.) Observe that,
𝐡 = πœƒ2 (π‘Ÿ) · (π‘š2𝜎(β„Ž)+πœ” ) · 2π‘Š (πœ†1 )−π‘Š (πœ†2 )−π‘Š (𝑝2 )
though it has the same overall structure, the algorithm
where the constant πœ” is given by
presented in [2] is not strictly speaking a special case
πœ” = π‘Š (𝑝2 ) − 𝑝2 + π‘Š (πœ†2 ) − πœ†2 − π‘Š (πœ†1 ) + πœ†1 + 𝜏 − 1 of Algorithm 2, as the definition of the bipartite table
(cf. (11)) is slightly different.
in a similar way, using a π‘Š (πœ†2 )-by-π‘Š (𝑝2 )-bit multiplication.
5 Compute the difference ΔΜƒ = 𝐴 − 𝐡 as a π‘Š (πœ†1 )-bit 6 A N E QUALITY T EST
Besides deciding the two possible inequalities, the direct
signed integer.
6 Compute Δ′ = ⌊ΔΜƒ2−𝜏 −1 ⌋2𝜏 +1 by masking the 𝜏 + 1 and bipartite methods described in the previous sections
are able to determine cases when π‘₯2 is equal to π‘₯10 .
least significant bits of ΔΜƒ.
However, when it comes to solely determine such equality
7 Return
⎧
′
βŽͺ
“π‘₯
<
π‘₯
”
if
Δ
<
0,
10
2
cases additional properties lead to a simpler and faster
⎨
′
algorithm.
“π‘₯10 = π‘₯2 ” if Δ = 0,
βŽͺ
⎩
′
The condition π‘₯2 = π‘₯10 is equivalent to
“π‘₯10 > π‘₯2 ” if Δ > 0.
Proposition 9. When either 𝜏 > 0 or 𝜏 = −1 and (18) holds,
Algorithm 2 correctly decides the ordering between π‘₯2 and π‘₯10 .
Proof: Hypotheses (19) and (21), combined with the
inequalities 𝜎(β„Ž) > 0 and 𝜏 > −1, imply that the shifts
π‘š · 2β„Ž+𝑀 = 𝑛 · 5𝑔 ,
(22)
and thus implies certain divisibility relations between π‘š,
𝑛, 5±π‘” and 2±β„Ž±π‘€ . We now describe an algorithm that
takes advantage of these relations to decide (22) using
10
only binary shifts and multiplications on no more than
max(𝑝′10 + 2, 𝑝2 + 1) bits.
Proposition 10. Assume π‘₯2 = π‘₯10 . Then, we have
eq
eq
𝑔min
6 𝑔 6 𝑔max
,
−𝑝′10 + 1 6 β„Ž 6 𝑝2 .
where
eq
𝑔min
= −⌊𝑝10 (1 + log5 2)⌋,
the algorithm correctly returns false. In particular, no table
eq
eq
access occurs unless 𝑔 lies between 𝑔min
and 𝑔max
.
Now assume that these four conditions hold. In particular, we have π‘š1 = 2−𝑒 π‘š as π‘š2 = π‘š, and similarly
+
𝑛1 = 2−𝑣 𝑛. The bounds on 𝑔 imply that 5𝑔 fits on 𝑝2
−
bits and 5𝑔 fits on 𝑝′10 bits. If 𝑔 and 𝑀 + β„Ž are both
nonnegative, then π‘š3 = π‘š, and we have
eq
𝑔max
= ⌊𝑝2 log5 2⌋.
Additionally,
𝑔
−𝑔
divides 𝑀10
βˆ™ if 𝑔 > 0, then 5 divides π‘š, otherwise 5
(and 𝑛);
β„Ž+𝑀
βˆ™ if β„Ž > −𝑀, then 2
divides 𝑛, otherwise 2−(β„Ž+𝑀)
divides π‘š.
Proof: First observe that 𝑔 must be equal to πœ™(β„Ž) by
Proposition 1, so 𝑔 > 0 if and only if β„Ž > 0. The divisibility
properties follow immediately from the coprimality of
2 and 5.
When 𝑔 > 0, they imply 5𝑔 6 π‘š < 2𝑝2 − 1, whence
𝑔 < 𝑝2 log5 2. Additionally, 2−β„Ž−𝑀 𝑛 is an integer. Since
′
β„Ž > 0, it follows that 2β„Ž 6 2−𝑀 𝑛 < 2𝑝10 −𝑀 and hence
β„Ž 6 𝑝′10 − 𝑀 − 1 = 𝑝2 .
If now 𝑔 < 0, then 5−𝑔 divides π‘š = 2𝜈 𝑀10 and hence
divides 𝑀10 , while 2−β„Ž divides 2𝑀 π‘š. Since 𝑀10 < 10𝑝10
and 2𝑀 π‘š < 2𝑝2 +𝑀 , we have −𝑔 < 𝑝10 log5 10 and −β„Ž <
𝑝10 log2 10 < 𝑝′10 .
These properties translate into Algorithm 3 below.
Recall that π‘₯+ = max(π‘₯, 0) and π‘₯− = π‘₯+ − π‘₯. (Branchless algorithms exist to compute π‘₯+ and π‘₯− [1].) Let
𝑝 = max(𝑝′10 + 2, 𝑝2 + 1).
′
+
5𝑔 𝑛1 = 5πœ™(β„Ž) 2−(𝑀+β„Ž) 𝑛 < 2−𝑀+𝑝10 = 2𝑝2 +1 6 2𝑝 .
By the same reasoning, if 𝑔, 𝑀 + β„Ž 6 0, then 𝑛3 = 𝑛 and
′
5−𝑔 2𝑀+β„Ž π‘š < 5 · 2𝑝10 −1 < 2𝑝 .
If now 𝑔 > 0 with β„Ž 6 −𝑀, then π‘š3 = π‘š1 and
′
′
5𝑔 𝑛 < 2β„Ž+𝑝10 6 2−𝑀+𝑝10 = 2𝑝2 +1 6 2𝑝 .
Similarly, for 𝑔 6 0, −β„Ž 6 𝑀, we have
5−𝑔 π‘š < 5 · 2−β„Ž+𝑝2 6 5 · 2𝑀+𝑝2 6 2𝑝 .
In all four cases,
−
π‘š3 = 5𝑔 2−𝑒 π‘š
and
+
𝑛3 = 5𝑔 2−𝑣 𝑛
are computed without overflow. Therefore
+
−
π‘š3
π‘š · 2(β„Ž+𝑀) −(β„Ž+𝑀)
=
𝑛3
𝑛 · 5𝑔+ −𝑔−
=
π‘š · 2β„Ž+𝑀
𝑛 · 5𝑔
and π‘₯2 = π‘₯10 if and only if π‘š3 = 𝑛3 , by (22).
7
E XPERIMENTAL
RESULTS
For each combination of a binary basic IEEE format and a
decimal one, we implemented the comparison algorithm
Algorithm 3. Equality test.
eq
eq
consisting of the first step (Section 3) combined with
1 If not 𝑔min 6 𝑔 6 𝑔max then return false.
−
+
the version of the second step based on a bipartite table
2 Set 𝑒 = (𝑀 + β„Ž) and 𝑣 = (𝑀 + β„Ž) .
(Section
5.2). We used the parameters suggested in the
3 Using right shifts, compute
second part of Table 8, that is for the case when 𝛾 is a
π‘š1 = ⌊2−𝑒 π‘š⌋,
𝑛1 = ⌊2−𝑣 𝑛⌋.
power of two and one tries to reduce table size.
The implementation was aided by scripts that take as
𝑒
𝑣
4 Using left shifts, compute π‘š2 = 2 π‘š1 and 𝑛2 = 2 𝑛1 .
+
−
input
the parameters from Tables 2 and 8 and produce C
5 Read 5𝑔 and 5𝑔 from a table (or compute them).
−
code
with
GNU extensions [10]. Non-standard extensions
6 Compute π‘š3 = 5𝑔 π‘š1 with a 𝑝′10 × π‘2 → 𝑝 bit mulare used for 128-bit integer arithmetic and other basic
tiplication (meaning that the result can be anything
integer operations. No further effort at optimization was
if the product does not fit on 𝑝 bits).
made.
+
7 Compute 𝑛3 = 5𝑔 𝑛1 with a 𝑝2 × π‘′10 → 𝑝 bit
Our implementations, the code generator itself, as well
multiplication.
as various scripts to select the parameters are available
8 Return (π‘š2 = π‘š) ∧ (𝑛2 = 𝑛) ∧ (𝑔 = πœ™(β„Ž)) ∧ (𝑛3 = π‘š3 ).
at
As a matter of course, the algorithm can return false
http://hal.archives-ouvertes.fr/hal-01021928/.
as soon as any of the conditions from Step 8 is known
The code generator can easily be adapted to produce
unsatisfied.
other variants of the algorithm.
Proposition 11. Algorithm 3 correctly decides whether π‘₯2 =
We tested the implementations on random inputs
π‘₯10 .
generated as follows. For each decimal exponent and
Proof: First observe that, if any of the conditions each quantum [5, 2.1.42], we generate a few random
eq
eq
𝑔 = πœ™(β„Ž), 𝑔min
6 𝑔 6 𝑔max
, π‘š2 = π‘š and 𝑛2 = 𝑛 is vio- significands and round the resulting decimal numbers
lated, then π‘₯2 ΜΈ= π‘₯10 by Propositions 1 and 10. Indeed, upwards and downwards to the considered binary format.
π‘₯2 = π‘₯10 implies π‘š2 = π‘š and 𝑛2 = 𝑛 due to the (We exclude the decimal exponents that merely yield to
divisibility conditions of Proposition 10. In all these cases, underflow or overflow on the binary side.)
11
Special cases (±0, NaNs, Inf)
π‘₯2 , π‘₯10 of opposite sign
π‘₯2 normal, same sign, “easy” cases
π‘₯2 subnormal, same sign, “easy” cases
π‘₯2 normal, same sign, “hard” cases
π‘₯2 subnormal, same sign, “hard” cases
Naïve method
converting π‘₯10
to binary64
(incorrect)
min/avg/max
Naïve method
converting π‘₯2
to decimal64
(incorrect)
min/avg/max
Direct method
(Section 5.1)
min/avg/max
Bipartite table
method (manual
implementation)
(Section 5.2)
min/avg/max
18/–/164
30/107/188
31/107/184
45/106/178
60/87/186
79/106/167
21/–/178
44/124/179
48/132/180
89/119/179
39/78/180
78/107/179
12/–/82
15/30/76
25/57/118
153/163/189
45/56/139
171/177/189
11/–/74
16/31/73
21/43/95
19/27/73
29/40/119
34/49/94
TABLE 5
Timings (in cycles) for binary64–decimal64 comparisons.
Table 6 reports the observed timings for our random
test data. The generated code was executed on a system
equipped with a quad-core Intel Core i5-3320M processor
clocked at 2.6 GHz, running Linux 3.16 in 64 bit mode.
We used the gcc 4.9.2 compiler, at optimization level
-O3, with the -march=native flag. The measurements
were performed using the Read Time Stamp Counter
instruction (RDTSC) after pipeline serialization. The
serialization and function call overhead was estimated by
timing an empty function and subtracted off. The caches
were preheated by additional comparison function calls,
the results of which were discarded.
It can be observed that “balanced” comparisons
(d64/b64, d128/b128) perform significantly better than
“unbalanced” ones. A quick look at the generated assembly code suggests that there is room for performance
improvements, especially in cases involving 128-bit floating point formats.
We performed more detailed experiments in the special
case of binary64–decimal64 comparisons. In complement
to the generated code, we wrote a version of the bipartite
table method with some manual optimization, and implemented the method based on direct tabulation of 𝑓 (β„Ž)
presented in Section 5.1. We tested these implementations
thoroughly with test vectors that extensively cover all
floating-point input classes (normal numbers, subnormals,
Not-A-Numbers, zeros, and infinities) and exercise all
possible values of the normalization parameter 𝜈 (cf.
Section 3) as well as all possible outcomes for both of
the algorithm’s steps.
Finally, we compared them for performance to the naïve
comparison method where one of the inputs is converted
d64
d128
b32
b64
b128
41
235
36
266
75
61
TABLE 6
Average run time (in cycles) for the bipartite table method,
using the parameters suggested in Table 8 for 𝛾 = 2π‘˜ ,
optimizing table size.
to the radix of the other input. These experimental results
are reported in Table 5. In the last two rows, “easy” cases
are cases when the first step of our algorithm succeeds,
while “hard” cases refer to those for which running the
second step is necessary.
8
C ONCLUSION
Though not foreseen in the IEEE 754-2008 Standard,
exact comparisons between floating-point formats of
different radices would enrich the current floating-point
environment and make numerical software safer and
easier to prove.
This paper has investigated the feasibility of such comparisons. A simple test has been presented that eliminates
most of the comparison inputs. For the remaining cases,
two algorithms were proposed, a direct method and a
technique based on a more compact bipartite table. For
instance, in the case of binary64–decimal64, the bipartite
table uses only 800 bytes of table space.
Both methods have been proven, implemented and
thoroughly tested. According to our experiments, they
outperform the naïve comparison technique consisting in
conversion of one of the inputs to the respectively other
format. Furthermore, they always return a correct answer,
which is not the case of the naïve technique.
Since the algorithmic problems of exact binary to decimal comparison and correctly rounded radix conversion
are related, future investigations should also consider
the possible reuse of tables for both problems. Finally,
one should mention that considerable additional work
is required in order to enable mixed-radix comparisons
in the case when the decimal floating-point number is
stored in dense-packed-decimal representation.
R EFERENCES
[1]
[2]
Jörg Arndt, Matters computational. Ideas, algorithms, source code,
Springer, 2011.
Nicolas Brisebarre, Christoph Lauter, Marc Mezzarobba, and
Jean-Michel Muller, Comparison between binary64 and decimal64
floating-point numbers, ARITH 21 (Austin, Texas, USA) (Alberto
Nannarelli, Peter-Michael Seidel, and Ping Tak Peter Tang, eds.),
IEEE Computer Society, 2013, pp. 145–152.
12
Marius Cornea, Cristina Anderson, John Harrison, Ping Tak Peter
Tang, and Eric Schneider Evgeny Gvozdev, A software implementation of the IEEE 754R decimal floating-point arithmetic using the binary
encoding format, IEEE Transactions on Computers 58 (2009), no. 2,
148–162.
[4] Godfrey H. Hardy and Edward M. Wright, An introduction to the
theory of numbers, Oxford University Press, London, 1979.
[5] IEEE Computer Society, IEEE standard for floating-point arithmetic, IEEE Standard 754-2008, August 2008, available at http:
//ieeexplore.ieee.org/servlet/opac?punumber=4610933.
[6] ISO/IEC JTC 1/SC 22/WG 14, Extension for the programming
language C to support decimal floating-point arithmetic, Proposed
Draft Technical Report, May 2008.
[7] Aleksandr Ya. Khinchin, Continued fractions, Dover, New York,
1997.
[8] Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin,
Claude-Pierre Jeannerod, Vincent Lefèvre, Guillaume Melquiond,
Nathalie Revol, Damien Stehlé, and Serge Torres, Handbook of
floating-point arithmetic, Birkhäuser, 2010.
[9] Oskar Perron, Die Lehre von den Kettenbrüchen, 3rd ed., Teubner,
Stuttgart, 1954–57.
[10] The GNU project, GNU compiler collection, 1987–2014, available at
http://gcc.gnu.org/.
[11] Charles Tsen, Sonia González-Navarro, and Michael Schulte,
Hardware design of a binary integer decimal-based floating-point adder,
ICCD 2007. 25th International Conference on Computer Design,
IEEE, 2007, pp. 288–295.
[3]
Nicolas Brisebarre was born in Bordeaux,
France, in 1971. He received his Ph.D. in pure
mathematics from the Université Bordeaux I,
France, in 1998. He is Chargé de recherche (junior researcher) at CNRS, France, and he is
a member of the LIP laboratory (LIP is a joint
computer science laboratory of CNRS, École Normale Supérieure de Lyon, INRIA, and Université
Claude Bernard Lyon 1). His research interests
are in computer arithmetic, number theory, validated computing and computer algebra.
Christoph Lauter was born in Amberg, Germany, in 1980. He received this Ph.D. in Computer Science from École Normale Supérieure
de Lyon, France, in 2008. He worked as a
Software Engineer in the Numerics Team at Intel
Corporation. Since 2010, he is Maître de Conférences (Assistant Professor) at Université Pierre
et Marie Curie Paris 6, Sorbonne Universités, in
the PEQUAN Team. His research interests are
Floating-Point Arithmetic and Validated Numerical Computing.
Marc Mezzarobba received the Ph.D. degree in
Computer Science from École polytechnique in
2011. Since 2013 he has been with CNRS and
Université Paris 6. His main research interests
are in computer algebra and computer arithmetic.
Jean-Michel Muller was born in Grenoble,
France, in 1961. He received his Ph.D. degree in
1985 from the Institut National Polytechnique de
Grenoble. He is Directeur de Recherches (senior
researcher) at CNRS, France, and he is the cohead of GDR-IM. His research interests are in
Computer Arithmetic. Dr. Muller was co-program
chair of the 13th IEEE Symposium on Computer
Arithmetic (Asilomar, USA, June 1997), general
chair of SCAN’97 (Lyon, France, sept. 1997),
general chair of the 14th IEEE Symposium on
Computer Arithmetic (Adelaide, Australia, April 1999), general chair of
the 22nd IEEE Symposium on Computer Arithmetic (Lyon, France, June
2015). He is the author of several books, including ”Elementary Functions,
Algorithms and Implementation” (2nd edition, Birkhauser, 2006), and he
coordinated the writing of the ”Handbook of Floating-Point Arithmetic
(Birkhäuser, 2010). He is an associate editor of the IEEE Transactions
on Computers, and a senior member of the IEEE.
13
b32/d64
b32/d128
b64/d64
b64/d128
b128/d64
b128/d128
32
64
32
128
64
64
64
128
128
64
128
128
−1
8
−4, 10
145
17
159, 160
32, 32
0
3
332
9
+1
8
−42, 39
117
17
125, 128
32, 32
0
3
1344
10
−1
8
−38, 45
178
17
191, 192
32, 32
0
3
2048
16
+1
8
−622, 617
190
17
190, 192
32, 32
−1
3
29792
16
+1
8
−624, 617
242
17
253, 256
32, 32
0
3
39776
36
−1
8
−4, 8
86
17
95, 96
32, 32
0
3
188
5
−1
16
−2, 5
145
35
159, 160
64, 64
0
7
288
13
+1
32
−10, 10
117
72
125, 128
96, 96
0
7
720
14
−1
32
−9, 12
178
72
191, 192
96, 96
0
7
912
24
−1
64
−77, 78
190
147
191, 192
160, 160
0
7
5024
34
+1
64
−78, 78
242
147
253, 256
160, 160
0
7
6304
52
any 𝛾, optimizing multiplication count
πœ–
−1
𝛾
13
range of π‘ž
−2, 5
min πœ†1 satisfying (12)
86
min πœ†2 satisfying (13)
28
chosen πœ†1 , π‘Š (πœ†1 )
95, 96
chosen πœ†2 , π‘Š (πœ†2 )
32, 32
𝜏
0
Step 6 of Algo. 2 can be omitted
3
total table size in bytes
148
32-bit muls
5
−1
13
−2, 6
145
28
159, 160
32, 32
0
3
232
9
+1
14
−24, 22
117
31
125, 128
32, 32
0
3
808
10
−1
14
−22, 26
178
31
191, 192
32, 32
0
3
1232
16
+1
14
−355, 353
190
31
190, 192
32, 32
−1
3
17072
16
+1
14
−357, 353
242
31
253, 256
32, 32
0
3
22808
36
−1
13
−2, 6
145
28
159, 160
32, 32
0
3
232
9
+1
28
−12, 11
117
63
125, 128
64, 64
0
3
608
12
+1
28
−12, 11
178
63
189, 192
64, 64
0
3
800
28
−1
69
−71, 73
190
158
191, 192
160, 160
0
7
4860
34
+1
83
−60, 60
242
191
253, 256
192, 192
0
7
5864
56
π‘Š (𝑝2 )
π‘Š (𝑝′10 )
𝛾 = 2π‘˜ , optimizing multiplication count
πœ–
−1
𝛾
8
range of π‘ž
−4, 8
min πœ†1 satisfying (12)
86
min πœ†2 satisfying (13)
17
chosen πœ†1 , π‘Š (πœ†1 )
95, 96
chosen πœ†2 , π‘Š (πœ†2 )
32, 32
𝜏
0
Step 6 of Algo. 2 can be omitted
3
total table size in bytes
188
32-bit muls
5
𝛾 = 2π‘˜ , optimizing table size
πœ–
𝛾
range of π‘ž
min πœ†1 satisfying (12)
min πœ†2 satisfying (13)
chosen πœ†1 , π‘Š (πœ†1 )
chosen πœ†2 , π‘Š (πœ†2 )
𝜏
Step 6 of Algo. 2 can be omitted
total table size in bytes
32-bit muls
any 𝛾, optimizing table size
πœ–
𝛾
range of π‘ž
min πœ†1 satisfying (12)
min πœ†2 satisfying (13)
chosen πœ†1 , π‘Š (πœ†1 )
chosen πœ†2 , π‘Š (πœ†2 )
𝜏
Step 6 of Algo. 2 can be omitted
total table size in bytes
32-bit muls
−1
13
−2, 5
86
28
95, 96
32, 32
0
3
148
5
TABLE 7
Suggested parameters for Algorithm 2 on a 32-bit machine.
14
b32/d64
b32/d128
b64/d64
b64/d128
b128/d64
b128/d128
64
64
64
128
64
64
64
128
128
64
128
128
−1
16
−2, 5
145
35
191, 192
64, 64
0
3
320
5
+1
16
−21, 20
117
35
125, 128
64, 64
0
3
800
3
−1
16
−19, 23
178
35
191, 192
64, 64
0
3
1160
5
+1
16
−311, 309
190
35
190, 192
64, 64
−1
3
15032
5
+1
16
−312, 309
242
35
253, 256
64, 64
0
3
20032
10
+1
16
−3, 3
86
35
125, 128
64, 64
0
3
240
3
−1
16
−2, 5
145
35
191, 192
64, 64
0
3
320
5
+1
16
−21, 20
117
35
125, 128
64, 64
0
3
800
3
−1
32
−9, 12
178
72
191, 192
128, 128
0
7
1040
7
−1
64
−77, 78
190
147
191, 192
192, 192
0
7
5280
9
+1
64
−78, 78
242
147
253, 256
192, 192
0
7
6560
14
any 𝛾, optimizing multiplication count
πœ–
+1
𝛾
13
range of π‘ž
−4, 3
min πœ†1 satisfying (12)
86
min πœ†2 satisfying (13)
28
chosen πœ†1 , π‘Š (πœ†1 )
125, 128
chosen πœ†2 , π‘Š (πœ†2 )
64, 64
𝜏
0
Step 6 of Algo. 2 can be omitted
3
total table size in bytes
232
64-bit muls
3
−1
20
−1, 4
145
45
191, 192
64, 64
0
3
304
5
+1
28
−12, 11
117
63
125, 128
64, 64
0
3
608
3
−1
28
−11, 13
178
63
191, 192
64, 64
0
3
824
5
+1
28
−177, 177
190
63
190, 192
64, 64
−1
3
8744
5
+1
28
−178, 177
242
63
253, 256
64, 64
0
3
11616
10
−1
20
−1, 4
145
45
191, 192
64, 64
0
3
304
5
+1
28
−12, 11
117
63
125, 128
64, 64
0
3
608
3
+1
28
−12, 11
178
63
189, 192
64, 64
0
3
800
7
−1
82
−60, 61
190
189
191, 192
192, 192
0
7
4896
9
+1
83
−60, 60
242
191
253, 256
192, 192
0
7
5864
14
π‘Š (𝑝2 )
π‘Š (𝑝′10 )
𝛾 = 2π‘˜ , optimizing multiplication count
πœ–
+1
𝛾
16
range of π‘ž
−3, 3
min πœ†1 satisfying (12)
86
min πœ†2 satisfying (13)
35
chosen πœ†1 , π‘Š (πœ†1 )
125, 128
chosen πœ†2 , π‘Š (πœ†2 )
64, 64
𝜏
0
Step 6 of Algo. 2 can be omitted
3
total table size in bytes
240
64-bit muls
3
𝛾 = 2π‘˜ , optimizing table size
πœ–
𝛾
range of π‘ž
min πœ†1 satisfying (12)
min πœ†2 satisfying (13)
chosen πœ†1 , π‘Š (πœ†1 )
chosen πœ†2 , π‘Š (πœ†2 )
𝜏
Step 6 of Algo. 2 can be omitted
total table size in bytes
64-bit muls
any 𝛾, optimizing table size
πœ–
𝛾
range of π‘ž
min πœ†1 satisfying (12)
min πœ†2 satisfying (13)
chosen πœ†1 , π‘Š (πœ†1 )
chosen πœ†2 , π‘Š (πœ†2 )
𝜏
Step 6 of Algo. 2 can be omitted
total table size in bytes
64-bit muls
+1
13
−4, 3
86
28
125, 128
64, 64
0
3
232
3
TABLE 8
Suggested parameters for Algorithm 2 on a 64-bit machine.
Download