A Proposed Standard for Binary Floating.

Offered here for publc comment, this proposed standard facilitates
transportation of numerically oriented programs and encourages
development of high-q iality numerical software.
A Proposed Standard for Binary
Floating.-Point Arithmetic
Draft 8.0 of IEEE,Jask P754
Introductory Comments by David Stevenson, Chairman, Floating-Point Working Group
Microprocessor Standards Committee, IEEE Computer Society
Few programmers care how their computer performs
floating-point arithmetic. If they do, it is usually because
they've had a divide-by-zero fault (even after inserting a
test to ensure thatx . y before dividingbyx - y) or some
equally mysterious incident. Specifying a programming
environment that minimizes such anomalies is one of the
goals of this standardization effort. Overall, it attempts
to facilitate the transportation of numerically oriented
programs and to encourage the development of highquality numerical software. These two goals are especially
important in the microprocessor environment since component vendors are not likely to devote extensive resources to developing numerical software for the general
A number of rationales underlying the development of
this proposal should be brought to the reader's attention.
First, the working group responsible for this document
was not restricted to the format or other conventions of
an existing floating-point system; instead, the interests of
the user community were placed above the goal of industrial continuity. * In fact, the segment of the computer
industry that has shown the greatest interest in this work
has been the semiconductor industry that is currently introducing the second generation of floating-point units
on IC chips. The second major rationale, based on the
realization that most implementations would rely on software to supply the full functionality of the proposal, was
that the document describe a programming environment-meaning both hardware and software. Indeed,
one goal was to encourage hardware implementations
that do not preclude an efficient implementation of the
total desired functionality.
These rationales should be kept in mind as we review
the proposal's major features and indicate why they were
'A new working group, IEEE Task 854, has been formed recently to permit development of a floating-point standard parameterized to accommodate different computer word formats and radices. Dr. W. J. Cody of
Argonne National Laboratory will be chairman of this working group.
March 1981
included. There are three major aspects of the proposal:
the format of the data types, the arithmetic, and the exception handling.
Formats. The basic format sizes for floating-point
numbers-32 bits and 64 bits-were selected for efficient
calculation of array elements in byte-addressable memories. For the 32-bit format, precision was deemed the most
important criterion, hence the choice of radix 2 instead of
octal or hexadecimal. Other characteristics include not
representing the leading significand bit in normalized
numbers, a minimally acceptable exponent range which
uses eight bits, and an exponent bias which allows the
reciprocal of all normalized numbers to be represented
without overflow.
For the 64-bit format, the main consideration was
range; as a minimum, the desire was that the product of
any two 32-bit numbers should not overflow the 64-bit
format. The final choice of exponent range provides that
a product of eight 32-bit terms cannot overflow the 64-bit
format-a possible boon to users of optimizing compilers
which reorder the sequence otIrithmetic operations from
that specified by the careful programmer.
The proposal also recommends the minimum requitements for extended-precision temporaries (qtuantities
whose range and precision are greater than a basic format
but do not require twice as many bits for representation).
With their greater precision, extended-precision temporaries lessen the chance of a final result that has been
contaminated by. excessive roundoff error; with their
greater range, they also lessen the chance of an intermediate overflow aborting a computation whose result
would have been l-tiresent4ble in a basic format. The
precision is to afford
motivation for
some benefits of a higher basic precision without incurring the time penalty usually assotiated with higher precision; however, the proppsed standard requires only single
precision (32-bit format) for conforming implementations.
0018-9162/81/0300-0051S00.75 C 1981 IEEE
Arithmetic. The proposed standard requires accurate
computation of all arithmetic results to within half a unit
in the last place of the destination format. Once the hardware is in place to achieve this goal (guard, round, and
sticky bits, for example), it requires little more to achieve
directed roundings that are useful in interval arithmetic,
so the proposal requires these additional rounding capabilities (see Figure 1).
In addition to the four basic arithmetic operations, the
proposed standard also requires remainder, square root,
and conversions between binary and decimal representations. Remainder, square root, and conversions within a
specified range must be as accurate as the basic arithmetic. Remainder is included because of its usefulness in
argument reduction in computing elementary transcendental functions, square root because of its frequency in
matrix algorithms; both are included because they can be
supported-in a well-designed divide unit-to the required accuracy with little additional hardware. Conversion was included to ensure that accurate, reproducible
L i
Figure 1. Guard, round, and sticky bits ensure accurate
unbiased rounding of computed results to within half a
unit in the least-significant bit. Two bits are required for
perfect rounding; the guard bit is the first bit beyond
rounding precision, and the sticky bit is the logical OR of
all bits thereafter. To accommodate post-normalization in
some operations, the round bit is kept, beyond the guard
bit, and the stickly bit is a logical OR of all bits beyond
results on different implementations would not be lost at
the I/O interface.
It should be remembered that this proposal specifies a
programming environment; the supporting hardware
need not directly implement these operations as single instructions. Finally, the insistence on accuracy is not only
an end in itself (providing sharper error bounds), it also
ensures a host of pleasant derivative features, such as the
commutativity of addition and multiplication.
Exceptions. Operations that produce results beyond
the range of normalized floating-point numbers are also
treated in this proposal. In situations where a trap is not
allowed, overflow and divide-by-zero generate infinities,
and subsequent arithmetic involving these infinities produces results obeying traditional mathematical conventions regarding infinity. Operations that have no
mathematical interpretation, such as zero divided by
zero, will produce a not-a-number called a NaN. Such
NaNs can be used to convey diagnostic information
regarding their creation, or can be used as 'escapemechanism pointers to nonstandard representations.
Underflow is handled by introducing "denormalized"
numbers-nonzero numbers that lie between the largest
negative normalized number and the smallest positive
normalized number, with constant spacing (see Figure 2).
In many instances, denormalized numbers reduce potential underflow damage to no more than roundoff error.
One of the consequences of a floating-point system
with NaNs is that comparisons do not obey the trichotomy rule: two items may compare not only as less than,
equal, or greater than; they may also be unordered. This
complicates the handling of branching conditions, but
those responsible for this proposal felt that the additional
complexity was unavoidable and that the additional functionality which gives rise to unordered comparisions was
worth the logical expense.
Unordered comparisons are one of several exceptional
operations that can arise, and a default action for a non-
Figure 2. Each vertical tick stands for a 4-bit significand
binary floating-point number. The underf low threshold m
is a power of 1/2 depending upon the allowed range of exponents; every floating-point number bigger than m, but
none smaller, is representable as a normalized floatingpointnumber.Onsome machines(IBM 7094, DEC PDP-10,
DEC PDP-11, etc.) m is a normalized number, too; on
others (HP-3000) it is not. Flushing underf lows to zero in52
troduces a gap between m and 0 much widier than between m and the next larger number. GraduaIl underf low
fills that gap with denormalized numbers as densely
packed between m and 0 as are normalized nlumbers between m and 2m. Doing so relegates underfllow in most
computations to a status comparable witth roundoff
among the normalized numbers.
trapping environment is specified for each occasion; for
trapping contexts, the result to be delivered to a userspecifiable trap handler is indicated.
In this brief introduction, it is impossible to give an
adequate account of several years' work by the many people involved in the effort of the working group (the
minutes of the meetings and supporting documents run to
hundreds of pages). An early collection of most of the
ideas was made by Prof. William Kahan and Jerome
Coonen, both of the University of California at Berkeley,
and Prof. Harold Stone of the University of Massachusetts at Amherst; a much-revised version of that work appeared in the October 1979 A CM SIGNUMNewsletter, a
special issue providing an extensive account of the proposal's features and the alternatives considered by the
The Floating-Point Working Group, IEEE Task P754,
of the Microprocessor Standards Subcommittee, under
the initial chairmanship of Richard H. Delp of FourPhase Systems, recast that work-many times. Coonen,
Kahan, John F. Palmer of Intel, Tom Pittman of Itty
Bitty Computers, and David Stevenson of Zilog were responsible for drafting this proposal. Other members of
the working group who played major roles in its deliberations by presenting alternative proposals include Bob
Fraley and Steve Walther of Hewlett-Packard Laboratories and Mary Payne, Dileep Bhandarkar, and William
Strecker of Digital Equipment Corporation. Several
working-group members will present a one-day tutorial
on the proposed standard May 20, 1981, in conjunction
with the Fifth Symposium on Computer Arithmetic in
Ann Arbor, Michigan.
The IEEE Computer Society is publishing the draft
standard-along with related material-to invite public
comment prior to its submission to the IEEE Standards
Board for adoption as an IEEE Standard. Comments
should be sent to Stevenson by May 15, 1981, with copies
to Mike Smolin and Steve Diamond, co-chairmen of the
Microprocessor Standards Committee. Stevenson's address is Zilog, Inc., 10460 Bubb Road, Cupertino, CA
95014; (408) 446-4666, ext. 5476. Smolin and Diamond's
address is Synertek, PO Box 552 MS-34, Santa Clara, CA
95052. If you would like to participate in this and other efforts of the Microprocessor Standards Committee, please
contact one of the co-chairmen. *
The Proposed Standard
This standard is a product of the Floating-Point Working Group of the Microprocessor Standards Subcommittee of the IEEE Computer Society Computer Standards
Committee. It is intended that the standard embody the
essence of "An Implementation Guide to a Proposed
1. Computer, Vol. 13, No. 1, January 1980,
on p.
61 of this issue.
MVarch 1981
pp. 68-79. See
also errata
Standard for Floating-Point Arithmetic" by Jerome T.
This standard defines a family of commercially feasible
ways for new systems to perform binary floating-point
arithmetic. The issues of retrofitting were not considered.
The desiderata which guided the formulation of this standard included:
(a) Facilitate movement of existing programs from
diverse computers to those which adhere to this
(b) Enhance the capabilities and safety available to
programmers who, though not expert in numerical
methods, may well be attempting to produce
numerically sophisticated programs. However, we
recognize that utility and safety are sometimes antagonists.
(c) Encourage experts to develop and distribute robust
and efficient numerical programs portable, via
minor editing and recompilation, onto any computer which conforms to this standard and possesses adequate capacity. When restricted to a
declared subset of the standard, these programs
should produce identical results on all conforming
(d) Provide direct support for
* execution-time diagnosis of anomalies,
* smoother handling of exceptions, and
* interval arithmetic at a reasonable cost.
(e) Provide for development of
* elementary functions like exp, cos, .
* very high precision (multiword) arithmetic, and
* coupling of numerical and symbolic algebraic
(f) Enable rather than preclude 'further refinements
and extensions.
1. Scope
1.1. Implementation objectives. It is intended that an implementation of a floating-point system conforming to
this standard can be realized entirely in software, entirely
in hardware, or in any combination of hardware and software. It is the actual environment which the programmer
or user of the system sees that conforms or fails to conform to this standard. Hardware components that require
software support to conform shall not be said to conform
apart from such software.
1.2. Inclusions. This standard specifies:
(a) floating-point number formats;
(b) the results for add, subtract, multiply, divide,
square root, remainder, and compare;
(c) conversions between integers and floating-point
(d) conversions between different floating-point formats;
(e) conversion between basic format (see §3.1) floating-point numbers and decimal strings; and
(f) floating-point exceptions and their handling, including non-numbers (NaNs).
1.3. Exclusions. This standard does not specify:
(a) integer representation;
(b) interpretation of signs and fraction fields of NaNs;
(c) binary - decimal conversions to and from extended
formats; or
(d) formats of decimal strings.
regarded as if its range were unlimited. If the significand is
zero, the number becomes normal zero. Normalizing a
number does not change its sign.
2.11. NaN. Not a number; a symbolic entity encoded in
floating-point format. See §3 and §6.2.
2.12. Status flag. A variable which may take two states,
set and clear. A program may clear or copy a flag. When
set, a status flag may contain additional system-depen-
2. Definitions
2.1. User. The user of a floating-point system is considered to be any person, hardware, or program, not itself
specified by this standard, having access to and controlling those operations of the programming environment
specified in this standard.
2.2. Binary floating-point number. A bit-string characterized by three components, a sign, a signed exponent,
and a significand. Its numerical value, if any, is the signed
product of its significand and two raised to the power of
its exponent. In this document a bit-string is not always
distinguished from a number it may represent.
2.3. Exponent. That component of a binary floatingpoint number which normally signifies the power to
which two is raised in determining the value of the
represented number. Occasionally the exponent is called
the signed or unbiased exponent.
2.4. Biased exponent. The sum of the exponent and a constant (bias) chosen to make the biased exponent's range
2.5. Significand. That component of a binary floatingpoint number which consists of an explicit or implicit
leading bit to the left of its binary point and a fraction
field to the right of the binary point.
2.6. Fraction. That field of the significand that lies to the
right of its implied binary point.
2.7. Normal zero. The exponent is the format's
minimum, and the significand is zero. Normal zero may
have either a positive or a negative sign. Only the extended
formats have any unnormalized zeros (see §2.9).
2.8. Denormalized. The exponent is the format's minimum, the explicit or implicit leading bit is zero, and the
number is not normal zero. To denormalize a binary
floating-point number means to shift its significand right
while incrementing its exponent, until it becomes a denormalized number.
2.9. Unnormalized. Occurs only in the extended format.
The number's exponent is greater than the format's
minimum, and the explicit leading bit is zero. If the
significand is zero, this is an unnormalized zero.
2.10. Normalize. If the number is nonzero, shift its
significand left while decrementing its exponent until the
leading significand bit becomes one; the exponent is
dent information, possibly inaccessible to the program.
The operations of this standard may, as a side effect, set
some of the following flags: inexact result, underflow,
overflow, divide by zero, and invalid operation.
2.13. Destination. Every unary or binary operation delivers its result to a destination, either explicitly designated
by the user or implicitly supplied by the system (e.g., intermediate results in subexpressions or arguments for
procedures). Some languages place the results of intermediate calculations in destinations beyond the programmer's control. Nonetheless, this standard defines
the result of an operation in terms of that destination format as well as the operands' values.
2.14. Mode. A mode is a variable which a program may
set, sense, save and restore, to control tL. execution of
subsequent arithmetic operations. The default mode is
that mode which a program can assume to be in effect
unless an explicitly contrary statement is included either
in the program or its specification. The standard entails
the modes
(a) projective/affine, which concerns the interpretation of oo,
(b) rounding direction, which concerns the direction
of rounding errors,
and, in certain implementations,
(c) rounding precision, to shorten the precision of intermediate results.
Optionally, an implementator may provide the modes
(d) warning/normalizing, for handling underflowed
values, and
(e) traps disabled/enabled, for handling exceptions.
2.15. Shall and should. In this standard, the use of the
word "shall" signifies that which is obligatory in any conforming implementation; the use of the word "should"
signifies that which is strongly recommended as being in
keeping with the intent of the standard, despite architectural or other constraints beyond the scope of this standard that may on occasion render the recommendations
3. Formats
This standard defines four floating-point formats in
two groups, basic and extended, each having two widths,
single and double. The standard levels of implementation
Figure 1. Single-precision format.
Figure 2. Double-precision format.
are distinguished by the combinations of formats sup-
3.1. Basic formats
3.1.1. Single. A 32-bit format for a binary floating-point
number Xis divided as shown in Figure 1. The component
fields of X are the I -bit sign s, the 8-bit biased exponent e,
and the 23-bit fractionf. The value v of Xis as follows:
(a) If e=255 andf.0, then v=NaN.
(b) Ife=255 andf=0, then v= (- I)soo.
(c) If O<e<255, then v=(- 1)s2e-127(l.f).
(d) If e = 0 andf.0, then v = ( - 1)52 -126(0.f).
(e) Ife=Oandf=0, then v=(- l)sO, (zero).
e, and the 52-bit fractionf. The value v ofXis as follows:
(a) Ife=2047andfX0, then v=NaN.
(b) Ife=2047andf=0,thenv=(- I)soo.
(c) IfO<e<2047, then v=(- 1)s2e-1023(1.f).
(d) Ife=O andf.0, then v= (- 1)52 - 1022(o.f).
(e) If e = 0 andf= 0, then v = (- 1)sO, (zero).
3.2. Extended formats
3.2.1. Single-extended. Extended is an implementationdependent format. An extended binary floating-point
number X has four components: a 1-bit sign s, an exponent e of specified range combined with an implementation-dependent bias, a 1-bit integer part j, and a fraction f with at least 31 bits. The exponent shall range be-
tween a minimum value m s - 1023 and a maximum
value M- + 1024. The value v of Xis as follows:
(a) If e=Mandf.X0, then v=NaN.
(b) If e=Mandf=0, then v=(- 1)soo.
(c) Ifm<e<M, then v=(- I)s2e(.ff).
(d) If e=m and j=f=0, then v=(-)sO, (normal
(e) If e = m and j orf is nonzero, then
v = (- )s2e' (j.f), where e' = m or m + I at the im-
plementor's option.
3.1.2. Double. A 64-bit format for a binary floating-point
number Xis divided as shown in Figure 2. The component 3.2.2. Double-extended. The double-extended format is
fields of X are the I -bit sign s, the 11-bit biased exponent the same as single-extended described in §3.2.1, except
that the exponent shall range between m c - 16383 and
M2 + 16384, and the fraction shall have at least 63 bits.
3.2.3. Exponent range. An implementation of this standard is not required to provide (and the user should not
assume) that single-extended have greater range than
3.3. Combinations of formats. All implementations conforming to this standard shall support single. Implementations should support the extended format corresponding to the widest basic format supported, and need not
support any other extended format.2
2. Only if upward compatibility and speed are important issues should a
system supporting the double-extended format also support singleextended.
4. Rounding
Except for binary -decimal conversion, all operations
specified in §5 and §7 shall be performed as if correct to
infinite precision, then rounded according to the specifications in this section. Rounding takes a number regarded
as infinitely precise and, if necessary, modifies it to fit in
the destination's format while signalling that it is inexact
(see §8.5).
4.1. Default rounding mode. An implementation of this
standard shall provide round to nearest, with rounding to
even in case of a tie, as the default rounding mode. When
rounding to nearest, the result shall differ from the infinite precision exact result by at most one half in the leastsignificant-digit position; rounding to even means that if
the difference is exactly half then the rounded result shall
have an even last digit.
4.2. Directed rounding modes. An implementation shall
provide user-selectable positive- and negative-directed
rounding (round toward + Xo and round toward - oo)
and truncation (round toward 0) for all operations.
When rounding toward + oo, the result shall be the format's value (possibly + co) closest to and no less than the
infinitely precise result, except as specified in §8.3;
analogously, when rounding toward - oo, the result shall
be the format's value (possibly - co) closest to and no
greater than the infinitely precise result, except as
specified in §8.3. When rounding toward 0, the result
shall be the format's value closest to and no greater in
magnitude than the infinitely precise result.
The rounding modes may affect the signs of zero sums
(see §6.3).
4.3. Rounding precision. Normally a result is rounded to
the precision of its destination. However, some hardware
will always deliver results from single format operands to
double or extended destinations. On such a system the
user, which may be a high-level language compiler, shall
be able to specify that a result be rounded instead to single
precision, though it is stored in the double or extended
format with its wider exponent range.3 Similarly, a system
that delivers all results from double format operands to
extended destinations shall permit the user to specify
rounding to double precision. Note that to meet the
specifications in §4.1, the result cannot suffer more than
one rounding error.
3. Rounding precision control is intended to allow systems whose
destinations are always double or extended to mimic systems with single
and double destinations. However, use of precision control to combine
double (or extended) operands to produce a single format result with just
one rounding is considered nonstandard.
5. Operations
5.4. Conversion between floating-point and integer. It
shall be possible to round a floating-point number to an
integer value in the same floating-point format, for all
supported formats. The rounding shall be as specified in
§4, with the understanding that in round to nearest mode,
if the difference between the unrounded operand and the
rounded result is exactly one half, the rounded result is
All conforming implementations of this standard shall
provide add, subtract, multiply, divide, square root, remainder, floating-point format conversions, conversions
between floating-point and integers, binary decimal
conversions, and comparisons.
When all operands are normalized, the operations shall even.
be performed as if to infinite precision before rounding as
It shall be possible to convert between all supported
specified in §4. §7 specifies the results when at least one of floating-point formats and all supported integer formats.
the operands is not normalized. §6 augments the specifi- Conversion to integer shall be effected by rounding as
cations to cover signed zero and oc and NaN; §8 enumer- specified in §4. Conversions between floating-point inates exceptions.
tegers and integer formiats shall be exact unless an exception arises as specified in §8.1. 1.
5.1. Arithmetic. An implementation shall provide add,
subtract, multiply, divide, and remainder for any two 5.5. Binary -decimal conversion. Conversion between
operands of the same format, for each supported format; decimal strings in at least one format and binary floatingit should also provide the operations for operands of dif- point numbers in all supported basic formats shall be profering formats. The destination format (regardless of the vided for numbers throughout the ranges specified in
rounding precision control of §4.3) shall be at least as Table 1. The integers Mand N-in Tables 1 and 2 below are
wide as the operands' format. All results shall be rounded such that the decimal strings have values Mx 10±N. On
as specified in §4.
input, trailtig zeros shall be appended to or stripped from
The remainder r =x REM y is defined regardless of the M (up to the limits specified in Table 1) in order to
rounding mode by the following relation when y .0:
minimize N. When the destination is a decimal string, its
least-significant digit should be located by format
for purposes of rounding.
where n is the integer nearest x/y; n is even whenever specifications
Conversions shall be correctly rounded as specified in
n-xly = ½.2 Note that with this definition the re- §4 for operands
within the ranges specified in Table
mainder is exact. The result shall be normalized unless it 2. Otherwise the lying
error in the converted result shall not exunderflows.
ceed by more that 0.47 units in the destination's leastsignificant digit the error that would be incurred by the
5.2. Square root. The square root operation shall be pro- rounding specifications
of §4, provided that exponent
vided in all supported formats and is defined for all nor- over/underflow does not occur.
malized operands .0; 0= -0. The destination forConversions shall be monotonic. That is, increasing the
mat shall be at least as wide as the operand's. The result value of a binary floating-point number shall
not decrease
shall be rounded as specified in §4.
its value when converted to a decimal string, and increasing the value of a decimal string shall not decrease its value
5.3. Floating-point format conversions. It shall be possi- when converted to a binary floating-point number.
ble to convert floating-point numbers between all supWhen rounding to nearest, conversion from binary to
ported formats. If the conversion is to a less wide preci- decimal and back to binary shall be the identity as
sion, the result shall be rounded as specified in §4. If the the decimal string is carried to the maximum long as
conversion is to a wider precision, it shall be exact, al- specified in Table 1, namely nine
for single and 17
though an invalid result exception may be raised as speci- for double.4
fied in §8.1.2.
If decimal to binary conversion over/underflows, the
response is as specified in §8. Over/underflow and NaNs
and infinities encountered during binary to decimal conTable 1.
should be indicated to the user by appropriate
Decimal conversion ranges.
Table 2.
Correctly rounded decimal conversion range.
1O17 l1
5.6. Comparison. It shall be possible to compare floatingpoint numbers in all supported formats, including comparisons between operands of differing formats. Comparisons are exact and never overflow or underflow. Four
mutually exclusive relations are possible: '.'less than,"
"equal," "greater than," and "unordered." The last
case arises when at least one operand is NaN, or when oo in
4. The properties specified for conversions are implied by error bounds
that depend on the format (single or double) and the number of decimal
digits involved; the 0.47 mentioned is a worst-case bound only. For a detailed discussion of these error bounds and economical conversion
algorithms that exploit the extended format, see "Binary Decimal Conversion in KCS Arithmetic," by Jerome T. Coonen (to appear).
the projective mode is compared to anything other than
co; every NaN shall compare "unordered" with everything, including itself. Comparisons shall ignore the sign
of infinity in the projective mode (where + X = - oo), and
shall ignore the sign of zero (so + 0 = - 0).
5.6.1. Condition codes. When the result of a comparison
is reported via condition codes, the result shall be an encoding of one of the four relations listed in §5.6.
5.6.2. Predicates. When the result of a comparison is
reported as an affirmation or negation of a predicate, the
following implications shall determine that response:
(a) The relation "less than" affirms the predicates <,
c, ., and denies the predicates =, -, >,
(b) The relation "equal" affirms the predicates =, c,
., and denies the predicates <, >, .X, unordered.
(c) The relation "greater than" affirms the predicates
>, ., ., and denies the predicates =, c, <,
(d) The relation "unordered" affirms the predicates
., unordered, and denies the predicates <, c, =,
In addition to the response specified above, an invalid
operation exception (see §8.1) shall arise in a comparison
just when two values whose relation is "unordered" are
compared via a predicate involving <, c, 2, >, or their
negations, as specified in §8.1 .1 .h.
6. Infinity, NaNs, and signed zero
6.1. Infinity arithmetic. Infinity arithmetic shall be construed as the limiting case of real arithmetic with operands
of arbitrarily large magnitude, when such a limit exists.
Infinity arithmetic shall be supported under two userselectable modes, affine and projective, with projective
being the default. In affine mode - oo< (every finite
number) < + oo, but in projective mode infinities compare "equal" regardless of sign and compare "unordered" with everything else. Consequently, the two
modes are distinguished by exceptions in add, subtract,
square root, and compare, as specified in §8.
Except for the invalid operations specified for oo,
arithmetic upon oo is always exact and therefore shall raise
no exceptions. The three exceptions that do pertain to oo
are raised only when
(a) oo is created from finite operands by overflow
(§8.3) or division by zero (§8.2), with the corresponding trap disabled, or
(b) oo is an invalid operand (§8.1).
6.2. Operations with NaNs. Two different kinds of NaN,
trapping and nontrapping, shall be supported in all opera-
Trapping NaNs shall be reserved operands which precipitate an invalid operation exception (§8.1.1) or some
other implementation-dependent exception for every
operation listed in §5 that is performed upon them.5
Nontrapping NaNs shall obey the following rules; these
NaNs should, by means left to the implementor's discretion, afford retrospective diagnostic information inherited from invalid or unavailable data and results.
For those operations specified to deliver floating-point
(a) every operation involving a trapping NaN or invalid operation (§8.1), if no trap occurs, shall set
the invalid operation flag and deliver in place of its
invalid result a nontrapping NaN;
(b) every operation involving one or two input NaNs,
none of them trapping, shall raise no exception but
deliver as a result either the same NaN (if operating
upon just one) or one or the other of the input
NaNs, according to an implementation-dependent
precedence rule.
The operations not covered in this paragraph, namely
those which do not deliver a floating-point result, are
comparison (§5.6) and conversion to a format that has no
NaNs (§5.4 and §5.5).
6.3. The sign bit. This standard says nothing about the
sign of a NaN. Otherwise the sign of a product or quotient
is the exclusive OR of the operands' signs; and the sign of
a sum, or of a difference x-y regarded as a sum
x+ (-y), differs from at most one of the addends' -signs.
These rules shall apply even when operands or results are
zero or infinite.
When the sum of two operands with opposite signs (or
the difference of two operands with like signs) is exactly
zero, either normal or unnormalized (see §7), the sign of
that sum (or difference) shall be " + " in all rounding
modes except round toward - oo, in which mode that sign
shall be" - ." However, x+x =x-( -x) retains the same
sign as x even when x is zero.
A valid square root can have a negative sign only when
the operand is - 0.
7. Unnormalized and denormalized arithmetic
The default6 mode of arithmetic, when at least one
operand is not normalized, shall obey the following rules.
Rounding and over/underflow handling are performed
after the operations specified here and may modify the
results. In the following specifications expon(x) refers to
the unbiased exponent of x.
(a) Add or subtract (z: = x±y): If at least one of the
operands having exponent m, where m = max (ex5. These NaNs afford arithmetic-like enhancements (such as complexaffine infinities, extremely wide range, etc.) that are not the subject of the
standard. However, if there is no special trap designated and enabled for
these NaNs, then the invalid operation exception is raised as specified in
§8.1 .1.
6. These default rules are analogous to those for normalized numbers,
though they tend more toward excessive caution than optimal utility, and
offer pipelined processors a faster but second-best alternative to providing
an optional normalizing mode described later. More useful than these
rules, but probably harder to implement, are the rules for significance
pon(x), expon(y)), is normalized, then z shall be either kind of invalid operation exception occurs without
normalized before rounding. Otherwise ex- a trap, shall be a nontrapping NaN (see §6.2).
pon(z) = m.
(b) Multiply (z: = x x y): expon(z) = expon(x) + ex- 8.1.1. Invalid operand. Invalid operation shall be sigpon(y), with the same exceptions as noted in §7(c). nalled in the following cases:
(c) Divide (z : = x/y): expon(z) = expon(x) - ex(a) if any operand is a trapping NaN (see §6.2) and no
pon(y) - 1 when y is normalized and nonzero, exother (implementation-dependent) trap is desigcept that when only one of x and y is unnormalized
nated and enabled;
and the other is normal 0 or oo the result z is the
(b) addition or subtraction oX i oo in projective tnode
same (normal 0 or 00 or invalid) as if the unnorand magnitude subtraction of infinities like
malized operand were replaced by its normalized
( + oo) + ( - oo) in affine mode;
equivalent. Otherwise an exception shall be sig(c) multiplication 0 x oo;
nalled as specified in §8.1.
(d) division 0/0, 0/0oo, or the divisor is not normalized
(d) Remainder (z : = x REM y): z shall be calculated
and the dividend is finite and not normal zero;
as if x were first normalized.
(e) remainder x REM y, where y is zero or not nor(e) Square root is an invalid operation if its operand is
malized, or x is infinite;
not normalized.
(f) square root if the operand is less than zero, oo in the
(f) Conversion (z : = x): expon(z) = expon(x).
projective mode, or not normalized;
(g) Integer conversion (z : = IntegerPart(x)): If ex(g) conversion of a binary floating-point number to an
pon(x) > the number of fraction bits, then z shall
integer or decimal format when overflow, infinity,
be identically x. Otherwise z shall be normalized.
or NaN precludes a faithful representation in that
format and this cannot otherwise be signalled; and
(h) Conversion of denormalized binary floating-point
numbers to decimal forms representing values
(h) comparison via the predicates <, c, >, > or their
Mx 10N, where M and N are integers, should use
negations, when the relation is "unordered."
leading zeros in the representation of Mto indicate
A binary floating-point result to be delivered, when an
the degree to which the number is denormalized.
invalid operation exception arises without a trap, shall be
(i) Compare: comparisons shall be made as if both a nontrapping NaN.
operands had first been normalized.
8.1.2. Invalid result. In any operation, when the result
7.1. Normalizing mode. Another mode of arithmetic destined for a single or double format would be unnorshould be provided which normalizes all denormalized malized but not denormalized, invalid operation shall be
values before performing arithmetic with them, and signalled.8 When an invalid result exception coincides
hence precludes the creation of new unnormalized with an overflow or inexact or trapped underflow excepoperands.7 This applies to all operations listed in §7. Un- tion, invalid result shall take precedence; but untrapped
normalized operands shall not be affected by this mode. underflows cannot be invalid results because they are
8.2 Division by zero. If the divisor is normal zero and the
dividend is a finite nonzero number, then division by zero
exception shall be signalled. The default result shall be a
correctly signed oo (see §6.3).
8. Exceptions
There are five types of exceptions that shall be detected.
A trap under user control should be associated with each
exception as specified in §9. The default response to an exception shall be to proceed without a trap. This standard
specifies results to be delivered in both trapping and nontrapping situations. In some cases the result is different if
a trap is enabled.
For each type of exception the implementation shall
provide a status flag which shall be set on any occurrence
of the corresponding exception when no trap occurs. It
shall be reset only at the user's request. The user shall be
able to test and to alter the status flags individually, and
should further be able to save and restore all five at one
8.1. Invalid- operation. There are two kinds of invalid
operation exception. One, called invalid operand, arises
if an operand is invalid for the operation to be performed.
The other, called invalid result, arises if the result is invalid for the destination. The result to be delivered, when
8.3 Overflow. If a rounded result is finite and not an invalid result but its exponent is too large to represent in the
target floating-point format, then overflow shall be
signalled, unless the rounding mode is round toward + oo
7. In many computations the loss of significance due to denormalization
is not consequential, and the invalid results (see §8.1.2) that arise in the
warning mode are too pessimistic. Thus implementors are strongly encouraged to support the normalizing mode, except perhaps in pipelined array processors with an extended format for accumulation of intermediate
sums. Use of the extended format, with its explicit leading significant bit,
permits the calculation of intermediate products and quotients involving
denormalized numbers without invalid result exceptions.
8. This can happen only in certain cases of the following operations:
(a) When an unnormalized extended is converted to a basic format.
(b) Except in the normalizing mode, when operations upon denormalized single format operands have double destinations.
(c) Except in the normalizing mode, when a denormalized number is
magnified by multiplication or division and the destination's format is single or double.
or round toward - oo and there is no trap on overflow. In
the latter case overflows shall be rounded thus:
(a) Round toward - co carries normalized positive
overflows to the format's largest number, and unnormalized positive overflows to the format's
largest number's exponent without changing the
(b) Round toward + oo carries normalized negative
overflows to the format's most negative number,
and unnormalized negative overflows to the format's largest number's exponent without changing
the significand.
In these two cases invalid result may arise, but only in conversions from extended to basic formats. All other cases
of overflow without a trap shall yield oo with the appropriate sign.
If the overflow trap is enabled, overflow upon conversion shall deliver to the trap handler a result in the widest
format supported but rounded to the destination's precision, except that, when the result of decimal to binary
conversion lies outside that range, NaN shall be delivered.
All other trapped overflows shall deliver to the trap
handler a result with correctly rounded significand and
modified exponent. The modified exponent is the correct
exponent minus a bias adjust of 192 in the single format,
1536 in double, and 3 x 2"-2 in extended, where n is the
number of bits in the exponent field.9
8.4 Underflow. Underflow occurs whenever
(a) a result which is not normal zero, when examined
either before or after rounding at the implementor's
option,10 is found to have too small an exponent to
be represented in the destination's format without
further denormalizing, or
(b) an extended format product or quotient with
neither operand a normal zero, when examined
either before or after rounding at the implementor's option, turns out to be indistinguishable from
a normal zero. (Note that this cannot happen with
normalized operands.)
When underflow occurs with no trap, the unrounded
result shall be first denormalized, then rounded, then
delivered to its destination; moreover the underflow flag
shall be set to signal the event unless the rounding mode is
round toward + oX or - oo.
If the underflow trap is enabled, the result delivered to
the trap handler shall be as specified for overflow in §8.3
except that the bias adjust is added rather than subtracted.
$.5 Inexact. In the absence of an invalid operation exception, if the rounded result of an operation is not exact or if
9. The bias adjust is chosen to translate over/underflowed exponents as
nearly as possible to the middle of the exponent range so that a trap handler
can provide appropriate information for later reconstruction of the correct result.
10. To examine a number before rounding means to examine it as
though it were first rounded toward zero.
it overflows without a trap, then the inexact exception
shall be signalled. The rounded or overflowed result shall
be delivered to the destination.
9. Traps
A user should be able to request a trap on any of the five
exceptions and to request that trap to be disabled. If the
trap is disabled, then the corresponding exceptions shall
be handled in the default manner specified in §8. If an exception is signalled for which the trap is enabled, then the
execution of the program in which the exception occurred
shall be suspended, a handling routine specified by the
user shall be activated, and a result, if specified in §8, shall
be delivered to the trap handler.
9.1. Trap handler. For each trap supported, the user shall
be able to specify a trap handler having the capabilities of
a subroutine that can return a value to be used in lieu of
the exceptional operation's result; this result is undefined
unless delivered by the trap handler. Similarly, that
trapped exception's flag(s) may be undefined unless set or
reset by the trap handler. When a system traps, the trap
handler should be able to determine
(a) the type(s) of exception(s) that occurred on this
(b) the kind of operation that was being performed,
(c) the destination's format,
(d) in overflow, underflow, inexact, and invalid result,
the correctly rounded result including information
that might not fit in the destination's format, and
(e) in invalid operand and divide by zero, the operand
Appendix: Recommended functions
and predicates
The following functions and predicates are recommended as aids to program portability across different
systems, perhaps performing arithmetic very differently.
They are described generically; that is, the types of the
operands and results are inherent in the operands.
Languages that require explicit typing will have corresponding families of functions and predicates.
(a) copysign(x,y) returns x with the sign of y. Hence,
abs(x) = copysign(x, 1.0).
(b) - x is x with its sign reversed.
(c) scalb(x,N) returns the product of x and 2N, for integral values N; this is accomplished by adding Nto
the exponent of xand then checking for exceptional
(d) logb(x) returns the unbiased exponent of x, a
signed integer in the format of x, except that
logb(O) is - oo, logb(oo) is + oo, and logb(NaN) is
that NaN. When x is positive and finite,
1 scalb(x, - logb(x)) <2 except when x is denormalized in the warning mode or unnormalized.
IEEE P754 voting committee members
neighborofxinthedirectiontowardy. Ifx=y, or at time of adoption of the proposed draft
either x or y is oo in the projective mode or a NaN,
(e) nextafter(x,y) returns the next representable
then x is returned.
Andrew Allison, Los Altos Hills, California
(f) finite(x) returns the value TRUE if
William Ames, Hewlett-Packard Data Systems
- oo <X< + co and returns FALSE otherwise.
Cupertino, Calif.
(g) isnan(x), or equivalently xX.x, returns the value Janis Arya,
Baron, Intel
TRUE if x is a NaN and returns FALSE otherwise. Dileep Bhandarkar, Digital Equipment Corporation
(h) x< >y is TRUE only when x<y or x>y, and is Joel Boney, Motorola
distinct from x y which means NOT (x =y) and is Jim Bunch, University of California, La Jolla
Ed Burdick, National Semiconductor
never an invalid opefation.
Paul Clemente, Prime Computer
(i) unordered(x,y) returns the value TRUE if x is W. J. Cody, Argonne National Laboratory
unordered with y and returns FALSE otherwise; Jerome T. Coonen, University of California, Berkeley
Jim Crapuchettes, Menlo Computer Associates
this is never an invalid operation. O
Richard H. Delp, Four-Phase Systems
Alvin Despain, University of California, Berkeley
Tom Eggers, Digital Equipment Corporation
Dick Fateman, University of California, Berkeley
Don Feinberg, Digital Equipment Corporation
Stuart Feldman, Bell Laboratories
Eugene Fisher, Lawrence Livermore National Laboratory
Paul F. Flanagan, Analytical Mechanics
Gordon Force, Kylex
Lloyd Fosdick, University of Colorado
Robert Fraley, Hewlett-Packard Laboratories
Howard Fullmer, Parasitic Engineering
Daniel D. Gajski, University of Illinois, Urbana
David Gay, Massachusetts Institute of Technology
C. W. Gear, University of Illinois, Urbana
Martin Graham, University of California, Berkeley
David Gustavson, Stanford Linear Accelerator Center
Guy Haas, Datapoint
Chuck Hastings, Data General
David Hough, Apple Computer
John E. Howe, Intel
Thomas E. Hull, University of Toronto
Suren Irukulla, Prime Computer
Richard James III, Santa Clara, California
Paul S. Jensen, Lockheed Research Laboratory
William Kahan, University of California, Berkeley
Howard Kaikow, Nashua, New Hampshire
Dick Karpinski, University of California, San Francisco
Virginia Klema, Massachusetts Institute of Technology
Les Kohn, National Semiconductor
Dan Kuyper, Sperry Univac
M. Dundee Maples, M & E Associates
John Markiel, Westmont, New Jersey
Roy Martin, Apple Computer
Dean Miller, Motorola
Webb Miller, University of California, Santa Barbara
John C. Nash, Vanier, Ontario, Canada
Dan O'Dowd, National Semiconductor
Cash Olsen, Signetics
John F. Palmer, Intel
Beresford Parlett, University of California, Berkeley
Dave Patterson, University of California, Berkeley
Mary Payne, Digital Equipment Corporation
Tom Pittman, Itty Bitty Computers
Lewis Randall, Apple Computer
Robert Reid, Dunstable, Massachusetts
Christian Reinsch, Leibniz-Rech/Bay. Akad. Wiss.
Roger Stafford, Beckman Instruments
David Stevenson, Zilog
G. W. Stewart, University of Maryland
Robert G. Stewart, Stewart Research Enterprises
Harold Stone, University of Massachusetts
William D. Strecker, Digital Equipment Corporation
Robert Swarz, Digital Equipment Corporation
George Taylor, University of California, Berkeley
Dar-Sun Tsien, Intel
Greg Walker, Motorola
John Stephen Walther, Hewlett Packard Laboratories
P. C. Waterman, Burlington, Massachusetts