A Proposed Standard for Binary Floating.

Offered here for publc comment, this proposed standard facilitates transportation of numerically oriented programs and encourages development of high-q iality numerical software. A Proposed Standard for Binary Floating.-Point Arithmetic I Draft 8.0 of IEEE,Jask P754 Introductory Comments by David Stevenson, Chairman, Floating-Point Working Group Microprocessor Standards Committee, IEEE Computer Society Few programmers care how their computer performs floating-point arithmetic. If they do, it is usually because they've had a divide-by-zero fault (even after inserting a test to ensure thatx . y before dividingbyx - y) or some equally mysterious incident. Specifying a programming environment that minimizes such anomalies is one of the goals of this standardization effort. Overall, it attempts to facilitate the transportation of numerically oriented programs and to encourage the development of highquality numerical software. These two goals are especially important in the microprocessor environment since component vendors are not likely to devote extensive resources to developing numerical software for the general community. A number of rationales underlying the development of this proposal should be brought to the reader's attention. First, the working group responsible for this document was not restricted to the format or other conventions of an existing floating-point system; instead, the interests of the user community were placed above the goal of industrial continuity. * In fact, the segment of the computer industry that has shown the greatest interest in this work has been the semiconductor industry that is currently introducing the second generation of floating-point units on IC chips. The second major rationale, based on the realization that most implementations would rely on software to supply the full functionality of the proposal, was that the document describe a programming environment-meaning both hardware and software. Indeed, one goal was to encourage hardware implementations that do not preclude an efficient implementation of the total desired functionality. These rationales should be kept in mind as we review the proposal's major features and indicate why they were 'A new working group, IEEE Task 854, has been formed recently to permit development of a floating-point standard parameterized to accommodate different computer word formats and radices. Dr. W. J. Cody of Argonne National Laboratory will be chairman of this working group. March 1981 included. There are three major aspects of the proposal: the format of the data types, the arithmetic, and the exception handling. Formats. The basic format sizes for floating-point numbers-32 bits and 64 bits-were selected for efficient calculation of array elements in byte-addressable memories. For the 32-bit format, precision was deemed the most important criterion, hence the choice of radix 2 instead of octal or hexadecimal. Other characteristics include not representing the leading significand bit in normalized numbers, a minimally acceptable exponent range which uses eight bits, and an exponent bias which allows the reciprocal of all normalized numbers to be represented without overflow. For the 64-bit format, the main consideration was range; as a minimum, the desire was that the product of any two 32-bit numbers should not overflow the 64-bit format. The final choice of exponent range provides that a product of eight 32-bit terms cannot overflow the 64-bit format-a possible boon to users of optimizing compilers which reorder the sequence otIrithmetic operations from that specified by the careful programmer. The proposal also recommends the minimum requitements for extended-precision temporaries (qtuantities whose range and precision are greater than a basic format but do not require twice as many bits for representation). With their greater precision, extended-precision temporaries lessen the chance of a final result that has been contaminated by. excessive roundoff error; with their greater range, they also lessen the chance of an intermediate overflow aborting a computation whose result would have been l-tiresent4ble in a basic format. The precision is to afford motivation for some benefits of a higher basic precision without incurring the time penalty usually assotiated with higher precision; however, the proppsed standard requires only single supplyinAO&tended precision (32-bit format) for conforming implementations. 0018-9162/81/0300-0051S00.75 C 1981 IEEE 51 Arithmetic. The proposed standard requires accurate computation of all arithmetic results to within half a unit in the last place of the destination format. Once the hardware is in place to achieve this goal (guard, round, and sticky bits, for example), it requires little more to achieve directed roundings that are useful in interval arithmetic, so the proposal requires these additional rounding capabilities (see Figure 1). In addition to the four basic arithmetic operations, the proposed standard also requires remainder, square root, and conversions between binary and decimal representations. Remainder, square root, and conversions within a specified range must be as accurate as the basic arithmetic. Remainder is included because of its usefulness in argument reduction in computing elementary transcendental functions, square root because of its frequency in matrix algorithms; both are included because they can be supported-in a well-designed divide unit-to the required accuracy with little additional hardware. Conversion was included to ensure that accurate, reproducible -4----- SIGNIFICAND BITS LEAST-SIGNIFICANT BIT L i I I I I I I I 1 1 GUARD BIT ROUND BIT STICKY BIT Figure 1. Guard, round, and sticky bits ensure accurate unbiased rounding of computed results to within half a unit in the least-significant bit. Two bits are required for perfect rounding; the guard bit is the first bit beyond rounding precision, and the sticky bit is the logical OR of all bits thereafter. To accommodate post-normalization in some operations, the round bit is kept, beyond the guard bit, and the stickly bit is a logical OR of all bits beyond round. results on different implementations would not be lost at the I/O interface. It should be remembered that this proposal specifies a programming environment; the supporting hardware need not directly implement these operations as single instructions. Finally, the insistence on accuracy is not only an end in itself (providing sharper error bounds), it also ensures a host of pleasant derivative features, such as the commutativity of addition and multiplication. Exceptions. Operations that produce results beyond the range of normalized floating-point numbers are also treated in this proposal. In situations where a trap is not allowed, overflow and divide-by-zero generate infinities, and subsequent arithmetic involving these infinities produces results obeying traditional mathematical conventions regarding infinity. Operations that have no mathematical interpretation, such as zero divided by zero, will produce a not-a-number called a NaN. Such NaNs can be used to convey diagnostic information regarding their creation, or can be used as 'escapemechanism pointers to nonstandard representations. Underflow is handled by introducing "denormalized" numbers-nonzero numbers that lie between the largest negative normalized number and the smallest positive normalized number, with constant spacing (see Figure 2). In many instances, denormalized numbers reduce potential underflow damage to no more than roundoff error. One of the consequences of a floating-point system with NaNs is that comparisons do not obey the trichotomy rule: two items may compare not only as less than, equal, or greater than; they may also be unordered. This complicates the handling of branching conditions, but those responsible for this proposal felt that the additional complexity was unavoidable and that the additional functionality which gives rise to unordered comparisions was worth the logical expense. Unordered comparisons are one of several exceptional operations that can arise, and a default action for a non- TINY BINARY FLOATING-POINT NUMBERS GRADUAL c4 UNDERFLOW DENORMALIZED O NUMBERS - m 2m Figure 2. Each vertical tick stands for a 4-bit significand binary floating-point number. The underf low threshold m is a power of 1/2 depending upon the allowed range of exponents; every floating-point number bigger than m, but none smaller, is representable as a normalized floatingpointnumber.Onsome machines(IBM 7094, DEC PDP-10, DEC PDP-11, etc.) m is a normalized number, too; on others (HP-3000) it is not. Flushing underf lows to zero in52 4mn 8m troduces a gap between m and 0 much widier than between m and the next larger number. GraduaIl underf low fills that gap with denormalized numbers as densely packed between m and 0 as are normalized nlumbers between m and 2m. Doing so relegates underfllow in most computations to a status comparable witth roundoff among the normalized numbers. COM PUTER trapping environment is specified for each occasion; for trapping contexts, the result to be delivered to a userspecifiable trap handler is indicated. In this brief introduction, it is impossible to give an adequate account of several years' work by the many people involved in the effort of the working group (the minutes of the meetings and supporting documents run to hundreds of pages). An early collection of most of the ideas was made by Prof. William Kahan and Jerome Coonen, both of the University of California at Berkeley, and Prof. Harold Stone of the University of Massachusetts at Amherst; a much-revised version of that work appeared in the October 1979 A CM SIGNUMNewsletter, a special issue providing an extensive account of the proposal's features and the alternatives considered by the group. The Floating-Point Working Group, IEEE Task P754, of the Microprocessor Standards Subcommittee, under the initial chairmanship of Richard H. Delp of FourPhase Systems, recast that work-many times. Coonen, Kahan, John F. Palmer of Intel, Tom Pittman of Itty Bitty Computers, and David Stevenson of Zilog were responsible for drafting this proposal. Other members of the working group who played major roles in its deliberations by presenting alternative proposals include Bob Fraley and Steve Walther of Hewlett-Packard Laboratories and Mary Payne, Dileep Bhandarkar, and William Strecker of Digital Equipment Corporation. Several working-group members will present a one-day tutorial on the proposed standard May 20, 1981, in conjunction with the Fifth Symposium on Computer Arithmetic in Ann Arbor, Michigan. The IEEE Computer Society is publishing the draft standard-along with related material-to invite public comment prior to its submission to the IEEE Standards Board for adoption as an IEEE Standard. Comments should be sent to Stevenson by May 15, 1981, with copies to Mike Smolin and Steve Diamond, co-chairmen of the Microprocessor Standards Committee. Stevenson's address is Zilog, Inc., 10460 Bubb Road, Cupertino, CA 95014; (408) 446-4666, ext. 5476. Smolin and Diamond's address is Synertek, PO Box 552 MS-34, Santa Clara, CA 95052. If you would like to participate in this and other efforts of the Microprocessor Standards Committee, please contact one of the co-chairmen. * The Proposed Standard Foreword This standard is a product of the Floating-Point Working Group of the Microprocessor Standards Subcommittee of the IEEE Computer Society Computer Standards Committee. It is intended that the standard embody the essence of "An Implementation Guide to a Proposed 1. Computer, Vol. 13, No. 1, January 1980, on p. 61 of this issue. MVarch 1981 pp. 68-79. See also errata Standard for Floating-Point Arithmetic" by Jerome T. Coonen.l This standard defines a family of commercially feasible ways for new systems to perform binary floating-point arithmetic. The issues of retrofitting were not considered. The desiderata which guided the formulation of this standard included: (a) Facilitate movement of existing programs from diverse computers to those which adhere to this standard. (b) Enhance the capabilities and safety available to programmers who, though not expert in numerical methods, may well be attempting to produce numerically sophisticated programs. However, we recognize that utility and safety are sometimes antagonists. (c) Encourage experts to develop and distribute robust and efficient numerical programs portable, via minor editing and recompilation, onto any computer which conforms to this standard and possesses adequate capacity. When restricted to a declared subset of the standard, these programs should produce identical results on all conforming systems. (d) Provide direct support for * execution-time diagnosis of anomalies, * smoother handling of exceptions, and * interval arithmetic at a reasonable cost. (e) Provide for development of * elementary functions like exp, cos, . * very high precision (multiword) arithmetic, and * coupling of numerical and symbolic algebraic computation. (f) Enable rather than preclude 'further refinements and extensions. 1. Scope 1.1. Implementation objectives. It is intended that an implementation of a floating-point system conforming to this standard can be realized entirely in software, entirely in hardware, or in any combination of hardware and software. It is the actual environment which the programmer or user of the system sees that conforms or fails to conform to this standard. Hardware components that require software support to conform shall not be said to conform apart from such software. 1.2. Inclusions. This standard specifies: (a) floating-point number formats; (b) the results for add, subtract, multiply, divide, square root, remainder, and compare; (c) conversions between integers and floating-point numbers; (d) conversions between different floating-point formats; (e) conversion between basic format (see §3.1) floating-point numbers and decimal strings; and (f) floating-point exceptions and their handling, including non-numbers (NaNs). Preliminary-Subject to Revision 53 1.3. Exclusions. This standard does not specify: (a) integer representation; (b) interpretation of signs and fraction fields of NaNs; (c) binary - decimal conversions to and from extended formats; or (d) formats of decimal strings. regarded as if its range were unlimited. If the significand is zero, the number becomes normal zero. Normalizing a number does not change its sign. 2.11. NaN. Not a number; a symbolic entity encoded in floating-point format. See §3 and §6.2. 2.12. Status flag. A variable which may take two states, set and clear. A program may clear or copy a flag. When set, a status flag may contain additional system-depen- 2. Definitions 2.1. User. The user of a floating-point system is considered to be any person, hardware, or program, not itself specified by this standard, having access to and controlling those operations of the programming environment specified in this standard. 2.2. Binary floating-point number. A bit-string characterized by three components, a sign, a signed exponent, and a significand. Its numerical value, if any, is the signed product of its significand and two raised to the power of its exponent. In this document a bit-string is not always distinguished from a number it may represent. 2.3. Exponent. That component of a binary floatingpoint number which normally signifies the power to which two is raised in determining the value of the represented number. Occasionally the exponent is called the signed or unbiased exponent. 2.4. Biased exponent. The sum of the exponent and a constant (bias) chosen to make the biased exponent's range non-negative. 2.5. Significand. That component of a binary floatingpoint number which consists of an explicit or implicit leading bit to the left of its binary point and a fraction field to the right of the binary point. 2.6. Fraction. That field of the significand that lies to the right of its implied binary point. 2.7. Normal zero. The exponent is the format's minimum, and the significand is zero. Normal zero may have either a positive or a negative sign. Only the extended formats have any unnormalized zeros (see §2.9). 2.8. Denormalized. The exponent is the format's minimum, the explicit or implicit leading bit is zero, and the number is not normal zero. To denormalize a binary floating-point number means to shift its significand right while incrementing its exponent, until it becomes a denormalized number. 2.9. Unnormalized. Occurs only in the extended format. The number's exponent is greater than the format's minimum, and the explicit leading bit is zero. If the significand is zero, this is an unnormalized zero. 2.10. Normalize. If the number is nonzero, shift its significand left while decrementing its exponent until the leading significand bit becomes one; the exponent is 54 dent information, possibly inaccessible to the program. The operations of this standard may, as a side effect, set some of the following flags: inexact result, underflow, overflow, divide by zero, and invalid operation. 2.13. Destination. Every unary or binary operation delivers its result to a destination, either explicitly designated by the user or implicitly supplied by the system (e.g., intermediate results in subexpressions or arguments for procedures). Some languages place the results of intermediate calculations in destinations beyond the programmer's control. Nonetheless, this standard defines the result of an operation in terms of that destination format as well as the operands' values. 2.14. Mode. A mode is a variable which a program may set, sense, save and restore, to control tL. execution of subsequent arithmetic operations. The default mode is that mode which a program can assume to be in effect unless an explicitly contrary statement is included either in the program or its specification. The standard entails the modes (a) projective/affine, which concerns the interpretation of oo, (b) rounding direction, which concerns the direction of rounding errors, and, in certain implementations, (c) rounding precision, to shorten the precision of intermediate results. Optionally, an implementator may provide the modes (d) warning/normalizing, for handling underflowed values, and (e) traps disabled/enabled, for handling exceptions. 2.15. Shall and should. In this standard, the use of the word "shall" signifies that which is obligatory in any conforming implementation; the use of the word "should" signifies that which is strongly recommended as being in keeping with the intent of the standard, despite architectural or other constraints beyond the scope of this standard that may on occasion render the recommendations impractical. 3. Formats This standard defines four floating-point formats in two groups, basic and extended, each having two widths, single and double. The standard levels of implementation Preliminary-Subject to Revision COM PUTER~ e s 8 0 Figure 1. Single-precision format. s f 31 e 11 0 Figure 2. Double-precision format. 63 are distinguished by the combinations of formats supported. 3.1. Basic formats 3.1.1. Single. A 32-bit format for a binary floating-point number Xis divided as shown in Figure 1. The component fields of X are the I -bit sign s, the 8-bit biased exponent e, and the 23-bit fractionf. The value v of Xis as follows: (a) If e=255 andf.0, then v=NaN. (b) Ife=255 andf=0, then v= (- I)soo. (c) If O<e<255, then v=(- 1)s2e-127(l.f). (d) If e = 0 andf.0, then v = ( - 1)52 -126(0.f). (e) Ife=Oandf=0, then v=(- l)sO, (zero). e, and the 52-bit fractionf. The value v ofXis as follows: (a) Ife=2047andfX0, then v=NaN. (b) Ife=2047andf=0,thenv=(- I)soo. (c) IfO<e<2047, then v=(- 1)s2e-1023(1.f). (d) Ife=O andf.0, then v= (- 1)52 - 1022(o.f). (e) If e = 0 andf= 0, then v = (- 1)sO, (zero). 3.2. Extended formats 3.2.1. Single-extended. Extended is an implementationdependent format. An extended binary floating-point number X has four components: a 1-bit sign s, an exponent e of specified range combined with an implementation-dependent bias, a 1-bit integer part j, and a fraction f with at least 31 bits. The exponent shall range between a minimum value m s - 1023 and a maximum value M- + 1024. The value v of Xis as follows: (a) If e=Mandf.X0, then v=NaN. (b) If e=Mandf=0, then v=(- 1)soo. (c) Ifm<e<M, then v=(- I)s2e(.ff). (d) If e=m and j=f=0, then v=(-)sO, (normal zero). (e) If e = m and j orf is nonzero, then v = (- )s2e' (j.f), where e' = m or m + I at the implementor's option. 3.1.2. Double. A 64-bit format for a binary floating-point number Xis divided as shown in Figure 2. The component 3.2.2. Double-extended. The double-extended format is fields of X are the I -bit sign s, the 11-bit biased exponent the same as single-extended described in §3.2.1, except Preliminary-Subject to Revision that the exponent shall range between m c - 16383 and M2 + 16384, and the fraction shall have at least 63 bits. 3.2.3. Exponent range. An implementation of this standard is not required to provide (and the user should not assume) that single-extended have greater range than double. 3.3. Combinations of formats. All implementations conforming to this standard shall support single. Implementations should support the extended format corresponding to the widest basic format supported, and need not support any other extended format.2 2. Only if upward compatibility and speed are important issues should a system supporting the double-extended format also support singleextended. 4. Rounding Except for binary -decimal conversion, all operations specified in §5 and §7 shall be performed as if correct to infinite precision, then rounded according to the specifications in this section. Rounding takes a number regarded as infinitely precise and, if necessary, modifies it to fit in the destination's format while signalling that it is inexact (see §8.5). 4.1. Default rounding mode. An implementation of this standard shall provide round to nearest, with rounding to even in case of a tie, as the default rounding mode. When rounding to nearest, the result shall differ from the infinite precision exact result by at most one half in the leastsignificant-digit position; rounding to even means that if the difference is exactly half then the rounded result shall have an even last digit. Preliminary-Subject to Revision 4.2. Directed rounding modes. An implementation shall provide user-selectable positive- and negative-directed rounding (round toward + Xo and round toward - oo) and truncation (round toward 0) for all operations. When rounding toward + oo, the result shall be the format's value (possibly + co) closest to and no less than the infinitely precise result, except as specified in §8.3; analogously, when rounding toward - oo, the result shall be the format's value (possibly - co) closest to and no greater than the infinitely precise result, except as specified in §8.3. When rounding toward 0, the result shall be the format's value closest to and no greater in magnitude than the infinitely precise result. The rounding modes may affect the signs of zero sums (see §6.3). 4.3. Rounding precision. Normally a result is rounded to the precision of its destination. However, some hardware will always deliver results from single format operands to double or extended destinations. On such a system the user, which may be a high-level language compiler, shall be able to specify that a result be rounded instead to single precision, though it is stored in the double or extended format with its wider exponent range.3 Similarly, a system that delivers all results from double format operands to extended destinations shall permit the user to specify rounding to double precision. Note that to meet the specifications in §4.1, the result cannot suffer more than one rounding error. 3. Rounding precision control is intended to allow systems whose destinations are always double or extended to mimic systems with single and double destinations. However, use of precision control to combine double (or extended) operands to produce a single format result with just one rounding is considered nonstandard. Preliminary-Subject to Revision 5. Operations 5.4. Conversion between floating-point and integer. It shall be possible to round a floating-point number to an integer value in the same floating-point format, for all supported formats. The rounding shall be as specified in §4, with the understanding that in round to nearest mode, if the difference between the unrounded operand and the rounded result is exactly one half, the rounded result is All conforming implementations of this standard shall provide add, subtract, multiply, divide, square root, remainder, floating-point format conversions, conversions between floating-point and integers, binary decimal conversions, and comparisons. When all operands are normalized, the operations shall even. be performed as if to infinite precision before rounding as It shall be possible to convert between all supported specified in §4. §7 specifies the results when at least one of floating-point formats and all supported integer formats. the operands is not normalized. §6 augments the specifi- Conversion to integer shall be effected by rounding as cations to cover signed zero and oc and NaN; §8 enumer- specified in §4. Conversions between floating-point inates exceptions. tegers and integer formiats shall be exact unless an exception arises as specified in §8.1. 1. 5.1. Arithmetic. An implementation shall provide add, subtract, multiply, divide, and remainder for any two 5.5. Binary -decimal conversion. Conversion between operands of the same format, for each supported format; decimal strings in at least one format and binary floatingit should also provide the operations for operands of dif- point numbers in all supported basic formats shall be profering formats. The destination format (regardless of the vided for numbers throughout the ranges specified in rounding precision control of §4.3) shall be at least as Table 1. The integers Mand N-in Tables 1 and 2 below are wide as the operands' format. All results shall be rounded such that the decimal strings have values Mx 10±N. On as specified in §4. input, trailtig zeros shall be appended to or stripped from The remainder r =x REM y is defined regardless of the M (up to the limits specified in Table 1) in order to rounding mode by the following relation when y .0: minimize N. When the destination is a decimal string, its least-significant digit should be located by format r=x-yxn for purposes of rounding. where n is the integer nearest x/y; n is even whenever specifications Conversions shall be correctly rounded as specified in n-xly = ½.2 Note that with this definition the re- §4 for operands within the ranges specified in Table mainder is exact. The result shall be normalized unless it 2. Otherwise the lying error in the converted result shall not exunderflows. ceed by more that 0.47 units in the destination's leastsignificant digit the error that would be incurred by the 5.2. Square root. The square root operation shall be pro- rounding specifications of §4, provided that exponent vided in all supported formats and is defined for all nor- over/underflow does not occur. malized operands .0; 0= -0. The destination forConversions shall be monotonic. That is, increasing the mat shall be at least as wide as the operand's. The result value of a binary floating-point number shall not decrease shall be rounded as specified in §4. its value when converted to a decimal string, and increasing the value of a decimal string shall not decrease its value 5.3. Floating-point format conversions. It shall be possi- when converted to a binary floating-point number. ble to convert floating-point numbers between all supWhen rounding to nearest, conversion from binary to ported formats. If the conversion is to a less wide preci- decimal and back to binary shall be the identity as sion, the result shall be rounded as specified in §4. If the the decimal string is carried to the maximum long as precision conversion is to a wider precision, it shall be exact, al- specified in Table 1, namely nine for single and 17 digits though an invalid result exception may be raised as speci- for double.4 fied in §8.1.2. If decimal to binary conversion over/underflows, the response is as specified in §8. Over/underflow and NaNs and infinities encountered during binary to decimal conTable 1. version should be indicated to the user by appropriate Decimal conversion ranges. FORMAT SINGLE DOUBLE DECIMAL TO BINARY MAX M MAX N BINARY TO DECIMAL MAX M MAX N i10-i io1-i 1o19-1 99 999 1ol7-1 54 341 Table 2. Correctly rounded decimal conversion range. FORMAT SINGLE DOUBLE 58 DECIMAL TO BINARY MAX M MAX N BINARY TO DECIMAL MAX M MAX N io1-i i10-i 1O17 l1 io'l-i 13 27 13 27 strings. 5.6. Comparison. It shall be possible to compare floatingpoint numbers in all supported formats, including comparisons between operands of differing formats. Comparisons are exact and never overflow or underflow. Four mutually exclusive relations are possible: '.'less than," "equal," "greater than," and "unordered." The last case arises when at least one operand is NaN, or when oo in 4. The properties specified for conversions are implied by error bounds that depend on the format (single or double) and the number of decimal digits involved; the 0.47 mentioned is a worst-case bound only. For a detailed discussion of these error bounds and economical conversion algorithms that exploit the extended format, see "Binary Decimal Conversion in KCS Arithmetic," by Jerome T. Coonen (to appear). Preliminary-Subject to Revision COM PUTER the projective mode is compared to anything other than co; every NaN shall compare "unordered" with everything, including itself. Comparisons shall ignore the sign of infinity in the projective mode (where + X = - oo), and shall ignore the sign of zero (so + 0 = - 0). 5.6.1. Condition codes. When the result of a comparison is reported via condition codes, the result shall be an encoding of one of the four relations listed in §5.6. 5.6.2. Predicates. When the result of a comparison is reported as an affirmation or negation of a predicate, the following implications shall determine that response: (a) The relation "less than" affirms the predicates <, c, ., and denies the predicates =, -, >, unordered. (b) The relation "equal" affirms the predicates =, c, ., and denies the predicates <, >, .X, unordered. (c) The relation "greater than" affirms the predicates >, ., ., and denies the predicates =, c, <, unordered. (d) The relation "unordered" affirms the predicates ., unordered, and denies the predicates <, c, =, In addition to the response specified above, an invalid operation exception (see §8.1) shall arise in a comparison just when two values whose relation is "unordered" are compared via a predicate involving <, c, 2, >, or their negations, as specified in §8.1 .1 .h. 6. Infinity, NaNs, and signed zero 6.1. Infinity arithmetic. Infinity arithmetic shall be construed as the limiting case of real arithmetic with operands of arbitrarily large magnitude, when such a limit exists. Infinity arithmetic shall be supported under two userselectable modes, affine and projective, with projective being the default. In affine mode - oo< (every finite number) < + oo, but in projective mode infinities compare "equal" regardless of sign and compare "unordered" with everything else. Consequently, the two modes are distinguished by exceptions in add, subtract, square root, and compare, as specified in §8. Except for the invalid operations specified for oo, arithmetic upon oo is always exact and therefore shall raise no exceptions. The three exceptions that do pertain to oo are raised only when (a) oo is created from finite operands by overflow (§8.3) or division by zero (§8.2), with the corresponding trap disabled, or (b) oo is an invalid operand (§8.1). 6.2. Operations with NaNs. Two different kinds of NaN, trapping and nontrapping, shall be supported in all operations. Trapping NaNs shall be reserved operands which precipitate an invalid operation exception (§8.1.1) or some other implementation-dependent exception for every operation listed in §5 that is performed upon them.5 March 1981 Nontrapping NaNs shall obey the following rules; these NaNs should, by means left to the implementor's discretion, afford retrospective diagnostic information inherited from invalid or unavailable data and results. For those operations specified to deliver floating-point results, (a) every operation involving a trapping NaN or invalid operation (§8.1), if no trap occurs, shall set the invalid operation flag and deliver in place of its invalid result a nontrapping NaN; (b) every operation involving one or two input NaNs, none of them trapping, shall raise no exception but deliver as a result either the same NaN (if operating upon just one) or one or the other of the input NaNs, according to an implementation-dependent precedence rule. The operations not covered in this paragraph, namely those which do not deliver a floating-point result, are comparison (§5.6) and conversion to a format that has no NaNs (§5.4 and §5.5). 6.3. The sign bit. This standard says nothing about the sign of a NaN. Otherwise the sign of a product or quotient is the exclusive OR of the operands' signs; and the sign of a sum, or of a difference x-y regarded as a sum x+ (-y), differs from at most one of the addends' -signs. These rules shall apply even when operands or results are zero or infinite. When the sum of two operands with opposite signs (or the difference of two operands with like signs) is exactly zero, either normal or unnormalized (see §7), the sign of that sum (or difference) shall be " + " in all rounding modes except round toward - oo, in which mode that sign shall be" - ." However, x+x =x-( -x) retains the same sign as x even when x is zero. A valid square root can have a negative sign only when the operand is - 0. 7. Unnormalized and denormalized arithmetic The default6 mode of arithmetic, when at least one operand is not normalized, shall obey the following rules. Rounding and over/underflow handling are performed after the operations specified here and may modify the results. In the following specifications expon(x) refers to the unbiased exponent of x. (a) Add or subtract (z: = x±y): If at least one of the operands having exponent m, where m = max (ex5. These NaNs afford arithmetic-like enhancements (such as complexaffine infinities, extremely wide range, etc.) that are not the subject of the standard. However, if there is no special trap designated and enabled for these NaNs, then the invalid operation exception is raised as specified in §8.1 .1. 6. These default rules are analogous to those for normalized numbers, though they tend more toward excessive caution than optimal utility, and offer pipelined processors a faster but second-best alternative to providing an optional normalizing mode described later. More useful than these rules, but probably harder to implement, are the rules for significance arithmetic. Preliminary-Subject to Revision 59 pon(x), expon(y)), is normalized, then z shall be either kind of invalid operation exception occurs without normalized before rounding. Otherwise ex- a trap, shall be a nontrapping NaN (see §6.2). pon(z) = m. (b) Multiply (z: = x x y): expon(z) = expon(x) + ex- 8.1.1. Invalid operand. Invalid operation shall be sigpon(y), with the same exceptions as noted in §7(c). nalled in the following cases: (c) Divide (z : = x/y): expon(z) = expon(x) - ex(a) if any operand is a trapping NaN (see §6.2) and no pon(y) - 1 when y is normalized and nonzero, exother (implementation-dependent) trap is desigcept that when only one of x and y is unnormalized nated and enabled; and the other is normal 0 or oo the result z is the (b) addition or subtraction oX i oo in projective tnode same (normal 0 or 00 or invalid) as if the unnorand magnitude subtraction of infinities like malized operand were replaced by its normalized ( + oo) + ( - oo) in affine mode; equivalent. Otherwise an exception shall be sig(c) multiplication 0 x oo; nalled as specified in §8.1. (d) division 0/0, 0/0oo, or the divisor is not normalized (d) Remainder (z : = x REM y): z shall be calculated and the dividend is finite and not normal zero; as if x were first normalized. (e) remainder x REM y, where y is zero or not nor(e) Square root is an invalid operation if its operand is malized, or x is infinite; not normalized. (f) square root if the operand is less than zero, oo in the (f) Conversion (z : = x): expon(z) = expon(x). projective mode, or not normalized; (g) Integer conversion (z : = IntegerPart(x)): If ex(g) conversion of a binary floating-point number to an pon(x) > the number of fraction bits, then z shall integer or decimal format when overflow, infinity, be identically x. Otherwise z shall be normalized. or NaN precludes a faithful representation in that format and this cannot otherwise be signalled; and (h) Conversion of denormalized binary floating-point numbers to decimal forms representing values (h) comparison via the predicates <, c, >, > or their Mx 10N, where M and N are integers, should use negations, when the relation is "unordered." leading zeros in the representation of Mto indicate A binary floating-point result to be delivered, when an the degree to which the number is denormalized. invalid operation exception arises without a trap, shall be (i) Compare: comparisons shall be made as if both a nontrapping NaN. operands had first been normalized. 8.1.2. Invalid result. In any operation, when the result 7.1. Normalizing mode. Another mode of arithmetic destined for a single or double format would be unnorshould be provided which normalizes all denormalized malized but not denormalized, invalid operation shall be values before performing arithmetic with them, and signalled.8 When an invalid result exception coincides hence precludes the creation of new unnormalized with an overflow or inexact or trapped underflow excepoperands.7 This applies to all operations listed in §7. Un- tion, invalid result shall take precedence; but untrapped normalized operands shall not be affected by this mode. underflows cannot be invalid results because they are denormalized. 8.2 Division by zero. If the divisor is normal zero and the dividend is a finite nonzero number, then division by zero exception shall be signalled. The default result shall be a correctly signed oo (see §6.3). 8. Exceptions There are five types of exceptions that shall be detected. A trap under user control should be associated with each exception as specified in §9. The default response to an exception shall be to proceed without a trap. This standard specifies results to be delivered in both trapping and nontrapping situations. In some cases the result is different if a trap is enabled. For each type of exception the implementation shall provide a status flag which shall be set on any occurrence of the corresponding exception when no trap occurs. It shall be reset only at the user's request. The user shall be able to test and to alter the status flags individually, and should further be able to save and restore all five at one time. 8.1. Invalid- operation. There are two kinds of invalid operation exception. One, called invalid operand, arises if an operand is invalid for the operation to be performed. The other, called invalid result, arises if the result is invalid for the destination. The result to be delivered, when 60 8.3 Overflow. If a rounded result is finite and not an invalid result but its exponent is too large to represent in the target floating-point format, then overflow shall be signalled, unless the rounding mode is round toward + oo 7. In many computations the loss of significance due to denormalization is not consequential, and the invalid results (see §8.1.2) that arise in the warning mode are too pessimistic. Thus implementors are strongly encouraged to support the normalizing mode, except perhaps in pipelined array processors with an extended format for accumulation of intermediate sums. Use of the extended format, with its explicit leading significant bit, permits the calculation of intermediate products and quotients involving denormalized numbers without invalid result exceptions. 8. This can happen only in certain cases of the following operations: (a) When an unnormalized extended is converted to a basic format. (b) Except in the normalizing mode, when operations upon denormalized single format operands have double destinations. (c) Except in the normalizing mode, when a denormalized number is magnified by multiplication or division and the destination's format is single or double. Preliminary-Subject to Revision COM PUTER or round toward - oo and there is no trap on overflow. In the latter case overflows shall be rounded thus: (a) Round toward - co carries normalized positive overflows to the format's largest number, and unnormalized positive overflows to the format's largest number's exponent without changing the significand. (b) Round toward + oo carries normalized negative overflows to the format's most negative number, and unnormalized negative overflows to the format's largest number's exponent without changing the significand. In these two cases invalid result may arise, but only in conversions from extended to basic formats. All other cases of overflow without a trap shall yield oo with the appropriate sign. If the overflow trap is enabled, overflow upon conversion shall deliver to the trap handler a result in the widest format supported but rounded to the destination's precision, except that, when the result of decimal to binary conversion lies outside that range, NaN shall be delivered. All other trapped overflows shall deliver to the trap handler a result with correctly rounded significand and modified exponent. The modified exponent is the correct exponent minus a bias adjust of 192 in the single format, 1536 in double, and 3 x 2"-2 in extended, where n is the number of bits in the exponent field.9 8.4 Underflow. Underflow occurs whenever (a) a result which is not normal zero, when examined either before or after rounding at the implementor's option,10 is found to have too small an exponent to be represented in the destination's format without further denormalizing, or (b) an extended format product or quotient with neither operand a normal zero, when examined either before or after rounding at the implementor's option, turns out to be indistinguishable from a normal zero. (Note that this cannot happen with normalized operands.) When underflow occurs with no trap, the unrounded result shall be first denormalized, then rounded, then delivered to its destination; moreover the underflow flag shall be set to signal the event unless the rounding mode is round toward + oX or - oo. If the underflow trap is enabled, the result delivered to the trap handler shall be as specified for overflow in §8.3 except that the bias adjust is added rather than subtracted. $.5 Inexact. In the absence of an invalid operation exception, if the rounded result of an operation is not exact or if 9. The bias adjust is chosen to translate over/underflowed exponents as nearly as possible to the middle of the exponent range so that a trap handler can provide appropriate information for later reconstruction of the correct result. 10. To examine a number before rounding means to examine it as though it were first rounded toward zero. March 1981 it overflows without a trap, then the inexact exception shall be signalled. The rounded or overflowed result shall be delivered to the destination. 9. Traps A user should be able to request a trap on any of the five exceptions and to request that trap to be disabled. If the trap is disabled, then the corresponding exceptions shall be handled in the default manner specified in §8. If an exception is signalled for which the trap is enabled, then the execution of the program in which the exception occurred shall be suspended, a handling routine specified by the user shall be activated, and a result, if specified in §8, shall be delivered to the trap handler. 9.1. Trap handler. For each trap supported, the user shall be able to specify a trap handler having the capabilities of a subroutine that can return a value to be used in lieu of the exceptional operation's result; this result is undefined unless delivered by the trap handler. Similarly, that trapped exception's flag(s) may be undefined unless set or reset by the trap handler. When a system traps, the trap handler should be able to determine (a) the type(s) of exception(s) that occurred on this operation, (b) the kind of operation that was being performed, (c) the destination's format, (d) in overflow, underflow, inexact, and invalid result, the correctly rounded result including information that might not fit in the destination's format, and (e) in invalid operand and divide by zero, the operand values. Appendix: Recommended functions and predicates The following functions and predicates are recommended as aids to program portability across different systems, perhaps performing arithmetic very differently. They are described generically; that is, the types of the operands and results are inherent in the operands. Languages that require explicit typing will have corresponding families of functions and predicates. (a) copysign(x,y) returns x with the sign of y. Hence, abs(x) = copysign(x, 1.0). (b) - x is x with its sign reversed. (c) scalb(x,N) returns the product of x and 2N, for integral values N; this is accomplished by adding Nto the exponent of xand then checking for exceptional conditions. (d) logb(x) returns the unbiased exponent of x, a signed integer in the format of x, except that logb(O) is - oo, logb(oo) is + oo, and logb(NaN) is that NaN. When x is positive and finite, 1 scalb(x, - logb(x)) <2 except when x is denormalized in the warning mode or unnormalized. Preliminary-Subject to Revision 61 IEEE P754 voting committee members neighborofxinthedirectiontowardy. Ifx=y, or at time of adoption of the proposed draft either x or y is oo in the projective mode or a NaN, (e) nextafter(x,y) returns the next representable then x is returned. Andrew Allison, Los Altos Hills, California (f) finite(x) returns the value TRUE if William Ames, Hewlett-Packard Data Systems - oo <X< + co and returns FALSE otherwise. Mike Cupertino, Calif. (g) isnan(x), or equivalently xX.x, returns the value Janis Arya, Baron, Intel TRUE if x is a NaN and returns FALSE otherwise. Dileep Bhandarkar, Digital Equipment Corporation (h) x< >y is TRUE only when x<y or x>y, and is Joel Boney, Motorola distinct from x y which means NOT (x =y) and is Jim Bunch, University of California, La Jolla Ed Burdick, National Semiconductor never an invalid opefation. Paul Clemente, Prime Computer (i) unordered(x,y) returns the value TRUE if x is W. J. Cody, Argonne National Laboratory unordered with y and returns FALSE otherwise; Jerome T. Coonen, University of California, Berkeley Jim Crapuchettes, Menlo Computer Associates this is never an invalid operation. O Preliminary-Subject to Revision Richard H. Delp, Four-Phase Systems Alvin Despain, University of California, Berkeley Tom Eggers, Digital Equipment Corporation Dick Fateman, University of California, Berkeley Don Feinberg, Digital Equipment Corporation Stuart Feldman, Bell Laboratories Eugene Fisher, Lawrence Livermore National Laboratory Paul F. Flanagan, Analytical Mechanics Gordon Force, Kylex Lloyd Fosdick, University of Colorado Robert Fraley, Hewlett-Packard Laboratories Howard Fullmer, Parasitic Engineering Daniel D. Gajski, University of Illinois, Urbana David Gay, Massachusetts Institute of Technology C. W. Gear, University of Illinois, Urbana Martin Graham, University of California, Berkeley David Gustavson, Stanford Linear Accelerator Center Guy Haas, Datapoint Chuck Hastings, Data General David Hough, Apple Computer John E. Howe, Intel Thomas E. Hull, University of Toronto Suren Irukulla, Prime Computer Richard James III, Santa Clara, California Paul S. Jensen, Lockheed Research Laboratory William Kahan, University of California, Berkeley Howard Kaikow, Nashua, New Hampshire Dick Karpinski, University of California, San Francisco Virginia Klema, Massachusetts Institute of Technology Les Kohn, National Semiconductor Dan Kuyper, Sperry Univac M. Dundee Maples, M & E Associates John Markiel, Westmont, New Jersey Roy Martin, Apple Computer Dean Miller, Motorola Webb Miller, University of California, Santa Barbara John C. Nash, Vanier, Ontario, Canada Dan O'Dowd, National Semiconductor Cash Olsen, Signetics John F. Palmer, Intel Beresford Parlett, University of California, Berkeley Dave Patterson, University of California, Berkeley Mary Payne, Digital Equipment Corporation Tom Pittman, Itty Bitty Computers Lewis Randall, Apple Computer Robert Reid, Dunstable, Massachusetts Christian Reinsch, Leibniz-Rech/Bay. Akad. Wiss. Roger Stafford, Beckman Instruments David Stevenson, Zilog G. W. Stewart, University of Maryland Robert G. Stewart, Stewart Research Enterprises Harold Stone, University of Massachusetts William D. Strecker, Digital Equipment Corporation Robert Swarz, Digital Equipment Corporation George Taylor, University of California, Berkeley Dar-Sun Tsien, Intel Greg Walker, Motorola John Stephen Walther, Hewlett Packard Laboratories P. C. Waterman, Burlington, Massachusetts 62 COMPUTER

A Proposed Standard for Binary Floating.

Related documents

Products

Support

A Proposed Standard for Binary Floating.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib