Running head: THE FIXED-POINT REPRESENTATION OF REALS The Fixed-Point Representation of Real Numbers Anthony DesArmier El Paso Community College 1 THE FIXED-POINT REPRESENTATION OF REALS 2 Abstract Computers are amazing devices capable of calculating extreme amounts of information by use of functions at impressive speeds. These devices were not designed quickly nor easily, and much work has been done by computer scientists and engineers to develop these machines. One aspect of computers often taken for granted by programmers is how they represent real number values beyond integers. By examining designs and techniques that have been developed, programmers will have a better understanding of how to represent rational numbers in better ways and avoid common misconceptions and mistakes. THE FIXED-POINT REPRESENTATION OF REALS 3 The Fixed-Point Representation of Real Numbers Computers have the ability to execute several millions of instructions per second, being able to handle and process everything from complex abstract data structures –a particular way of organizing data in a computer so that it can be used efficiently– to the most basic branch of mathematics: arithmetic. Electronic devices, as small as a few nanometers, are capable of processing arithmetic operations at speeds multitudes greater than most humans. However, as complex as they have come to be, they are incapable of the level of thought humans instinctively use. Many data structures have been developed in order to efficiently represent this data. A very popular primitive data type is the floating-point able to represent non-integer rational values. Many general programmers are unaware of some of it’s limitations, and what to do when it is not available; typically required to learn a different technique out of necessity. Computers accept and process data in a predefined way making binary logical choices along the way, although parts of this process can be abstracted away from the programmer. Using the simplest of instructions a computer is capable of handling as an example: arithmetic, a computer can add two integers together, and by extension, subtract, multiply, divide, power, and square root. These basic instructions can be used to handle higher and more complicated mathematical operations. However, many problems and issues need to be addressed for a computer to properly do these calculations due to many limitations in their fundamental design. One such issue is when a computer is required to handle a number besides an integer, such as non-integer real numbers. In the worst case scenario this value must be approximated through use of certain data structures, and care must be taken to evaluate this value with a minimum tolerance for error. Computers store integers as sequences of bits, which are digits with values of either 1 or 0. There are two common encodings for integers: base-2, which is a positional notation system with a radix of 2; and binary coded decimal (BCD), in which each decimal digit is encoded in a fixed number of bits, with a radix of 10. When storing integers that can be either THE FIXED-POINT REPRESENTATION OF REALS 4 positive or negative, a single bit representing the sign is appended to the base-2 string or to a predetermined bit-length if BCD encoding is used. The computer hardware reads, stores, and manipulates these representations directly. Handling integers is quite straightforward, unlike handling other real-numbers. Fractional values, because of their approximate nature, need to be encoded in a different way. The most common encoding method is called floating-point format which was established under a technical standard known as the IEEE Standard for Floating-Point Arithmetic (IEEE 754) in 1985, and revised in 2008 (IEEE, 2008). Before this format was commonplace, many programmers used a practical technique called fixed-point, which although now mostly forgotten, still has advantages over floating point when computing resources are constrained. Floating-point representation is so popular and well-known that a vast majority of hardware systems include a specialized device known as a Floating-Point Unit (FPU) –also known as a math co-processor– designed purely to handle this encoding and includes operations involving that data with its own instruction sets. However, fixed-point, while once used extensively before floating-point was introduced, is now not even mentioned in many computer programming text books nor is widely known among general computer science professors (R. Escalante-Ruiz, personal communication, Feburary 23rd, 2016) and is almost exclusively only found in Digital Signal Processing (DSP) manuals1 , video codec design textbooks2 , specialized computer arithmetic textbooks3 , and other embedded systems in which no FPU is used4 . Embedded systems are computer systems found within a larger electrical system. These systems are usually designed to accomplish one task. Some examples include radar equipment, missile guidance systems, fuel emission systems in cars, internet routers, home appliances, and even television remotes and calculators. They are often designed to be of low power consumption, small, able to handle weather conditions, and very cheap to mass 1 (e.g., TMS320UC5402, 2008; TMS320VC5402A, 2008; TMS320VC5410A, 2008; TMS320VC5471, 2008) (e.g., Richardson, 2002, p. 191) 3 (e.g., Parhami, 2000, p. 8) 4 (Labrosse, 1998) 2 THE FIXED-POINT REPRESENTATION OF REALS 5 produce. However, to accomplish this, they are often built with strictly what is only feasible with the given limited resources, including size, space and cost. This usually leads to the absence of an FPU, which is relatively expensive to make when trying to cut costs for mass scale production. Thus, embedded systems must use techniques other than floating-point in order to handle non-integer arithmetic, typically through software simulation. The distinction between fixed and floating point, besides cost to implement, is in their concepts. Floating-point encodes a value, the (significand), and an exponent which is is scaled by in a single word. The base for scaling is typically two or ten, though two is much more widely used due to the binary nature of hardware. It is called floating because the decimal place, or radix point, is allowed to be placed anywhere within the significand, determined by the exponent component. It can be described as an implementation of scientific notation. Fixed-point is a value scaled by an implicit factor, set by the programmer, causing the radix point to be fixed in a specific place, hence the nomenclature. This implicit scaling is much less complicated and computationally demanding than floating-point. The multinational semiconductor and software design company ARM Holdings plc (ARM) also elaborates saying that if the scaled factor is known before compile time, it is fixed-point, and if it is not known, it is floating-point (“Application Note 33”, 1996, p. 3). A fixed radix point ensures a set amount of range5 and precision6 offered by the word size and the number of bits available creating a fixed range of values possible, while a floating radix point offers the ability to trade range for precision as needed (DSP, 2016, Dynamic Range and Precision section, para. 1) allowing the user to automatically be able to represent extremely large values to extremely small values. In BCD, a fixed-point value is simply an integer interpreted with an implicit decimal point after a certain number of encoded digits determined by the language used. There is no support for IEEE 754 floating-point BCD, though conversion algorithms have been made (e.g., Bende & Tembhurne, 2014) and other implementations have been designed7 . 5 Range is determined by the biggest and smallest absolute value numbers you can represent. Precision is determined by the smallest (relative) difference between two represent-able numbers. 7 Further discussion will mostly deal with raw binary encoding as it does not pertain to BCD. 6 THE FIXED-POINT REPRESENTATION OF REALS 6 Floating-point numbers must be carefully handled and processed in order to perform arithmetic functions by decoding stored values, applying appropriate arithmetic functions and encoding the value to be returned. While normally involving many clock cycles for numerous operations to handle even basic arithmetic functions, this has been alleviated through specialized algorithm designs, pipelining, and out-of-order execution, as well as offloading workload onto the dedicated FPU so the CPU is free to handle other tasks, to increase performance of FPUs making them feasible against arithmetic logic units (ALU) using much simpler arithmetic functions (such as fixed-point or integer). Even so, integer operations are still faster than floating-point operations (LaPlante, 2012, p. 465). Fixed-point operations involve only simple integer and bit-wise operations and some involvement from the programmer, allowing them to be used on even the cheapest microprocessors. The programmer decides on how much precision and range is needed for the function looking at bound case scenarios in order to utilize the number of bits available from word size in the best way possible. Texas Instruments, being the leading producer of DSPs (TI DSP Firsts, 2001), uses the Q format notation for binary (base 2) fixed-point: For example, the Q15 is a popular format in which the most significant bit is the sign bit, followed by 15 bits of fraction. This Q-value specifies how many binary digits are allocated for the fractional portion of the number. (J. Stevenson, 2002) ARM also uses the Q format (CMSIS, 2015) though their own general documentation states it had been superseded (ARM Information Center, 2001) and uses other Q-like formats for different devices (VCVT, 2013). Many computer engineers develop their own style of representing a fixed point number (e.g., Labrosse, 1998) due to a lack of an international standard - such as IEEE 754 for floating-point - resulting in various devices each using their own, albeit similar, notation in documentation. Regardless of notation, the methodology for handling binary fixed-point arithmetic is the same8 : 8 In the following sections the mathematically focused (n, e) notation will be used. THE FIXED-POINT REPRESENTATION OF REALS 7 In computing arithmetic, fractional quantities can be approximated by using a pair of integers (n, k): the mantissa and the exponent. The pair represents the fraction: n2−k . The exponent k can be considered as the number of digits you have to move into n before placing the binary point. (“Application Note 33”, 1996) To convert a fractional value into a binary fixed-point format the decimal value is multiplied by 2 to the power of the number of bits desired for precision9 For example: 2.75 · 22 = 11 The result is an (11, 2)2 integer, which can be stored in a regular integer variable (or register) and noted with the implicit factor of 22 .10 In another example: 3.14159265359 · 216 ≈ 205887 The data after the decimal is not even calculated, so the resulting value is treated as an (205887, 16)2 integer. It should be noted that this has become an approximation. To convert a fixed-point value back to a decimal value, the fixed-point value is divided by 2 to the power of the implied factor11 , known as unscaling the factor: 205887/216 = 3.1415863037109375 The result shows that 3.141586 ≈ 3.141592 which is fairly accurate. For greater precision, more bits could be dedicated to the resolution at a further cost of range, but this is determined by the application’s necessity. In reality however the data after the decimal point is discarded Precision is determined by the resolution, which is calculated by base−e for fixed point numbers. Note the resolution in this example is 2−2 = 0.25 which is able to exactly represent 0.75, resulting in perfect accuracy. 11 These multiplications and divisions by 2 can be done with much faster arithmetic bit-shifts: left bit-shift («) for multiply and right bit-shift (») for division. 9 10 THE FIXED-POINT REPRESENTATION OF REALS 8 and 3 is the returned result and for negative numbers also reduced by 1. This process, known as flooring12 , can create bias in these results. Consider −2.05: −2.05 · 216 ≈ −134348 −134348/216 = −2.04998779296875 f loor(−2.04998779296875) = −3 The closest integer to −2.05 is −2 yet −3 was returned after conversion to fixed-point and back to integer. Avoiding bias due to flooring can be done by simulating rounding rules and make the result more accurate (S. Stevenson, 2012, para. 5). A common method is to add 0.5 in the same format before conversion back to an integer (S. Stevenson, 2012, para. 7). For example: 0.5 · 216 = 32768 (−134348 + 32768)/216 = −1.54998779296875 f loor(−1.54998779296875) = −2 Now the returned result is −2 which is the closest integer to −2.05. Further rounding rules may be implemented, such as round half to even13 , which is the default rounding mode in the IEEE 754 (The GNU C Library: Rounding, 2016). After this conversion, arithmetic becomes available using only ALU functions. The ARM manual for fixed-point arithmetic (1996) provides examples and proofs for the various basic functions such as add, subtract, multiply, divide, and square root as well as examples for determining the exponent to use. Care must be taken in regards to potential overflow. As precision is increased, available range is decreased14 and it suddenly becomes quite easy to overflow the word and create 12 Flooring returns the greatest integer not greater than the operand. Round half to even results are rounded to the nearest representable value. 14 Range is determined by precision, base and word size, defined by the interval [−basem , basem − base−k ] for signed integers with m = (wordsize − 1 − k). 13 THE FIXED-POINT REPRESENTATION OF REALS 9 unusable data. This is handled by observing maximum and minimum possible values between operations and if the magnitude of the exponent exceeds the word size, necessitating careful coding by the programmer. The use of 2 as the base for scale factoring, both in fixed and floating-point, can cause inaccuracies with various values which can’t be represented exactly by any amount of binary scaling. The most popular example is the value 0.1, or 1 : 10 00012 = 0.000112 10102 As this result is non terminating, a finite amount of precision is unable to represent this value exactly. Because of this fact, the expression f loat 0.1 + f loat 0.1 will not be equivalent to 0.2. Any rational with a denominator that does not have any prime factors shared with it’s base will produce an infinite expansion. For example, in base 10, the fraction 1 3 is non terminating because 3 does not contain either 2 or 5 as prime factors while the fraction 1 8 does terminate because it has a prime factor of 2. In binary, 2’s only prime factor is 2, and thus only fractions whose denominator is a power of 2 will terminate such as 0.5 ( 12 ) since the denominator is 21 . This problem of being unable to accurately store or represent repeating expansions as well as irrational numbers such as π and Euler’s number in a finite set of data has posed a challenge for various industries such as financial, commercial, and scientific sectors. In response the IEEE 754 has put forth larger word sizes to handle greater precision while numerical analysis is used to develop algorithms which reduce accuracy errors. Fixed-point offers an easier solution. The decision to use 2 as the base for scaling in binary fixed-point is chosen purely for ease of computation as multiplications and divisions of 2 are instead done with bit shifts, drastically increasing performance. Fixed-point works with any base, not just 2. Consider base 7: 9 ≈ 1.2857142857 · 73 = 441 7 THE FIXED-POINT REPRESENTATION OF REALS 10 This result is a perfectly accurate (441, 3)7 integer which can be stored. This number can only be added or subtracted with values using the same base. The least common multiple of the different bases can be used to create a common base15 : a b ank bmj amj + bnk + = + = mj nk mj nk mj nk mj nk For example: 441 11 441 11 + 2 = 22 · ( 3 ) + 73 · ( 2 ) 3 7 2 7 2 5537 1764 3773 + = = 4.0357142857 . . . 1372 1372 1372 1.2857142857 . . . + 2.75 = 4.0357142857 . . . Multiplication and division is also possible15 : a b ab · k = j k j m n mn For example: 4851 441 11 · 2 = = 3.5357142857 . . . 3 7 2 1372 1.2857142857 . . . · 2.75 = 3.5357142857 . . . The result is no longer necessarily a scale of powers and thus needs to be taken as a raw value or converted into a new fixed-point value. While these methods are possible to retain perfect accuracy through at least one operation, it has become very costly for performance and complex to handle for the programmer as the denominators used in these examples is implicit and not stored within memory. When the denominator is stored within memory, it 15 Variables a and b are scaled integers, variables m and n are bases, and variables j and k are exponents. THE FIXED-POINT REPRESENTATION OF REALS 11 is considered a rational data type. It is best to use a single scale within a function whenever possible. A popular example would be to use base 10 to handle decimal currency with perfect accuracy, though fixed-point BCD would suit this just as well. Another fixed-point technique involves exploiting overflow and underflow of integer registers to very quickly simulate modulo logic and simplify look-up table mapping. For example, 2 two-dimensional points are given as (x,y) integer pairs. An angle can be derived in relation to these two points using trigonometric functions and fixed-point operations. This result is then immediately scaled by 2−(wordsize−1) and stored as a binary angular measurement (BAM)16 . A look-up table is made to map BAM values to degree values so that 0 is 0 degrees, 0.5 is 90 degrees, -1.0 and 0.999...17 as 180 degrees and -0.5 as 270 degrees, or any further derived set of mapped values (e.g., Sanglard, 2010, Walls section, para. 3). These BAM values can be added and subtracted together and multiplied and divided by scalars, and put through the look-up table at the last stage to return a decimal angular value. Even if a BAM function would produce a value beyond it’s range, the overflow and underflow would wrap the result back into the valid range and the rotation of the angles remains correct, eliminating the need for modulo logic such as ≥ 360 and < 0: a costly operation (LaPlante, 2012, p. 467). According to LaPlante (2012), "BAM is frequently used in many applications including navigation software, robotic control, and in conjunction with digitizing imaging devices" (p. 467). This technique may also be used to count a loop of 60 seconds accurately instead of 360 degrees, and extended to minutes, then hours on a loop of 24, and so on. Given all these functions, techniques, and encoding, fixed-point is a powerful tool that can be used in alternative to floating-point and should not be dismissed so quickly as being obsolete, archaic or too abstract to learn. While floating-point arithmetic has been heavily researched and developed through hardware in order to simplify the use of fractional values for the programmer, it is not without caveats including cost, performance, and numerical error. Fixed-point allows programmers, with a bit of thought, to manipulate and calculate 16 17 BAM range is defined as the interval [−1, 1 − 2−(wordsize−1) ]. The maximum and minimum values. THE FIXED-POINT REPRESENTATION OF REALS 12 fractional values with more granularity and applications then just floating-point can offer, while also saving processing time and cost of resources. Fixed-point arithmetic and rational data types beyond integers can benefit programmers of all types and should not be left to be taught out of necessity in the workplace. THE FIXED-POINT REPRESENTATION OF REALS 13 References Application Note 33 [Computer software manual]. (1996, September). (Retrieved 2016-30-03, from http://infocenter.arm.com/help/topic/com.arm.doc.dai0033a/ DAI0033A_fixedpoint_appsnote.pdf) ARM Information Center. (2001). Retrieved 2016-21-03, from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0066d/ CHDFAAEI.html Bende, S., & Tembhurne, P. S. (2014). Design of BCD to Floating Point Converter Based On Single Precision Format. IOSR Journal of Electrical and Electronics Engineering, 43-45. CMSIS - Cortex Microcontroller Software Interface Standard. (2015). Retrieved 2016-30-03, from http://www.arm.com/products/processors/ cortex-m/cortex-microcontroller-software-interface-standard.php Fixed-Point vs. Floating-Point Digital Signal Processing. (2016). Retrieved 2016-22-03, from http://www.analog.com/en/education/education-library/ articles/fixed-point-vs-floating-point-dsp.html The GNU C Library: Rounding. (2016). Retrieved 2015-12-03, from https://www.gnu.org/software/libc/manual/html_node/Rounding.html IEEE Standard for Floating-Point Arithmetic [Computer software manual]. (2008, August). (Retrieved 2015-22-11, from http://www.csee.umbc.edu/~tsimo1/CMSC455/IEEE-754-2008.pdf) Labrosse, J. (1998, February). Fixed-Point Arithmetic for Embedded Systems. Retrieved 2016-24-03, from http://www.drdobbs.com/ fixed-point-arithmetic-for-embedded-syst/184403460 LaPlante, P. A. (2012). Real-Time Systems Design And Analysis (Vol. 4th). Wiley-IEEE Press. Parhami, B. (2000). Computer Arithmetic. Oxford University Press. THE FIXED-POINT REPRESENTATION OF REALS 14 Regan, R. (2012). Why 0.1 Does Not Exist In Floating-Point. Retrieved 2016-15-03, from http://www.exploringbinary.com/ why-0-point-1-does-not-exist-in-floating-point/ Richardson, I. E. (2002). Video Codec Design: Developing Image and Video Compression Systems. Chichester, West Sussex, England: John Wiley & Sons. Sanglard, F. (2010). Doom Engine Code Review. Retrieved 2015-11-26, from http://fabiensanglard.net/doomIphone/doomClassicRenderer.php Stevenson, J. (2002, February). Q-Values in the Watch Window [Computer software manual]. (Retrieved 2016-29-03, from http://www.ti.com/lit/an/spra109/spra109.pdf) Stevenson, S. (2012). Rounding in fixed point number conversions. Retrieved 2015-12-03, from https://sestevenson.wordpress.com/2009/08/ 19/rounding-in-fixed-point-number-conversions/ TI DSP Firsts. (2001). Retrieved 2016-30-03, from http://www.ti.com/corp/docs/investor/dsp/firsts.htm TMS320UC5402 Fixed-Point Digital Signal Processor [Computer software manual]. (2008). (Retrieved 2016-24-03, from http://www.ti.com/lit/ds/symlink/tms320uc5402.pdf) TMS320VC5402A Fixed-Point Digital Signal Processor [Computer software manual]. (2008). (Retrieved 2016-24-03, from http://www.ti.com/lit/ds/symlink/tms320vc5402a.pdf) TMS320VC5410A Fixed-Point Digital Signal Processor [Computer software manual]. (2008). (Retrieved 2016-24-03, from http://www.ti.com/lit/ds/symlink/tms320vc5410a.pdf) TMS320VC5471 Fixed-Point Digital Signal Processor [Computer software manual]. (2008). (Retrieved 2016-24-03, from http://www.ti.com/lit/ds/symlink/tms320vc5471.pdf) VCVT (between floating-point and fixed-point). (2013). Retrieved 2016-30-03, from THE FIXED-POINT REPRESENTATION OF REALS http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489i/ CIHDJFHF.html 15