Running head: THE FIXED-POINT REPRESENTATION OF REALS 1 Anthony DesArmier

advertisement
Running head: THE FIXED-POINT REPRESENTATION OF REALS
The Fixed-Point Representation of Real Numbers
Anthony DesArmier
El Paso Community College
1
THE FIXED-POINT REPRESENTATION OF REALS
2
Abstract
Computers are amazing devices capable of calculating extreme amounts of information by
use of functions at impressive speeds. These devices were not designed quickly nor easily, and
much work has been done by computer scientists and engineers to develop these machines.
One aspect of computers often taken for granted by programmers is how they represent
real number values beyond integers. By examining designs and techniques that have been
developed, programmers will have a better understanding of how to represent rational numbers
in better ways and avoid common misconceptions and mistakes.
THE FIXED-POINT REPRESENTATION OF REALS
3
The Fixed-Point Representation of Real Numbers
Computers have the ability to execute several millions of instructions per second, being
able to handle and process everything from complex abstract data structures –a particular
way of organizing data in a computer so that it can be used efficiently– to the most basic
branch of mathematics: arithmetic. Electronic devices, as small as a few nanometers, are
capable of processing arithmetic operations at speeds multitudes greater than most humans.
However, as complex as they have come to be, they are incapable of the level of thought
humans instinctively use. Many data structures have been developed in order to efficiently
represent this data. A very popular primitive data type is the floating-point able to represent
non-integer rational values. Many general programmers are unaware of some of it’s limitations,
and what to do when it is not available; typically required to learn a different technique out
of necessity.
Computers accept and process data in a predefined way making binary logical choices
along the way, although parts of this process can be abstracted away from the programmer.
Using the simplest of instructions a computer is capable of handling as an example: arithmetic,
a computer can add two integers together, and by extension, subtract, multiply, divide, power,
and square root. These basic instructions can be used to handle higher and more complicated
mathematical operations. However, many problems and issues need to be addressed for a
computer to properly do these calculations due to many limitations in their fundamental
design. One such issue is when a computer is required to handle a number besides an integer,
such as non-integer real numbers. In the worst case scenario this value must be approximated
through use of certain data structures, and care must be taken to evaluate this value with a
minimum tolerance for error.
Computers store integers as sequences of bits, which are digits with values of either 1
or 0. There are two common encodings for integers: base-2, which is a positional notation
system with a radix of 2; and binary coded decimal (BCD), in which each decimal digit is
encoded in a fixed number of bits, with a radix of 10. When storing integers that can be either
THE FIXED-POINT REPRESENTATION OF REALS
4
positive or negative, a single bit representing the sign is appended to the base-2 string or to
a predetermined bit-length if BCD encoding is used. The computer hardware reads, stores,
and manipulates these representations directly. Handling integers is quite straightforward,
unlike handling other real-numbers.
Fractional values, because of their approximate nature, need to be encoded in a
different way. The most common encoding method is called floating-point format which
was established under a technical standard known as the IEEE Standard for Floating-Point
Arithmetic (IEEE 754) in 1985, and revised in 2008 (IEEE, 2008). Before this format
was commonplace, many programmers used a practical technique called fixed-point, which
although now mostly forgotten, still has advantages over floating point when computing
resources are constrained. Floating-point representation is so popular and well-known that a
vast majority of hardware systems include a specialized device known as a Floating-Point
Unit (FPU) –also known as a math co-processor– designed purely to handle this encoding and
includes operations involving that data with its own instruction sets. However, fixed-point,
while once used extensively before floating-point was introduced, is now not even mentioned
in many computer programming text books nor is widely known among general computer
science professors (R. Escalante-Ruiz, personal communication, Feburary 23rd, 2016) and
is almost exclusively only found in Digital Signal Processing (DSP) manuals1 , video codec
design textbooks2 , specialized computer arithmetic textbooks3 , and other embedded systems
in which no FPU is used4 .
Embedded systems are computer systems found within a larger electrical system.
These systems are usually designed to accomplish one task. Some examples include radar
equipment, missile guidance systems, fuel emission systems in cars, internet routers, home
appliances, and even television remotes and calculators. They are often designed to be of
low power consumption, small, able to handle weather conditions, and very cheap to mass
1
(e.g., TMS320UC5402, 2008; TMS320VC5402A, 2008; TMS320VC5410A, 2008; TMS320VC5471, 2008)
(e.g., Richardson, 2002, p. 191)
3
(e.g., Parhami, 2000, p. 8)
4
(Labrosse, 1998)
2
THE FIXED-POINT REPRESENTATION OF REALS
5
produce. However, to accomplish this, they are often built with strictly what is only feasible
with the given limited resources, including size, space and cost. This usually leads to the
absence of an FPU, which is relatively expensive to make when trying to cut costs for mass
scale production. Thus, embedded systems must use techniques other than floating-point in
order to handle non-integer arithmetic, typically through software simulation.
The distinction between fixed and floating point, besides cost to implement, is in their
concepts. Floating-point encodes a value, the (significand), and an exponent which is is scaled
by in a single word. The base for scaling is typically two or ten, though two is much more
widely used due to the binary nature of hardware. It is called floating because the decimal
place, or radix point, is allowed to be placed anywhere within the significand, determined by
the exponent component. It can be described as an implementation of scientific notation.
Fixed-point is a value scaled by an implicit factor, set by the programmer, causing the radix
point to be fixed in a specific place, hence the nomenclature. This implicit scaling is much
less complicated and computationally demanding than floating-point. The multinational
semiconductor and software design company ARM Holdings plc (ARM) also elaborates saying
that if the scaled factor is known before compile time, it is fixed-point, and if it is not known,
it is floating-point (“Application Note 33”, 1996, p. 3). A fixed radix point ensures a set
amount of range5 and precision6 offered by the word size and the number of bits available
creating a fixed range of values possible, while a floating radix point offers the ability to trade
range for precision as needed (DSP, 2016, Dynamic Range and Precision section, para. 1)
allowing the user to automatically be able to represent extremely large values to extremely
small values. In BCD, a fixed-point value is simply an integer interpreted with an implicit
decimal point after a certain number of encoded digits determined by the language used.
There is no support for IEEE 754 floating-point BCD, though conversion algorithms have
been made (e.g., Bende & Tembhurne, 2014) and other implementations have been designed7 .
5
Range is determined by the biggest and smallest absolute value numbers you can represent.
Precision is determined by the smallest (relative) difference between two represent-able numbers.
7
Further discussion will mostly deal with raw binary encoding as it does not pertain to BCD.
6
THE FIXED-POINT REPRESENTATION OF REALS
6
Floating-point numbers must be carefully handled and processed in order to perform
arithmetic functions by decoding stored values, applying appropriate arithmetic functions
and encoding the value to be returned. While normally involving many clock cycles for
numerous operations to handle even basic arithmetic functions, this has been alleviated
through specialized algorithm designs, pipelining, and out-of-order execution, as well as
offloading workload onto the dedicated FPU so the CPU is free to handle other tasks, to
increase performance of FPUs making them feasible against arithmetic logic units (ALU)
using much simpler arithmetic functions (such as fixed-point or integer). Even so, integer
operations are still faster than floating-point operations (LaPlante, 2012, p. 465).
Fixed-point operations involve only simple integer and bit-wise operations and some
involvement from the programmer, allowing them to be used on even the cheapest microprocessors. The programmer decides on how much precision and range is needed for the function
looking at bound case scenarios in order to utilize the number of bits available from word
size in the best way possible. Texas Instruments, being the leading producer of DSPs (TI
DSP Firsts, 2001), uses the Q format notation for binary (base 2) fixed-point:
For example, the Q15 is a popular format in which the most significant bit is the
sign bit, followed by 15 bits of fraction. This Q-value specifies how many binary
digits are allocated for the fractional portion of the number. (J. Stevenson, 2002)
ARM also uses the Q format (CMSIS, 2015) though their own general documentation states
it had been superseded (ARM Information Center, 2001) and uses other Q-like formats
for different devices (VCVT, 2013). Many computer engineers develop their own style of
representing a fixed point number (e.g., Labrosse, 1998) due to a lack of an international
standard - such as IEEE 754 for floating-point - resulting in various devices each using their
own, albeit similar, notation in documentation.
Regardless of notation, the methodology for handling binary fixed-point arithmetic is
the same8 :
8
In the following sections the mathematically focused (n, e) notation will be used.
THE FIXED-POINT REPRESENTATION OF REALS
7
In computing arithmetic, fractional quantities can be approximated by using a
pair of integers (n, k): the mantissa and the exponent. The pair represents the
fraction: n2−k . The exponent k can be considered as the number of digits you
have to move into n before placing the binary point. (“Application Note 33”,
1996)
To convert a fractional value into a binary fixed-point format the decimal value is multiplied
by 2 to the power of the number of bits desired for precision9 For example:
2.75 · 22 = 11
The result is an (11, 2)2 integer, which can be stored in a regular integer variable (or register)
and noted with the implicit factor of 22 .10 In another example:
3.14159265359 · 216 ≈ 205887
The data after the decimal is not even calculated, so the resulting value is treated as an
(205887, 16)2 integer. It should be noted that this has become an approximation. To convert
a fixed-point value back to a decimal value, the fixed-point value is divided by 2 to the power
of the implied factor11 , known as unscaling the factor:
205887/216 = 3.1415863037109375
The result shows that 3.141586 ≈ 3.141592 which is fairly accurate. For greater precision,
more bits could be dedicated to the resolution at a further cost of range, but this is determined
by the application’s necessity. In reality however the data after the decimal point is discarded
Precision is determined by the resolution, which is calculated by base−e for fixed point numbers.
Note the resolution in this example is 2−2 = 0.25 which is able to exactly represent 0.75, resulting in
perfect accuracy.
11
These multiplications and divisions by 2 can be done with much faster arithmetic bit-shifts: left bit-shift
(«) for multiply and right bit-shift (») for division.
9
10
THE FIXED-POINT REPRESENTATION OF REALS
8
and 3 is the returned result and for negative numbers also reduced by 1. This process, known
as flooring12 , can create bias in these results. Consider −2.05:
−2.05 · 216 ≈ −134348
−134348/216 = −2.04998779296875
f loor(−2.04998779296875) = −3
The closest integer to −2.05 is −2 yet −3 was returned after conversion to fixed-point and
back to integer. Avoiding bias due to flooring can be done by simulating rounding rules and
make the result more accurate (S. Stevenson, 2012, para. 5). A common method is to add
0.5 in the same format before conversion back to an integer (S. Stevenson, 2012, para. 7).
For example:
0.5 · 216 = 32768
(−134348 + 32768)/216 = −1.54998779296875
f loor(−1.54998779296875) = −2
Now the returned result is −2 which is the closest integer to −2.05. Further rounding rules
may be implemented, such as round half to even13 , which is the default rounding mode in
the IEEE 754 (The GNU C Library: Rounding, 2016). After this conversion, arithmetic
becomes available using only ALU functions. The ARM manual for fixed-point arithmetic
(1996) provides examples and proofs for the various basic functions such as add, subtract,
multiply, divide, and square root as well as examples for determining the exponent to use.
Care must be taken in regards to potential overflow. As precision is increased, available
range is decreased14 and it suddenly becomes quite easy to overflow the word and create
12
Flooring returns the greatest integer not greater than the operand.
Round half to even results are rounded to the nearest representable value.
14
Range is determined by precision, base and word size, defined by the interval
[−basem , basem − base−k ] for signed integers with m = (wordsize − 1 − k).
13
THE FIXED-POINT REPRESENTATION OF REALS
9
unusable data. This is handled by observing maximum and minimum possible values between
operations and if the magnitude of the exponent exceeds the word size, necessitating careful
coding by the programmer.
The use of 2 as the base for scale factoring, both in fixed and floating-point, can cause
inaccuracies with various values which can’t be represented exactly by any amount of binary
scaling. The most popular example is the value 0.1, or
1
:
10
00012
= 0.000112
10102
As this result is non terminating, a finite amount of precision is unable to represent this
value exactly. Because of this fact, the expression f loat 0.1 + f loat 0.1 will not be equivalent
to 0.2. Any rational with a denominator that does not have any prime factors shared with
it’s base will produce an infinite expansion. For example, in base 10, the fraction
1
3
is non
terminating because 3 does not contain either 2 or 5 as prime factors while the fraction
1
8
does terminate because it has a prime factor of 2. In binary, 2’s only prime factor is 2, and
thus only fractions whose denominator is a power of 2 will terminate such as 0.5 ( 12 ) since
the denominator is 21 .
This problem of being unable to accurately store or represent repeating expansions
as well as irrational numbers such as π and Euler’s number in a finite set of data has posed
a challenge for various industries such as financial, commercial, and scientific sectors. In
response the IEEE 754 has put forth larger word sizes to handle greater precision while
numerical analysis is used to develop algorithms which reduce accuracy errors. Fixed-point
offers an easier solution. The decision to use 2 as the base for scaling in binary fixed-point is
chosen purely for ease of computation as multiplications and divisions of 2 are instead done
with bit shifts, drastically increasing performance. Fixed-point works with any base, not just
2. Consider base 7:
9
≈ 1.2857142857 · 73 = 441
7
THE FIXED-POINT REPRESENTATION OF REALS
10
This result is a perfectly accurate (441, 3)7 integer which can be stored. This number can
only be added or subtracted with values using the same base. The least common multiple of
the different bases can be used to create a common base15 :
a
b
ank
bmj
amj + bnk
+
=
+
=
mj nk
mj nk mj nk
mj nk
For example:
441 11
441
11
+ 2 = 22 · ( 3 ) + 73 · ( 2 )
3
7
2
7
2
5537
1764 3773
+
=
= 4.0357142857 . . .
1372 1372
1372
1.2857142857 . . . + 2.75 = 4.0357142857 . . .
Multiplication and division is also possible15 :
a
b
ab
· k = j k
j
m n
mn
For example:
4851
441 11
· 2 =
= 3.5357142857 . . .
3
7
2
1372
1.2857142857 . . . · 2.75 = 3.5357142857 . . .
The result is no longer necessarily a scale of powers and thus needs to be taken as a raw
value or converted into a new fixed-point value. While these methods are possible to retain
perfect accuracy through at least one operation, it has become very costly for performance
and complex to handle for the programmer as the denominators used in these examples is
implicit and not stored within memory. When the denominator is stored within memory, it
15
Variables a and b are scaled integers, variables m and n are bases, and variables j and k are exponents.
THE FIXED-POINT REPRESENTATION OF REALS
11
is considered a rational data type. It is best to use a single scale within a function whenever
possible. A popular example would be to use base 10 to handle decimal currency with perfect
accuracy, though fixed-point BCD would suit this just as well.
Another fixed-point technique involves exploiting overflow and underflow of integer
registers to very quickly simulate modulo logic and simplify look-up table mapping. For
example, 2 two-dimensional points are given as (x,y) integer pairs. An angle can be derived
in relation to these two points using trigonometric functions and fixed-point operations. This
result is then immediately scaled by 2−(wordsize−1) and stored as a binary angular measurement
(BAM)16 . A look-up table is made to map BAM values to degree values so that 0 is 0 degrees,
0.5 is 90 degrees, -1.0 and 0.999...17 as 180 degrees and -0.5 as 270 degrees, or any further
derived set of mapped values (e.g., Sanglard, 2010, Walls section, para. 3). These BAM
values can be added and subtracted together and multiplied and divided by scalars, and
put through the look-up table at the last stage to return a decimal angular value. Even if a
BAM function would produce a value beyond it’s range, the overflow and underflow would
wrap the result back into the valid range and the rotation of the angles remains correct,
eliminating the need for modulo logic such as ≥ 360 and < 0: a costly operation (LaPlante,
2012, p. 467). According to LaPlante (2012), "BAM is frequently used in many applications
including navigation software, robotic control, and in conjunction with digitizing imaging
devices" (p. 467). This technique may also be used to count a loop of 60 seconds accurately
instead of 360 degrees, and extended to minutes, then hours on a loop of 24, and so on.
Given all these functions, techniques, and encoding, fixed-point is a powerful tool that
can be used in alternative to floating-point and should not be dismissed so quickly as being
obsolete, archaic or too abstract to learn. While floating-point arithmetic has been heavily
researched and developed through hardware in order to simplify the use of fractional values
for the programmer, it is not without caveats including cost, performance, and numerical
error. Fixed-point allows programmers, with a bit of thought, to manipulate and calculate
16
17
BAM range is defined as the interval [−1, 1 − 2−(wordsize−1) ].
The maximum and minimum values.
THE FIXED-POINT REPRESENTATION OF REALS
12
fractional values with more granularity and applications then just floating-point can offer,
while also saving processing time and cost of resources. Fixed-point arithmetic and rational
data types beyond integers can benefit programmers of all types and should not be left to be
taught out of necessity in the workplace.
THE FIXED-POINT REPRESENTATION OF REALS
13
References
Application Note 33 [Computer software manual]. (1996, September). (Retrieved 2016-30-03,
from http://infocenter.arm.com/help/topic/com.arm.doc.dai0033a/
DAI0033A_fixedpoint_appsnote.pdf)
ARM Information Center. (2001). Retrieved 2016-21-03, from
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0066d/
CHDFAAEI.html
Bende, S., & Tembhurne, P. S. (2014). Design of BCD to Floating Point Converter Based
On Single Precision Format. IOSR Journal of Electrical and Electronics Engineering,
43-45.
CMSIS - Cortex Microcontroller Software Interface Standard. (2015). Retrieved 2016-30-03,
from http://www.arm.com/products/processors/
cortex-m/cortex-microcontroller-software-interface-standard.php
Fixed-Point vs. Floating-Point Digital Signal Processing. (2016). Retrieved 2016-22-03, from
http://www.analog.com/en/education/education-library/
articles/fixed-point-vs-floating-point-dsp.html
The GNU C Library: Rounding. (2016). Retrieved 2015-12-03, from
https://www.gnu.org/software/libc/manual/html_node/Rounding.html
IEEE Standard for Floating-Point Arithmetic [Computer software manual]. (2008, August).
(Retrieved 2015-22-11, from
http://www.csee.umbc.edu/~tsimo1/CMSC455/IEEE-754-2008.pdf)
Labrosse, J. (1998, February). Fixed-Point Arithmetic for Embedded Systems. Retrieved
2016-24-03, from http://www.drdobbs.com/
fixed-point-arithmetic-for-embedded-syst/184403460
LaPlante, P. A. (2012). Real-Time Systems Design And Analysis (Vol. 4th). Wiley-IEEE
Press.
Parhami, B. (2000). Computer Arithmetic. Oxford University Press.
THE FIXED-POINT REPRESENTATION OF REALS
14
Regan, R. (2012). Why 0.1 Does Not Exist In Floating-Point. Retrieved 2016-15-03, from
http://www.exploringbinary.com/
why-0-point-1-does-not-exist-in-floating-point/
Richardson, I. E. (2002). Video Codec Design: Developing Image and Video Compression
Systems. Chichester, West Sussex, England: John Wiley & Sons.
Sanglard, F. (2010). Doom Engine Code Review. Retrieved 2015-11-26, from
http://fabiensanglard.net/doomIphone/doomClassicRenderer.php
Stevenson, J. (2002, February). Q-Values in the Watch Window [Computer software manual].
(Retrieved 2016-29-03, from http://www.ti.com/lit/an/spra109/spra109.pdf)
Stevenson, S. (2012). Rounding in fixed point number conversions. Retrieved 2015-12-03,
from https://sestevenson.wordpress.com/2009/08/
19/rounding-in-fixed-point-number-conversions/
TI DSP Firsts. (2001). Retrieved 2016-30-03, from
http://www.ti.com/corp/docs/investor/dsp/firsts.htm
TMS320UC5402 Fixed-Point Digital Signal Processor [Computer software manual]. (2008).
(Retrieved 2016-24-03, from
http://www.ti.com/lit/ds/symlink/tms320uc5402.pdf)
TMS320VC5402A Fixed-Point Digital Signal Processor [Computer software manual]. (2008).
(Retrieved 2016-24-03, from
http://www.ti.com/lit/ds/symlink/tms320vc5402a.pdf)
TMS320VC5410A Fixed-Point Digital Signal Processor [Computer software manual]. (2008).
(Retrieved 2016-24-03, from
http://www.ti.com/lit/ds/symlink/tms320vc5410a.pdf)
TMS320VC5471 Fixed-Point Digital Signal Processor [Computer software manual]. (2008).
(Retrieved 2016-24-03, from
http://www.ti.com/lit/ds/symlink/tms320vc5471.pdf)
VCVT (between floating-point and fixed-point). (2013). Retrieved 2016-30-03, from
THE FIXED-POINT REPRESENTATION OF REALS
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489i/
CIHDJFHF.html
15
Download