Fast Integer Divivision Using Multiplicative Inverse with Lookup Table

advertisement
94
Fast Integer Divivision Using Multiplicative
Inverse with Lookup Table
Kazmirenko V., Golubeva I.
National Technical University of Ukraine “KPI”, Kyiv, Ukraine
e-mail: v.kazmirenko@ieee.org, golubeva@phbme.kpi.ua
Abstract – The paper presents method for fast approximate
implementation of integer division. Multiplicative inverse
is estimated approximately using lookup table. Error of
quotient calculated by proposed method does not exceed 1
bit for 16 bit integers.
Keywords – OFDM; equalizer; integer division; multiplicative inverse; lookup table
I. INTRODUCTION
Many modern communication standards rely on use
of orthogonal frequency division multiplexing (OFDM)
as underlying technology of multiple access. It offers
such advantages as robustness against frequency selective fading, caused by multipath propagation and other
interference effect of urban landscape. To add more,
equalization can be implemented simply in the frequency domain.
Zero forcing is one of the simplest equalizers. It was
introduced in early 1960s by glorious Robert W. Lucky
[1]. Nevertheless, this simple idea still widely used in
modern communication systems, such as IEEE 802.11n
(MIMO). Suppose channel has frequency response
F  f  . Then zero forcing equalizer C  f  would be constructed as C  f   1 F  f  , so combination of channel
and equalizer gives flat frequency response and linear
phase F  f  C  f   1 . Note, that equalizer construction
requires division with channel response as divisor,
which is unknown a priori.
To aid estimation of channel frequency response designers embed some kind of reference signals among
regular useful data signals. They may come in a form of
frequency combs (DVB, WiFi, WiMAX) or span over
all subcarriers (LTE). Reference signals may come with
every symbol or appear in a timely fashion. Modulation
of the reference signals is known to receiver. Thus receiver estimates channel response by dividing expected
signal (1 in above derivation) by actually received one.
Calculation of this kind has to be performed for every
reference subcarrier of OFDM system. Number of used
subcarriers may reach 1 200 as in LTE.
Invention of mathematical coprocessor started the era
of hardware accelerated math and cheap floating point
support. However, many modern microcontrollers and
digital signal processors (DSP) used as in handheld termi-
nals, so in base stations of cellular networks still rely on
their integer arithmetic logic unit (ALU). Another important consideration is that modern DSPs are optimized for
multiply-accumulate operations. They have hardware single cycle multipliers and adders. As DSPs have to handle
large chunks of data, their performance improves in pipelined loops. However, most DSPs would call assembly
subroutine for division. This not only adds call overhead,
but breaks pipelining of the loops. From this point of view
division, even integer, is quite expensive operation. Replacement of division with couple of cheaper operations
may give performance benefits.
II. FAST INTEGER DIVISION
For the further derivations let us state the problem as
q
n
,
d
(1)
where q is quotient to find, n (numerator) is dividend,
and d is divisor.
Besides obvious case of divisor being power of 2
there is one special case of multiplication by constant.
Reference [2] gives some well-crafted examples for limited set of divisors using multiplies, addition, and shifts.
Generally division is implemented as multistep, iterative process. Division algorithms could be classified
as slow and fast ones. Slow algorithms, such as restoring
division, non-restoring division, and SweeneyRobertson-Tocher algorithm produce one binary digit
per iteration, while fast ones, such as Newton-Raphson
and Goldschmidt division may produce two.
In Newton-Raphson algorithm quotient q is found
by multiplying dividend with reciprocal of dividend d.
Newton’s method is used to estimate reciprocal of d.
Application of Newton’s method to function
f  x   1 x  d results in iterative process:
xi 1  xi 
1 xi  d
1 xi2
 xi  2  dxi  ,
which uses only multiplications and subtraction (effectively addition). Formal coding of such algorithm in custom application may not be profitable, as DSP compilers
struggle to pipeline inner loop, which would be Newton’s iterations loop. However, with good initial approximation just 2 iterations would produce 22 correct
ELNANO’ 2012, April 10-12, 2012, Kyiv, Ukraine
95
binary digits, which are enough for many practical applications, since ubiquitous ADCs range rarely exceeds
16 bit. Then, loop may be unrolled allowing efficient
pipelined implementation. Interested reader may find
implementation in open sources [3].
A more general approach still relies on divisor’s reciprocal. Then original problem (1) could be reformulated
as q  n  1 d  . To make it suitable for integers note,
that in place of 1 d one may use any fraction a b , which
reduces to 1 d . Then though a b still evaluates to zero,
we may rewrite desired operation as q   n  a  b . If b is
power of 2, division replaces with cheap right shift. For
arbitrary d it is not always possible to find such a and b,
where b is power of 2. However, as final right shift would
throw away some bits, their accuracy does not matter. So
a b does not need to be exactly 1 d , but rather close to
it. Several algorithms based on this idea presented in [4].
For unsigned 32-bit integers one can replace division by 3
1 431 655 765
, division by 5 with
4 294 967 296
858 993 459
multiply by
and so on. Note, that de4 294 967 296
with multiply by
nominator b  232 . Since division is replaced with multiply and shift, these methods sometimes referred as multiplicative inverse.
III. APPROXIMATION OF MULTIPLICATIVE
INVERSE WITH LOOKUP TABLE
In this paper we propose approximate algorithm, suitable for small numbers. Division by b  232 is equal to right
shift by 32, that would zero the result. In order to apply this
scheme successfully, we have to keep 64 bits in n  a product. Similarly, for 16 bit data we have to keep 32 bits of
product. Luckily, DSPs implement this type of operations.
Another obstacle is that finding multiplicative inverse in form a b still requires multistep branching
procedure, which would break pipeline of the loop. Let
us restrict b  2N , where N is either 16 or 32, and use
lookup table for a. To avoid branching, lookup table
contains nominator part a for d in range [1..2k-1]. Size of
lookup table 2k  1 is restricted by memory limitations.
Also, this table should be small enough to fit in cache of
processor. This would limit range of divisors. So we
propose to present d in form of
m
d  z2 ,
(2)
Figure 1. Pseudocode for division by multiplication using lookup table
Constant k defines size of lookup table. Variable m is
exponent in (2). Function norm() returns the number of
bits up to the first nonredundant sign bit of its argument,
and often available as single cycle intrinsic instruction.
Thus, 31 – norm(d) is most significant bit number of d.
Operator shr defines bitwise right shift. Array “inverse”
is precomputed at initialization stage and stores values
inverse  i  
i  1..2k  1 ,
where RANGE_MAX is maximum value for used data
type, e.g. 216–1 for 16 bit integers.
Pseudocode in Fig. 1 is good for positive divisors.
For negative ones additional step of keeping the sign and
taking absolute value is needed.
Numerical experiment performed with 16 bit integers using lookup table of 256 entries. Nominator n and
divisor d were varied in the range 1..32 767. In experiment discrepancy of q does not exceed one least significant bit, so relative error of quotient is q  1 q * ,
where q * is accurate quotient. Thus, method performs
better for larger quotients. In many practical applications
16 bit data acquired with 10 to 14 bit ADC. Then dividend n can be prescaled, so n d  q  4..64 .
Since lookup table size is known at compile time,
whole algorithm costs 2 additions, 2 shifts, 1 memory
load, 1 multiplication and 1 conditional assignment. This
amount is at least twice smaller, than in Newton method.
IV. CONCLUSION
Presented algorithm provides considerable performance benefit at the expense of memory used for lookup
table. Implicit expense is memory load from lookup table, which is efficient in presence of cache. 1 bit
accuracy may be enough in practical applications.
REFERENCES
[1]
[2]
[3]
k
where m is the smallest integer, such that z  2  1.
Then after multiplication by inverse from lookup table
and right shift by N additional right shift by m is applied.
Pseudocode for this algorithm is shown in Fig. 1.
RANGE_MAX
,
i
[4]
R.W. Lucky. The adaptive equalizer. IEEE Signal Processing
Magazine – 2006. – Vol. 23. – p. 104-107.
H. S. Warren,
Jr.
Hacker's
Delight.
Addison-Wesley
Professional.– 2002.– 320 p., chapter 10.
Shuhua Zhang. Computing Reciprocals of Fixed-Point Numbers
by the Newton-Raphson Method — Division by Multiplication //
http://www.dsprelated.com/showcode/201.php.
T. Granlund, P. L. Montgomery. Division by Invariant Integers
using Multiplication. ACM SIGPLAN Notices.– Vol 29, Issue
6.– June 1994.– P. 61 – 72.
ELNANO’ 2012, April 10-12, 2012, Kyiv, Ukraine
Download