94 Fast Integer Divivision Using Multiplicative Inverse with Lookup Table Kazmirenko V., Golubeva I. National Technical University of Ukraine “KPI”, Kyiv, Ukraine e-mail: v.kazmirenko@ieee.org, golubeva@phbme.kpi.ua Abstract – The paper presents method for fast approximate implementation of integer division. Multiplicative inverse is estimated approximately using lookup table. Error of quotient calculated by proposed method does not exceed 1 bit for 16 bit integers. Keywords – OFDM; equalizer; integer division; multiplicative inverse; lookup table I. INTRODUCTION Many modern communication standards rely on use of orthogonal frequency division multiplexing (OFDM) as underlying technology of multiple access. It offers such advantages as robustness against frequency selective fading, caused by multipath propagation and other interference effect of urban landscape. To add more, equalization can be implemented simply in the frequency domain. Zero forcing is one of the simplest equalizers. It was introduced in early 1960s by glorious Robert W. Lucky [1]. Nevertheless, this simple idea still widely used in modern communication systems, such as IEEE 802.11n (MIMO). Suppose channel has frequency response F f . Then zero forcing equalizer C f would be constructed as C f 1 F f , so combination of channel and equalizer gives flat frequency response and linear phase F f C f 1 . Note, that equalizer construction requires division with channel response as divisor, which is unknown a priori. To aid estimation of channel frequency response designers embed some kind of reference signals among regular useful data signals. They may come in a form of frequency combs (DVB, WiFi, WiMAX) or span over all subcarriers (LTE). Reference signals may come with every symbol or appear in a timely fashion. Modulation of the reference signals is known to receiver. Thus receiver estimates channel response by dividing expected signal (1 in above derivation) by actually received one. Calculation of this kind has to be performed for every reference subcarrier of OFDM system. Number of used subcarriers may reach 1 200 as in LTE. Invention of mathematical coprocessor started the era of hardware accelerated math and cheap floating point support. However, many modern microcontrollers and digital signal processors (DSP) used as in handheld termi- nals, so in base stations of cellular networks still rely on their integer arithmetic logic unit (ALU). Another important consideration is that modern DSPs are optimized for multiply-accumulate operations. They have hardware single cycle multipliers and adders. As DSPs have to handle large chunks of data, their performance improves in pipelined loops. However, most DSPs would call assembly subroutine for division. This not only adds call overhead, but breaks pipelining of the loops. From this point of view division, even integer, is quite expensive operation. Replacement of division with couple of cheaper operations may give performance benefits. II. FAST INTEGER DIVISION For the further derivations let us state the problem as q n , d (1) where q is quotient to find, n (numerator) is dividend, and d is divisor. Besides obvious case of divisor being power of 2 there is one special case of multiplication by constant. Reference [2] gives some well-crafted examples for limited set of divisors using multiplies, addition, and shifts. Generally division is implemented as multistep, iterative process. Division algorithms could be classified as slow and fast ones. Slow algorithms, such as restoring division, non-restoring division, and SweeneyRobertson-Tocher algorithm produce one binary digit per iteration, while fast ones, such as Newton-Raphson and Goldschmidt division may produce two. In Newton-Raphson algorithm quotient q is found by multiplying dividend with reciprocal of dividend d. Newton’s method is used to estimate reciprocal of d. Application of Newton’s method to function f x 1 x d results in iterative process: xi 1 xi 1 xi d 1 xi2 xi 2 dxi , which uses only multiplications and subtraction (effectively addition). Formal coding of such algorithm in custom application may not be profitable, as DSP compilers struggle to pipeline inner loop, which would be Newton’s iterations loop. However, with good initial approximation just 2 iterations would produce 22 correct ELNANO’ 2012, April 10-12, 2012, Kyiv, Ukraine 95 binary digits, which are enough for many practical applications, since ubiquitous ADCs range rarely exceeds 16 bit. Then, loop may be unrolled allowing efficient pipelined implementation. Interested reader may find implementation in open sources [3]. A more general approach still relies on divisor’s reciprocal. Then original problem (1) could be reformulated as q n 1 d . To make it suitable for integers note, that in place of 1 d one may use any fraction a b , which reduces to 1 d . Then though a b still evaluates to zero, we may rewrite desired operation as q n a b . If b is power of 2, division replaces with cheap right shift. For arbitrary d it is not always possible to find such a and b, where b is power of 2. However, as final right shift would throw away some bits, their accuracy does not matter. So a b does not need to be exactly 1 d , but rather close to it. Several algorithms based on this idea presented in [4]. For unsigned 32-bit integers one can replace division by 3 1 431 655 765 , division by 5 with 4 294 967 296 858 993 459 multiply by and so on. Note, that de4 294 967 296 with multiply by nominator b 232 . Since division is replaced with multiply and shift, these methods sometimes referred as multiplicative inverse. III. APPROXIMATION OF MULTIPLICATIVE INVERSE WITH LOOKUP TABLE In this paper we propose approximate algorithm, suitable for small numbers. Division by b 232 is equal to right shift by 32, that would zero the result. In order to apply this scheme successfully, we have to keep 64 bits in n a product. Similarly, for 16 bit data we have to keep 32 bits of product. Luckily, DSPs implement this type of operations. Another obstacle is that finding multiplicative inverse in form a b still requires multistep branching procedure, which would break pipeline of the loop. Let us restrict b 2N , where N is either 16 or 32, and use lookup table for a. To avoid branching, lookup table contains nominator part a for d in range [1..2k-1]. Size of lookup table 2k 1 is restricted by memory limitations. Also, this table should be small enough to fit in cache of processor. This would limit range of divisors. So we propose to present d in form of m d z2 , (2) Figure 1. Pseudocode for division by multiplication using lookup table Constant k defines size of lookup table. Variable m is exponent in (2). Function norm() returns the number of bits up to the first nonredundant sign bit of its argument, and often available as single cycle intrinsic instruction. Thus, 31 – norm(d) is most significant bit number of d. Operator shr defines bitwise right shift. Array “inverse” is precomputed at initialization stage and stores values inverse i i 1..2k 1 , where RANGE_MAX is maximum value for used data type, e.g. 216–1 for 16 bit integers. Pseudocode in Fig. 1 is good for positive divisors. For negative ones additional step of keeping the sign and taking absolute value is needed. Numerical experiment performed with 16 bit integers using lookup table of 256 entries. Nominator n and divisor d were varied in the range 1..32 767. In experiment discrepancy of q does not exceed one least significant bit, so relative error of quotient is q 1 q * , where q * is accurate quotient. Thus, method performs better for larger quotients. In many practical applications 16 bit data acquired with 10 to 14 bit ADC. Then dividend n can be prescaled, so n d q 4..64 . Since lookup table size is known at compile time, whole algorithm costs 2 additions, 2 shifts, 1 memory load, 1 multiplication and 1 conditional assignment. This amount is at least twice smaller, than in Newton method. IV. CONCLUSION Presented algorithm provides considerable performance benefit at the expense of memory used for lookup table. Implicit expense is memory load from lookup table, which is efficient in presence of cache. 1 bit accuracy may be enough in practical applications. REFERENCES [1] [2] [3] k where m is the smallest integer, such that z 2 1. Then after multiplication by inverse from lookup table and right shift by N additional right shift by m is applied. Pseudocode for this algorithm is shown in Fig. 1. RANGE_MAX , i [4] R.W. Lucky. The adaptive equalizer. IEEE Signal Processing Magazine – 2006. – Vol. 23. – p. 104-107. H. S. Warren, Jr. Hacker's Delight. Addison-Wesley Professional.– 2002.– 320 p., chapter 10. Shuhua Zhang. Computing Reciprocals of Fixed-Point Numbers by the Newton-Raphson Method — Division by Multiplication // http://www.dsprelated.com/showcode/201.php. T. Granlund, P. L. Montgomery. Division by Invariant Integers using Multiplication. ACM SIGPLAN Notices.– Vol 29, Issue 6.– June 1994.– P. 61 – 72. ELNANO’ 2012, April 10-12, 2012, Kyiv, Ukraine