PhD Research plan Proposed thesis title “Implementation of the Computational Arithmetic Operations for Signal Processing Applications” Proposal Submitted for the Degree of Doctor of Philosophy by Deepak Kumar Under the Supervision of Dr. Anup Dandapat Department of Electronics and Communication Engineering National Institute of Technology Meghalaya Department Computer Science and Engineering National Institute of Technology Meghalaya Shillong-793003, Meghalaya India December, 2014 Abstract Computational mathematics like multiplication, division, square, square root, reciprocal and other basic mathematical operations play a pivotal role in the field of digital signal processing, image processing, computer graphics, application-specific (embedded) system, cryptography etc. Now a days the implementation procedure for the above mentioned operations follow the software routines targeted to FPGA (Field programmable gate array). Thereby, the implementation procedures are not satisfying the present demand (high speed, low power consumption and chip area) of the people. With advances in the VLSI technology hardware implementation has become an attractive alternative. Assigning complex computation tasks to hardware and exploiting the parallelism and pipelining in algorithms yield significant speedup in running times. Moreover, computers keep getting faster; there are always new applications that need more processing speed from earlier. Examples of current high-demand applications include real-time video stream encoding and decoding, real-time biometric (face, retina, and/or fingerprint) recognition, military aerial and satellite surveillance applications. To meet the present and future demanded applications, modern techniques (algorithms) for accelerating applications on commercial hardware need to be developed. Contents 1. Introduction 2. Literature review 2.1 Works related to multiplier 2.2 Works related to floating point arithmetic 2.3 Works related to divider 2.4 Works related to reciprocal 2.5 Works related to square root 3. Scope of work 4. PhD Plan 4.1 Work done so far 4.2 Future work 4.3 Achievements and goals 5. Conclusions 6. References 7. Publications of the author related to the proposed work 1. Introduction Rapid advancements in electronics, particularly fabrication of the integrated circuits for commercial applications have a major impact in both industry and society. With remarkable progress in the field of very large scale integration (VLSI) circuit technology, many complex circuits are easily realizable. Algorithms that seem impossible to implement now have attractive implementation possibilities for the future. Thereby, amalgam-nation of conventional and unconventional computer arithmetic methods will give the trends for the investigation in new designs in near future. The trend is going towards exciting interaction between current research in theoretical computer science and practical applications in the area of VLSI design. This has mainly been initiated by the following two facts, i.e, I. Chip design without the assistance of computers is no longer conceivable. II. The construction of powerful chips owing towards the absolute frontiers of current possibilities and capabilities. An ASIC (Application Specific Integrated Circuit) provides very efficient solutions for well defined logic for a complex mathematical function. Optimization of digital system means to have a good balance between the physical structure of circuits and the informational structure of programs running on them. Because, complex systems belong to the programmable systems, the hardware support offered by circuits may be oriented towards programmable structures, whose functionality is actualized by the embedded information (program). Circuit implementation is evaluated on the basis of the following objectives: latency (propagation delay), power consumption, cycle-time and area as well as throughput (i.e. computation rate) for pipelined circuits. In addition, the circuit structure may be constrained pre and post specified operations with input/output (I/O) ports. In general, propagation delay and power consumption depend on the resources as well as the steering logic. 2. Literature review Digital arithmetic (DA) encompasses the study of number representations, operations on numbers, hardware implementation of arithmetic unit and application towards general-purpose and applicationspecific systems. An arithmetic unit (processor) is a system that performs operations on numbers. The most common cases are considered in which these numbers are: I. Fixed point numbers (Integers, Rational numbers) II. Floating-point numbers The floating-point numbers approximate real numbers and facilitate computations over a wide dynamic range. An arithmetic processor operates on one, or more operands depend on the applications. The operands are characterized and represented as set of values. The operation is selected from an allowable set, which usually includes addition, subtraction, multiplication, division, square root, change of sign, comparison, and so on. The results can be DA numbers, logical variables (conditions), and/or singularity conditions (exceptions). The domain of digital arithmetic systems is considered as part of computing science. This, possible view point presents the digital systems as systems which compute their associated transfer functions. But, from a functional view point it is simply a computational system, because future technologies will impose maybe different physical ways of implementation (for example, different kinds of ASIC System). Therefore, we decided to start our approach using a functionally oriented introduction in digital arithmetic circuits like multiplication and division, considered as a subdomain of computing science. Technology dependent knowledge has been described only as a supporting background for various design options. The initial and abstract level of computation has been described by the algorithmic level. Algorithms specify the steps to be executed in order to perform a computation. The most actual level consists in two realms: I. The huge and complex domain of the application software and II. The very tangible domain of the real machines implemented in a certain technology. An intermediate level provides the means to be used for allowing an algorithm to be embodied in a physical structure of a machine or in an informational structure of a program. It is about- I. The domain of the formal programming languages, and II. The domain of hardware architecture. Both of them are described using specific and rigorous formal tools. The hardware embodiment of computations is done in digital systems. Pseudo-code language has been described to express algorithms. The ‘main user’ of this kind of language is only the human mind. A list of survey for present computational techniques for arithmetic operations is presented here: 2.1 Works related to multiplier The common multiplication is done using shift and add operations [1], where the sequential mechanism is used and produces large propagation delay. In parallel multipliers, the partial products are generated through the Booth’s encoding [2] techniques and the partial products are added with the help of parallel adders, therefore the generation and addition stages limit overall speed of the parallel multiplier . To reduce the number of partial products, modified Booth’s algorithm [3] is one of the most popular mechanisms while to achieve speed improvements Wallace Tree algorithm [4] that reduce the number of sequential addition stages can be incorporated. Another solution for partial product addition has been reported by Wang in 1995, where the compressors [5] are used for the partial products addition stages, which reduces the carry propagation significantly. Vedic Mathematics is the ancient system of Indian mathematics which has a unique technique of calculations based on 16 Sutras (Formulae). “Urdhva-tiryakbyham” ( a Sanskrit word means ‘vertically and crosswise’) formula, is used for smaller number multiplication. Few research papers [6] have so far been published using “Urdhva-tiryakbyham” formula aiming for fast multiplication. However, Mehta et. al. [7] explored multiplication using “Urdhva-tiryakbyham” sutra indicates the carry propagation issues. Likewise, a multiplier design using “all from 9 and last from 10” formula (“Nikhilam Navatascaramam Dasatah” sutra) has been reported by Tiwari et al [8] in 2009, but without any hardware module implementation in the circuit level. 2.2 Works related to floating point arithmetic The IEEE Standard 754-1985 for Binary Floating-Point Arithmetic [9] was revised, and an important addition is the definition of decimal floating-point arithmetic. This is intended mainly to provide a robust, reliable framework for financial applications that are often subject to legal requirements concerning rounding and precision of the results, because the binary floating-point arithmetic may introduce small but unacceptable errors. There is increased interest in decimal floating-point arithmetic in both industry and academia as the IEEE 754R . Draft approaches the stage where it may soon become the new standard for floating-point arithmetic. The P754 Draft describes two different possibilities for encoding decimal floating-point values: the binary encoding, based on using a Binary Integer to represent the significand (BID, or Binary Integer Decimal), and the decimal encoding, which uses the Densely Packed Decimal (DPD) [10] method to represent groups of up to three decimal digits from the significand as 10-bit declets. An inherent problem of binary floating-point arithmetic used in financial calculations is that most decimal floating-point numbers cannot be represented exactly in binary floating-point formats, and errors that are not acceptable may occur in the course of the computation. Decimal floating-point arithmetic addresses this problem but a degradation in performance will occur compared to binary floating-point operations implemented in hardware. Despite its performance disadvantage, decimal floating-point is required by certain applications which need results identical to those calculated by hand. This is true for currency conversion, banking, billing, and other financial applications. Sometimes these requirements are mandated by law, other times they are necessary to avoid large accounting discrepancies. Because of the importance of this problem a number of decimal solutions exist, both hardware and software. Software solutions include C#], COBOL, and XML, which provide decimal operations and datatypes. Also, Java and C/C++ both have packages, called BigDecimal and decNumber, respectively. Hardware solutions were more prominent earlier in the computer age with the ENIAC] and UNIVAC. However, more recent examples include the CADAC, IBM’s z900 [11] and z9 architectures, and numerous other proposed hardware implementations [12] [13] [14]. More hardware examples can be found in [15], and a more in-depth discussion is found in Wang’s Processor Support for Decimal Floating-Point Arithmetic [16], Estimations have been made that hardware approaches to decimal floating-point will have average speedups of 100-1000 times over software [15]. Hardware implementations would undoubtedly yield a significant speedup but not as dramatic, and that will make a difference only if applications spend a large percentage of their time in decimal floating-point computations. 2.3 Works related to divider Many recursive techniques have so far been proposed by various researchers to implement the high speed divider [17-27], such as digit recurrence implementation methodology (restoring [23-25, 27], nonrestoring [20-22, 27]), division by convergence method (Newton- Raphson method [27]), division by series expansion (Goldschmidt algorithm[26]). The cost in terms of area and computational complexity of digit recurrence algorithms [20-25, 27] is low due to the large number of iterations; therefore latency (propagation delay) becomes high. While, some of the investigator rely on higher radix implementation of digit recurrence algorithm [28, 29] to reduce the iteration, therefore the latency becomes improved from the earlier reports [20-25], but these scheme additionally increases the hardware complexity. Some other attractive ideas are based on functional iterations, like Newton-Raphson [27] and Goldschmidt's [26] algorithm, that utilize multiplication techniques along-with the series expansion, where the amount of quotient digits obtained per iteration is doubled. The drawback of these methods is operands should be previously normalized, most used primitive are multiplications, and the remainder is not directly obtained [30]. 2.4 Works related to reciprocal Reciprocal approximations and division plays a pivotal role for several applications like digital signal and image processing, computer graphics, scientific computing etc. [31], because the division can be computed by the following manner: the reciprocal of divisor is computed at first, and then it is used as the multiplier in a subsequent multiplication with the dividend [32]. This method is especially economical when different dividend is to be divided by the same devisor. Such ‘reciprocal approximation’ methods are typically based on the Newton-Raphson iteration method [33]. Although, they are infrequent than other two basic arithmetic operations, like addition and multiplication, due to poor performance (high computation time) [34]. A lot of algorithm like Taylor series method [35], iterative techniques, such as Newton-Raphson [32, 33], Gold-Schmidt [34], have been reported so far to implement such function. Amalgam-nation with these basic algorithms substantial amount of work has been investigated in various literatures [31-34]. The above mentioned algorithms for performing these operations have long latencies or large area overhead and exhibit linear convergence rate thereby large number of operations is required for above mentioned the task. Moreover the iterative division technique starts with an initial approximation of the reciprocal of the divisor, usually implemented through a look-up table, thereby large ROM size is required to accommodate the denominator leading to higher delay and power. The desired precision level of the reciprocal unit is limited by the ROM size as the size of the lookup table increases with the increasing precision level. 2.5 Works related to square root Several methods have been used to perform the operation, as summarized in [38-61]. Among them is the digit-recurrence technique in which one digit of the result is produced per cycle. To reduce the number of cycles, it is convenient to use a high radix algorithm; however, the complexity of the selection of the digit limits the direct application to radices up to eight. A comprehensive presentation of this method is given in [43]. Another technique is based on functional iteration [46], [53], [57] in which the implementation is based on the use of a multiplier. Although the convergence of these algorithms can be quadratic, the time of a multiplication and the difficulty in producing a correctly rounded result makes the execution time comparable with digit-recurrence algorithms. For division, very-high radix algorithms have been proposed that simplify the selection function by prescaling of the divisor [38], [40], [41], [48], [49], [52], [58], [59], [60]. In [44], it has been demonstrated that this technique, when combined with selection by rounding, can be used to achieve a faster implementation than other known dividers, including dividers by functional iteration. Further comparisons that confirm this claim are presented in [43], [56]. The area of this implementation is larger than that for low-radix dividers, but it seems reasonable for the number of transistors available in today's chips. Because of the similarities between division and square root, combined implementations have been proposed [39], [42], [45], [47], [50], [51], [52], [54], [61]. This motivates the development of an algorithm for square root which is similar to the very-high radix division with prescaling. An algorithm of this type was presented in [52]. However, the resulting implementation is complex and is not compatible with the corresponding division unit. 3. Scope of work About a hundred years ago the researchers in the west discovered the Indian Vedas: ancient texts in their millions containing some of the most profound knowledge. In fact the Sanskrit word ‘Veda’ means ‘knowledge’. Researchers found Vedic texts on medicine, architecture, astronomy, ethics etc., and according to the Indian convention all knowledge is contained in these Vedas. Going through the propaganda reports of “Vedic Mathematics” is an “amazingly compact and powerful system of calculation”, and one also hears things like “once you have learnt the 16 sutras by heart, you can solve any long problem orally”, and so on, is essentially a compilation of some tricks in simple arithmetic and algebra. One of the main purposes of Vedic mathematics is to transform the tedious calculations into simpler, verbally manageable operation without much help of pen and paper. Human can perform mental operations for very small magnitude of numbers and hence Vedic mathematics provides techniques to solve operations with large magnitude of numbers easily, and it provides more than one methods for the solution of complex calculation like multiplication and division. For each operation there is at least one generic method provided along with some methods which is directed towards specific cases simplifying the calculations further. After intensive research and literature survey, we found that algorithms based on these methods, implemented in hardware, improve the results. 4. PhD Plan 4.1 Work done so far We have started Systematic study of the computational mathematics amalgamation with the application towards the application specific processor design was completed. The hardware implementation of decimal arithmetic is becoming a topic of interest to the researchers, for wide application of such arithmetic in the field of human-centric applications, where exact results are required. Generally, computer algorithms and architectures are based on binary number systems, because, of their simplicity from its counterpart i.e. decimal number systems. However, many decimal numbers cannot be represented exactly in binary format due to finite word-length effect, hence exact implementation is impractical. Recently, decimal arithmetic is becoming commercialized for general purpose computer, through Binary Coded Decimal (BCD) encoding techniques. The summary of work done in last 1.5 year is as follow: Reciprocal: Reciprocal approximations and division plays a pivotal role for several applications like digital signal and image processing, computer graphics, scientific computing etc. [31].In algorithmic and structural levels, a lot of reciprocal implementation techniques had been developed to reduce the propagation delay (latency) of such algorithm; which reduces the iteration aiming to reduction of latency but the principle behind the implementation methodology was same in all cases. Vedic Mathematics [36] is the ancient system of Indian mathematics which has a unique technique of calculations based on 16 Sutras (Formulae). In a paper we report on a reciprocal algorithm and its architecture based on such ancient Indian mathematics. Sahayaks (auxiliary fraction) is a Sanskrit term is adopted from Vedas; formula is encountered to implement the reciprocal circuitry. By employing the Vedic methodology reciprocal has been implemented by the transformation of the digits to a smaller digit and entire division has been carried out through the transformed digits, therefore offered the reduction of circuit level complexity, owing to the substantial reduction in propagation delay. To carry-out the transistor level implementation of decimal reciprocal unit optimized 8421 BCD recoding techniques [37] have been adopted in this literature. The reciprocal unit is fully optimized in terms of calculations, so any configuration of input digit to calculate the reciprocal could be elaborated. Transistor level implementation of such reciprocal circuitry was carried out by the combination of BCD arithmetic along with Vedic mathematics, performance parameters like propagation delay, dynamic switching power consumption calculation of the proposed method was calculated by spice spectre using 90nm CMOS technology and compared with the other design like Newton-Raphson (NR)[32] based implementation. The calculated results revealed 5-digit reciprocal units have propagation delay only ~3.57us with ~30.8mW dynamic switching power. Division: Division is a basic operation in many scientific and engineering applications. Generally, division operation is a sequential type of operation, thereby, more costly in terms of computational complexity and latency compared with other mathematical operations like multiplication and addition. In algorithmic and structural levels, a lot of division techniques had been developed so far to reduce the latency of the divider circuitry; which reduces the iteration aiming to reduction of latency but the principle behind division was same in all cases. Division algorithm and its architecture based on such ancient mathematics has been reported on a paper. Paravartya- Yojayet (PY) is a ‘Sanskrit’ term indicates ‘transpose and apply’ from Vedas; and implemented in the proposed division circuitry. By employing the Vedic methodology, division has been implemented by multiplication and addition, thereby, reduces the iteration, owing to the substantial reduction in propagation delay. Transistor level implementation of such division circuitry was carried out by the combination of Boolean logic with Vedic mathematics. Performance parameters like propagation delay, dynamic switching power consumption calculation of the proposed method was carried out by spice spectre using 90nm CMOS technology and scaling down to 65nm, 45nm and 32nm technology. Moreover, the proposed methodology has been compared with the other design like digit recurrence, convergence and series expansion based implementation. The calculated results revealed (32÷16) bit circuitry have propagation delay only ~23.5ns with ~35.7mW dynamic switching power. Division: In another paper we report on a decimal division technique and its hardware implementation based on such ancient Vedic mathematics. ‘Nikhilam Navatascaramam Dasatah (NND)’ is a Sanskrit term indicates ‘all from 9 and last form 10’, is adopted from Vedas; formula is encountered to achieve high speed decimal division circuitry. In this approach decimal divider was implemented through complement of the divisor digit instead of actual one, addition and little-bit multiplication, thereby reduces the iteration, owing to the substantial reduction in propagation delay. Algorithmic implementation of such division circuitry was carried out by the amalgam-nation BCD arithmetic with Vedic mathematics, performance parameters like propagation delay, dynamic switching power consumption calculation of the proposed method was calculated by Xilinx ISE simulator and compared with the other design like Digit-recurrence (D-R), Newton-Raphson (N-R) based implementation. The calculated results revealed 6÷3-digit divisor circuitry have propagation delay only ~41 ns with ~93 mW dynamic switching power. Future work 4.2 Reported works Division Reciprocal Future work Some special computational techniques (Square root, inverse square root, etc.), Floating point arithmetic (e.g. Multiplier, Divider), Reversible arithmetic circuits, Field logarithmic techniques, Special base number system (Complex base: 1+j, 1-j). Achievements and goals 4.3 Objective no. O1 O2 Task No. T1 T2 T3 T4 T5 T6 T7 O3 O4 O5 T8 T9 T10 T11 T12 T13 T14 Objective/Task Background study of computational arithmetic Course work Literature review about reciprocal, divider Course work Proposal of an algorithm for divider Proposal of algorithms for reciprocal Proposal of an algorithm for improved division operation based on different Vedic formula Literature review about square root and inverse square root Proposal of an algorithm for square root and inverse square root Literature review about reversible circuits Design reversible arithmetic circuits Literature review about Field logarithmic techniques and Special base number system Proposal of an algorithm for Field logarithmic techniques Proposal of some algorithms for operations on special base number system Gantt chart of PhD Research Plan The Gantt chart given below shows the timeline over seven semesters. Objectives Work plan Autumn, 2013 O1 Spring, 2014 Autumn, 2014 Spring, 2015 Autumn, 2015 Spring, 2016 Autumn, 2016 T1 T2 T3 T4 O2 T5 T6 T7 O3 T8 T9 O4 T10 T11 T12 O5 T13 T14 Final Thesis Writing Thesis writing 5. Conclusions Computational mathematics like multiplication, division, square, square root, reciprocal and other basic mathematical operations play a pivotal role in the field of digital signal processing, image processing, computer graphics, application-specific (embedded) system, cryptography etc. All these operations are usually implemented in software but may use special purpose hardware for speed. Although computers keep getting faster and faster per year, there are always new applications that need more processing than is available. To meet the demands of these and future applications, we need to develop new techniques (algorithms) for accelerating applications on commercial hardware. Ancient mathematics, especially Vedic mathematics, contains the methods developed for fast mental calculations. Threadbare study of the literature of Vedic mathematics indicates that the procedure (Vedic) may be fruitful in algorithmic level as well as VLSI implementation for the above mentioned circuits. 6. References [1] M. M.-Dastjerdi, A. A.-Kusha, and M. Pedram, “BZ-FAD: A Low-Power Low-Area Multiplier Based on Shift-and-Add Architecture,” IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 2, pp. 302-306, Feb. 2009. [2] A.D. Booth, “A signed binary multiplication technique,” Quarterly Journal of Mechanics and Applied Mathematics, vol. IV, pp. 236–240, 1952. [3] J. Hu, L. Wang, and T. Xu, “A low-power adiabatic multiplier based on modified booth algorithm,” in Proc. of the IEEE Int. Symp. on Integrated Circuits, Singapore, Sept. 2007, pp. 489-492. [4] C. S. Wallace, “A suggestion for a fast multiplier,” IEE Trans. Electronic Computer, vol. EC-13, no. 1, pp. 14–17, Jan. 1964. [5] Z. Wang, G.A. Jullien, and W.C. Miller, “A New Design Technique for Column Compression Multipliers”, IEEE Trans. on Computers, vol. 44, no. 8, pp. 962-970, Aug.1995. [6] M. Ramalatha, K.D. Dayalan, P. Dharani, and S.D. Priya, “High Speed Energy Efficient ALU Design using Vedic Multiplication Techniques, In Proc. of the IEEE, Int. Conf. on Advances in Computational Tools for Engineering Applications, Zouk Mosbeh, July 2009, pp. 600-603. [7] P. Mehta, and D. Gawali, “Conventional versus Vedic mathematical method for Hardware implementation of a multiplier,” in Proc. of the IEEE, Int. Conf. on Advances in Computing, Control, and Telecommunication Technologies, Trivandrum, Kerala, Dec. 2009, pp. 640-642. [8] H.D. Tiwari, G. Gankhuyag, C.M. Kim, and Y.B. Cho, “Multiplier design based on ancient Indian Vedic Mathematics,” in Proc. of the IEEE, Int. SoC Design Conf., Busan, Nov. 2008, pp. 65-68. [9] Institute of Electrical and Electronics Engineers. “Standard for Binary Floating-Point Arithmetic”, IEEE Std 754-1985. [10] M. F. Cowlishaw, “Densely packed decimal encoding.” IEEE Proceedings - Computers and Digital Techniques, vol. 149, pp. 102-104. May 2002. [11] F.Y. Busaba, C.A. Krygowski, W.H. Li, E.M. Schwarz, S.R. Carlough, “The IBM z900 Decimal Arithmetic Unit,” in Proceedings of the 35th Asilomar Conference on Signals, Systems and Computers, vol. 2, pp 1335, IEEE Computer Society, November 2001. [12] M. A. Erle, J. M. Linebarger, and M. J. Schulte, “Potential Speedup Using Decimal Floating-Point Hardware.” Proceedings of the Thirty Sixth Asilomar Conference on Signals, Systems, and Computers Pacific Grove, California. IEEE Press, pp. 1073-1077, November, 2002. [13] M. A. Erle, M. J. Schulte, “Decimal Multiplication Via Carry-Save Addition.” Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors The Hague, Netherlands. IEEE Computer Society Press, pp. 348-358, June, 2003. [14] G. Bohlender and T. Teufel “A Decimal Floating-Point Processor for Optimal Arithmetic”, pp 31-58. Computer Arithmetic: Scientific Computation and Programming Languages, Teubner Stuttgart, 1987. [15] M. F. Cowlishaw. “Decimal Floating-Point : Algorism for Computers,” in Proceedings of the 16th IEEE Symposium on Computer Arithmetic, pp. 104-111, June 2003. [16] L Wang. “Processor Support for Decimal Floating-Point Arithmetic.” Technical Report. Electrical and Computer Engineering Department. University of Wisconsin-Madison. Available upon request. [17] M. Langhammer, “Improved subtractive division algorithm,” in Proc. IEEE Int. ASIC Conf. 1998, Rochester, NY, Sept. 13-16, 1998, pp. 343-347. [18] R. Hagglund, P. Lowenborg, M. Vesterbacka, “A polynomial-based division algorithm,” in Proc. IEEE Int. Symp. on Circuits and Systems, May 26-29, 2002, vol. 3, pp. 571-574. [19] S-G. Chen, C-C. Li, “An efficient division algorithm and its architecture,” in Proc. IEEE Int. Computer, Communication, Control and Power Engineering Conf. 1993, Beijing , China, Oct. 19-21, 1993,vol 1, pp. 24-27. [20] W.P. Marnane, S.J. Bellis and P. Larsson-Edefors, “Bit-serial interleaved high speed division,” Electronics Letter, vol. 33, no. 13, pp. 1124-1125, June-1997. [21] H. T. Vergos, “An Efficient BIST Scheme for Non-Restoring Array Dividers,” in Proc. IEEE 10th Euromicro Conf. on Digital System Design Architectures, Methods and Tools 2007, Lubeck, Aug. 29-31, 2007, pp. 664-667. [22] J.B. Andersen, A.F. Nielsen, and O. Olsen, “A Systolic ON-LINE Non-restoring Division Scheme,” in Proc. IEEE Twenty-Seventh Hawaii Int. Conf. on System Sciences, Wailea, USA, Jan. 4-7, 1994, pp. 339-348. [23] N. Aggarwal, K. Asooja, S.S. Verma, and S. Negi, “An Improvement in the Restoring Division Algorithm (Needy Restoring Division Algorithm),” in Proc. IEEE Int. Conf. Computer Science and Information Technology 2009, Beijing, Aug. 8-11, 2009, pp. 246-249. [24] P. Nair, D. Kudithipudi, and E. John, “Design and Implementation of a CMOS Non-Restoring Divider,” in Proc. IEEE Int. Region 5 Conf, San Antonio, USA April 7-9, 2006, pp. 211-217. [25] C. Senthilpari, S. Kavitha and J. Joseph, “Lower delay and area efficient non-restoring array divider by using Shannon based adder technique,” in Proc. IEEE Int. Conf. on Semiconductor Electronics, Melaka, June 28-30, 2010, pp. 140-144. [26] S.F. Oberman, and M.J. Flynn, “Division Algorithms and Implementations,” IEEE Trans. on Comp. vol. 46, no. 8, pp. 833-854, Aug. 1997. [27] J-P. Deschamps, G.J.A. Bioul, G.D. Sutter, Synthesis of Arithmetic Circuits, FPGA, ASIC and Embedded System, John Wiley & Sons, Inc., publication, 2006. [28] T. M. Carter and J.E. Robertson. “Radix -16 signed-digit division,” IEEE Trans. Computers, vol. C-39, no. 12, pp. 1424-1433, Dec. 1990. [29] H. R. Srinivas and K. K. Parhi, “A Fast Radix 4 Division Algorithm,” in Proc. IEEE Int. Symp. on Circuit & Systems, London, May 30-June. 02, 1994,vol. 4, pp. 311-314. [30] G. Sutter, J-P. Deschamps, “High speed fixed point dividers for FPGAs,” in Proc. IEEE Int. Conf. on Field Programmable Logic and Applications, Prague, Aug. 31-Sept. 02, 2009, pp. 448-452. [31] M.J. Schulte, J.E. Stine and K.E. Wires, “High-speed Reciprocal Approximations,” Proc. of the ThirtyFirst Asilomar Conference on Signals, Systems & Computers, vol. 2, 1997, pp. 1183-1187. [32] D. Chen,and S.-B. Ko, “Design and Implementation of Decimal Reciprocal Unit,” in Proc. Canadian Conf. on Electrical and Computer Engineering, 2007, pp. 1094-1097. [33] D.L. Fowler and J. E. Smith, “An Accurate, High Speed Implementation of Division by Reciprocal Approximation,” in Proc. 9th Symp. on Computer Arithmetic, 1989, pp. 60-67. [34] J.-A. Pineiro and J.D. Bruguera, Member, “High-Speed Double-Precision Computation of Reciprocal, Division, Square Root, and Inverse Square Root”, IEEE Trans. on Computers, vol. 52, no. 12, pp. 13771388, Dec. 2002. [35] P.M. Farmwald, “High Bandwidth Evaluation of Elementary Functions”, Proc. Fifth IEEE Symp. Computer Arithmetic, pp. 139- 142, 1981. [36] J.S.S.B.K.T. Maharaja, “Vedic mathematics”, Motilal Banarsidass Publishers Pvt Ltd, Delhi, 2001. [37] J. Bhattacharya, A. Gupta, and A. Singh, “A high performance binary to BCD converter for decimal multiplication”, in Proc. of the IEEE, International Symp. on VLSI Design Automation and Test (VLSIDAT), Hsin Chu, 2010, pp. 315-318. [38] W.S. Briggs and D.W. Matula, “Method and Apparatus for Performing Division Using a Rectangular Aspect Ratio Multiplier”, U. S. Patent No. 5 046 038, Sept. 1991. [39] J. Cortadella and T. Lang, “High-Radix Division and Square Root with Speculation”, IEEE Trans. Computers, vol. 43, no. 8, pp. 919-931, Aug. 1994. [40] M.D. Ercegovac, “A Division with Simple Selection of Quotient Digits”, Proc. Sixth IEEE Symp. Computer Arithmetic, pp. 94-98, Aahrus, Denmark, June 1983. [41] M.D. Ercegovac and T. Lang, “Simple Radix-4 Division with Operands Scaling”, IEEE Trans. Computers, vol. 39, no. 9, pp. 1,204-1,208, Sept. 1990. [42] M.D. Ercegovac and T. Lang, “Module to Perform Multiplication,Division and Square Root in Systolic Arrays for Matrix Computations”, J. Parallel and Distributed Computing, vol. 11, no. 3, pp. 212-221, Mar. 1991. [43] M.D. Ercegovac and T. Lang, “Division and Square Root: Digit-Recurrence Algorithms and Implementations”. Kluwer Academic, 1994. [44] M.D. Ercegovac, T. Lang, and P. Montuschi, “Very-High Radix Division with Prescaling and Selection by Rounding”, IEEE Trans. Computers, vol. 43, no. 8, pp. 909-918, Aug. 1994. [45] J. Fandrianto, “Algorithm for High Speed Shared Radix 8 Division and Radix 8 Square-Root”, Proc. Ninth IEEE Symp. Computer Arithmetic, pp. 68-75, Santa Monica, Calif., Sept. 1989. [46] M.J. Flynn, “On Division by Functional Iteration”, IEEE Trans. Computers, vol. 19, no. 8, pp. 702-706, Aug. 1970. [47] J.B. Gosling and C.M.S. Blakeley, “Arithmetic Unit with Integral Division and Square Root”, IEEE Proc., Part E, vol. 134, pp. 17-23,Jan. 1987. [48] J. Klir, “A Note on Svoboda's Algorithm for Division”, Information Processing Machines (Stroje na Zpracovani Informaci), no. 9, pp. 35-39, 1963. [49] E.V. Krishnamurthy, “On Range-Transformation Techniques for Division”, IEEE Trans. Computers, vol. 19, no. 2, pp. 157-160, Feb. 1970. [50] S.E. McQuillan and J.V. McCanny, “VLSI Module for High Performance Multiply, Square Root and Divide”, IEEE Proc., Part E, vol. 139, no. 6, pp. 505-510, June 1992. [51] S.E. McQuillan, J.V. McCanny, and R. Hamill, “New Algorithms and VLSI Architectures for SRT Division and Square Root”, Proc. 11th IEEE Symp. Computer Arithmetic, pp. 80-86, Windsor, Ontario, Canada, July 1993. [52] D.W. Matula, “Highly Parallel Divide and Square Root Algorithms for a New Generation Floating Point Processors”, Proc. SCAN-89 Symp. Computer Arithmetic and Self-Validating Numerical Methods, Oct. 1989. [53] P. Montuschi and M. Mezzalama, “Survey of Square Rooting Algorithms,” IEEE Proc., Part E, vol. 137, no. 1, pp. 31-40, Jan. 1990. [54] P. Montuschi and L. Ciminiera, “Reducing Iteration Time When Result Digit is Zero for Radix-2 SRT Division and Square Root with Redundant Remainders”, IEEE Trans. Computers, vol. 42, no. 2, pp. 239246, Feb. 1993. [55] S.F. Oberman and M.J. Flynn, “Design Issues in Division and Other Floating-Point Operations”, IEEE Trans. Computers, vol. 46, no. 2, pp. 154-161, Feb. 1997. [56] S.F. Oberman and M.J. Flynn, “Division Algorithms and Implementations, “ IEEE Trans. Computers, vol. 46, no. 8, pp. 833-854, Aug. 1997. [57] C.V. Ramamoorthy, J.R. Goodman, and K.H. Kim, “Some Properties of Iterative Square-Rooting Methods Using High-Speed Multiplication”, IEEE Trans. Computers, vol. 21, no. 8, pp. 837-847, Aug. 1972. [58] A. Svoboda, “An Algorithm for Division”, Information Processing Machines, vol. 9, pp. 25-32, 1963. [59] C. Tung, “A Division Algorithm for Signed-Digit Arithmetic”, IEEE Trans. Computers, vol. 17, no. 9, pp. 887-889, Sept. 1970. [60] D.C. Wong and M.J. Flynn, “Fast Division Using Accurate Quotient Approximations to Reduce the Number of Iterations”, IEEE Trans. Computers, vol. 41, no. 8, pp. 981-995, Aug. 1992. [61] J.H.P. Zurawski and J.B. Gosling, “Design of a High-Speed Square Root Multiply and Divide Unit”, IEEE Trans. Computers, vol. 36, no. 1, pp. 13-23, Jan. 1987. 7. Publications of the author related to the proposed work Journals: [1] P. Saha, D. Kumar, P. Bhattacharyya, and A. Dandapat, “Vedic division methodology for high-speed very large scale integration applications”, IET Journal of Engineering 2014, pp. 1-9. DOI: 10.1049/joe.2013.0213 , Online ISSN 2051-3305 [2] P. Saha, D. Kumar, P. Bhattacharyya, and A. Dandapat, “Design of 64-bit squarer based on Vedic mathematics,” Journal of Circuits Systems and Computers 2014. vol. 23, no.7, pp. xx, 2014.ISSN: 0218-1266, DOI: 10.1142/S0218126614500923. [3] P. Saha, D. Kumar, P. Bhattacharyya, and A. Dandapat, “Improved Division Algorithm using Vedic Mathematics for VLSI Applications”, Microelectronics Journal. (communicated) Conference papers: [1] P. Saha, D. Kumar, P. Bhattacharyya, A. Dandapat, “Reciprocal Unit Based on Vedic Mathematics for Signal Processing Applications”, IEEE International Symposium on Electronic System Design, 2013, pp. 41-45. Digital Object Identifier: 10.1109/ISED.2013.15