Modulo Multiplicative Inverse Circuit Design Xiaoying Li1 Fuming Sun2 Ehua Wu1,3 1 Department of Computer and Information Science, FST, University of Macau, Macao, China 2 School of Information Engineering, University of Science and Technology, Beijing, China 3 State Key Lab of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China Email: ya27404@umac.mo, sunfuming@263.net, ehwu@umac.mo Abstract— In this paper, circuit design of an arithmetic module applied to cryptography – Modulo Multiplicative Inverse is presented and implemented using FPGA hardware technology. This modular arithmetic function contains iterative computations of division, multiplication and accumulation with variable loop times. Besides standard HDL programming and schematic input, Simulink-to-FPGA has been tried as a different design flow. Experimental results are compared between different design methods with discussion of their pros and cons. Step1. If u=0, then u-1 is set to zero, end; else Step2. Set initial values as n1=m, n2=u, b1=0 and b2=1; Step3. Divide n1 by n2 as n1=q*n2+r, get the quotient q and the remainder r; Step4. If r ≠ 0, update variables as n1=n2, n2=r, t=b2, b2=b1-q*b2, b1=t, then go back to Step3; Step5. If n2 ≠ 1, u-1 does not exist, end; else Step6. If b2<0, update b2 as b2=b2+m; I. INTRODUCTION With the increasing importance of information security, research works on cryptography and cipher design [1] become more and more significant. As the cryptographer’s mathematics, modular arithmetic, which is also called clock arithmetic, is the central mathematical concept in cryptography and used in almost any cipher from Caesar Cipher to the RSA Cipher. Different from some basic modcalculations, modulo multiplicative inverse is a relatively complex iterative procedure and time-consuming calculation with unfixed loop times. In this paper, two different design flows – HDL-based circuit description and Simulink-toFPGA circuit module design are utilized to implement the function of modulo multiplicative inverse in FPGA hardware. This module can be applied to cipher hardware design as a basic unit or it can be capsulated into a math IP core [2]. II. Step7. u-1 = b2, end. The function u-1= f (u, m) is an iterative procedure of integer division, multiplication and accumulation. It is converged on the value of remainder r so that the loop time is variable to different u. For m=216+1 in our cipher design, all the input u in the range {0,1,2,…,65536} has its corresponding mod-multiplicative inverse element u-1 and the maximum loop time is eighteen. The flowchart of algorithm is shown in Fig. 1. MOD-MULTIPLICATIVE INVERSE FUNCTION A. Definition of Element Suppose m is a positive integer, u ∈ {0, 1, 2, …, m-1}, if there exists u-1 ∈ {0, 1, 2, …, m-1} which satisfies u · u-1 = 1 mod m (1) u is called the multiplicative inverse element of u modulo m. -1 B. Computational Method Let’s set m a positive integer, for any u ∈ {0, 1, 2, …, m1}, the procedure of calculating the mod-multiplicative inverse element, u-1, is as follows Supported by Research Grant of University of Macau Figure 1. Flowchart of modulo multiplicative inverse algorithm. III. CIRCUIT DESIGN From the analysis of mod-multiplicative inverse function, the computation can be separated into two loops. One is the division loop. m and u are the initial dividend and divisor. In the subsequent divisions, previous divisor and remainder will be set as the current dividend and divisor respectively. In each iteration, the zero value of remainder r terminates the division loop and determines the convergence of the function. The other is multiplication and accumulation loop. The input of this loop comes from the quotient q of divider. Two temporary variables b1 and b2 are swapped and updated by multiplication and accumulation each time. The final result depends on the value of b2 with an offset m if b2 is less than zero. In this Section, besides standard FPGA development flow, Simulink-to-FPGA design flow is applied to the circuit design of mod-multiplicative inverse module. A. Schematic and HDL-based Circuit Design As shown in Fig. 2, the circuit structure of modmultiplicative inverse module is mainly composed of two iterative procedures: div_loop and mulacc_loop. In the division loop, DFF registers holding dividend n1 and divisor n2 are controlled by reset and enable signals. est resets initial values m and u for the first div loop. ed denotes that results from divider are ready after a long latency. en2 is valid when n2 is not equal to 1. Once the value of n2 is 1, which means that the remainder r will be zero, registers of n1 and n2 will be locked and remain their contents. In the multiplication and accumulation loop, two DFF registers are used for holding b1 and b2. Similarly, est resets initial zero and one to b1 and b2 respectively for the first mul-acc loop. Therefore, registers in both div_loop and mulacc_loop in Fig. 2 implement multiplexing with latching. Signal em is generated after ed according to the latency of mulacc operation. For this loop, reset must be done correctly to avoid accumulating errors between the continuous inputs. The timing diagram and control signals are illustrated in Fig. 3. Signal est is accompanying with every input u which starts the running of module. Signal ed is generated by the counter of division latency. eout is the enable signal of output ummi (u-1). In the HDL-based design, both unsigned pipeline divider and signed parallel multiplier are generated by Xilinx Core Generator tool. If m is 65537 (216+1), the latency of 17bit unsigned divider with both quotient and remainder is twenty cycles. Figure 2. Circuit structure Figure 3. Timing diagram and control signals. B. Simulink-to-FPGA Circuit Design With the continued growth in complexity of FPGA-based designs, more flexible, efficient and higher-level design methodology has become to change the traditional HDLcentric flows. Different from the behavioral or structural specification in VHDL or Verilog, some higher-level languages, such as C and Java can be used to describe the hardware design by software programming familiars. Higher-level design flow can directly incorporate model simulation with hardware implementation. Matlab&Simulink is a well-known tool that allows designers to model a system at a high-level and is ideal for certain classes of applications, such as digital signal processing, automotive control, communication, etc. The algorithm complexity of the design and the requirement of fast time-to-market drive such kind of need. To incorporate the good modeling and simulation functionality of Simulink, major FPGA vendors have promoted new product, which is combined into Simulink as specified blocksets. There are two popular ones: Xilinx System Generator for DSP [3] and Altera DSP Builder [4]. AccelChip [5] also provides a DSP synthesis tool for FPGA. Those blocksets and tools can implement a full FPGA design flow from Simulink modeling to simulation to hardware [6, 7]. It can transform Simulink model into synthesizable HDL code with test bench. Therefore, in this paper, besides the HDL circuit description method, the Simulink-to-FPGA flow using Xilinx System Generator tool is also applied to the mod-multiplicative inverse module design. Top two level models are shown in Fig. 4 and Fig. 5. Idea of sub-modules div_loop and mulacc_loop in Fig. 5 is as same as Fig. 2 so the blocks inside in Simulink are not illustrated in detail. Figure 4. Top-level model of Mod-Mul Inv function in Simulink Figure 5. Second-level model of Mod-Mul Inv function in Simulink The whole circuit model can be built up directly from Xilinx System Generator blockset in Simulink. It is easy to capsulate small modules into sub-system in a hierarchical way. Multiplex, register, adder/subtractor, constant, and multiplier are all basic blocks, which can be customized to different types. CORDIC divider is in the reference blockset. The type of divider is not very suitable for integer modular arithmetic. It cannot output remainder and the result of quotient has computation error for integer division. To satisfy the computation requirement of CORDIC divider, number format has to be changed from integer to extended real number with fractional bits, which increases the latency of divider. Other blocks have to be built up for remainder calculation. C. Experimental Results The HDL-based circuit design flow is completed with the Xilinx ISE tool to do synthesis, implementation, place & route and device programming for the whole cipher design. Behavioral and post simulation are supported by Mentor Graphics ModelSim tool. For the arithmetic units, unsigned pipeline integer divider with both quotient and remainder output are parameterized and generated by Xilinx Core Generator tool. Multiplication uses the embedded multiplier in the hardware. The target FPGA chip is Xilinx Virtex II xc2v2000. During the Simulink-to-FPGA design flow, circuit modeling is built up with Simulink basic blocks and Xilinx specified blocks. Input and output data are combined with Matlab workspace, which is convenient to convert number format and debug. The System Generator tool can generate synthesizable VHDL code for the circuit model with a complete ISE project including test bench. Importing the project into ISE or the related design files to other thirdparty tools, the subsequent standard design flow can be completed. TABLE I. Resource and Speed SLICES FLIP FLOPS LUTS MULT18X18S Max. Frequency Area and speed can be compared from Tab. 1. Due to the simple circuit structure in Fig. 2, the HDL-based design can quickly describe the control logic and the optimized divider and multiplier cores can get to area-efficient and speed-high performance. The resource consumption in the Simulink-toFPGA flow is much larger than the HDL flow in this case and the maximum frequency is lower than that. Because the CORDIC divider in the reference blockset of System Generator is not a good choice in this module design, a HDL and Simulink mixed design flow is also adopted, in which the divider is substituted by an HDL-based IP core. The Simulink and System Generator well support the mixed design and HDL co-simulation. The resource consumption has been greatly reduced with equivalent performance to HDL-based circuit. D. Discussion From the development of FPGA technology, the methodology challenges the update of various EDA tools. Based on the standard development flow (Fig. 6), initial efforts have been transferred to high-level design and synthesis. There are many conversion tools such as C-toFPGA, Stateflow diagram to VHDL (SF2VHD), Matlab-toFPGA (MATCH). The features of Simulink-to-FPGA flow can be discussed as follows • Friendly graphics interface. Although the schematic entry is a GUI interface, the Simulink is easier to organize input data and much convenient to observe output in many ways. • Easy to number format conversion. Double to fixed point number conversion is parameterized to functional blocks. But the consistence of data type must be noticed during the data flow. • Flexible modeling and simulation. The design can be well organized into hierarchical modules and easy to be combined with other entry method for design decision and convenient to debug and simulation. • Fast time-to-market for DSP development. With the assistance of specified DSP blocks for FPGA, the Simulink-to-FPGA flow can greatly shorten the development cycle from algorithm to hardware. The arithmetic blocksets might be further reinforced. PERFORMANCE COMPARISON Design Flow HDL-based Simulink-to-FPGA Mixed 682 1142 550 1 103Mhz 3573 5600 6082 9 84Mhz 746 1253 568 1 103Mhz Figure 6. Standard FPGA development flow As shown in Fig. 7, high-level designs are supported by more and more EDA vendors. Currently, most methods aim at synthesizable HDL to follow the standard FPGA development flow which can also be compatible with other parts in the whole system. Besides C synthesis, another way combined with Matlab&Simulink becomes applicable as well. applications such as image processing and communication, more functional blocks will be capsulated into FPGAmapped blocks in the Simulink and it will take on better performance with the future improvement. ACKNOWLEDGMENT The research is supported by the Research Grant of University of Macau & University PhD Studentship to the first author. REFERENCES [1] [2] [3] [4] Figure 7. High-level FPGA design flow and tools [5] IV. CONCLUSION In this paper, a circuit module design of modulo multiplicative inverse function for cipher has been proposed and mapped to FPGA hardware by different design flows. The standard HDL-based design shows good performance using optimized arithmetic IP cores. The Simulink-to-FPGA high-level design takes the advantage of good graphics interface and flexible design choices. For other DSP [6] [7] A. Daly, W. Marnane, “Efficient architectures for implementing montgomery modular multiplication and RSA modular exponentiation on reconfigurable logic”, Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays, Monterey, California, USA, pp.40 – 49, 2002. D. W. Matula, A. Fit-Florea, M. A. Thornton, “Table Lookup Structures for Multiplicative Inverses Modulo 2^k“, 17th IEEE Symposium on Computer Arithmetic (ARITH'05) pp. 156-163, 2005 Xilinx, “Xilinx System Generator”, Version 6.2, Xilinx Inc., USA, http://www.xilinx.com/ise/optional_prod/system_generator.htm. Altera,. “Altera DSP Builder”, Version 5.1, Altera Inc, USA, http://www.altera.com/products/software/products/dsp/dsp-builder. html. AccelChip, “Integrating MATLAB Algorithms into FPGA Designs,” in Xcell Journal, pp.73-75, 2005. M. A. Shanblatt, B. Foulds, “A Simulink-to-FPGA Implementation Tool for Enhanced Design Flow”, Proceedings of the 2005 IEEE International Conference on Microelectronic Systems Education (MSE'05), pp.89-90, 2005. M. Haldar, A. Nayak, A. Choudhary, and P. Banerjee, “A System for Synthesizing Optimized FPGA Hardware from MATLAB,” Proceedings of the 2001 IEEE/ACM International Conference on Computer-Aided Design, pp.314-319, 2001.