2010 Sixth International Conference on Intelligent Information Hiding and Multimedia Signal Processing A Fast Modular Multiplication Method Ali S.entürk and Mustafa Gök Department of Electrical and Electronics Engineering Cukurova University Adana, Turkey Email: asenturk@cu.edu.tr, musgok@cu.edu.tr tions that use 1024 to 2048 bit sizes have come to scene. A strategy to process such large operand sizes is dividing the operand into sub-operands equal to the processor’s word size and operating on them and combining the sub-results to get the final result. To free programmers from the burden of such tedious task, multiple-precision operation libraries which contains necessary functions to operate on infinite precision operands are written. Use of these libraries is the best choice when there is no special hardware support, which is the case in general. One of the well known is the GNU Multiple Precision Library (GMP) [9]. In this paper, GMP is used in test programs written to verify the proposed method. Modern general purpose processsor architectures have multiple high-performance multiply and multiply-add units in their datapaths due to extensive use of this operation in all kinds of applications. Thus, significant research effort is dedicated to decrease the latency of these instructions. For example, current Intel Core Duo processor family that uses 65nm technology has a 32-bit integer multiply instruction with 3 cycle latency whereas the addition instructions in this processor family have 1 cycle latency [10]. Including Montgomery’s algorithm most modular multiplication algorithms have been studied under the assumption that there is a large performance gap between the addition and the multiplication instructions. However, the gap has been closed in modern processors as stated above. To authors best of knowledge there has not been a significant study that searches the gap between multiply intensive modular multiplication method and other methods. This study presents a modular multiplication algorithm that uses multiply-add operation and shows that this type of an approach can give better results in comparision to addition intensive algorithms. The rest of the paper is organized as follows: Section 2 presents the proposed methods, Section 3 discusses the efficiency of the algorithm. Section 4 presents the conclusion. Abstract—Fast execution of modular multiplication is crucial to speed-up public key cryptography applications. This paper presents a modular multiplication method that exploits highspeed multiply-accumulate instructions supported in modern general-purpose architectures. The algorithm is implemented as a C program and tested on large operands by using GNU Multiple Precision Library (GMP). The performance of the method is compared with the performance of the Montgomery’s Algorithm. The comparision results show that the proposed method runs upto 5 times faster than Montgomery’s Algorithm. Keywords-modular multiplication method; multiply-add; I. I NTRODUCTION Cryptography algorithms and key exchange schemes such as RSA, Diffie-Hellmann, and ElGamal use modular multiplication [1], [2], [3]. Also, Digital Cryptography Standard proposed by Digital the National Institute for Standards and Technology require the computation of modular exponentiation [4]. In particular, RSA algorithm mostly relies on modular multiplication. RSA encrypts a plain text T as T E mod M where E is the public exponent and M is a large modulus. RSA decrypts a cyphertext, C, as C D mod M where D is the private exponent. Since the exponentiation operation can be performed by executing a series of modular multiplications, the efficient execution of this operation improves the performance of the algorithm. Direct computation of the modular multiplication requires a multiplication and a division. To compute modular exponent, most algorithms use the result of the modular multiplication iteratively which causes a data dependency. Because of the data dependency, hardware may waste many cycles if a division instruction is used in the computation. The previous work on modular multiplication mainly focus on improving the Montgomery’s algorithm [5], since it does not require trial divisions. Over the years it is also accepted that the Montgomery’s algorithm is the best fit for hardware implementations and significant research effort is dedicated on efficient hardware implementation of this algorithm [6], [7], [8]. Efficient hardware implementations decrease the time needed for the decryption of the cipher. To make brute force attacks difficult, the preferred bit sizes for the modulus, public key, and private key are increased. Recently, applica978-0-7695-4222-5/10 $26.00 © 2010 IEEE DOI 10.1109/IIHMSP.2010.58 II. M ONTGOMERY ’ S A LGORITHM Montgomery modular multiplication algorithm is very popular since it does not require trial division [5]. For a given moduli the algorithm maps the numbers to be operated to another domain where they can be processed easily. Due to the initial mapping cost the algorithm is not 204 Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on April 22,2021 at 08:55:52 UTC from IEEE Xplore. Restrictions apply. advantegous for a single modular multiplication but quite effective for long exponentiation operations. The algorithm computes A · B mod N by selecting a new radix R coprime to N such that R > N choosing R. The algorithm requires the computation of R−1 and N values that satisfy R · R−1 − N · N = 1 for the reduction operation explained below. The steps of the algorithm are given below: • The operators A and B are converted to N -residue class as A = A · R mod N and B = B · R mod N . • The multiplication is performed in this domain as T = A · B. 2 • T = R mod N should be restored to N -residue class by performing T R−1 mod N . However, this cannot be done by multiplying it with R−1 and diving the result by N since division is avoided. • Montgemery Reduction is performed as follows: and P L(i) = P (i)(x(i)−1):0 where x(i) = ls(i)/2 + n/2 . • Iteration step: – Perform P (i + 1) = P H(i) · W (i) + P L(i) operation iteratively till P H(i) = 1, where W (i) = 2x(i) mod M . • Final step: – When P H(i) = 1, 4M < P (i) < M . To obtain the correct result first subtraction P (i) − 3M is performed, if the result is negative, it is corrected by adding M , if the result is still negative another M is added. The flow chart of the algorithm is given in Figure I. The algorithm requires pre-computation of W (i) values. For an n/2-bit moduli the number of the W (i)s is n/2. These elements are computed only once for each new moduli. In the worst case, the algorithm computes the modular multiplication in log2 n multiply-add iterations and two addition iterations. A numerical example that computes m ← (T mod R)N (mod R) (0 ≤ m ≤ R) t ← (T + mN )/R if (t ≥ N ) return t − N else return t Note that if R is equal to some power of 2 division by R is inexpensive. III. T HE P ROPOSED METHOD A2 ≡ (7F )216 Modern general purpose processsor architectures have multiple high-performance multiply and multiply-add units in their datapaths. The proposed method exploits this property to perform modular multiplication for large moduli. The proposed method is based on the following equivalency expression: X mod Z ≡ (XH · 2l + XL) mod Z using the proposed method is given. To demonstrate the intermediate values more clearly, hexadecimal representation of the numbers are used in the example. Example • Iteration 1: P = A2 = 3F 01, ls(0) = 13, x(0) = 11 P H(0) = P13:11 = 7, P L(0) = P10:0 = 701, W (0) = 71 P (1) = (7 · 71) + 701 = A18 • Iteration 2: P (1) = A18, , ls(1) = 11, x(1) = 10 P H(1) = P (1)11:10 = 2, P L(1) = P (1)9:0 = 218, W (1) = 79 P (2) = (2 · 79) + 218 = 30A • Iteration 3: P (2) = 30A, ls(2) = 9, x(2) = 8. P H(2) = P (2)9:8 = 3, P L(2) = P (2)7:0 = A, W (3) = 7F . P (3) = (3 · 7F ) + A = 187 • Final Step • P L(3) = 187 − 3 · 81 = 4. (1) where Z is a k-bit moduli, X is a 2k-bit number which is divided into two parts as XH XL mod (81)16 = x2k−1 · 2k−1 + . . . + xk−l · 2k−l = xk−l−1 · 2k−l−1 + . . . + x1 · 21 + x0 · 20 Using Equation 1 the proposed algorithm computes a shorter representation of the product value in given modulus at each iteration. The steps of the proposed method are given below: • Initial Step: – Assume A and B are two n-bit numbers and M is the n-bit moduli. Multiply A and B, P (0) = A·B. – Find the location of the significant bit in P (0), ls(0), and divide the product, P (0) into two parts based on ls(0) such that either the more significant part’s size is one less than the less significant part or both sizes are equal. – Initially, the high part is labelled as P H(0) and the low part is labelled as P L(0). The high and low parts of the product are generated in each iteration i as follows: IV. R ESULTS The performance of the proposed method and Montgomery’s algorithm are compared using C programs. These programs compute modular exponents that require many modular squaring and modular multiplication operations. The modular operations are performed on 128, 256, 512, P H(i) = P (i)ls(i):x(i) 205 Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on April 22,2021 at 08:55:52 UTC from IEEE Xplore. Restrictions apply. Table I E XECUTION T IMES FOR T HE P ROPOSED METHOD AND M ONTGOMERY ’ S M ETHOD Bit Size 128 256 512 1024 2048 Proposed Method (Cycles) Min. Max. Avr. 38863 487355 194438 78771 1420639 651334 166716 5084156 2374925 915024 19681398 10538454 7505410 141866153 49757653 Montgomery’s Method Min. Max. 136642 1204456 189882 3455309 873796 19883424 4485107 90255231 27548081 478877036 (Cycles) Avr. 628026 1565042 10120386 48243927 218171419 Min. 3.51 2.41 5.24 4.90 4.38 Speed-Up Max. 2.47 2.43 3.91 4.59 3.37 Avr. 3.23 2.40 4.26 4.58 4.38 8 7 x 10 Montgomery Algorithm Proposed Algorithm 6 Start Clock cycle 5 P(0)=A . B PH(0)=P(0) ls(i):x(i) PL(0)=P(0) x(i)−1:0 4 3 2 1 P(i)=PH(i−1) .W(i−1) + PL(i−1) PH(i), PL(i), W(i) 0 0 Figure 2. 500 1000 1500 Exponent bit size 2000 2500 Results for 2048-Bit exponent computations using rdtsc. No PH(i)=1 1024, and 2048 bit sizes on randomly selected operands. The exponentiations are done using well-known Left-toRight Binary method. The standard library of C does not allow operations on such operator length. To overcome this problem, GNU Multiple Precision Library [9] which has been developed for cryptography applications and research, internet security applications, and computer algebra systems is used. The performance tests are executed on several PC systems. Similar results are obtained for all systems. In this section, the results for the best available configuration is presented which has an Intel R CoreTM 2 Duo T6600 Processor. The processor has a 2.20 GHz clock, 2x64 KB L1 cache and 2048 KB L2 cache. The memory of the system is DDR3 2x2048 MB at 533 MHz. The performance results are expessed in terms of clock cyles. In order to count the number of cycles for the critical regions of the programs first standard rdtsc function is used. Though this timestamp counter function is known to give excellent resolution. The use of it is discouraged recently with the advent of multicore systems. More reliable cycle counter routines for various architectures are written by M. Frigo and offered as a header file (cycle.h) which can be downloaded from the internet [11]. The tests are peformed with these two routines. The results obtained by the second method Yes P(i)=P(i)−3n P(i)<0 No Yes P(i)=P(i)+n Finish Figure 1. The Flow Chart. 206 Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on April 22,2021 at 08:55:52 UTC from IEEE Xplore. Restrictions apply. 8 5 gorithm for 1024-bit and 2048-bit exponents. The performance improvement can be also related to the fact that in each iteration the proposed method performs a half shorter size multiply-add operation compared to previous iteration. Multiple-Precision Libraries tend to use parallel algorithms to operate on large precision operands. Thus, shortening of the multiply-add operands might quadratically improve the performance. Another advantage of the proposed method over Montgomery’s algorithm is it does not require any specific restriction over the selection of the modulus. The planned future work is going to implement this algorithm on state of the art FPGA platforms which contain special high-speed circuitry for integer multiplication. x 10 Montgomery Algorithm Proposed Algorithm 4.5 4 Clock cycle 3.5 3 2.5 2 1.5 1 0.5 0 Figure 3. routines. R EFERENCES 0 500 1000 1500 Exponent bit size 2000 2500 [1] R. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and public-key cryptosystems,” Communications of the ACM, vol. 21, no. 2, p. 126, 1978. Results for 2048-Bit exponent computations using cycle.h [11] [2] W. Diffie and M. Hellman, “New directions in cryptography,” IEEE Transactions on information Theory, vol. 22, no. 6, pp. 644–654, 1976. is faster, but the percentage of the speed-up achieved by the proposed method has not changed. Figure 2 shows the execution clock cycles for 2048-bit operand exponents where exponent sizes varied 3 to 2048 bits. In this figure, x axis shows the different exponent sizes, y axis shows 100 million cycles executed for the corresponding exponent and the cycle counts are measured using rdtsc function. Figure 3 shows the results of similar operations on 2048-bit operands this time measured by using cycle.h routines. These figures show that compared to the Montgomery’s algorithm the proposed method has better peformance results when measured using both methods. Table I presents the minimum, maximum, and average execution times for the proposed method and the Montgomery’s Method in terms of cycle counts for various sizes. For example first line shows the best, worst and average performance results obtained for 128 bit exponents which reports a 3.23 speed-up achieved by the proposed method. In general, the average speed-up changes between 3.23 to 4.38 for the tested operand sizes while for sizes over 512 speed-up is greater than four. [3] T. Elgamal, “A public key cryptosystem and a signature scheme based on discrete logarithms,” IEEE Transactions on Information Theory, vol. 31, no. 4, pp. 469–472, 1985. [4] (1994) Digital signature standart (dss). [Online]. Available: http://www.itl.nist.gov/fipspubs/fip186.htm [5] P. Montgomery, “Modular multiplication without trial division,” Mathematics of computation, vol. 44, no. 170, pp. 519– 521, 1985. [6] S. Eldridge and C. Walter, “Hardware implementation of Montgomery’s modular multiplication algorithm,” IEEE Transactions on Computers, vol. 42, no. 6, pp. 693–699, 1993. [7] E. Brickell, “A survey of hardware implementations of RSA,” in Advances in Cryptology, CRYPTO89 Proceedings. Springer, 1986, pp. 368–370. [8] N. Nedjah and L. Mourelle, “Three hardware architectures for the binary modular exponentiation: Sequential, parallel, and systolic,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 53, no. 3, pp. 627–633, 2006. [9] (2010) GMP:GNU multiple precision library. [Online]. Available: http://gmplib.org/ V. C ONCLUSION This paper presents a fast modular multiplication method that can be used as an alternative to Montgomery’s algortihm or other modular multiplication methods. The method basically uses multiply-add operation which is directly supported on most general-purpose processors and DSPs. Due to the improvements in the implementation of the arithmetic hardware that performs this operation, the proposed method can be faster than other algorithms that depend on addition. Test programs that use GNU arithmetic library show that the exponent computations that use the proposed method is at least four times faster than the Montgomery’s Al- [10] Intel 64 and IA-32 Architectures Optimization Reference Manual, 248966-020, Intel Corporation, 2009. [11] (2008) cycle.h. [Online]. Available: www.fftw.org/cycle.h 207 Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on April 22,2021 at 08:55:52 UTC from IEEE Xplore. Restrictions apply.