Uploaded by julianegottlieb

Fast Modular Multiplication Method for Cryptography

advertisement
2010 Sixth International Conference on Intelligent Information Hiding and Multimedia Signal Processing
A Fast Modular Multiplication Method
Ali S.entürk and Mustafa Gök
Department of Electrical and Electronics Engineering
Cukurova University
Adana, Turkey
Email: asenturk@cu.edu.tr, musgok@cu.edu.tr
tions that use 1024 to 2048 bit sizes have come to scene.
A strategy to process such large operand sizes is dividing
the operand into sub-operands equal to the processor’s word
size and operating on them and combining the sub-results
to get the final result. To free programmers from the burden
of such tedious task, multiple-precision operation libraries
which contains necessary functions to operate on infinite
precision operands are written. Use of these libraries is the
best choice when there is no special hardware support, which
is the case in general. One of the well known is the GNU
Multiple Precision Library (GMP) [9]. In this paper, GMP is
used in test programs written to verify the proposed method.
Modern general purpose processsor architectures have
multiple high-performance multiply and multiply-add units
in their datapaths due to extensive use of this operation in
all kinds of applications. Thus, significant research effort is
dedicated to decrease the latency of these instructions. For
example, current Intel Core Duo processor family that uses
65nm technology has a 32-bit integer multiply instruction
with 3 cycle latency whereas the addition instructions in this
processor family have 1 cycle latency [10]. Including Montgomery’s algorithm most modular multiplication algorithms
have been studied under the assumption that there is a large
performance gap between the addition and the multiplication
instructions. However, the gap has been closed in modern
processors as stated above. To authors best of knowledge
there has not been a significant study that searches the gap
between multiply intensive modular multiplication method
and other methods. This study presents a modular multiplication algorithm that uses multiply-add operation and
shows that this type of an approach can give better results
in comparision to addition intensive algorithms. The rest of
the paper is organized as follows: Section 2 presents the
proposed methods, Section 3 discusses the efficiency of the
algorithm. Section 4 presents the conclusion.
Abstract—Fast execution of modular multiplication is crucial
to speed-up public key cryptography applications. This paper
presents a modular multiplication method that exploits highspeed multiply-accumulate instructions supported in modern
general-purpose architectures. The algorithm is implemented
as a C program and tested on large operands by using GNU
Multiple Precision Library (GMP). The performance of the
method is compared with the performance of the Montgomery’s
Algorithm. The comparision results show that the proposed
method runs upto 5 times faster than Montgomery’s Algorithm.
Keywords-modular multiplication method; multiply-add;
I. I NTRODUCTION
Cryptography algorithms and key exchange schemes such
as RSA, Diffie-Hellmann, and ElGamal use modular multiplication [1], [2], [3]. Also, Digital Cryptography Standard
proposed by Digital the National Institute for Standards
and Technology require the computation of modular exponentiation [4]. In particular, RSA algorithm mostly relies
on modular multiplication. RSA encrypts a plain text T
as T E mod M where E is the public exponent and M
is a large modulus. RSA decrypts a cyphertext, C, as
C D mod M where D is the private exponent. Since the
exponentiation operation can be performed by executing a
series of modular multiplications, the efficient execution of
this operation improves the performance of the algorithm.
Direct computation of the modular multiplication requires a
multiplication and a division. To compute modular exponent,
most algorithms use the result of the modular multiplication
iteratively which causes a data dependency. Because of
the data dependency, hardware may waste many cycles
if a division instruction is used in the computation. The
previous work on modular multiplication mainly focus on
improving the Montgomery’s algorithm [5], since it does
not require trial divisions. Over the years it is also accepted
that the Montgomery’s algorithm is the best fit for hardware
implementations and significant research effort is dedicated
on efficient hardware implementation of this algorithm [6],
[7], [8]. Efficient hardware implementations decrease the
time needed for the decryption of the cipher. To make brute
force attacks difficult, the preferred bit sizes for the modulus,
public key, and private key are increased. Recently, applica978-0-7695-4222-5/10 $26.00 © 2010 IEEE
DOI 10.1109/IIHMSP.2010.58
II. M ONTGOMERY ’ S A LGORITHM
Montgomery modular multiplication algorithm is very
popular since it does not require trial division [5]. For
a given moduli the algorithm maps the numbers to be
operated to another domain where they can be processed
easily. Due to the initial mapping cost the algorithm is not
204
Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on April 22,2021 at 08:55:52 UTC from IEEE Xplore. Restrictions apply.
advantegous for a single modular multiplication but quite
effective for long exponentiation operations. The algorithm
computes A · B mod N by selecting a new radix R
coprime to N such that R > N choosing R. The algorithm
requires the computation of R−1 and N values that satisfy
R · R−1 − N · N = 1 for the reduction operation explained
below. The steps of the algorithm are given below:
• The operators A and B are converted to N -residue class
as A = A · R mod N and B = B · R mod N .
• The multiplication is performed in this domain as
T = A · B.
2
• T = R mod N should be restored to N -residue class
by performing T R−1 mod N . However, this cannot be
done by multiplying it with R−1 and diving the result
by N since division is avoided.
• Montgemery Reduction is performed as follows:
and
P L(i) = P (i)(x(i)−1):0
where x(i) = ls(i)/2 + n/2 .
• Iteration step:
– Perform P (i + 1) = P H(i) · W (i) + P L(i) operation iteratively till P H(i) = 1, where W (i) =
2x(i) mod M .
• Final step:
– When P H(i) = 1, 4M < P (i) < M . To obtain
the correct result first subtraction P (i) − 3M is
performed, if the result is negative, it is corrected
by adding M , if the result is still negative another
M is added.
The flow chart of the algorithm is given in Figure I. The
algorithm requires pre-computation of W (i) values. For an
n/2-bit moduli the number of the W (i)s is n/2. These
elements are computed only once for each new moduli.
In the worst case, the algorithm computes the modular
multiplication in log2 n multiply-add iterations and two
addition iterations.
A numerical example that computes
m ← (T mod R)N (mod R) (0 ≤ m ≤ R)
t ← (T + mN )/R
if (t ≥ N ) return t − N else return t
Note that if R is equal to some power of 2 division by R is
inexpensive.
III. T HE P ROPOSED METHOD
A2 ≡ (7F )216
Modern general purpose processsor architectures have
multiple high-performance multiply and multiply-add units
in their datapaths. The proposed method exploits this property to perform modular multiplication for large moduli.
The proposed method is based on the following equivalency
expression:
X mod Z ≡ (XH · 2l + XL) mod Z
using the proposed method is given. To demonstrate the
intermediate values more clearly, hexadecimal representation
of the numbers are used in the example.
Example
• Iteration 1:
P = A2 = 3F 01, ls(0) = 13, x(0) = 11
P H(0) = P13:11 = 7, P L(0) = P10:0 = 701,
W (0) = 71
P (1) = (7 · 71) + 701 = A18
• Iteration 2:
P (1) = A18, , ls(1) = 11, x(1) = 10
P H(1) = P (1)11:10 = 2, P L(1) = P (1)9:0 = 218,
W (1) = 79
P (2) = (2 · 79) + 218 = 30A
• Iteration 3:
P (2) = 30A, ls(2) = 9, x(2) = 8.
P H(2) = P (2)9:8 = 3, P L(2) = P (2)7:0 = A,
W (3) = 7F .
P (3) = (3 · 7F ) + A = 187
• Final Step
• P L(3) = 187 − 3 · 81 = 4.
(1)
where Z is a k-bit moduli, X is a 2k-bit number which is
divided into two parts as
XH
XL
mod (81)16
= x2k−1 · 2k−1 + . . . + xk−l · 2k−l
= xk−l−1 · 2k−l−1 + . . . + x1 · 21 + x0 · 20
Using Equation 1 the proposed algorithm computes a shorter
representation of the product value in given modulus at each
iteration. The steps of the proposed method are given below:
• Initial Step:
– Assume A and B are two n-bit numbers and M is
the n-bit moduli. Multiply A and B, P (0) = A·B.
– Find the location of the significant bit in P (0),
ls(0), and divide the product, P (0) into two parts
based on ls(0) such that either the more significant
part’s size is one less than the less significant part
or both sizes are equal.
– Initially, the high part is labelled as P H(0) and the
low part is labelled as P L(0). The high and low
parts of the product are generated in each iteration
i as follows:
IV. R ESULTS
The performance of the proposed method and Montgomery’s algorithm are compared using C programs. These
programs compute modular exponents that require many
modular squaring and modular multiplication operations.
The modular operations are performed on 128, 256, 512,
P H(i) = P (i)ls(i):x(i)
205
Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on April 22,2021 at 08:55:52 UTC from IEEE Xplore. Restrictions apply.
Table I
E XECUTION T IMES FOR T HE P ROPOSED METHOD AND M ONTGOMERY ’ S M ETHOD
Bit
Size
128
256
512
1024
2048
Proposed Method (Cycles)
Min.
Max.
Avr.
38863
487355
194438
78771
1420639
651334
166716
5084156
2374925
915024
19681398
10538454
7505410
141866153
49757653
Montgomery’s Method
Min.
Max.
136642
1204456
189882
3455309
873796
19883424
4485107
90255231
27548081
478877036
(Cycles)
Avr.
628026
1565042
10120386
48243927
218171419
Min.
3.51
2.41
5.24
4.90
4.38
Speed-Up
Max.
2.47
2.43
3.91
4.59
3.37
Avr.
3.23
2.40
4.26
4.58
4.38
8
7
x 10
Montgomery Algorithm
Proposed Algorithm
6
Start
Clock cycle
5
P(0)=A . B
PH(0)=P(0) ls(i):x(i)
PL(0)=P(0) x(i)−1:0
4
3
2
1
P(i)=PH(i−1) .W(i−1)
+ PL(i−1)
PH(i), PL(i), W(i)
0
0
Figure 2.
500
1000
1500
Exponent bit size
2000
2500
Results for 2048-Bit exponent computations using rdtsc.
No
PH(i)=1
1024, and 2048 bit sizes on randomly selected operands.
The exponentiations are done using well-known Left-toRight Binary method. The standard library of C does not
allow operations on such operator length. To overcome this
problem, GNU Multiple Precision Library [9] which has
been developed for cryptography applications and research,
internet security applications, and computer algebra systems
is used. The performance tests are executed on several PC
systems. Similar results are obtained for all systems. In
this section, the results for the best available configuration
is presented which has an Intel R CoreTM 2 Duo T6600
Processor. The processor has a 2.20 GHz clock, 2x64 KB
L1 cache and 2048 KB L2 cache. The memory of the
system is DDR3 2x2048 MB at 533 MHz. The performance
results are expessed in terms of clock cyles. In order to
count the number of cycles for the critical regions of
the programs first standard rdtsc function is used. Though
this timestamp counter function is known to give excellent
resolution. The use of it is discouraged recently with the
advent of multicore systems. More reliable cycle counter
routines for various architectures are written by M. Frigo and
offered as a header file (cycle.h) which can be downloaded
from the internet [11]. The tests are peformed with these
two routines. The results obtained by the second method
Yes
P(i)=P(i)−3n
P(i)<0
No
Yes
P(i)=P(i)+n
Finish
Figure 1.
The Flow Chart.
206
Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on April 22,2021 at 08:55:52 UTC from IEEE Xplore. Restrictions apply.
8
5
gorithm for 1024-bit and 2048-bit exponents. The performance improvement can be also related to the fact that in
each iteration the proposed method performs a half shorter
size multiply-add operation compared to previous iteration.
Multiple-Precision Libraries tend to use parallel algorithms
to operate on large precision operands. Thus, shortening of
the multiply-add operands might quadratically improve the
performance. Another advantage of the proposed method
over Montgomery’s algorithm is it does not require any
specific restriction over the selection of the modulus. The
planned future work is going to implement this algorithm
on state of the art FPGA platforms which contain special
high-speed circuitry for integer multiplication.
x 10
Montgomery Algorithm
Proposed Algorithm
4.5
4
Clock cycle
3.5
3
2.5
2
1.5
1
0.5
0
Figure 3.
routines.
R EFERENCES
0
500
1000
1500
Exponent bit size
2000
2500
[1] R. Rivest, A. Shamir, and L. Adleman, “A method for
obtaining digital signatures and public-key cryptosystems,”
Communications of the ACM, vol. 21, no. 2, p. 126, 1978.
Results for 2048-Bit exponent computations using cycle.h [11]
[2] W. Diffie and M. Hellman, “New directions in cryptography,”
IEEE Transactions on information Theory, vol. 22, no. 6, pp.
644–654, 1976.
is faster, but the percentage of the speed-up achieved by
the proposed method has not changed. Figure 2 shows the
execution clock cycles for 2048-bit operand exponents where
exponent sizes varied 3 to 2048 bits. In this figure, x axis
shows the different exponent sizes, y axis shows 100 million
cycles executed for the corresponding exponent and the cycle
counts are measured using rdtsc function. Figure 3 shows
the results of similar operations on 2048-bit operands this
time measured by using cycle.h routines. These figures show
that compared to the Montgomery’s algorithm the proposed
method has better peformance results when measured using
both methods. Table I presents the minimum, maximum, and
average execution times for the proposed method and the
Montgomery’s Method in terms of cycle counts for various
sizes. For example first line shows the best, worst and
average performance results obtained for 128 bit exponents
which reports a 3.23 speed-up achieved by the proposed
method. In general, the average speed-up changes between
3.23 to 4.38 for the tested operand sizes while for sizes over
512 speed-up is greater than four.
[3] T. Elgamal, “A public key cryptosystem and a signature
scheme based on discrete logarithms,” IEEE Transactions on
Information Theory, vol. 31, no. 4, pp. 469–472, 1985.
[4] (1994) Digital signature standart (dss). [Online]. Available:
http://www.itl.nist.gov/fipspubs/fip186.htm
[5] P. Montgomery, “Modular multiplication without trial division,” Mathematics of computation, vol. 44, no. 170, pp. 519–
521, 1985.
[6] S. Eldridge and C. Walter, “Hardware implementation
of Montgomery’s modular multiplication algorithm,” IEEE
Transactions on Computers, vol. 42, no. 6, pp. 693–699, 1993.
[7] E. Brickell, “A survey of hardware implementations of
RSA,” in Advances in Cryptology, CRYPTO89 Proceedings.
Springer, 1986, pp. 368–370.
[8] N. Nedjah and L. Mourelle, “Three hardware architectures
for the binary modular exponentiation: Sequential, parallel,
and systolic,” IEEE Transactions on Circuits and Systems I:
Regular Papers, vol. 53, no. 3, pp. 627–633, 2006.
[9] (2010) GMP:GNU multiple precision library. [Online].
Available: http://gmplib.org/
V. C ONCLUSION
This paper presents a fast modular multiplication method
that can be used as an alternative to Montgomery’s algortihm
or other modular multiplication methods. The method basically uses multiply-add operation which is directly supported
on most general-purpose processors and DSPs. Due to
the improvements in the implementation of the arithmetic
hardware that performs this operation, the proposed method
can be faster than other algorithms that depend on addition.
Test programs that use GNU arithmetic library show that
the exponent computations that use the proposed method
is at least four times faster than the Montgomery’s Al-
[10] Intel 64 and IA-32 Architectures Optimization Reference
Manual, 248966-020, Intel Corporation, 2009.
[11] (2008) cycle.h. [Online]. Available: www.fftw.org/cycle.h
207
Authorized licensed use limited to: STAATS U UNIBIBL BREMEN. Downloaded on April 22,2021 at 08:55:52 UTC from IEEE Xplore. Restrictions apply.
Download