472-254

advertisement
Efficient Parallel Encryption/Decryption Information
Algorithm
Erick Fredj
Department of Computer Sciences, Jerusalem College of Technology (Machon Lev)
21 Havaad Haleumi St., P.O.B. 16031, 91160 Jerusalem, Israel
Abstract: This paper deals with the parallel implementation of the RSA algorithm for encryption
and decryption on a network of workstations. We present a new algorithm based on a residue
number system (RNS) and a hybrid of Montgomery’s method. RNS provide a good means for
extremely long integer arithmetic. Their carry free operations make parallel implementations
feasible. This paper shows a new combination of RNS with modulo reduction methods. The
algorithm complexity is of the order of O(n), with n denoting the amount of data
Key-Words: Computer Arithmetic, Cryptography, Modular Multiplication, Residue Number
System Arithmetic, Text Decryption, Text Encryption, Text Security
1 Introduction
Over the last years the concerns about
the lack of security online and potential
loss of privacy prevent many computer
users from realizing the full potential of
the Internet. Encryption systems, which
scramble electronic communications and
information [i][ii], allow users to
communicate text on the Internet with
confidence, knowing their security and
privacy are protected. The commonly
Rabin, Shamer and Edelman (RSA)[iii],[iv]
solution provides an enhanced security. The
RSA method is based on a series of
operations involving very large integers:
whole numbers usually at least 300 digits
long. An ordinary RSA 1024 bit decryption
involves about 3,000 multiplications and
divisions with 310 digit numbers. RSA
algorithm becomes widely used in the
industry and academia therefore fast
implementations are extremely wanted.
There exist several ways for speeding up
RSA[v]:
 Optimization of the sequential
algorithm using special purpose
hardware,
 Faster clock rates,
 Parallel
computers
and
algorithms[vi]
This article focuses on the last options,
parallel computing, which seems to provide
the greatest potential for speed up over the
long term.
1.1 How RSA works?
Suppose that we have some plain- text
message, which desires to encrypt, and the
message is M1M2… Mn. RSA encryption
and decryption works on one letter at a time,
so we’re going to deal with a single letter Mi
from this sequence. RSA is an example of a
public/ private key cryptographic system.
In such a system, encryption is done based
on a public key and decryption on a private
key. The RSA encryption relies on two
numbers N and e, so the public key is
simply the set {N, e}. Similarly, the private
key is the set {N, d} since decryption relies
on these two numbers.
1.2 Public and private keys
The public key {N, e} consists of an
exponent e and a modulus, and the
encryption operation transforms M  N
into a cipher text C  M e mod N  . The
private key {N, d} consists also of an
exponent d and a modulus, and the
decryption operation M  C d mod N 
converts the cipher text into the original
message. The modulus N is a product of two
large prime numbers p, and q. Since e is
usually in the order of 3 the public-key
operations are relatively fast (about O(n2)
operations, where n is the size of the
modulus).
The private exponent, d, is of the same
order as N, with private key operations
require O(n3) time. The basic modular
exponentiation function is perhaps the most
one for the encryption and decryption
process. Because of the huge gap speed
between private-key and
public-key
operations I will focus on speeding up
private-key operations. Certainly, the
function that is most important for the actual
encryption and encryption process is the
modular exponentiation function. Consider
trying to
evaluate
the
expression
51,000,000,000 (mod 22). Attempting to
calculate this directly using the built in
function for C++ or any other language will
overflow when we raise 5 to such a large
exponent. It would be possible to
write a class to represent large integers,
however since the answer lies in the range
of [0, 21], we should able to compute the
answer without such a class. The basic
modular exponentiation algorithm used by
RSA loop over each bit bi of the exponent b,
only need log 2 b iterations. Given a
message M, a modulus N, and an exponent
b, the basic modular exponentiation
algorithm used by RSA is shown in Table
1a.
n
Step 1:
Find the base 2 representation of b   bi 2 i
i 1
ans  1 ; T0  a
Step 2:
Step 3: for i:=1 to n  sizeof b * 8
// where n is the total number of bits; 8 is for 8 bits per byte.
Step 4:
Case 1 bi  b  i  & 1  0
 
In this case, The expression a bi
Step 5:
2i
 1  1 then ans  Ti 2 mod m 
2i
Case 2: bi  b  i  & 1  1
   a
In this case, The expression a bi
ans  ans  Ti mod m
2i
2i
 Ti then
Table 1a.The basic modular exponentiation algorithm simplifies the expression a b mod m .
From the table 1 the final value of ans
is the result of a b mod m . There does
not seem to be any way to perform
modular exponentiation faster than the
above method. As a result there appears
to be an inherent sequential code to the
main loop of RSA. We gain only a factor
of 3 by squaring in parallel with
multiplication,
and
exponentiation
modulo the two relatively prime factors
together.
1.3 Chance for Parallelize the RSA
algorithm
There exist four different approaches to
parallelism the basic RSA algorithm.
 If there is a sequence of messages to be
decrypted, each of these operations
may be performed independently on a
different processor. This is not speed
up the elapsed time for a single privatekey operation, but we expect to speed
up the overall performance.
 Step 5, squaring, can be performed in
parallel with the multiplication in step
4 from the previous iteration, saving
some 33%. This is only possible with
the loop running from low-order
exponent bits to high-order. Doing so
removes some inherent parallelism.
 For private-key, we may assume that
the factors p and q of N are known.
The modular exponentiation can be
performed separately and in parallel
mod p and mod q, and then the two
results can be combined by Chinese
remainder theorem.
 Finally, the multiplications in step 4
and 5 can all be performed and the
results summed in parallel. They do not
appear competitive until the used of
large
number.
Unlike,
parallel
multiplication is not practical if the
communication overhead between
processor is much larger than the
multiplication time.
M 0 M1 M 2  M m

Message
RSA
Binary to RNS
RNS: modular
multiplication
RNS to Binary
RSA
Binary to RNS
RNS: modular
multiplication
RNS to Binary
RSA
Binary to RNS
RNS: modular
multiplication
RNS to Binary
Figure 1. Parallel RSA implementation model.
Our parallel implementation of RSA will
incorporate the first three approaches as
shown in Figure 1. The first, performing
several RSA operations independently on
different processors, is the most scalable
technique. More ever, we also speed up the
response time of a single operation, for which
the other three techniques are needed by
using a RNS modular multiplication.
2 Modular multiplication
The most frequent operation we perform in RSA
is the modular multiplication x  ymod m . The
sequential implementation of this operation
requires about twice the time of simple
multiplication, since each multiplication is
followed by one modular step reduction.
Furthermore, each multiplication depends on the
results of the previous one, so they must be done
sequentially. Therefore, this method is inherently
sequential; to my knowledge there are no obvious
analogues in the parallel multiplication. In the
case of RSA, we need to use the Montgomery’s
method for modular multiplication combined with
a modular arithmetic where high precision
numbers are represented by their residues modulo
a set of small relatively prime numbers. The
Montgomery
algorithm
is
a
modular
multiplication algorithm where one reduction is
performed at each iteration of the multiplication.
The advantage of this algorithm is that the
modular reduction is performed by a shift instead
of a division. Let w  2 n bet at least 4m and
choose that mm  1mod w . Notice that
Montgomery’s method doesn’t depend on
w being a power of 2. Montgomery’s method
[vii] for modular multiplication work as follows:
s : x  y
Step 1:
Step 2:
q : msmod w where
mm  1mod w
Step 3:
r : s  qm w
Table1b.Montgomery modular multiplication
The products s and qm have twice as many
digits as m, x, y, or r . Since qm   s mod w ,
s  qm w will always be a multiple of w .
Reducing modulo w and dividing by w are
simple operations for multiple binary numbers.
However we still require three multiplications.
To overcome to this problem we implement
Montgomery’s method [viii] in Residue Number
System (RNS) arithmetic[ix].
3 Residue Number System
RNS have long been studied because of their
potential for high speed arithmetic processing,
achieved by breaking long word length numbers
up into many short word length numbers that may
be operated on in parallel. RNS coding suffers
from a number of serious drawbacks, all
stemming from its inherent inability to perform
magnitude comparison on pairs of numbers;
converting a number from RNS to binary [x] is
difficult; overflows are not easily detectable,
scaling an RNS number by a constant is time
consuming; general division is only possible
practically by converting the operands out of
RNS.
We introduce now our RNS system
terminology:
 The vector m1 , m2 , , mn  forms a
set of moduli, called the RNS-base where
the mi ’s are relatively prime.

M is the value of the product i 1 mi .

The
n
vector
x1 ,, xn  is
the
RNS
representation of X , an integer less than
M , where xi  X m  X mod mi
i
Any X less than M has one and only one RNS
representation according to the Chinese
 RNS and
Remainder
Theorem.
Addition
multiplication  RNS can be implemented
parallel and performed in one single step.
in
A  RNS B ~ a j  b j
RNS
addition
, for j  1,, n
A  RNS B ~ a j  b j
RNS
multiplication
RNS
division
mj
, for j  1,, n
R  RNS mi ~ rˆj ~ r j  m j m
mj
1
j
mj
, for j  i,, i  1, i  1,, n
Where  X m j denotes the inverse of X modulo
1
m j for X and m j relatively prime. The Mixed
Radix System (MRS) associated with this RNS
is defined using the same base of moduli.
Assuming
that x1  x2  x3    xn  , 0  xi  mi is the
MRS representation of X an integer less than
M,
then
X  x1  x2 m1  x3 m1m2    xn m1 mn 1 .
Motivation
Residue base
m1 , m2 ,, mn  where
M  i 1 mi . Modulus
n
N expressed in RNS
with GCDN , M   1 , and
satisfying
0N 
M
3 max i  1,  , n m i 
.
Integer A is given in MRS
A  i 1 ai  j 1 m j , and
n
Answer
i 1
Integer B is given in RNS
An integer R  2 N expressed
in RNS, such that:
R  ABM
Method
1
mod N
R=0
for i=1 to n do
1
qi  ri  ai  bi   mi  ni i
mod mi
R  R  RNS ai  RNS B 

RNS q i  RNS N
R  R RNS mi
end for
Table 2. RNS method for modular multiplication
The algorithm goes through n iterations. At
each iteration step a MRS digit qi of a number Q
is computed and a new value of R, using
qi and ai , is determined in RNS. At each step, R
is computed to be a multiple of mi and the moduli
are relatively prime numbers, dividing R by mi is
equivalent to multiplying each residue of R by the
modular inverse of mi . However, this cannot be
evaluated for the ith residue because mi is not
relatively prime to itself. Therefore, the ith residue
is lost. We propose two solutions for correctly
expressing R.
Task
Ii
Task
II i , j
Task
III i , j
qi  ri  ai  bi   mi  ni mi mod mi
Since at each step of the algorithm one residue
is lost, the intermediate result R cannot be
correctly
expressed
after
one
step
because R  3N  M max mi  . Our solution
presented here consists of extending the modular
system
with
an
auxiliary
base
~
~ ,m
~ ,, m
~ with
B  m
1
2
n
~
n
~
~
~, M
3 max m~ i   M 3 max mi  an
M  m
i
i 1

~

d GCD M , M  1 . This base extension can be
computed with the Szabo-Tanaka algorithm. The
algorithm computes ABM 1 mod N in RNS, the
~
result being obtained in the auxiliary base B .
1
rj  rj  ai  b j  qi  n j 
 mi m j mod mi j  i
1
a j  a j  ai  mi m j
1
Task I i
Use of an auxiliary residue system for expressing
the result.
1. Reconstruct the missing residue after it
lost.
This algorithm can be split into three kinds of
tasks see Table 3.
 The task I computes the MRS digit qi at
the ith step of the algorithm with ai .
 The task II computes the new value of R.
 The task III performs the conversion of
operand A from RNS to the MRS using
the Szabo Tanaka [xi] conversion
algorithm.
4 Distributed RSA Implementation
The most straightforward approach to parallelizing
the RSA algorithm using the message passing
interface (MPI) consists in applying the general
principle of space decomposition so that each
processor runs essential the same program on its
data. The algorithm described has been

i
Task
II i , j
mod m j 0  i  j j  2,3, , n
Table 3. General Tasks

~
~  n~ ~1
q~i  ~
ri  a~i  bi  m
i
i mi
~
mod m

Performed all residues in the
~
auxiliary system with a

~
~
~ ~1 mod m
~ ji
rj  ~
r j  a~i  b j  q~i  n~ j  m
i mj
j
from the original system.
Task
III i , j
The complementary computation of
R in the auxiliary system.
~ 1
r j  r j  a~i  b j  q~i  n j   m
i mj
mod m j j  i
Task
IVi , j
Task
Vi , j
Conversion to MRS
~
~ 1 mod m
~
x j  ~
x j  ~
xi   m
i j
j
Computation of the residues in
auxiliary system from MRS digits
~m
~
~
~
x j  x j  ~
x   m
1 2  mi 1  mod m j
Table 4. Task for the reverse multiplication
implemented on a parallel Mosix machine using
the MPI library. The parallel machine is a cluster
of 8 identical Pentium II PCs with 128Mb RAM ,
locally connected by a fast communication
protocol such as Myrinet, each computer with
20GBytes of local memory as shown in Figure 2.
Fast Communication Myrinet
Figure2. Mosix Parallel Computer.
For 1024 bit RSA algorithm, if we use 32 bits
processors then we need about 33 moduli for our
RNS implementation. The timing results for Mosix
[xii] machine, running the parallel hybrid RSA
algorithms are 1.99 faster on 2 processors and 3.99
faster on 4 processors than the single Intel Pentium
II processor. The Parallelization results show a
super linear behavior "p", where p denotes the
number of processors of the speedup and efficiency
quasi perfect of 1.
5 Conclusions and Future Work
In this work we investigate a new strategy to
implement an Encryption/Decryption information
algorithm based on RSA. The use of the RNS
allows the decomposition of a given dynamic range
into slices of smaller sub ranges on which the
computation can be efficiently implemented in
parallel. Using a parallel machine to do RSA
public/private key operations seem realistic today.
Our future work will focus on an interactive
computer service shall store and transmit with
integrity any text security measure associated with
certified security technologies that is used in
connection with copyrighted material or other
protected text content such service transmits or
stores.
References:
[i] Ronald L. Rivest and Adi Shamir. CryptoBytes
volume 2, number 1 RSA Laboratories, Spring, 711)1996).
[ii] Ronald L. Rivest, The RC5 Encryption
Algorithm in Fast Software Encryption, ed. Bart
Preneel, Springer Pp 86-96 (1995).
[iii] Ronald L. Rivest, Adi Shamir, and Leonard
Adlman. A method for obtaining digital signatures
and public-key cryptosystems. Communications of
the ACM, 21(2): 120-126, (1978).
[iv] E.F. Brickel, A Survey of Hardware
Implementations of RSA, Advances in CryptologyCRYPTO’89, G. Brassard, ed. Pp. 368-370,
Springer-Verlag, 1990.
[v] Mark Shand and Jean Vuillemin. Fast
implementation of RSA cryptorapghy, In
Proceedings, 11th Symposium on Computer
Arithmetic, 252-259, IEEE, 1993.
[vi] F. Thomson Leighton. Introduction to Parallel
Algorithms and Architectures: Arrays, Trees,
Hypercubes. Morgan Kaufmann, 1992.
[vii] Peter L. Montgomery. Modular multiplication
without trial division. Mathematics of computation,
44(170):519-521, (1985).
[viii] S.E. Eldrige and C.D. Walter, Hardware
Implementation of Montgomery’s Modular
Multiplication Algorithm, IEEE Trans. Computers,
vol. 42, no. 6, pp 693-699, 1993.
[ix] M.A. Soderstrand, W.K. Jenkins, G.A. Jullien,
and F.J. Taylor, Residue Number System
Arithmetic: Modern Applications in Digital Signal
Processing. New York: IEEE Press, 1986.
[x] R.M. Capocelli and R.Giancarlo, Efficient VLSI
Networks for converting an Integer from Binary to
Residue Number System and Vice Versa, IEEE
Trans. Circuits and Syst., vol. CAS-35, 1425-1431,
1988.
[xi] N.S. Szabo and R.I. Tanaka, Residue
Arithmetic and its Applications to Computer
Technology. New York: McGraw-Hill, 1967.
[xii] Barak A., Guday S. and Wheeler R., The
MOSIX Distributed Operating System, Load
Balancing For UNIX. Lecture Notes in Computer
Science, Vol. 672, Springer-Verlag, 1993.
Acknowledgment:
The author would like to thank Professor Joseph
Steiner of Department of Applied Math at The
Jerusalem College of Technology for his important
comments.
Download