Efficient Parallel Encryption/Decryption Information Algorithm Erick Fredj Department of Computer Sciences, Jerusalem College of Technology (Machon Lev) 21 Havaad Haleumi St., P.O.B. 16031, 91160 Jerusalem, Israel Abstract: This paper deals with the parallel implementation of the RSA algorithm for encryption and decryption on a network of workstations. We present a new algorithm based on a residue number system (RNS) and a hybrid of Montgomery’s method. RNS provide a good means for extremely long integer arithmetic. Their carry free operations make parallel implementations feasible. This paper shows a new combination of RNS with modulo reduction methods. The algorithm complexity is of the order of O(n), with n denoting the amount of data Key-Words: Computer Arithmetic, Cryptography, Modular Multiplication, Residue Number System Arithmetic, Text Decryption, Text Encryption, Text Security 1 Introduction Over the last years the concerns about the lack of security online and potential loss of privacy prevent many computer users from realizing the full potential of the Internet. Encryption systems, which scramble electronic communications and information [i][ii], allow users to communicate text on the Internet with confidence, knowing their security and privacy are protected. The commonly Rabin, Shamer and Edelman (RSA)[iii],[iv] solution provides an enhanced security. The RSA method is based on a series of operations involving very large integers: whole numbers usually at least 300 digits long. An ordinary RSA 1024 bit decryption involves about 3,000 multiplications and divisions with 310 digit numbers. RSA algorithm becomes widely used in the industry and academia therefore fast implementations are extremely wanted. There exist several ways for speeding up RSA[v]: Optimization of the sequential algorithm using special purpose hardware, Faster clock rates, Parallel computers and algorithms[vi] This article focuses on the last options, parallel computing, which seems to provide the greatest potential for speed up over the long term. 1.1 How RSA works? Suppose that we have some plain- text message, which desires to encrypt, and the message is M1M2… Mn. RSA encryption and decryption works on one letter at a time, so we’re going to deal with a single letter Mi from this sequence. RSA is an example of a public/ private key cryptographic system. In such a system, encryption is done based on a public key and decryption on a private key. The RSA encryption relies on two numbers N and e, so the public key is simply the set {N, e}. Similarly, the private key is the set {N, d} since decryption relies on these two numbers. 1.2 Public and private keys The public key {N, e} consists of an exponent e and a modulus, and the encryption operation transforms M N into a cipher text C M e mod N . The private key {N, d} consists also of an exponent d and a modulus, and the decryption operation M C d mod N converts the cipher text into the original message. The modulus N is a product of two large prime numbers p, and q. Since e is usually in the order of 3 the public-key operations are relatively fast (about O(n2) operations, where n is the size of the modulus). The private exponent, d, is of the same order as N, with private key operations require O(n3) time. The basic modular exponentiation function is perhaps the most one for the encryption and decryption process. Because of the huge gap speed between private-key and public-key operations I will focus on speeding up private-key operations. Certainly, the function that is most important for the actual encryption and encryption process is the modular exponentiation function. Consider trying to evaluate the expression 51,000,000,000 (mod 22). Attempting to calculate this directly using the built in function for C++ or any other language will overflow when we raise 5 to such a large exponent. It would be possible to write a class to represent large integers, however since the answer lies in the range of [0, 21], we should able to compute the answer without such a class. The basic modular exponentiation algorithm used by RSA loop over each bit bi of the exponent b, only need log 2 b iterations. Given a message M, a modulus N, and an exponent b, the basic modular exponentiation algorithm used by RSA is shown in Table 1a. n Step 1: Find the base 2 representation of b bi 2 i i 1 ans 1 ; T0 a Step 2: Step 3: for i:=1 to n sizeof b * 8 // where n is the total number of bits; 8 is for 8 bits per byte. Step 4: Case 1 bi b i & 1 0 In this case, The expression a bi Step 5: 2i 1 1 then ans Ti 2 mod m 2i Case 2: bi b i & 1 1 a In this case, The expression a bi ans ans Ti mod m 2i 2i Ti then Table 1a.The basic modular exponentiation algorithm simplifies the expression a b mod m . From the table 1 the final value of ans is the result of a b mod m . There does not seem to be any way to perform modular exponentiation faster than the above method. As a result there appears to be an inherent sequential code to the main loop of RSA. We gain only a factor of 3 by squaring in parallel with multiplication, and exponentiation modulo the two relatively prime factors together. 1.3 Chance for Parallelize the RSA algorithm There exist four different approaches to parallelism the basic RSA algorithm. If there is a sequence of messages to be decrypted, each of these operations may be performed independently on a different processor. This is not speed up the elapsed time for a single privatekey operation, but we expect to speed up the overall performance. Step 5, squaring, can be performed in parallel with the multiplication in step 4 from the previous iteration, saving some 33%. This is only possible with the loop running from low-order exponent bits to high-order. Doing so removes some inherent parallelism. For private-key, we may assume that the factors p and q of N are known. The modular exponentiation can be performed separately and in parallel mod p and mod q, and then the two results can be combined by Chinese remainder theorem. Finally, the multiplications in step 4 and 5 can all be performed and the results summed in parallel. They do not appear competitive until the used of large number. Unlike, parallel multiplication is not practical if the communication overhead between processor is much larger than the multiplication time. M 0 M1 M 2 M m Message RSA Binary to RNS RNS: modular multiplication RNS to Binary RSA Binary to RNS RNS: modular multiplication RNS to Binary RSA Binary to RNS RNS: modular multiplication RNS to Binary Figure 1. Parallel RSA implementation model. Our parallel implementation of RSA will incorporate the first three approaches as shown in Figure 1. The first, performing several RSA operations independently on different processors, is the most scalable technique. More ever, we also speed up the response time of a single operation, for which the other three techniques are needed by using a RNS modular multiplication. 2 Modular multiplication The most frequent operation we perform in RSA is the modular multiplication x ymod m . The sequential implementation of this operation requires about twice the time of simple multiplication, since each multiplication is followed by one modular step reduction. Furthermore, each multiplication depends on the results of the previous one, so they must be done sequentially. Therefore, this method is inherently sequential; to my knowledge there are no obvious analogues in the parallel multiplication. In the case of RSA, we need to use the Montgomery’s method for modular multiplication combined with a modular arithmetic where high precision numbers are represented by their residues modulo a set of small relatively prime numbers. The Montgomery algorithm is a modular multiplication algorithm where one reduction is performed at each iteration of the multiplication. The advantage of this algorithm is that the modular reduction is performed by a shift instead of a division. Let w 2 n bet at least 4m and choose that mm 1mod w . Notice that Montgomery’s method doesn’t depend on w being a power of 2. Montgomery’s method [vii] for modular multiplication work as follows: s : x y Step 1: Step 2: q : msmod w where mm 1mod w Step 3: r : s qm w Table1b.Montgomery modular multiplication The products s and qm have twice as many digits as m, x, y, or r . Since qm s mod w , s qm w will always be a multiple of w . Reducing modulo w and dividing by w are simple operations for multiple binary numbers. However we still require three multiplications. To overcome to this problem we implement Montgomery’s method [viii] in Residue Number System (RNS) arithmetic[ix]. 3 Residue Number System RNS have long been studied because of their potential for high speed arithmetic processing, achieved by breaking long word length numbers up into many short word length numbers that may be operated on in parallel. RNS coding suffers from a number of serious drawbacks, all stemming from its inherent inability to perform magnitude comparison on pairs of numbers; converting a number from RNS to binary [x] is difficult; overflows are not easily detectable, scaling an RNS number by a constant is time consuming; general division is only possible practically by converting the operands out of RNS. We introduce now our RNS system terminology: The vector m1 , m2 , , mn forms a set of moduli, called the RNS-base where the mi ’s are relatively prime. M is the value of the product i 1 mi . The n vector x1 ,, xn is the RNS representation of X , an integer less than M , where xi X m X mod mi i Any X less than M has one and only one RNS representation according to the Chinese RNS and Remainder Theorem. Addition multiplication RNS can be implemented parallel and performed in one single step. in A RNS B ~ a j b j RNS addition , for j 1,, n A RNS B ~ a j b j RNS multiplication RNS division mj , for j 1,, n R RNS mi ~ rˆj ~ r j m j m mj 1 j mj , for j i,, i 1, i 1,, n Where X m j denotes the inverse of X modulo 1 m j for X and m j relatively prime. The Mixed Radix System (MRS) associated with this RNS is defined using the same base of moduli. Assuming that x1 x2 x3 xn , 0 xi mi is the MRS representation of X an integer less than M, then X x1 x2 m1 x3 m1m2 xn m1 mn 1 . Motivation Residue base m1 , m2 ,, mn where M i 1 mi . Modulus n N expressed in RNS with GCDN , M 1 , and satisfying 0N M 3 max i 1, , n m i . Integer A is given in MRS A i 1 ai j 1 m j , and n Answer i 1 Integer B is given in RNS An integer R 2 N expressed in RNS, such that: R ABM Method 1 mod N R=0 for i=1 to n do 1 qi ri ai bi mi ni i mod mi R R RNS ai RNS B RNS q i RNS N R R RNS mi end for Table 2. RNS method for modular multiplication The algorithm goes through n iterations. At each iteration step a MRS digit qi of a number Q is computed and a new value of R, using qi and ai , is determined in RNS. At each step, R is computed to be a multiple of mi and the moduli are relatively prime numbers, dividing R by mi is equivalent to multiplying each residue of R by the modular inverse of mi . However, this cannot be evaluated for the ith residue because mi is not relatively prime to itself. Therefore, the ith residue is lost. We propose two solutions for correctly expressing R. Task Ii Task II i , j Task III i , j qi ri ai bi mi ni mi mod mi Since at each step of the algorithm one residue is lost, the intermediate result R cannot be correctly expressed after one step because R 3N M max mi . Our solution presented here consists of extending the modular system with an auxiliary base ~ ~ ,m ~ ,, m ~ with B m 1 2 n ~ n ~ ~ ~, M 3 max m~ i M 3 max mi an M m i i 1 ~ d GCD M , M 1 . This base extension can be computed with the Szabo-Tanaka algorithm. The algorithm computes ABM 1 mod N in RNS, the ~ result being obtained in the auxiliary base B . 1 rj rj ai b j qi n j mi m j mod mi j i 1 a j a j ai mi m j 1 Task I i Use of an auxiliary residue system for expressing the result. 1. Reconstruct the missing residue after it lost. This algorithm can be split into three kinds of tasks see Table 3. The task I computes the MRS digit qi at the ith step of the algorithm with ai . The task II computes the new value of R. The task III performs the conversion of operand A from RNS to the MRS using the Szabo Tanaka [xi] conversion algorithm. 4 Distributed RSA Implementation The most straightforward approach to parallelizing the RSA algorithm using the message passing interface (MPI) consists in applying the general principle of space decomposition so that each processor runs essential the same program on its data. The algorithm described has been i Task II i , j mod m j 0 i j j 2,3, , n Table 3. General Tasks ~ ~ n~ ~1 q~i ~ ri a~i bi m i i mi ~ mod m Performed all residues in the ~ auxiliary system with a ~ ~ ~ ~1 mod m ~ ji rj ~ r j a~i b j q~i n~ j m i mj j from the original system. Task III i , j The complementary computation of R in the auxiliary system. ~ 1 r j r j a~i b j q~i n j m i mj mod m j j i Task IVi , j Task Vi , j Conversion to MRS ~ ~ 1 mod m ~ x j ~ x j ~ xi m i j j Computation of the residues in auxiliary system from MRS digits ~m ~ ~ ~ x j x j ~ x m 1 2 mi 1 mod m j Table 4. Task for the reverse multiplication implemented on a parallel Mosix machine using the MPI library. The parallel machine is a cluster of 8 identical Pentium II PCs with 128Mb RAM , locally connected by a fast communication protocol such as Myrinet, each computer with 20GBytes of local memory as shown in Figure 2. Fast Communication Myrinet Figure2. Mosix Parallel Computer. For 1024 bit RSA algorithm, if we use 32 bits processors then we need about 33 moduli for our RNS implementation. The timing results for Mosix [xii] machine, running the parallel hybrid RSA algorithms are 1.99 faster on 2 processors and 3.99 faster on 4 processors than the single Intel Pentium II processor. The Parallelization results show a super linear behavior "p", where p denotes the number of processors of the speedup and efficiency quasi perfect of 1. 5 Conclusions and Future Work In this work we investigate a new strategy to implement an Encryption/Decryption information algorithm based on RSA. The use of the RNS allows the decomposition of a given dynamic range into slices of smaller sub ranges on which the computation can be efficiently implemented in parallel. Using a parallel machine to do RSA public/private key operations seem realistic today. Our future work will focus on an interactive computer service shall store and transmit with integrity any text security measure associated with certified security technologies that is used in connection with copyrighted material or other protected text content such service transmits or stores. References: [i] Ronald L. Rivest and Adi Shamir. CryptoBytes volume 2, number 1 RSA Laboratories, Spring, 711)1996). [ii] Ronald L. Rivest, The RC5 Encryption Algorithm in Fast Software Encryption, ed. Bart Preneel, Springer Pp 86-96 (1995). [iii] Ronald L. Rivest, Adi Shamir, and Leonard Adlman. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21(2): 120-126, (1978). [iv] E.F. Brickel, A Survey of Hardware Implementations of RSA, Advances in CryptologyCRYPTO’89, G. Brassard, ed. Pp. 368-370, Springer-Verlag, 1990. [v] Mark Shand and Jean Vuillemin. Fast implementation of RSA cryptorapghy, In Proceedings, 11th Symposium on Computer Arithmetic, 252-259, IEEE, 1993. [vi] F. Thomson Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, 1992. [vii] Peter L. Montgomery. Modular multiplication without trial division. Mathematics of computation, 44(170):519-521, (1985). [viii] S.E. Eldrige and C.D. Walter, Hardware Implementation of Montgomery’s Modular Multiplication Algorithm, IEEE Trans. Computers, vol. 42, no. 6, pp 693-699, 1993. [ix] M.A. Soderstrand, W.K. Jenkins, G.A. Jullien, and F.J. Taylor, Residue Number System Arithmetic: Modern Applications in Digital Signal Processing. New York: IEEE Press, 1986. [x] R.M. Capocelli and R.Giancarlo, Efficient VLSI Networks for converting an Integer from Binary to Residue Number System and Vice Versa, IEEE Trans. Circuits and Syst., vol. CAS-35, 1425-1431, 1988. [xi] N.S. Szabo and R.I. Tanaka, Residue Arithmetic and its Applications to Computer Technology. New York: McGraw-Hill, 1967. [xii] Barak A., Guday S. and Wheeler R., The MOSIX Distributed Operating System, Load Balancing For UNIX. Lecture Notes in Computer Science, Vol. 672, Springer-Verlag, 1993. Acknowledgment: The author would like to thank Professor Joseph Steiner of Department of Applied Math at The Jerusalem College of Technology for his important comments.