Generating RSA Primes Jim Townsend CSE633 Final Results Fall 2010 Importance • Encryption is harder to secure than ever • RSA is an important standard in Public Key Encryption • Developed in 1977, it began with relatively small keys – 128,256 bit keys • Current standard: 1048 bit keys (310 decimal digits) • Math on these numbers is very CPU intensive How Keys are Generated • • • • • • • Use the Miller-Rabin algorithm Tests against a specific few numbers Only a probabilistic method Probability a number is prime: .75 Repeated passes used to eliminate false positives 16 repetitions: (1-.75)^16 Runtime: O(ln(N)^4) Sieve of Eratosphenes • Decided to implement a small sieve on the numbers before using the Miller-Rabin algorithm • Using all the prime numbers less than 1000 (168 numbers), see if any of those evenly divide the number first • Decreased serial runtime by more than half Current Program • The program takes in two strings: a starting value and a range • Runs a sieve on the range with the first 168 primes • Uses the remaining numbers and tests them with the MillerRabin algorithm up to 16 times on each. Serial Results Serial Results • • • • Finding small numbers was relatively fast Found 2263 primes 20 digits long in just .68 seconds Large numbers are a different story: 310 digits (Current RSA standard) took 27.01 seconds to find only 118 primes Parallel Algorithm • Divided the range among each processor • Each node checked its set and reported the number of primes it found • Final reduction to sum up the count Gains • Saw incredible speedup due to the minimal communication needed • Most of the real gains came from tweaking the serial algorithm • Using the sieve and only checking odd numbers • Would see much more by using load balancing using OpenMP Single Parallel vs Serial Algorithm 35 30 Time (s) 25 20 Single 15 Serial 10 5 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 Number of Decimal Digits All Parallel Test Runs 4 3.5 3 2.5 Time (s) 8 Cores 16 Cores 24 Cores 2 32 Cores 40 Cores 48 Cores 1.5 56 Cores 64 Cores 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Number of Decimal Digits Total Speedup 50 45 40 Speedup Factor 35 30 310 Digits 25 240 Digits 180 Digits 120 Digits 20 60 Digits 15 10 5 0 1 8 16 24 32 40 Number of Cores 48 56 64 Efficiency: Ts/(P*Tp) 1 0.9 0.8 0.7 Percent 0.6 310 Digits 0.5 240 Digits 180 Digits 120 Digits 0.4 60 Digits 0.3 0.2 0.1 0 1 8 16 24 32 Cores 40 48 56 64 Future Work • Could be more improved by load balancing the test with OpenMP • Exit on first failed test • Much better synchronization would be possible • Could also use this to divide the test into smaller pieces as well • Implementation in CUDA using GPGPUs Any Questions?