CSEP 590 Final Project Greg Hullender CESP 590 – Practical Applications of Modern Cryptography Greg Hullender Microsoft Campus Tuesday, March 7, 2006 Final Project -1- CSEP 590 Final Project Greg Hullender Estimating a Lower Bound on the Entropy of HTTPS Public Keys Greg Hullender Abstract Public Key Cryptography secures online commerce on https sites, but this is only as secure as the keys themselves. A poor random number generator could lead to some sites sharing prime factors, making it trivial to compute their private keys. This paper aims to place an upper bound on the actual risk by using a large sample of actual public keys to estimate a lower bound on the entropy of the primes chosen for the modulus of the keys. Background Most e-commerce depends on the security of “https” web pages. Those pages implement “TSL/SSL” [1] which depends on the security of RSA public key cryptography. [2] In the RSA public key system, a site-owner picks two large prime numbers (p and q) at random, where large means anywhere from 256 to 4096 bits long. The owner publishes the product of those two primes (called the modulus) plus another number (called the exponent) and together these two numbers constitute the public key. The private key is simply the same modulus plus the multiplicative inverse of the public exponent, modulo (p-1)(q-1). The security of the algorithm rests on the fact that the private exponent cannot be computed easily without knowing p and q, and that the problem of factoring the modulus is intractable. Accordingly, the site owner has to keep p, q, and the private exponent as closely-guarded secrets. A potential weakness is the entropy of the random-number generator used to produce p and q. This is not just a theoretical concern; the same problem has affected other systems in the recent past. For example, in 2003, it was reported that many implementations of the popular Kerberos protocol used a random-number generator that only had about 20 bits of entropy. [3] It’s at least conceivable that many sites using TLS/SSL have a similar problem. To exploit this weakness, if it exists, an attacker would obtain a large number of public keys from https sites and compute the Greatest Common Divisor (GCD) of the moduli of all possible pairs of keys. Unlike the problem of factoring the modulus, the problem of computing the GCD from a pair of keys is only O(n2), where n is the length of the modulus. If the GCD of two keys is not 1, that means they share a prime, and the GCD will be that shared prime. Division of each modulus by the shared prime recovers the unshared one. The secret exponents (for both keys) can then be computed using the Extended Euclidean algorithm – the same way the legitimate owners created them in the first place. [2] Experimental Procedure -2- CSEP 590 Final Project Greg Hullender Summary: I queried hundreds of thousands of domains for their public keys, collecting 50,727 unique 1024-bit moduli. I then computed the GCD of all 1.25 billion pairs, all of which were 1, which means I didn’t find any compromised keys in my sample. Details: I got a list of 15,000,000 “active” domains from Microsoft’s MSN Search. A domain is “active” if it shows some signs of being used. (E.g. customers click on it when it shows as a result of a search.) The list included hit counts as well as domain names, and I selected domains from the top of it in order to maximize the probability that I would find domains with public keys. (Someone could equally well start with the Registrar database.) This potentially has the effect of biasing the experiment towards more popular sites, which one might expect to be more careful about choosing their keys, but casual examination of the list reveals that most of these are very small organizations, so that risk is probably small. Because TLS/SSL is a property of a domain, not a web page, it’s sufficient to query domains to get public keys. Microsoft’s .NET Framework 2.0 contains primitives in the TCPIP library to open a secure connection to a domain and authenticate its public key using its X509 [4] certificate. In order to maximize the number of keys, I accepted all certificates, no matter what authentication problems they had (e.g. expiration). Unfortunately, Microsoft does not document the format it uses to store public keys, but Michael Gallant’s RSAPubKeyData.cs [5] provides excellent APIs to extract keys in any of several different formats. Finally, I used the Miracl [6] library from Shamus Software to do the GCD calculations. All experiments were done on a single-processor 2.4GHz Pentium 4 running Windows XP. Experimental Results I tested 598,705 domains and found 113,397 public keys, of which 60,543 were unique. (Duplicates result from the same company having multiple domains. For example, Google appears to use the same public key on dozens of different domains.) Of the unique ones, 50,727 (over 80%) were 1024-bit keys. This process took two weeks, although it can be optimized to much less with better use of multithreading. Next I used Miracl’s multi-precision arithmetic package to compute the pair-wise GCDs. My machine could do about 5600 of these per second, so it took about 2 ½ days to completely analyze all 50,727 keys. This process did not detect any duplicates. Interpreting the Results -3- CSEP 590 Final Project Greg Hullender We can answer (or partly answer) three important questions from these results: What would it take to scale this process up to test all key pairs on the web? What is the maximum number of compromised sites on the web? What is the maximum number of sites using key generation with entropy as low as Kerberos v4? Scaling up Since I could query about 100,000 domains per day, it would have taken 150 days to interrogate all 15,000,000 domains in my list. Obviously this scales linearly with the number of computers – someone with 20 machines could do it in just a week. (There are other optimizations that could speed this up, but those are outside the scope of this paper.) At the point I stopped, roughly one site in 5 had a public key – half of them duplicates. One expects this to drop off as one gets to less and less popular sites, but there should definitely be no more than 1,500,000 sites with public keys. My machine would have taken 6 years to compute the GCD of all these pairs, since that task scales quadratically. A modern Opteron 64-bit dual-processor system is about 7× as fast as my machine, so a set of 20 machines should be able to complete this task in under three weeks. What is the maximum number of compromised sites on the web? How many compromised pairs could there be, given we found none? Imagine that there are x reused primes. There are then 2x vulnerable pairs. Order these pairs according to the value of the not-duplicated prime, so each pair has a left half and a right half. (Ignore the possibility of triples or worse.) If we randomly select M keys from a total population of N, then we’d get the following probabilities p x N Probability a random URL is a left half The binomial distribution tells us that xx M N Expected number of left-halves among M URLs That is, we ought to have x̅ “left-hand” keys, which means we should see a duplicate if we drew any of the corresponding z̅ “right-hand” keys. The odds of not seeing a duplicate in any of M keys (when there are x̅ out of N) are -4- CSEP 590 Final Project Greg Hullender M 1 xM 1 2 N 2 assume 1/2 because we found none Solving for x and using 1,500,000 for N and 50,727 for M we get x N2 ln 2 606 M2 So there is only a 50% chance that there are as many as 600 duplicated primes and therefore as many as 1200 vulnerable keys out of 1,500,000 or no more than about 0.04%. Since that estimate for the number of domains with unique keys is unquestionably high, it’s unlikely there are as many as 1,000 compromised sites. Estimating Entropy Assume that most users are generating truly random 512-bit primes, so there is no significant chance that any two will get the same prime. Now assume that some proportion of sites are using an inferior key generator with considerably lower entropy. We’ll use these variables: M m M/m b p All sites surveyed Proportion of sites using bad key generation Number of sites using bad key generation Number of bits of randomness (entropy) in bad key generator Probability we missed seeing any collisions This is simply a form of the “Birthday Problem” from statistics, treating N/m as the number of people and 2b are the number of possible “birthdays.” (See Appendix.) Directly plugging these numbers into the Birthday Equation we get M 1 2b 1 ln m p (This will not be very accurate for small numbers of bits, but we’re not interested in key generators with entropy below 8 anyway, since they’re extremely unlikely to have gone undetected.) Using this equation, we can estimate the proportion of sites that would have to be using the Kerberos v4 random-number generator for us to have a 50% chance of failing to see any collisions: -5- CSEP 590 Final Project m M 2b 1 ln Greg Hullender 1 p 50, 727 221 ln 2 64 In other words, we’re 50% confident that no more than one site in 64 is using a key generator as bad as the Kerberos v4 one. If we actually had tested all 3,000,000 domains instead of just over 50,000, we’d estimate no more than one in about 4,000 had that problem – assuming we still found no collisions. From the same equation, what’s the proportion if they’re using the old 15-bit Unix rand() function? m 50, 727 216 ln 2 238 So if even one site in 200 were using something as crude as the rand() function, we should have spotted one. We can also put a limit on the entropy of the random number generators used by all sites (that is, assume m = 1). M 2 m b log 2 1 1 ln p 50, 727 2 log 2 1 ln 2 30.8 This says we’re 50% confident that the average site is at least generating 31-bit random primes. Had we tested all 1,500,000 domains without a collision, that estimate would go up to 40.5 bits. Conclusions If this low-entropy key generation is a problem at all, it certainly isn’t a common one. Someone with the resources could certainly do a comprehensive test of all domains, but the results of this sample show that he or she is unlikely to retrieve as many as 1,000 private keys this way. Further, since the test explicitly included all the high-value sites -6- CSEP 590 Final Project Greg Hullender on the Internet, it seems impossible that a malefactor could actually make money by attempting this. References [1] Tim Dierks, Eric Rescorla, 2005, The TLS Protocol, Version 1.1, The Internet Engineering Task Force, http://www.ietf.org/internet-drafts/draft-ietf-tlsrfc2246-bis-13.txt [2] Alfred J. Menezes, Paul C. van Oorschot, Scott A. Vanstone, 1997, Handbook of Applied Cryptography, CRC Press, pp 287 ff. [3] Massachusetts Institute of Technology, 2003, Security Advisory http://web.mit.edu/kerberos/www/advisories/MITKRB5-SA-2003-004-krb4.txt [4] R. Housley, W. Ford, W. Polk, D. Solo, 1999, Internet X.509 Public Key Infrastructure, Certificate and CRL Profile, The Internet Engineering Task Force, http://www.ietf.org/rfc/rfc2459.txt [5] Michel I. Gallant , JkeyNet: Using ASN.1 encoded public keys in .NET, JavaScience http://www.jensign.com/JavaScience/dotnet/JKeyNet/index.html [6] Shamus Software, Ltd., Multiprecision Integer and Rational Arithmetic C/C++ Library, http://indigo.ie/~mscott/ Appendix The Birthday Problem A group of m people having n possible birthdays (usually 365) has the following probability of no shared birthdays m p i 1 ni n m i 1 n i 1 Making use of the following approximation (which is quite good if i/n is small) i i ln 1 n n We can approximate ln p -7- CSEP 590 Final Project m ln p i 1 Greg Hullender i n m m 1 2n 2 m 2n Since we usually want to estimate the expected number of people before the probability of no collisions drops to p, we solve for m m 2n ln This is the formula used above. -8- 1 p