paper

advertisement
CSEP 590
Final Project
Greg Hullender
CESP 590 – Practical Applications of Modern Cryptography
Greg Hullender
Microsoft Campus
Tuesday, March 7, 2006
Final Project
-1-
CSEP 590
Final Project
Greg Hullender
Estimating a Lower Bound on the Entropy of HTTPS Public Keys
Greg Hullender
Abstract
Public Key Cryptography secures online commerce on https sites, but this is only as
secure as the keys themselves. A poor random number generator could lead to some sites
sharing prime factors, making it trivial to compute their private keys. This paper aims to
place an upper bound on the actual risk by using a large sample of actual public keys to
estimate a lower bound on the entropy of the primes chosen for the modulus of the keys.
Background
Most e-commerce depends on the security of “https” web pages. Those pages implement
“TSL/SSL” [1] which depends on the security of RSA public key cryptography. [2]
In the RSA public key system, a site-owner picks two large prime numbers (p and q) at
random, where large means anywhere from 256 to 4096 bits long. The owner publishes
the product of those two primes (called the modulus) plus another number (called the
exponent) and together these two numbers constitute the public key. The private key is
simply the same modulus plus the multiplicative inverse of the public exponent, modulo
(p-1)(q-1). The security of the algorithm rests on the fact that the private exponent
cannot be computed easily without knowing p and q, and that the problem of factoring
the modulus is intractable. Accordingly, the site owner has to keep p, q, and the private
exponent as closely-guarded secrets.
A potential weakness is the entropy of the random-number generator used to produce p
and q. This is not just a theoretical concern; the same problem has affected other systems
in the recent past. For example, in 2003, it was reported that many implementations of
the popular Kerberos protocol used a random-number generator that only had about 20
bits of entropy. [3] It’s at least conceivable that many sites using TLS/SSL have a
similar problem.
To exploit this weakness, if it exists, an attacker would obtain a large number of public
keys from https sites and compute the Greatest Common Divisor (GCD) of the moduli of
all possible pairs of keys. Unlike the problem of factoring the modulus, the problem of
computing the GCD from a pair of keys is only O(n2), where n is the length of the
modulus. If the GCD of two keys is not 1, that means they share a prime, and the GCD
will be that shared prime. Division of each modulus by the shared prime recovers the
unshared one. The secret exponents (for both keys) can then be computed using the
Extended Euclidean algorithm – the same way the legitimate owners created them in the
first place. [2]
Experimental Procedure
-2-
CSEP 590
Final Project
Greg Hullender
Summary: I queried hundreds of thousands of domains for their public keys, collecting
50,727 unique 1024-bit moduli. I then computed the GCD of all 1.25 billion pairs, all of
which were 1, which means I didn’t find any compromised keys in my sample.
Details: I got a list of 15,000,000 “active” domains from Microsoft’s MSN Search. A
domain is “active” if it shows some signs of being used. (E.g. customers click on it when
it shows as a result of a search.) The list included hit counts as well as domain names,
and I selected domains from the top of it in order to maximize the probability that I
would find domains with public keys. (Someone could equally well start with the
Registrar database.) This potentially has the effect of biasing the experiment towards
more popular sites, which one might expect to be more careful about choosing their keys,
but casual examination of the list reveals that most of these are very small organizations,
so that risk is probably small.
Because TLS/SSL is a property of a domain, not a web page, it’s sufficient to query
domains to get public keys. Microsoft’s .NET Framework 2.0 contains primitives in the
TCPIP library to open a secure connection to a domain and authenticate its public key
using its X509 [4] certificate. In order to maximize the number of keys, I accepted all
certificates, no matter what authentication problems they had (e.g. expiration).
Unfortunately, Microsoft does not document the format it uses to store public keys, but
Michael Gallant’s RSAPubKeyData.cs [5] provides excellent APIs to extract keys in any
of several different formats.
Finally, I used the Miracl [6] library from Shamus Software to do the GCD calculations.
All experiments were done on a single-processor 2.4GHz Pentium 4 running Windows
XP.
Experimental Results
I tested 598,705 domains and found 113,397 public keys, of which 60,543 were unique.
(Duplicates result from the same company having multiple domains. For example,
Google appears to use the same public key on dozens of different domains.) Of the
unique ones, 50,727 (over 80%) were 1024-bit keys. This process took two weeks,
although it can be optimized to much less with better use of multithreading.
Next I used Miracl’s multi-precision arithmetic package to compute the pair-wise GCDs.
My machine could do about 5600 of these per second, so it took about 2 ½ days to
completely analyze all 50,727 keys.
This process did not detect any duplicates.
Interpreting the Results
-3-
CSEP 590
Final Project
Greg Hullender
We can answer (or partly answer) three important questions from these results:



What would it take to scale this process up to test all key pairs on the web?
What is the maximum number of compromised sites on the web?
What is the maximum number of sites using key generation with entropy as low
as Kerberos v4?
Scaling up
Since I could query about 100,000 domains per day, it would have taken 150 days to
interrogate all 15,000,000 domains in my list. Obviously this scales linearly with the
number of computers – someone with 20 machines could do it in just a week. (There are
other optimizations that could speed this up, but those are outside the scope of this
paper.)
At the point I stopped, roughly one site in 5 had a public key – half of them duplicates.
One expects this to drop off as one gets to less and less popular sites, but there should
definitely be no more than 1,500,000 sites with public keys. My machine would have
taken 6 years to compute the GCD of all these pairs, since that task scales quadratically.
A modern Opteron 64-bit dual-processor system is about 7× as fast as my machine, so a
set of 20 machines should be able to complete this task in under three weeks.
What is the maximum number of compromised sites on the web?
How many compromised pairs could there be, given we found none? Imagine that there
are x reused primes. There are then 2x vulnerable pairs. Order these pairs according to
the value of the not-duplicated prime, so each pair has a left half and a right half. (Ignore
the possibility of triples or worse.) If we randomly select M keys from a total population
of N, then we’d get the following probabilities
p
x
N
Probability a random URL is a left half
The binomial distribution tells us that
xx
M
N
Expected number of left-halves among M URLs
That is, we ought to have x̅ “left-hand” keys, which means we should see a duplicate if
we drew any of the corresponding z̅ “right-hand” keys. The odds of not seeing a
duplicate in any of M keys (when there are x̅ out of N) are
-4-
CSEP 590
Final Project
Greg Hullender
M
1
 xM 
1  2  
N 
2

assume  1/2 because we found none
Solving for x and using 1,500,000 for N and 50,727 for M we get
x
N2
ln 2  606
M2
So there is only a 50% chance that there are as many as 600 duplicated primes and
therefore as many as 1200 vulnerable keys out of 1,500,000 or no more than about 0.04%.
Since that estimate for the number of domains with unique keys is unquestionably high,
it’s unlikely there are as many as 1,000 compromised sites.
Estimating Entropy
Assume that most users are generating truly random 512-bit primes, so there is no
significant chance that any two will get the same prime. Now assume that some
proportion of sites are using an inferior key generator with considerably lower entropy.
We’ll use these variables:
M
m
M/m
b
p
All sites surveyed
Proportion of sites using bad key generation
Number of sites using bad key generation
Number of bits of randomness (entropy) in bad key generator
Probability we missed seeing any collisions
This is simply a form of the “Birthday Problem” from statistics, treating N/m as the
number of people and 2b are the number of possible “birthdays.” (See Appendix.)
Directly plugging these numbers into the Birthday Equation we get
M
1
 2b 1 ln
m
p
(This will not be very accurate for small numbers of bits, but we’re not interested in key
generators with entropy below 8 anyway, since they’re extremely unlikely to have gone
undetected.)
Using this equation, we can estimate the proportion of sites that would have to be using
the Kerberos v4 random-number generator for us to have a 50% chance of failing to see
any collisions:
-5-
CSEP 590
Final Project
m
M
2b 1 ln

Greg Hullender
1
p
50, 727
221 ln 2
 64
In other words, we’re 50% confident that no more than one site in 64 is using a key
generator as bad as the Kerberos v4 one. If we actually had tested all 3,000,000 domains
instead of just over 50,000, we’d estimate no more than one in about 4,000 had that
problem – assuming we still found no collisions.
From the same equation, what’s the proportion if they’re using the old 15-bit Unix rand()
function?
m
50, 727
216 ln 2
 238
So if even one site in 200 were using something as crude as the rand() function, we
should have spotted one.
We can also put a limit on the entropy of the random number generators used by all sites
(that is, assume m = 1).
  M 2 
  
m
b  log 2      1
1 

 ln p 


50, 727 2
 log 2
1
ln 2
 30.8
This says we’re 50% confident that the average site is at least generating 31-bit random
primes. Had we tested all 1,500,000 domains without a collision, that estimate would go
up to 40.5 bits.
Conclusions
If this low-entropy key generation is a problem at all, it certainly isn’t a common one.
Someone with the resources could certainly do a comprehensive test of all domains, but
the results of this sample show that he or she is unlikely to retrieve as many as 1,000
private keys this way. Further, since the test explicitly included all the high-value sites
-6-
CSEP 590
Final Project
Greg Hullender
on the Internet, it seems impossible that a malefactor could actually make money by
attempting this.
References
[1] Tim Dierks, Eric Rescorla, 2005, The TLS Protocol, Version 1.1, The Internet
Engineering Task Force, http://www.ietf.org/internet-drafts/draft-ietf-tlsrfc2246-bis-13.txt
[2] Alfred J. Menezes, Paul C. van Oorschot, Scott A. Vanstone, 1997, Handbook of
Applied Cryptography, CRC Press, pp 287 ff.
[3] Massachusetts Institute of Technology, 2003, Security Advisory
http://web.mit.edu/kerberos/www/advisories/MITKRB5-SA-2003-004-krb4.txt
[4] R. Housley, W. Ford, W. Polk, D. Solo, 1999, Internet X.509 Public Key
Infrastructure, Certificate and CRL Profile, The Internet Engineering Task Force,
http://www.ietf.org/rfc/rfc2459.txt
[5] Michel I. Gallant , JkeyNet: Using ASN.1 encoded public keys in .NET, JavaScience
http://www.jensign.com/JavaScience/dotnet/JKeyNet/index.html
[6] Shamus Software, Ltd., Multiprecision Integer and Rational Arithmetic C/C++
Library, http://indigo.ie/~mscott/
Appendix
The Birthday Problem
A group of m people having n possible birthdays (usually 365) has the following
probability of no shared birthdays
m
p
i 1
ni
n
m
i

  1  
n
i 1 
Making use of the following approximation (which is quite good if i/n is small)
i
i

ln 1    
n
 n
We can approximate ln p
-7-
CSEP 590
Final Project
m
ln p  
i 1

Greg Hullender
i
n
m  m  1
2n
2
m
2n
Since we usually want to estimate the expected number of people before the probability
of no collisions drops to p, we solve for m

m  2n ln
This is the formula used above.
-8-
1
p
Download