Lecture 2: January 09 2.1 Random measurements

advertisement
CSE 525 Randomized Algorithms & Probabilistic Analysis
Winter 2008
Lecture 2: January 09
Lecturer: James R. Lee
Scribes: Elisa Celis and Andrey Kolobov
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
2.1
Random measurements
By cleverly measuring (i.e. viewing data) at random, it is sometimes possible to give a correct answer
with high probability while reducing the amount of time or storage needed for computation. Random
measurements have applications in dimensionality reduction, nearest-neighbor search, compressed sensing,
and other problems.
2.1.1
String Equality
Suppose Alice has a string a = a0 , . . . , an−1 , and Bob has a string b = b0 , . . . , bn−1 . How can they determine
whether a equals b?
A trivial solution would have Alice send a to Bob, and have Bob compare it to b. While this is correct, it is
non-optimal because n bits need to be transmitted. We want to do better.
The Fingerprint method uses a random measurement as follows. Without loss of generality, assume ai , bi ∈
{0, 1}, so a is a binary representation of some integer t. Now, let Alice choose a uniformly random prime
number p ∈ {2, . . . , T } where t ≤ T . (In Section 2.3, we give a good way to choose p). Let the fingerprint of
a be defined as Fp (a) = a mod p. Now, Alice sends Fp (a) and p to Bob, and Bob computes Fp (b). If Bob
sees Fp (a) = Fp (b) we output Y ES, otherwise we output N O.
The error for this solution is one-sided since Fp (a) = Fp (b) whenever a = b, so there are no false negatives.
However, if a 6= b we can still have Fp (a) = Fp (b), so there may be false positives. We wish to show P r[error]
is small when a 6= b.
Consider the prime-counting function π(x) = |{p : p is prime, p ≤ x}|. If Fp (a) = Fp (b), then a ≡p b, so p
divides |a − b|. We can now use the following proposition.
Proposition 2.1. A nonzero n-bit integer has at most n distinct prime divisors.
Proof. Each distinct prime divisor is at least 2, and the integer itself is at most 2n − 1. Therefore, by the
fundamental theorem of arithmetic, there are no more than n prime divisors.
Since |a − b| is a nonzero n-bit integer, there are at most n primes that divide |a − b|. Therefore,
P r[error] ≤
Additionally, recall the following theorem.
2-1
n
.
π(T )
2-2
Lecture 2: January 09
Theorem 2.2 (Prime Number Theorem). π(x) ∼
x
ln(x)
as x → ∞. In particular, for x ≥ 17,
1.26x
x
≤ π(x) ≤
.
ln(x)
ln(x)
)
Hence, P r[error] ≤ n ln(T
T . If we want this to be small, we can simply choose T = cn ln(n). Thus,
ln(n))
ln(n))
ln(cn ln(n))
). Since ln(cln(n)
= o(1), we have
P r[error] ≤ n cn ln(n) = 1c (1 + ln(cln(n)
P r[error] =
1
+ o(1),
c
so we can make P r[error] arbitrarily small with the appropriate choice of c.
Since p ≤ T , we know p will use at most ln(cn ln(n)) bits. To improve this, we take note of the following.
Fact 2.3. A nonzero n-bit number has at most π(n) prime factors.
Thus, P r[error] ≤
π(n)
π(T ) ,
and we can choose T = cn to get P r[error] ≤
P r[error] =
1.26n
ln(n)
·
ln(T )
T
=
1.26 ln(cn)
c ( ln(n) ).
Hence
1.26
· (1 + o(1)).
c
This way we can make P r[error] arbitrarily small with a prime p of length at most ln(cn) bits. Thus, both
p and Fp (a) use O(ln(n)) bits, and Alice sends Bob O(ln(n)) bits.
To give a concrete example, if n = 1M B (approximately 223 bits) and T = 232 (32-bit fingerprint), then
P r[error] ≈ 0.0035.
2.1.2
Pattern Matching
Suppose we have two input strings, X = x0 , . . . , xn−1 and Y = y0 , . . . , ym−1 , where m < n. How do we
determine whether Y is a contiguous substring of X? Let X(j) = xj , xj+1 , . . . , xj+m−1 . We can now ask if
X(j) = Y for some j ∈ {0, 1, . . . , n − m}.
The most trivial deterministic algorithm (explicitly comparing every contiguous substring of X to Y ) takes
O(mn) time. There is a more efficient algorithm that works in O(m + n) time, but is hard to implement, has
large overhead, and does not generalize well to similar problems. We will provide a randomized approach
that also works in O(m + n), but is simpler and easier to extend.
As before, let us treat X(j) and Y as binary integers. Choose a random prime p ∈ {2, . . . , T }, and compute
Fp (Y ) = Y and Fp (X(j)) for all j ∈ {0, 1, . . . , n − m}. If there exists some j for which Fp (X(j)) = Fp (Y ),
output M AT CH, otherwise output N O M AT CH.
The error is one-sided since there are no false negatives, but there may be false positives. If X(j) 6= Y for
every j ∈ {0, 1, . . . , n − m}, then by the union bound P r[error] ≤ n π(m)
π(T ) . To get a tighter bound, recall that
if Fp (X(j)) = Fp (Y ) then p divides |X(j) − Y |. Thus, if there is an error, p divides the product
n−m
Y
|X(j) − Y |.
j=0
Since |X(j) − Y | is an m bit number, and we multiply n − m of those together, this product is at most an
1.26mn ln(cmn)
nm-bit integer. Thus, P r[error] ≤ π(mn)
=
π(T ) , and if we choose T = cmn we have P r[error] ≤ ln(mn) cmn
Lecture 2: January 09
2-3
1.26
c ·(1+o(1)).
2
Hence we can make the error arbitrarily small using a prime p with no more than O(log(mn)) =
O(log(n )) = O(log(n)) bits.
A trivial bound on the runtime of this algorithm is O(mn) since we must compute Fp (X(j)) (in time O(m))
for n − m distinct js (giving a runtime of O(m(n − m)) = O(nm)). This is worse than the best deterministic
algorithm. However, Karp and Rabin improved the runtime to O(n + m) in 1981 by making the following
observation: If we know Fp (X(j)), then we can compute Fp (X(j + 1)) in O(1) steps.
Specifically (under a binary representation), we know that X(j + 1) = 2(X(j) − 2m−1 xj ) + xj+m (note that
the most significant bits in this representation are to the left). In other words, we need only “slide” the
m-bit window one bit along the string as depicted below.
By performing all the arithmetic modulo p, we see that Fp (X(j+1)) = Fp (2(Fp (X(j))−Fp (2m−1 )xj )+xj+m ).
This takes O(1) time (assuming that p fits into a standard integer variable - a fact we assumed implicitly in
previous runtime analysis). Therefore, the runtime becomes O(n + m) = O(n).
To give a concrete example, if we are looking for a substring of length m = 28 in a DNA string of length
n = 214 , then picking T = 232 (so p is a 32-bit integer) yields P r[error] ≤ 0.002.
2.2
Types of Randomized Algorithms.
There are different kinds of randomized algorithms. In a Monte Carlo Algorithm, the output for a single
instance may differ from run to run and may be incorrect. However, a Las Vegas Algorithm is never incorrect
(it always returns a correct result, or reports a failure) and has a potentially unbounded worst-case runtime,
but a small expected runtime.
The algorithms we saw above were Monte Carlo Algorithms. Every Las Vegas algorithm can be turned into
a Monte Carlo algorithm by running it until the expected runtime, and guessing if no answer has been found.
It is unknown whether every Monte Carlo algorithm can be converted into a Las Vegas algorithm. However,
we can convert the above algorithm for pattern matching into a Las Vegas algorithm as follows.
Since the only error that can occur is a false positive, we need only check the cases where Fp (X(j)) = Y to
make sure it is a true positive. Check each such case explicitly using bit-to-bit comparison. If we are very
unlucky, we will need to do this (n − m) times. Thus, the worst case runtime is O(mn). However, this is
extremely unlikely, and it can be shown the expected running time is O(n) and is highly concentrated.
2-4
2.3
Lecture 2: January 09
Primality Testing
To implement the aforementioned algorithms, we need to be able to find prime numbers. In practice, we
obtain a prime p ∈ {2, . . . , T } by choosing an integer m ∈ {2, . . . , T } uniformly at random, and checking
if m is a prime. Theorem 2.2 says we only need approximately ln(T ) attempts before we find a prime
number. Thus, the only remaining question is how to determine if an integer is a prime. Establishing
this
√
deterministically is a hard problem. A naive solution, trying to divide m by every k = 2, 3, . . . , b mc, has
runtime exponential in the size of the input, while the best deterministic algorithm (Agrawal, Kayal and
Saxena, 2002) runs in O(n6 ) time, and is impractical for many applications.
We wish to develop a randomized algorithm for primality testing based on the following theorem.
Theorem 2.4 (Fermat’s Little Theorem). If p is a prime number and p doesn’t divide a then ap−1 ≡ 1
mod p.
To determine if m is prime, choose a ∈ {2, 3, . . . , m − 1} uniformly at random. If gcd(a, m) 6= 1, output
N OT P RIM E. Additionally, if am−1 6≡ 1 mod m, output N OT P RIM E. If m passes these two tests,
return P RIM E.
We can compute gcd(a, m) in O(log2 m) time with Euclid’s algorithm. Additionally, am−1 mod m can be
computed in O(log2 m) time by modular expoentiation (compute a2 mod m, then a4 mod m, etc.). Thus,
the total runtime is O(log2 m), polynomial in the size of the input.
The error is once again one-sided since we can only get false positives. However, this is far from perfect since
there are an infinite number of Carmichael numbers - composite numbers m such that am−1 ≡ 1 mod m
for all all a where gcd(a, m) = 1. It is known that P r[m is Carmichael] → 0 as b → ∞, where b is the
number of bits in m. Thus, we are unlikely to encounter a Carmichael number. However, even if we do
not encounter one, this does not necessarily mean we are safe; a non-Carmichael number m may still have
am−1 ≡ 1 for most a. We will consider this problem in the next lecture, and introduce the Miller-Rabin test
which circumvents the problem of Carmichael numbers altogether.
Download