Hash Functions

advertisement
Introduction to
Algorithms
Jiafen Liu
Sept. 2013
Today’s Tasks
Hashing
• Direct access tables
• Choosing good hash functions
– Division Method
– Multiplication Method
• Resolving collisions by chaining
• Resolving collisions by open
addressing
Symbol-Table Problem
• Hashing comes up in compilers called the
Symbol Table Problem.
• Suppose: Table S holding n records:
• Operations on S:
– INSERT(S, x)
– DELETE(S, x)
– SEARCH(S, k)
• Dynamic Set vs Static Set
The Simplest Case
• Suppose that the keys are drawn from the set
U⊆{0, 1, …, m–1}, and keys are distinct.
• Direct access Table: set up an array T[0 . .m–1]
if x∈S and key[x] = k,
otherwise.
• In the worst case, the 3 operations take time of
– Θ(1)
• Limitations of direct-access table?
– The range of keys can be large: 64-bit numbers
– character strings (difficult to represent it).
• Hashing: Try to keep the table small, while
preserving the property of linear running time.
Naïve Hashing
• Solution: Use a hash function h to map
the keys of records in S into {0, 1, …, m–1}.
T
0
k1
Keys
k5
h(k4)
h(k1)
k3
h(k2) =h(k5)
k4
k2
m-1
h(k3)
Collisions
• When a record to be inserted maps to an
already occupied slot in T, a collision
occurs.
• The Simplest way to solve collision?
– Link records in the same slot into a list.
49
86
52
h(49)=h(86)=h(52)=i
Worst Case of Chaining
• What’s the worst case of chaining?
– Each key hashes to the same slot. The table
turn out to be a chaining list.
• Access Time in the worst case?
– Θ(n) if we assume the size of S is n.
Average Case of Chaining
• In order to analyze the average case
– we should know all possible inputs and their
probability.
– We don’t know exactly the distribution, so we
always make assumptions.
• Here, we make the assumption of simple
uniform hashing:
– Each key k in S is equally likely be hashed to
any slot in T, independent of other keys.
• Simple uniform hashing includes an
independence assumption.
Average Case of Chaining
• Let n be the number of keys in the table,
and let m be the number of slots.
• Under simple uniform hashing assumption
what’s the possibility of two keys are
hashed to the same slot?
– 1/m.
• Define: load factor of T to be α= n/m, that
means?
– The average number of keys per slot.
Search Cost
• The expected time for an unsuccessful
search for a record with a given key is?
Θ(1 + α)
apply hash function search the list
and access slot
• If α= O(1), expected search time = Θ(1)
• How about a successful search?
– It has same asymptotic bound.
– Reserved for your homework.
Choosing a hash function
• The assumption of simple uniform hashing
is hard to guarantee, but several common
techniques tend to work well in practice.
– A good hash function should distribute the
keys uniformly into all the slots.
– Regularity of the key distribution should not
affect this uniformity.
• For example, all the keys are even numbers.
• The simplest way to distribute keys to m
slots evenly?
Division Method
• Assume all keys are integers, and define
h(k) = k mod m.
• Advantage: Simple and practical usually.
• Caution:
– Be careful about choice of modulus m.
– It doesn't work well for every size m of table.
• Example: if we pick m with a small divisor
d.
Deficiency of Division Method
• Deficiency: if we pick m with a small
divisor d.
– Example: d=2, so that m is an even number.
– It happens to all keys are even.
– What happens to the hash table?
– We will never hash anything to an oddnumbered slot.
Deficiency of Division Method
• Extreme deficiency: If m= 2r, that’s to say,
all its factors are small divisors.
• If k= (1011000111011010)2 and m=26,
What the hash value turns out to be?
• The hash value doesn’t evenly depend on
all the bits of k.
• Suppose: all the low order bits are the
same, and all the high order bits differ.
How to choose modulus?
• Heuristics for choosing modulus m:
– Choose m to be a prime
– Make m not close to a power of two or ten.
• Division method is not a really good one:
– Sometimes, making the table size a prime is
inconvenient. We often want to create a table
in size 2r.
– The other reason is division takes more time
to compute compared with multiplication or
addition on computers.
Another method—Multiplication
• Multiplication method is a little more
complicated but superior.
• Assume that all keys are integers, m= 2r,
and our computer has w-bit words.
• Define h(k) = (A·k mod 2w) rsh (w–r):
– A is an odd integer in the range 2w–1< A< 2w.
– (Both the highest bit and the lowest bit are 1)
– rsh is the “bitwise right-shift” operator .
• Multiplication modulo 2w is fast compared
to division, and the rsh operator is fast.
• Tips: Don’t pick A too close to 2w–1 or 2w.
Example of multiplication method
• Suppose that m= 8 = 23, r=3, and that our
computer has w= 7-bit words:
• We chose A =1 0 1 1 0 0 1
•
k =1 1 0 1 0 1 1
• 10010100110011
Ignored by mod
h(k) Ignored by rsh
Another way to solve collision
• We’ve talked about resolving collisions by
chaining. With chaining, we need an extra
link field in each record.
• There's another way—open addressing,
with idea: No storage for links.
• We should systematically probe the table
until an empty slot is found.
Open Addressing
• The hash function depends on both the
key and probe number:
universe of keys
probe number
slot number
• The probe sequence ⟨h(k,0), h(k,1), …,
h(k,m–1)⟩should be a permutation of {0, 1,
…, m–1}.
Implementation of Insertion
• What about HASH-SEARCH(T,k)?
Implementation of Searching
More about Open Addressing
• The hash table may fill up.
– We must have the number of elements less
than or equal to the table size.
• Deletion is difficult, why?
– When we remove a key out of the table, and
somebody is going to find his element.
– The probe sequence he uses happens to hit
the key we’ve deleted.
– He finds it's an empty slot, and says the key I
am looking for probably isn't in the table.
• We should keep deleted things marked.
Example of open addressing
Example of open addressing
Example of open addressing
Example of open addressing
Some heuristics about probe
• We can record the largest times of probes
needed to do an insertion globally.
– A search never looks more than that number.
• There are lots of ideas about forming a
probe sequence effectively.
• The simplest one is ?
– linear probing.
The simplest probing strategy
• Linear probing: given an hash function
h(k), linear probing uses
h(k,i) = (h(k,0) +i) mod m
• Advantage: Simple
• Disadvantage?
– primary clustering
Primary Clustering
• It suffers from primary clustering, where
regions of the hash table get full.
– Anything that hashes into that region has to
look through all the stuff.
– What’s more, where long runs of occupied
slots build up, increasing the average search
time.
Another probing strategy
• Double hashing: given two ordinary hash
functions h1(k), h2(k), double hashing uses
h(k,i) = ( h1(k) +i⋅h2(k) ) mod m
• If h2(k) is relatively prime to m, double
hashing generally produces excellent
results.
– We always make m a power of 2 and design
h2(k) to produce only odd numbers.
Analysis of open addressing
• We make the assumption of uniform
hashing:
– Each key is equally likely to have any one of
the m! permutations as its probe sequence,
independent of other keys.
• Theorem. Given an open-addressed hash
table with load factor α= n/m< 1, the
expected number of probes in an
unsuccessful search is at most 1/(1–α) .
Proof of the theorem
Proof:
• At least one probe is always necessary.
• With probability n/m , the first probe hits an
occupied slot, and a second probe is necessary.
• With probability (n–1)/(m–1) ,the second probe
hits an occupied slot, and a third probe is
necessary.
• With probability (n–2)/(m–2) ,the third probe
hits an occupied slot, etc.
• And then how to prove?
• Observe that
for i= 1, 2, …, n.
Proof of the theorem
• Therefore, the expected number of probes is
(geometric series)
Implications of the theorem
• If α is constant, then accessing an openaddressed hash table takes constant time.
• If the table is half full, then the expected
number of probes is ?
– 1/(1–0.5) = 2.
• If the table is 90%full, then the expected
number of probes is ?
– 1/(1–0.9) = 10.
• Full utilization in spaces causes hashing
slow.
Still Hashing
• Universal hashing
• Perfect hashing
A weakness of hashing
• Problem: For any hash function h, there
exists a bad set of keys that all hash to
the same slot.
– It causes the average access time of a hash
table to skyrocket.
– An adversary can pick all keys from {k: h(k) =
i } for some slot i.
• IDEA: Choose the hash function at
random, independently of the keys.
Universal hashing
Universality is good
• Theorem:
• Let h be a hash function chosen at
random from a universal set H of hash
functions.
• Suppose h is used to hash n arbitrary
keys into the m slots of a table T.
• Then for a given key x, we have:
E[number of collisions with x] < n/m.
Universality theorem
• Proof. Let Cx be the random variable
denoting the total number of collisions of
keys in T with x, and let
Universality theorem
For E[cxy]=1/m
Construction universal hash function set
• One method to construct a set of universal
hash functions:
• Let m be prime. Decompose key k into
r+1 digits, each with value in the set {0, 1,
…, m–1}.
– That is, let k = <k0, k1, …, kr>, where 0≤ki<m.
• Randomized strategy:
– Pick a = ⟨a0, a1, …, ar⟩ where each ai is
chosen randomly from {0, 1, …, m–1}.
• Define
One method of Construction
• How big is H = {ha}?
– |H| = mr + 1.
• Theorem. The set H = {ha} is universal.
• Proof.
• Suppose that x = ⟨x0, x1, …, xr⟩ and y =
⟨y0, y1, …, yr⟩ be distinct keys.
• Thus, they differ in at least one digit
position.
• Without loss of generality, position 0.
• For how many ha∈H do x and y collide?
One method of Construction
• ha(x) = ha(y), which implies that
• Equivalently, we have
Fact from number theory
•
•
Back to the proof
• We just have
and since x0 ≠ y0 , an inverse (x0– y0)–1
must exist, which implies that
• Thus, for any choices of a1, a2, …, ar,
exactly one choice of a0 causes x and y to
collide.
Proof
• How many ha will cause x and y to collide?
– There are m choices for each of a1, a2, …, ar ,
but once these are chosen, exactly one choice
for a0 causes x and y to collide,
• Thus, the number of h that cause x and y to
collide is mr ·1 = mr = |H|/m.
Perfect hashing
• Requirement: Given a set of n keys,
construct a static hash table of size m =
O(n) such that SEARCH takes Θ(1) time
in the worst case.
• IDEA: Two- level scheme with universal
hashing at both levels. No collisions at
level 2 !
Example of Perfect hashing
Collisions at level 2
• Theorem. Let H be a class of universal hash
functions for a table of size m = n2. If we use a
random h∈H to hash n keys into the table, the
expected number of collisions is at most 1/2.
• Proof. By the definition of universality, the
probability that two given keys collide under h is
1/m = 1/n2. There are pairs of keys that can
possibly collide, the expected number of
collisions is
Another fact from number theory
• Markov’s inequality says that for any non
negative random variable X, we have
Pr{X ≥ t} ≤ E[X]/t.
• Theorem. The probability of no collisions
is at least 1/2.
• Proof. Applying this inequality with t = 1,
we find that the probability of 1 or more
collisions is at most 1/2.
• Conclusion: Just by testing random hash
functions in H, we’ll quickly find one that
works.
Download