Hash Functions Andy Wang Data Structures, Algorithms, and Generic Programming

advertisement
Hash Functions
Andy Wang
Data Structures, Algorithms, and
Generic Programming
Introduction

Hash function
– Maps keys to integers (buckets)
 Hash(Key) = Integer
– Ideally in a random-like manner
 Evenly distributed bucket values
 Even if the input data is not evenly distributed
An Example

ID Number Generation
– Key = your name
– Hash(Key) = a number

Not a great hash function…
– Two people with the same name will have the
same number…
Simple Hash Functions

Assumptions:
– K: an unsigned 32-bit integer
– M: the number of buckets (the number of
entries in a hash table)

Goal:
– If a bit is changed in K, all bits are equally
likely to change for Hash(K)
A Simple Hash Function…

What if K = M?
 Hash(K) = K
 What is wrong?
 Your student ID = SSN
– I can’t use your SSN to post your grades…
Another Simple Function

If K > M
 Hash(K) = K % M
 What is wrong?
 Suppose M = 4, K = 2, 4, 6, 8
 K % M = 2, 0, 2, 0
Yet Another Simple Function

If K > P, P = prime number
 Hash(K) = K % P
 Suppose P = 3, K = 2, 4, 6, 8
 K % P = 2, 1, 0, 3
 More uniform distribution…but still
problematic for other cases
More on Prime Numbers

K > P1 > P2, P1 and P2 are prime numbers
 Hash(K) = (K % P1) % P2
 Suppose P1 = 5, P2 = 3, K = 2, 4, 6, 8, 10
 (K % 5) = 2, 4, 1, 3, 0
 (K % 5) % 3 = 2, 1, 1, 0, 0
 Still uniform distribution
Polynomial Functions

If K > P, P = prime number
 Hash(K) = K(K + 3) % P
 Slightly better than pure modulo functions
How About…

Hash(K) = rand()
 What is wrong?
 Not repeatable
How About…

K > P, P = prime number
 Hash(K) = rand(K) % P
 Better randomness
 Can be expensive to compute random
numbers
Pre-generated Randomness

Two prime numbers: P1 and P2
 K > P1 and K > P2
 A table R[P1], with R[i] pre-initialized to
rand(i) % P2
 Hash(K) = R[K % P1]
 Slight Problem: Possible duplicate mapping
To Avoid Duplicate Mapping…

Two prime numbers: P1 and P2
 K > P1 and K > P2
 A table R[P1], with R[i] pre-initialized to
unique random numbers
 Hash(K) = R[K % P1]
An Example
K = 0…232, P1 = 3, P2 = 5
 R[3] = {0, 4, 1}
 Hash(K) = R[K % 3]

Hashing a Sequence of Keys
K = {K1, K2, …, Kn)
 E.g., Hash(“test”) = 98157
 Design Principles

– Use the entire key
– Use the ordering information
– Use pre-generated randomness
Use the Entire Key
unsigned int Hash(const char *Key) {
unsigned int hash = 0;
for (unsigned int j = 0; j < K; j++) {
hash = hash ^ Key[j]
}
return hash;
}

Problem: Hash(“ab”) == Hash(“ba”)
Use the Ordering Information
unsigned int Hash(const char *Key) {
unsigned int hash = 0;
for (unsigned int j = 0; j < K; j++) {
hash = hash ^ Key[j]
hash = /* hash with some shiftings */
}
return hash;
}

Problem: H(short keys) will not perturb all
32-bits (clustering)
Use Pre-generated
Randomness
unsigned int Hash(const char *Key) {
unsigned int hash = 0;
for (unsigned int j = 0; j < K; j++) {
hash = hash ^ R[Key[j]]
hash = /* hash with some shiftings */
}
return hash;
}
CRC Variant

Do 5-bit circular shift of hash
 XOR hash and K[j]
…
for (…) {
highorder =
hash = hash
hash = hash
hash = hash
}
…
hash & 0xf8000000;
<< 5;
^ (highorder >> 27)
^ K[j];
CRC Variant
+ For long keys, all 32-bits are exercised
+ More randomness toward lower bits
- Not all bits are changed for short keys
BUZ Hash

Set up an array R to store precomputed
random numbers
…
for (…) {
highorder =
hash = hash
hash = hash
hash = hash
}
…
hash & 0x80000000;
<< 1;
^ (highorder >> 31)
^ R[K[j]];
References

Aho, Sethi, and Ullman. Compilers:
Principles, Techniques, and Tools, 1986.
 Cormen, Leiserson, River. Introduction to
Algorithms, 1990
 Knuth. The Art of Computer Programming,
1973
 Kuenning. Hash Functions, 2003.
Download