Hash Functions Andy Wang Data Structures, Algorithms, and Generic Programming Introduction Hash function – Maps keys to integers (buckets) Hash(Key) = Integer – Ideally in a random-like manner Evenly distributed bucket values Even if the input data is not evenly distributed An Example ID Number Generation – Key = your name – Hash(Key) = a number Not a great hash function… – Two people with the same name will have the same number… Simple Hash Functions Assumptions: – K: an unsigned 32-bit integer – M: the number of buckets (the number of entries in a hash table) Goal: – If a bit is changed in K, all bits are equally likely to change for Hash(K) A Simple Hash Function… What if K = M? Hash(K) = K What is wrong? Your student ID = SSN – I can’t use your SSN to post your grades… Another Simple Function If K > M Hash(K) = K % M What is wrong? Suppose M = 4, K = 2, 4, 6, 8 K % M = 2, 0, 2, 0 Yet Another Simple Function If K > P, P = prime number Hash(K) = K % P Suppose P = 3, K = 2, 4, 6, 8 K % P = 2, 1, 0, 3 More uniform distribution…but still problematic for other cases More on Prime Numbers K > P1 > P2, P1 and P2 are prime numbers Hash(K) = (K % P1) % P2 Suppose P1 = 5, P2 = 3, K = 2, 4, 6, 8, 10 (K % 5) = 2, 4, 1, 3, 0 (K % 5) % 3 = 2, 1, 1, 0, 0 Still uniform distribution Polynomial Functions If K > P, P = prime number Hash(K) = K(K + 3) % P Slightly better than pure modulo functions How About… Hash(K) = rand() What is wrong? Not repeatable How About… K > P, P = prime number Hash(K) = rand(K) % P Better randomness Can be expensive to compute random numbers Pre-generated Randomness Two prime numbers: P1 and P2 K > P1 and K > P2 A table R[P1], with R[i] pre-initialized to rand(i) % P2 Hash(K) = R[K % P1] Slight Problem: Possible duplicate mapping To Avoid Duplicate Mapping… Two prime numbers: P1 and P2 K > P1 and K > P2 A table R[P1], with R[i] pre-initialized to unique random numbers Hash(K) = R[K % P1] An Example K = 0…232, P1 = 3, P2 = 5 R[3] = {0, 4, 1} Hash(K) = R[K % 3] Hashing a Sequence of Keys K = {K1, K2, …, Kn) E.g., Hash(“test”) = 98157 Design Principles – Use the entire key – Use the ordering information – Use pre-generated randomness Use the Entire Key unsigned int Hash(const char *Key) { unsigned int hash = 0; for (unsigned int j = 0; j < K; j++) { hash = hash ^ Key[j] } return hash; } Problem: Hash(“ab”) == Hash(“ba”) Use the Ordering Information unsigned int Hash(const char *Key) { unsigned int hash = 0; for (unsigned int j = 0; j < K; j++) { hash = hash ^ Key[j] hash = /* hash with some shiftings */ } return hash; } Problem: H(short keys) will not perturb all 32-bits (clustering) Use Pre-generated Randomness unsigned int Hash(const char *Key) { unsigned int hash = 0; for (unsigned int j = 0; j < K; j++) { hash = hash ^ R[Key[j]] hash = /* hash with some shiftings */ } return hash; } CRC Variant Do 5-bit circular shift of hash XOR hash and K[j] … for (…) { highorder = hash = hash hash = hash hash = hash } … hash & 0xf8000000; << 5; ^ (highorder >> 27) ^ K[j]; CRC Variant + For long keys, all 32-bits are exercised + More randomness toward lower bits - Not all bits are changed for short keys BUZ Hash Set up an array R to store precomputed random numbers … for (…) { highorder = hash = hash hash = hash hash = hash } … hash & 0x80000000; << 1; ^ (highorder >> 31) ^ R[K[j]]; References Aho, Sethi, and Ullman. Compilers: Principles, Techniques, and Tools, 1986. Cormen, Leiserson, River. Introduction to Algorithms, 1990 Knuth. The Art of Computer Programming, 1973 Kuenning. Hash Functions, 2003.