Introduction to Algorithms Jiafen Liu Sept. 2013 Today’s Tasks Hashing • Direct access tables • Choosing good hash functions – Division Method – Multiplication Method • Resolving collisions by chaining • Resolving collisions by open addressing Symbol-Table Problem • Hashing comes up in compilers called the Symbol Table Problem. • Suppose: Table S holding n records: • Operations on S: – INSERT(S, x) – DELETE(S, x) – SEARCH(S, k) • Dynamic Set vs Static Set The Simplest Case • Suppose that the keys are drawn from the set U⊆{0, 1, …, m–1}, and keys are distinct. • Direct access Table: set up an array T[0 . .m–1] if x∈S and key[x] = k, otherwise. • In the worst case, the 3 operations take time of – Θ(1) • Limitations of direct-access table? – The range of keys can be large: 64-bit numbers – character strings (difficult to represent it). • Hashing: Try to keep the table small, while preserving the property of linear running time. Naïve Hashing • Solution: Use a hash function h to map the keys of records in S into {0, 1, …, m–1}. T 0 k1 Keys k5 h(k4) h(k1) k3 h(k2) =h(k5) k4 k2 m-1 h(k3) Collisions • When a record to be inserted maps to an already occupied slot in T, a collision occurs. • The Simplest way to solve collision? – Link records in the same slot into a list. 49 86 52 h(49)=h(86)=h(52)=i Worst Case of Chaining • What’s the worst case of chaining? – Each key hashes to the same slot. The table turn out to be a chaining list. • Access Time in the worst case? – Θ(n) if we assume the size of S is n. Average Case of Chaining • In order to analyze the average case – we should know all possible inputs and their probability. – We don’t know exactly the distribution, so we always make assumptions. • Here, we make the assumption of simple uniform hashing: – Each key k in S is equally likely be hashed to any slot in T, independent of other keys. • Simple uniform hashing includes an independence assumption. Average Case of Chaining • Let n be the number of keys in the table, and let m be the number of slots. • Under simple uniform hashing assumption what’s the possibility of two keys are hashed to the same slot? – 1/m. • Define: load factor of T to be α= n/m, that means? – The average number of keys per slot. Search Cost • The expected time for an unsuccessful search for a record with a given key is? Θ(1 + α) apply hash function search the list and access slot • If α= O(1), expected search time = Θ(1) • How about a successful search? – It has same asymptotic bound. – Reserved for your homework. Choosing a hash function • The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice. – A good hash function should distribute the keys uniformly into all the slots. – Regularity of the key distribution should not affect this uniformity. • For example, all the keys are even numbers. • The simplest way to distribute keys to m slots evenly? Division Method • Assume all keys are integers, and define h(k) = k mod m. • Advantage: Simple and practical usually. • Caution: – Be careful about choice of modulus m. – It doesn't work well for every size m of table. • Example: if we pick m with a small divisor d. Deficiency of Division Method • Deficiency: if we pick m with a small divisor d. – Example: d=2, so that m is an even number. – It happens to all keys are even. – What happens to the hash table? – We will never hash anything to an oddnumbered slot. Deficiency of Division Method • Extreme deficiency: If m= 2r, that’s to say, all its factors are small divisors. • If k= (1011000111011010)2 and m=26, What the hash value turns out to be? • The hash value doesn’t evenly depend on all the bits of k. • Suppose: all the low order bits are the same, and all the high order bits differ. How to choose modulus? • Heuristics for choosing modulus m: – Choose m to be a prime – Make m not close to a power of two or ten. • Division method is not a really good one: – Sometimes, making the table size a prime is inconvenient. We often want to create a table in size 2r. – The other reason is division takes more time to compute compared with multiplication or addition on computers. Another method—Multiplication • Multiplication method is a little more complicated but superior. • Assume that all keys are integers, m= 2r, and our computer has w-bit words. • Define h(k) = (A·k mod 2w) rsh (w–r): – A is an odd integer in the range 2w–1< A< 2w. – (Both the highest bit and the lowest bit are 1) – rsh is the “bitwise right-shift” operator . • Multiplication modulo 2w is fast compared to division, and the rsh operator is fast. • Tips: Don’t pick A too close to 2w–1 or 2w. Example of multiplication method • Suppose that m= 8 = 23, r=3, and that our computer has w= 7-bit words: • We chose A =1 0 1 1 0 0 1 • k =1 1 0 1 0 1 1 • 10010100110011 Ignored by mod h(k) Ignored by rsh Another way to solve collision • We’ve talked about resolving collisions by chaining. With chaining, we need an extra link field in each record. • There's another way—open addressing, with idea: No storage for links. • We should systematically probe the table until an empty slot is found. Open Addressing • The hash function depends on both the key and probe number: universe of keys probe number slot number • The probe sequence 〈h(k,0), h(k,1), …, h(k,m–1)〉should be a permutation of {0, 1, …, m–1}. Implementation of Insertion • What about HASH-SEARCH(T,k)? Implementation of Searching More about Open Addressing • The hash table may fill up. – We must have the number of elements less than or equal to the table size. • Deletion is difficult, why? – When we remove a key out of the table, and somebody is going to find his element. – The probe sequence he uses happens to hit the key we’ve deleted. – He finds it's an empty slot, and says the key I am looking for probably isn't in the table. • We should keep deleted things marked. Example of open addressing Example of open addressing Example of open addressing Example of open addressing Some heuristics about probe • We can record the largest times of probes needed to do an insertion globally. – A search never looks more than that number. • There are lots of ideas about forming a probe sequence effectively. • The simplest one is ? – linear probing. The simplest probing strategy • Linear probing: given an hash function h(k), linear probing uses h(k,i) = (h(k,0) +i) mod m • Advantage: Simple • Disadvantage? – primary clustering Primary Clustering • It suffers from primary clustering, where regions of the hash table get full. – Anything that hashes into that region has to look through all the stuff. – What’s more, where long runs of occupied slots build up, increasing the average search time. Another probing strategy • Double hashing: given two ordinary hash functions h1(k), h2(k), double hashing uses h(k,i) = ( h1(k) +i⋅h2(k) ) mod m • If h2(k) is relatively prime to m, double hashing generally produces excellent results. – We always make m a power of 2 and design h2(k) to produce only odd numbers. Analysis of open addressing • We make the assumption of uniform hashing: – Each key is equally likely to have any one of the m! permutations as its probe sequence, independent of other keys. • Theorem. Given an open-addressed hash table with load factor α= n/m< 1, the expected number of probes in an unsuccessful search is at most 1/(1–α) . Proof of the theorem Proof: • At least one probe is always necessary. • With probability n/m , the first probe hits an occupied slot, and a second probe is necessary. • With probability (n–1)/(m–1) ,the second probe hits an occupied slot, and a third probe is necessary. • With probability (n–2)/(m–2) ,the third probe hits an occupied slot, etc. • And then how to prove? • Observe that for i= 1, 2, …, n. Proof of the theorem • Therefore, the expected number of probes is (geometric series) Implications of the theorem • If α is constant, then accessing an openaddressed hash table takes constant time. • If the table is half full, then the expected number of probes is ? – 1/(1–0.5) = 2. • If the table is 90%full, then the expected number of probes is ? – 1/(1–0.9) = 10. • Full utilization in spaces causes hashing slow. Still Hashing • Universal hashing • Perfect hashing A weakness of hashing • Problem: For any hash function h, there exists a bad set of keys that all hash to the same slot. – It causes the average access time of a hash table to skyrocket. – An adversary can pick all keys from {k: h(k) = i } for some slot i. • IDEA: Choose the hash function at random, independently of the keys. Universal hashing Universality is good • Theorem: • Let h be a hash function chosen at random from a universal set H of hash functions. • Suppose h is used to hash n arbitrary keys into the m slots of a table T. • Then for a given key x, we have: E[number of collisions with x] < n/m. Universality theorem • Proof. Let Cx be the random variable denoting the total number of collisions of keys in T with x, and let Universality theorem For E[cxy]=1/m Construction universal hash function set • One method to construct a set of universal hash functions: • Let m be prime. Decompose key k into r+1 digits, each with value in the set {0, 1, …, m–1}. – That is, let k = <k0, k1, …, kr>, where 0≤ki<m. • Randomized strategy: – Pick a = 〈a0, a1, …, ar〉 where each ai is chosen randomly from {0, 1, …, m–1}. • Define One method of Construction • How big is H = {ha}? – |H| = mr + 1. • Theorem. The set H = {ha} is universal. • Proof. • Suppose that x = 〈x0, x1, …, xr〉 and y = 〈y0, y1, …, yr〉 be distinct keys. • Thus, they differ in at least one digit position. • Without loss of generality, position 0. • For how many ha∈H do x and y collide? One method of Construction • ha(x) = ha(y), which implies that • Equivalently, we have Fact from number theory • • Back to the proof • We just have and since x0 ≠ y0 , an inverse (x0– y0)–1 must exist, which implies that • Thus, for any choices of a1, a2, …, ar, exactly one choice of a0 causes x and y to collide. Proof • How many ha will cause x and y to collide? – There are m choices for each of a1, a2, …, ar , but once these are chosen, exactly one choice for a0 causes x and y to collide, • Thus, the number of h that cause x and y to collide is mr ·1 = mr = |H|/m. Perfect hashing • Requirement: Given a set of n keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time in the worst case. • IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2 ! Example of Perfect hashing Collisions at level 2 • Theorem. Let H be a class of universal hash functions for a table of size m = n2. If we use a random h∈H to hash n keys into the table, the expected number of collisions is at most 1/2. • Proof. By the definition of universality, the probability that two given keys collide under h is 1/m = 1/n2. There are pairs of keys that can possibly collide, the expected number of collisions is Another fact from number theory • Markov’s inequality says that for any non negative random variable X, we have Pr{X ≥ t} ≤ E[X]/t. • Theorem. The probability of no collisions is at least 1/2. • Proof. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2. • Conclusion: Just by testing random hash functions in H, we’ll quickly find one that works.