Hash Tables Hash tables Dealing with raw bytes Some probabilistic analysis Motivations • Balanced search trees – Store (key, value)-pairs – O(log n)-time search, insert, delete, max, min – O(log n + |output|)-time range query – Relatively complex implementation • Can we improve running times for basic operations? 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 1 Example • Given a set of (key, value)-pairs – Keys are in the set {0, 1, 2, 3, …, n-1} – Values could be anything • Store them in an array (Direct Access Table) 0 1 2 3 … … … … n-2 n-1 v0 NULL v2 NULL … … … … Vn-2 Vn-1 • Search/insert/delete takes O(1)-time 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 2 What are the drawbacks? • Unlikely see “good” key set in practice – (Key, Value) = (English Word, Meaning) – (Key, Value) = (function, address) – (Key, Value) = (URL, IP address) • Even if keys were non-negative integers, there might be lots of NULL entries – Wastes lots of space because n >> # pairs – Say keys are 8-byte integers, n = 2256-1 • Can’t do range-query efficiently 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 3 - Dictionary Reserved words in a programming language Command names in an Operating Systems File names in CD-ROM Lazy array & Minimal Perfect hashing THE STATIC KEY SET CASE 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 4 Static Key Set • Wanted: data structure for an online dictionary • With what we know so far, what are the choices? – Sorted array, use binary search – Use a balanced BST such as RB tree, AVL tree – Randomize the keys and insert one by one in a normal BST or a splay tree • Sorting + binary search is the best of the three options (with caveats) – Search still takes O(log n)-time – Can we do better? 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 5 A Wild Solution • Every key is a series of 0s and 1s • There is always some integer m such that, for all practical purposes – A key K is an (≤ m)-bit number bin-rep(K) – ASCII: bin-rep(“cse250”) = 0x637365323530 – Use this m-bit number as an index to an array of values • What’s the longest non-technical English word? – Floccinaucinihilipilification (29 characters) • What’s the longest technical English word? – Pneumonoultramicroscopicsilicovolcanoconiosis (a disease) 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 6 The solution is wild in at least two ways • So, say m = 8x30 = 240 bits – Need an array A of 2240 ≈ 1.76 x 1072 elements • Even if we have that much memory space, there is still one major problem – A[x] initialized to NULL for all x from 0 to 2240-1 – NULL is just 0 – Dictionary has n = 150000 = 15x104 words, say – Initializing the data structure takes ≥ n13 steps – The O(n log n) sorting + binary search looks great! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 7 The lazy array data structure • Sequentially read inputs into an array Dict of size n – Dict[i] is the i’th (word, value) pair • Insert all words into huge array A – For (i=0; i<n; ++i) A[Dict[i].word] = i • search(x): (x is a word) – If (0 ≤ A[x] ≤ n-1 && Dict[A[x]].word == x) • Return Dict[A[x]].value – Else • Return false • We can even delete(x) in O(1)-time – Just set Dict[A[x]] = NULL 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 8 Lazy array, an illustration cse250 … 0 (“UB”, “University at Buffalo”) 0x4D6174726978 1 (“Mark Twain”, “Great writer”) … 2 (“cse250”, “boring course”) 0x637365323530 3 NULL … … … … n-1 (“Matrix”, “Best Scifi Movie”) … cse251 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 2 Set to 2 by accident 0x637365323531 Dict n-1 2 A 9 Major drawback and an inspiration • Used a humongous amount of space – Typically n << 2# bits to represent the longest word • However, if there was a function h – 0 ≤ h(word) ≤ n-1 – And, for any two words x & y, h(x) ≠ h(y) • Then, we’re (almost) in good shape 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 10 Sample Input, n=6 Key Index Value 4164616D 0 … “Ashley” 4173686C6579 1 … “Daniel” 44616E69656C 2 … “Kayla” 4B61796C61 3 … “Mike” 4D696B65 4 … “Troy” 54726F79 5 … “Adam” Hash Code (using ASCII) In hex Function h(hash_code) {0,1,2,3,4,5} 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 11 What was that function h? int h(int a) { a = (((a%256)%100)%41)%10; a = (a*a)%14; return (a>2) ? a-4 : a; } Took me ½ hour to come up with this stupid function 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 12 (Minimal) Perfect Hash Function • S: the set of n (hash codes of) keys –S={ 0x4164616D, 0x4173686C6579, 0x44616E69656C, 0x4B61796C61, 0x4D696B65, 0x54726F79 } in the example above • h: S {0,1,..,n-1} is a MPHF if it is a bijection • We want to – Find such a function h (in short amount of time) – Maybe store the function … in a data structure! – Evaluate h(code) in O(1)-time • Possible, but a little bit complicated – http://cmph.sourceforge.net/ – http://www.gnu.org/software/gperf/ 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 13 - Hashing first proposed by Arnold Dumey (1956) Hash function = Hash codes + Compression function Separate Chaining Open addressing, linear probing, quadratic probing HASHING – GENERAL IDEAS 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 14 Top level view h(object) Arbitrary objects (strings, doubles, ints) {0,1,…,m-1} int with wide range n Objects actually used Hash code Compression function m We will also call this a hash function 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 15 Good Hash Function • If key1 ≠ key2, then it’s extremely unlikely that h(key1) = h(key2) – Collision problem! • Constructing the function h takes little time • Given key, computing h(key) takes O(|key|)-time 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 16 Collision unavoidable for unknown key set • Pigeonhole principle – K+1 pigeons, K holes at least one hole with ≥ 2 pigeons • There are many more objects in the universe than m – – – – Object set = set of strings of length ≤ 30 characters Object set = set of possible URLs Object set = set of possible file names in a CD-ROM While the range size m is something like a few hundred thousands or less 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 17 Hash codes for int-style types • Say we want hash codes map to (4-byte) int • Easy when objects = – short, or int, or char or unsigned char – Simply cast them to uint32_t • What about when objects = – long int (8 byte integers) x0 x1 x2 x3 x4 x5 x6 x7 Hash code y0 5/28/2016 y1 y2 CSE 250, SUNY Buffalo, @Hung Q. Ngo y3 18 Casting Down, Lose Infomation unsigned int hash_code1(unsigned long a) { return static_cast<unsigned int>(a); } int main() { unsigned long a = 0x8888888877777777; unsigned long b = 0x1111111177777777; cout << hex << a << " converted to " << hash_code1(a) << endl; cout << hex << b << " converted to " << hash_code1(b) << endl; return 0; } 8888888877777777 converted to 77777777 1111111177777777 converted to 77777777 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 19 Drawback of Casting from long to int • We ignore the first 4 bytes of information • If key1 and key2 differ only in the first 4 bytes, they will collide! • On the other hand, if keys are uniformly distributed, we are OK. • Could also sum 1st 4 bytes with 2nd 4 bytes 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 20 Hash codes for strings & variable length objects • Say we have a universe of character array objects – “Computer Science” – “Floccinaucinihilipilification” – “Alan Turing” –… • How do we produce 4-byte hash codes for them? 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 21 Hash codes for strings (or byte-sequences) • • • • • • • Add up the characters XOR 4-bytes at a time Polynomial hash codes Shifting hash codes FNV hash MurmurHash Etc. • Important Lesson: data-dependency! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 22 Some experimental results Hash code function into uint32_t #of collisions Max bucket size Sum 1730769 175 Xor 583 3 Shift7 56 2 Poly31 22 2 Poly33 22 2 FNV 0 1 FNV (or FNV1a) is widely used, in DNS & Twitter, for example Poly31 is used in Java hashCode() method http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html#hashCode%28%29 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 23 Compression functions • 4-byte hash codes can’t be used as indices – 232 = 4,294,967,296 ≈ 4x109 is too many • Store n entries, need indices in {0,1, …m-1} – m should be close to n (say n = 150K, m = 200K) • Compression function – f: uint32_t {0,1, …m-1} – Division method – Multiplication method – Universal hashing Compression functions are hash functions and thus there methods can be used to design hash codes too! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 24 Hash/compression function design problem • Universe U = all uint32_t integers • S, an unknown subset of n members of U • Find f : U {0,1,…,m-1} – Computing f(u) is fast – Minimize collisions • Note: suppose |U| > m ≥ n – For a fixed S, there always exists f with no collisions – For a fixed f, there always exists S with lots of collisions • If S’s distribution is truly arbitrary – The best f is such that f(s) is uniformly distributed on {0…m-1} – Ball-into-Bins model: Throw n “balls” randomly into m “bins” 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 25 An analogy – the birthday problem • U = 7 billion people in the world • S = set of students in this room • f : { students } {Jan 01, …, Dec 31} – So m = 365 (forget leap years) • Question: – If S is chosen randomly from U, how large must S be until it is more likely to have a collision than not? • This is called the birthday “paradox” 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 26 Birthday paradox • • • • Say there are n students in this room Prob[1st student does not “collide”] = 1 Prob[2nd student does not “collide”] = 1-1/m Prob[3rd student does not “collide” | first two didn’t collide] = 1-2/m • … • Overall probability of no collision is (1-1/m)(1-2/m)…(1-(n-1)/m) < ½ when n=23 and m = 365 • When n=30, Prob[no collision] ≈ 30% 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 27 Rarity of Minimal Perfect Hash Function • Consider |U| = N, m = n, MPHF is a bijection • Number of functions from U to {0,1,…n-1} is nN • For a fixed S (but unknown) of size n – number of MPHF for S is n!nN-n – Hence, the fraction of functions which are MPHF is – When n = 10, the ratio is 0.00036… – When n = 20, the ratio is 2.32*10-8 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 28 Division method • How does this function perform for different m? • The answer depends a lot on the distribution of S in the universe U 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 29 m from 50K to 60K • Total # collisions: 19K • Max bucket size 6-8, typically • Recall n ≈ 47K • Could we have guessed this result without coding? – Something in the spirit of the birthday paradox? – Motto: Think, then code! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 30 Balls into Bins • Throw n balls into m bins randomly • Probability a given bin is empty is (1-1/m)n ≈ e-n/m • Expected number of empty bins is me-n/m • It can be shown mathematically that on average, when m ≈ n, the maximum bin size is about 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 31 These estimates are incredibly good! • n = 47000, m = 50000 • me-n/m ≈ 19000 • And • You can repeat the experiment with m ≈ 100K 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 32 Multiplication method – slightly better! Golden ratio 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 33 Universal Hashing • Adversary can always pick key set S which create O(n) collisions – Denial of Service attack (more later!) • Universal hashing approach – Design a family H of hash functions such that for any k ≠ k’ – Pick a hash function h in H uniformly at random – Note that the key set is chosen by the adversary 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 34 Theoretical Results • α = n/m is the load factor of the hash table • Expected bucket size is at most 1 + α • Fact: when the universal family is nindependent, they behave almost as if we throw balls randomly and independently into bins 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 35 Additional notes • Table size m: should choose a prime for mod compression – If it’s even, even % even = even – Objects in computer memory often start with even address – If it’s a power of 2 then we effectively mod out the high-order bits • Lost the relative order of keys – Can’t answer queries such as: • “what are the keys (& associated values) in between 3 and 432?” • “list the smallest k keys” – BSTs handle those just fine! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 36 Separate chaining Open addressing Cuckoo hashing COLLISION RESOLUTION 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 37 Separate Chaining Turing Cantor Index 0 1 Knuth Turing Knuth Karp Cantor Dijkstra 2 3 Karp Pointer 4 Dijkstra 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 38 Performance • Under simple uniform hashing assumption – i.e. Each object hashed into a bucket with probability 1/m, uniformly and independent from other objects • Expected search time Θ(1+α) • Worst-case search time Ω(n) – though very unlikely • Using universal hashing, expected time for n operations is Ω(n) 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 39 Denial of Service Attacks • http://events.ccc.de/congress/2011/Fahrplan/events/4680.en.html • http://www.ocert.org/advisories/ocert-2011-003.html • http://permalink.gmane.org/gmane.comp.security.full-disclosure/83694 • “Hash tables are a commonly used data structure in most programming languages. Web application servers or platforms commonly parse attacker-controlled POST form data into hash tables automatically, so that they can be accessed by application developers. If the language does not provide a randomized hash function or the application server does not recognize attacks using multi-collisions, an attacker can degenerate the hash table by sending lots of colliding keys. The algorithmic complexity of inserting n elements into the table then goes to O(n**2), making it possible to exhaust hours of CPU time using a single HTTP request.” 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 40 Poor Hash Choices Brought Down msn.com http://blogs.msdn.com/b/ericlippert/archive/200 3/09/19/arrrrr-cap-n-eric-be-learnin-aboutthreadin-the-harrrrd-way.aspx 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 41 BTW • You can do Separate Treeing too! • They don’t teach that in school • What’s the performance of your hash table then? • What’s the drawback? 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 42 Open Addressing • Store all entries in the hash table itself, no pointer to the “outside” • Advantage – Less space waste – Perhaps good cache usage • Disadvantage – More complex collision resolution – Slower operations 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 43 Open Addressing Index Turing Pointer 0 1 Cantor h(“Knuth”, 0) Knuth Karp h(“Karp”, 0) Turing h(“Knuth”, 1)1) h(“Dijkstra”, 2 3 Knuth 4 Cantor h(“Dijkstra”, 2) 5 Dijkstra Karp h(“Karp”, 1) 6 7 Dijkstra 5/28/2016 h(“Dijkstra”, 0) CSE 250, SUNY Buffalo, @Hung Q. Ngo 44 Open Addressing Scheme • Instead of h : U {0,1,…,m-1} e.g. h(key) = 3 • We use an extended hash function which defines a probe sequence • h : U x {0,1,…,m-1} {0,1,…m-1} e.g. h(key, 0) = 5 h(key, 1) = 9 h(key, 2) = 7 … 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 45 Insert Algorithm for (i=0; i<m; i++) { j = h(key, i); if (Table[j] == NULL) { insert entry; break; } } if (i == m) report error “hash table overflown” 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 46 Desirable Property of Probe Sequence • For any key, h(key, 0), …, h(key, m-1) is a permutation of the set {0,1, …, m-1} • What happens if the property does not hold? • How do we search, BTW? 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 47 Delete • Find where the key is • Can’t simply remove and set the entry to NULL – Why? • One solution – Set deleted entry to be a special DELETED object – Modify insert so that new object replaces a DELETED entry as well as a NULL entry – When search, pass over DELETED entries – don’t stop! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 48 Three typical choices for probe sequence • Linear probing -- h(key, i) = h’(key) + ci (mod m) – Good hash function h’, and c relatively prime to m (why?) – Causes primary clustering problem – Widely used due to excellent cache usage Quadratic probing -- h(key, i) = h’(key) + c1i + c2i2 (mod m) – c2 ≠ 0 is an auxiliary constant • Double hashing -- h(key, i) = h1(key) + i*h2(key) (mod m) – Need h2(key) relatively prime to m – E.g., m = 2k for some k, and h2(key) always odd 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 49 Analysis (α < 1) • Expected # of probes in an unsuccessful search • Insertion on average takes time • Expected # of probes in a successful searh 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 50 Cuckoo Hashing - Rasmus Pagh & Flemming Friche Rodler, 2001 - A variant of open addressing - Does not use perfect hashing - Time: - O(1)-lookup time in the worst-case - O(1)-amortized insertion time - Space: - 3 words per key like BSTs - Very competitive in practice 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 51 Cuckoo Hashing – Basic Idea Karp HQN Levin Knuth Rehash! (pick new & random h1 h1 h2) Cantor Dijkstra h2 Turing 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 52