HASH TABLES DEALING WITH RAW BYTES SOME PROBABILISTIC ANALYSIS Hash Tables Motivations 1 Balanced search trees Store (key, value)-pairs O(log n)-time search, insert, delete Relatively complex implementations Can we improve running times for basic operations? CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Example 2 Given a set of (key, value)-pairs Keys are in the set {0, 1, 2, 3, …, n-1} Values could be anything Store them in an array (direct access table) 0 1 2 3 … … … … n-2 n-1 v0 NULL v2 NULL … … … … V[n-2] V[n-1] Search/insert/delete takes O(1)-time CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 What is the drawback? 3 Unlikely see such perfect key set in practice (Key, Value) = (English Word, Meaning) (Key, Value) = (function, address) (Key, Value) = (URL, IP address) Wastes lots of space when n >> # pairs Say keys are 8-byte integers, n = 2256-1 CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 The Static Key Set Case 4 DICTIONARY RESERVED WORDS IN A PROGRAMMING LANGUAGE - COMMAND NAMES IN AN OS - FILE NAMES IN CD-ROM - - LAZY ARRAY & MINIMAL PERFECT HASHING CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Static Key Set 5 Wanted: data structure for an online dictionary With what we know so far, what are the choices? Use a balanced BST such as RB tree, AVL tree Randomize the keys and insert one by one in a normal BST or a splay tree Sort the keys, use binary search Sorting + binary search is the best of the three options Search still takes O(log n)-time Can we do better? CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 A Wild Solution 6 Every key is a series of 0s and 1s There is always some integer m such that, for all practical purposes A key K is an (≤ m)-bit number bin-rep(K) E.g. in ASCII, bin-rep(“cse250”) = 0x637365323530 Use this m-bit number as an index to an array of values What’s the longest non-technical English word? Floccinaucinihilipilification (29 characters) What’s the longest technical English word? Pneumonoultramicroscopicsilicovolcanoconiosis (a disease) CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 The solution is wild in at least two ways 7 So, say m = 8x30 = 240 bits Need an array A of 2240 ≈ 1.76 x 1072 elements Even if we have that much memory space, there is still one major problem A[x] have to be initialized to NULL for all x from 0 to 2240-1 NULL is just 0 We have n = 150000 = 15x104 words, say Initializing the data structure takes ≥ n13 steps The O(n log n) sorting + binary search looks great CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 The lazy array data structure 8 Sequentially read inputs into an array Dict of size n Dict[i] is the i’th (word, value) pair Insert all words: For (i=0; i<n; ++i) A[Dict[i].word] = i search(x): (x is a word) If (0 ≤ A[x] ≤ n-1 && Dict[A[x]].word == x) Return Dict[A[x]].value Else Return false You can even delete(x) in O(1)-time Just set Dict[i] = NULL CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Lazy array, an illustration 9 cse250 … 0 (“UB”, “University at Buffalo”) 0x4D6174726978 1 (“Mark Twain”, “Great writer”) … 2 (“cse250”, “boring course”) 0x637365323530 3 NULL … … … … n-1 (“Matrix”, “Best Scifi Movie”) … cse251 CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 2 Set to 2 by accident 0x637365323531 Dict n-1 2 A 5/28/2016 Major drawback and an inspiration 10 Used a humongous amount of space n << 2# bits to represent the longest word However, if there was a function h 0 ≤ h(word) ≤ n-1 For any two words x & y, h(x) ≠ h(y) Then, we’re (almost) in good shape CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Sample Input, n=6 11 Key Index Value 4164616D 0 … “Ashley” 4173686C6579 1 … “Daniel” 44616E69656C 2 … “Kayla” 4B61796C61 3 … “Mike” 4D696B65 4 … “Troy” 54726F79 5 … “Adam” Hash Code (using ASCII) In hex Function h(hash_code) {0,1,2,3,4,5} CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 What’s that function h? 12 int h(int a) { a = (((a%256)%100)%41)%10; a = (a*a)%14; return (a>2) ? a-4 : a; } Took me ½ hour to come up with that CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Minimal Perfect Hash Function 13 S: the set of n (hash codes of) keys S = {0x4164616D, 0x4173686C6579, 0x44616E69656C, 0x4B61796C61, 0x4D696B65, 0x54726F79} in the example above h: S {0,1,..,n-1} is a MPHF if it is a bijection We want to Find such a function h (in short amount of time) May be store … the function in a data structure! Evaluating h(code) should take O(1)-time Possible, but a little bit complicated CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Hasing – General Ideas 14 - HASHING FIRST PROPOSED BY ARNOLD DUMEY (1956) - HASH CODES - CHAINING - OPEN ADDRESSING, LINEAR PROBING, QUADRATIC PROBING CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Top level view 15 h(object) Arbitrary objects (strings, doubles, ints) n Objects actually used {0,1,…,m-1} int with wide range Has h code m Compression function We will also call this the hash function CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Good Hash Function 16 If key1 = key2, then h(key1) = h(key2) If key1 ≠ key2, then it’s extremely unlikely that h(key1) = h(key2) Collision problem! Constructing the function h takes little time Given key, computing h(key) takes O(|key|)-time CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Collision is simply unavoidable 17 Pigeonhole principle K+1 pigeons, K holes at least one hole with ≥ 2 pigeons There are many more objects in the universe than m Object set = set of strings of length ≤ characters Object set = set of possible URLs Object set = set of possible file names in a CD-ROM While m is something like a few hundred thousands or less CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Hash codes or int-style types 18 Say we want hash codes map to (4-byte) int Easy when objects = short int, or int, or char or unsigned char Simply cast them to uint32_t What about when objects = long int (8 byte integers) x0 x1 x2 x3 x4 x5 x6 x7 Hash code y0 CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo y1 y2 y3 5/28/2016 Casting Down 19 unsigned int hash_code1(unsigned long a) { return static_cast<unsigned int>(a); } int main() { unsigned long a = 0x8888888877777777; unsigned long b = 0x1111111177777777; cout << hex << a << " converted to " << hash_code1(a) << endl; cout << hex << b << " converted to " << hash_code1(b) << endl; return 0; } CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Drawback of Casting from Long to Int 20 We ignore the first 4 bytes of information If key1 and key2 differ only in the first 4 bytes, they will collide! On the other hand, if keys are uniformly distributed, we are OK. Could also sum 1st 4 bytes with 2nd 4 bytes CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Hash codes for strings & variable length objects 21 Say we have a universe of character array objects “Computer Science” “Floccinaucinihilipilification” “Alan Turing” … How do we produce 4-byte hash codes for them? CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Hash codes for strings (or byte-sequences) 22 Add up the characters XOR 4-bytes at a time Polynomial hash codes Shifting hash codes FNV hash MurmurHash Etc. Important Lesson: data-dependency! CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Some experimental results 23 Hash code function into uint32_t #of collisions Max bucket size Sum 1730769 175 Xor 583 3 Shift7 56 2 Poly31 22 2 Poly33 22 2 FNV 0 1 FNV is widely used, in DNS & Twitter, for example CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Compression functions 24 4-byte hash codes can’t be used as indices 232 = 4,294,967,296 ≈ 4x109 is too many To store n entries, need indices in {0,1, …m-1} m should be close to n (say n = 50K, m = 60K) Compression function f: uint32_t {0,1, …m-1} Division method Multiplication method Universal hashing Compression functions are hash functions and thus there methods can be used to design hash codes too! CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Hash function design problem 25 Universe U = all uint32_t integers S, an unknown subset of n members of U Find f : U {0,1,…,m-1} Computing f(u) is fast Minimize collisions Note: suppose |U| > m ≥ n For a fixed S, there always exists f with no collisions For a fixed f, there always exists S with lots of collisions If S’s distribution is truly arbitrary The best f is such that f(s) is uniformly distributed on {0…m-1} Ball-into-Bins model: Throw n “balls” randomly into m “bins” CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 An analogy – the birthday problem 26 U = 7 billion people in the world S = set of students in this room f maps students to birthdates {Jan 01, …, Dec 31} So m = 365 (forget leap year) Question: If S is chosen randomly from U, how large must S be until it is more likely to have a collision than not? This is called the birthday “paradox” CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Birthday paradox 27 Say there are n students in this room Prob[1st student does not “collide”] = 1 Prob[2nd student does not “collide”] = 1-1/m Prob[3rd student does not “collide” | first two didn’t collide] = 1-2/m … Overall probability of no collision is (1-1/m)(1-2/m)…(1-(n-1)/m) < ½ when n=23 and m = 365 When n=30, Prob[no collision] ≈ 30% CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Rarity of Minimal Perfect Hash Function 28 Consider |U| = N, m = n, MPHF is a bijection Number of functions from U to {0,1,…n-1} is nN For a fixed S (but unknown) of size n number of MPHF for S is Hence, the fraction of functions which are MPHF is When n = 10, the ratio is 0.00036… When n = 20, the ratio is 2.32*10-8 CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Division method 29 How does this function perform for different m? The answer depends a lot on the distribution of S in the universe U CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 m from 50K to 60K 30 Total # collisions: 19K Max bucket size 6-8, typically Recall n ≈ 47K Could we have guessed this result without coding? Something in the spirit of the birthday paradox? Motto: Think, then code! CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Balls into Bins 31 Throw n balls into m bins randomly Probability a given bin is empty is (1-1/m)n ≈ e-n/m Expected number of empty bins is me-n/m It can be shown mathematically that on average, when m ≈ n, the maximum bin size is about CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 These estimates are incredibly good! 32 n = 47000, m = 50000 me-n/m ≈ 19000 And You can repeat the experiment with m ≈ 100K CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Multiplication method – slightly better! 33 Golden ratio CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Universal Hashing 34 Adversary can always pick key set S which create O(n) collisions Denial of Service attack (more later!) Universal hashing approach Design a family H of hash functions such that for any k ≠ k’ Pick a hash function h in H uniformly at random Note that the key set is chosen by the adversary CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Theoretical Results 35 α = n/m is the load factor of the hash table Expected bucket size is at most 1 + α Fact: when the universal family is n-independent, they behave almost as if we throw balls randomly and independently into bins CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Additional notes 36 Table size m: should choose a prime for mod compression If it’s even, even % even = even Objects in computer memory often start with even address If it’s a power of 2 then we effectively mod out the high-order bits Lost the relative order of keys Can’t answer queries such as: “what are the keys (& associated values) in between 3 and 432?” “list the smallest k keys” BTW, how do we answer such query with a BST? CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Collision resolution 37 SEPARATE CHAINING OPEN ADDRESSING CUCKOO HASHING CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Separate Chaining 38 Turing Cantor Index Pointer 0 1 Knuth Knuth Karp Cantor Dijkstra 2 3 Karp Turing 4 Dijkstra CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Performance 39 Under simple uniform hashing assumption i.e. Each object hashed into a bucket with probability 1/m, uniformly and independent from other objects Expected search time Θ(1+α) Worst-case search time Ω(n) – though very unlikely Using universal hashing, expected time for n operations is Ω(n) CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Denial of Service Attacks 40 http://events.ccc.de/congress/2011/Fahrplan/events/4680.en.html http://www.ocert.org/advisories/ocert-2011-003.html http://permalink.gmane.org/gmane.comp.security.full-disclosure/83694 “Hash tables are a commonly used data structure in most programming languages. Web application servers or platforms commonly parse attackercontrolled POST form data into hash tables automatically, so that they can be accessed by application developers. If the language does not provide a randomized hash function or the application server does not recognize attacks using multi-collisions, an attacker can degenerate the hash table by sending lots of colliding keys. The algorithmic complexity of inserting n elements into the table then goes to O(n**2), making it possible to exhaust hours of CPU time using a single HTTP request.” CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 BTW 41 You can do Separate Treeing too! They don’t teach that in school What’s the performance of your hash table then? What’s the drawback? CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Open Addressing 42 Store all entries in the hash table itself, no pointer to the “outside” Advantage Less space waste Perhaps good cache usage Disadvantage More complex collision resolution Slower operations CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Open Addressing 43 Index Pointer Turing 0 1 Cantor h(“Knuth”, 0) Knuth Karp h(“Karp”, 0) Turing h(“Knuth”, 1)1) h(“Dijkstra”, 2 3 Knuth 4 Cantor h(“Dijkstra”, 2) 5 Dijkstra Karp h(“Karp”, 1) 6 7 Dijkstra h(“Dijkstra”, 0) CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Open Addressing Scheme 44 Instead of h : U {0,1,…,m-1} e.g. h(key) = 3 We use an extended hash function which defines a probe sequence h : U x {0,1,…,m-1} {0,1,…m-1} e.g. h(key, 0) = 5 h(key, 1) = 9 h(key, 2) = 7 … CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Insert Algorithm 45 for (i=0; i<m; i++) { j = h(key, i); if (Table[j] == NULL) { insert entry; break; } } if (i == m) report error “hash table overflown” CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Desirable Property of Probe Sequence 46 For any key, h(key, 0), …, h(key, m-1) is a permutation of the set {0,1, …, m-1} What happens if the property does not hold? How do we do search, BTW? CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Delete 47 Find where the key is Can’t simply remove and set the entry to NULL Why? One solution Set deleted entry to be a special DELETED object Modify insert so that new object replaces a DELETED entry as well as a NULL entry When search, pass over DELETED entries – don’t stop! CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Three typical choices for probe sequence 48 Linear probing -- h(key, i) = h’(key) + ci (mod m) Good hash function h’, and c relatively prime to m (why?) Causes primary clustering problem Widely used due to excellent cache usage Quadratic probing -- h(key, i) = h’(key) + c1i + c2i2 (mod m) c2 ≠ 0 is an auxiliary constant Double hashing -- h(key, i) = h1(key) + i*h2(key) (mod m) Need h2(key) relatively prime to m E.g., m = 2k for some k, and h2(key) always odd CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Analysis (α < 1) 49 Expected # of probes in an unsuccessful search Insertion on average takes time Expected # of probes in a successful searh CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Cuckoo Hashing 50 - Rasmus Pagh & Flemming Friche Rodler, 2001 - A variant of open addressing - Does not use perfect hashing - Time: - O(1)-lookup time in the worst-case - O(1)-amortized insertion time - Space: - 3 words per key like BSTs - Very competitive in practice CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016 Cuckoo Hashing – Basic Idea 51 HQN Karp Levin Knuth h1 Rehash! (pick new & random h1 h2) Cantor Dijkstra h2 Turing CSE 250, Spring 2012, SUNY Buffalo, @Hung Q. Ngo 5/28/2016