Hash Tables Hash tables Dealing with raw bytes Some probabilistic analysis

Hash Tables Hash tables Dealing with raw bytes Some probabilistic analysis Motivations • Balanced search trees – Store (key, value)-pairs – O(log n)-time search, insert, delete, max, min – O(log n + |output|)-time range query – Relatively complex implementation • Can we improve running times for basic operations? 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 1 Example • Given a set of (key, value)-pairs – Keys are in the set {0, 1, 2, 3, …, n-1} – Values could be anything • Store them in an array (Direct Access Table) 0 1 2 3 … … … … n-2 n-1 v0 NULL v2 NULL … … … … Vn-2 Vn-1 • Search/insert/delete takes O(1)-time 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 2 What are the drawbacks? • Unlikely see “good” key set in practice – (Key, Value) = (English Word, Meaning) – (Key, Value) = (function, address) – (Key, Value) = (URL, IP address) • Even if keys were non-negative integers, there might be lots of NULL entries – Wastes lots of space because n >> # pairs – Say keys are 8-byte integers, n = 2256-1 • Can’t do range-query efficiently 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 3 - Dictionary Reserved words in a programming language Command names in an Operating Systems File names in CD-ROM Lazy array & Minimal Perfect hashing THE STATIC KEY SET CASE 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 4 Static Key Set • Wanted: data structure for an online dictionary • With what we know so far, what are the choices? – Sorted array, use binary search – Use a balanced BST such as RB tree, AVL tree – Randomize the keys and insert one by one in a normal BST or a splay tree • Sorting + binary search is the best of the three options (with caveats) – Search still takes O(log n)-time – Can we do better? 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 5 A Wild Solution • Every key is a series of 0s and 1s • There is always some integer m such that, for all practical purposes – A key K is an (≤ m)-bit number bin-rep(K) – ASCII: bin-rep(“cse250”) = 0x637365323530 – Use this m-bit number as an index to an array of values • What’s the longest non-technical English word? – Floccinaucinihilipilification (29 characters) • What’s the longest technical English word? – Pneumonoultramicroscopicsilicovolcanoconiosis (a disease) 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 6 The solution is wild in at least two ways • So, say m = 8x30 = 240 bits – Need an array A of 2240 ≈ 1.76 x 1072 elements • Even if we have that much memory space, there is still one major problem – A[x] initialized to NULL for all x from 0 to 2240-1 – NULL is just 0 – Dictionary has n = 150000 = 15x104 words, say – Initializing the data structure takes ≥ n13 steps – The O(n log n) sorting + binary search looks great! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 7 The lazy array data structure • Sequentially read inputs into an array Dict of size n – Dict[i] is the i’th (word, value) pair • Insert all words into huge array A – For (i=0; i<n; ++i) A[Dict[i].word] = i • search(x): (x is a word) – If (0 ≤ A[x] ≤ n-1 && Dict[A[x]].word == x) • Return Dict[A[x]].value – Else • Return false • We can even delete(x) in O(1)-time – Just set Dict[A[x]] = NULL 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 8 Lazy array, an illustration cse250 … 0 (“UB”, “University at Buffalo”) 0x4D6174726978 1 (“Mark Twain”, “Great writer”) … 2 (“cse250”, “boring course”) 0x637365323530 3 NULL … … … … n-1 (“Matrix”, “Best Scifi Movie”) … cse251 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 2 Set to 2 by accident 0x637365323531 Dict n-1 2 A 9 Major drawback and an inspiration • Used a humongous amount of space – Typically n << 2# bits to represent the longest word • However, if there was a function h – 0 ≤ h(word) ≤ n-1 – And, for any two words x & y, h(x) ≠ h(y) • Then, we’re (almost) in good shape 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 10 Sample Input, n=6 Key Index Value 4164616D 0 … “Ashley” 4173686C6579 1 … “Daniel” 44616E69656C 2 … “Kayla” 4B61796C61 3 … “Mike” 4D696B65 4 … “Troy” 54726F79 5 … “Adam” Hash Code (using ASCII) In hex Function h(hash_code)  {0,1,2,3,4,5} 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 11 What was that function h? int h(int a) { a = (((a%256)%100)%41)%10; a = (a*a)%14; return (a>2) ? a-4 : a; } Took me ½ hour to come up with this stupid function 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 12 (Minimal) Perfect Hash Function • S: the set of n (hash codes of) keys –S={ 0x4164616D, 0x4173686C6579, 0x44616E69656C, 0x4B61796C61, 0x4D696B65, 0x54726F79 } in the example above • h: S  {0,1,..,n-1} is a MPHF if it is a bijection • We want to – Find such a function h (in short amount of time) – Maybe store the function … in a data structure! – Evaluate h(code) in O(1)-time • Possible, but a little bit complicated – http://cmph.sourceforge.net/ – http://www.gnu.org/software/gperf/ 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 13 - Hashing first proposed by Arnold Dumey (1956) Hash function = Hash codes + Compression function Separate Chaining Open addressing, linear probing, quadratic probing HASHING – GENERAL IDEAS 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 14 Top level view h(object) Arbitrary objects (strings, doubles, ints) {0,1,…,m-1} int with wide range n Objects actually used Hash code Compression function m We will also call this a hash function 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 15 Good Hash Function • If key1 ≠ key2, then it’s extremely unlikely that h(key1) = h(key2) – Collision problem! • Constructing the function h takes little time • Given key, computing h(key) takes O(|key|)-time 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 16 Collision unavoidable for unknown key set • Pigeonhole principle – K+1 pigeons, K holes  at least one hole with ≥ 2 pigeons • There are many more objects in the universe than m – – – – Object set = set of strings of length ≤ 30 characters Object set = set of possible URLs Object set = set of possible file names in a CD-ROM While the range size m is something like a few hundred thousands or less 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 17 Hash codes for int-style types • Say we want hash codes map to (4-byte) int • Easy when objects = – short, or int, or char or unsigned char – Simply cast them to uint32_t • What about when objects = – long int (8 byte integers) x0 x1 x2 x3 x4 x5 x6 x7 Hash code y0 5/28/2016 y1 y2 CSE 250, SUNY Buffalo, @Hung Q. Ngo y3 18 Casting Down, Lose Infomation unsigned int hash_code1(unsigned long a) { return static_cast<unsigned int>(a); } int main() { unsigned long a = 0x8888888877777777; unsigned long b = 0x1111111177777777; cout << hex << a << " converted to " << hash_code1(a) << endl; cout << hex << b << " converted to " << hash_code1(b) << endl; return 0; } 8888888877777777 converted to 77777777 1111111177777777 converted to 77777777 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 19 Drawback of Casting from long to int • We ignore the first 4 bytes of information • If key1 and key2 differ only in the first 4 bytes, they will collide! • On the other hand, if keys are uniformly distributed, we are OK. • Could also sum 1st 4 bytes with 2nd 4 bytes 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 20 Hash codes for strings & variable length objects • Say we have a universe of character array objects – “Computer Science” – “Floccinaucinihilipilification” – “Alan Turing” –… • How do we produce 4-byte hash codes for them? 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 21 Hash codes for strings (or byte-sequences) • • • • • • • Add up the characters XOR 4-bytes at a time Polynomial hash codes Shifting hash codes FNV hash MurmurHash Etc. • Important Lesson: data-dependency! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 22 Some experimental results Hash code function into uint32_t #of collisions Max bucket size Sum 1730769 175 Xor 583 3 Shift7 56 2 Poly31 22 2 Poly33 22 2 FNV 0 1 FNV (or FNV1a) is widely used, in DNS & Twitter, for example Poly31 is used in Java hashCode() method http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html#hashCode%28%29 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 23 Compression functions • 4-byte hash codes can’t be used as indices – 232 = 4,294,967,296 ≈ 4x109 is too many • Store n entries, need indices in {0,1, …m-1} – m should be close to n (say n = 150K, m = 200K) • Compression function – f: uint32_t  {0,1, …m-1} – Division method – Multiplication method – Universal hashing Compression functions are hash functions and thus there methods can be used to design hash codes too! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 24 Hash/compression function design problem • Universe U = all uint32_t integers • S, an unknown subset of n members of U • Find f : U  {0,1,…,m-1} – Computing f(u) is fast – Minimize collisions • Note: suppose |U| > m ≥ n – For a fixed S, there always exists f with no collisions – For a fixed f, there always exists S with lots of collisions • If S’s distribution is truly arbitrary – The best f is such that f(s) is uniformly distributed on {0…m-1} – Ball-into-Bins model: Throw n “balls” randomly into m “bins” 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 25 An analogy – the birthday problem • U = 7 billion people in the world • S = set of students in this room • f : { students }  {Jan 01, …, Dec 31} – So m = 365 (forget leap years) • Question: – If S is chosen randomly from U, how large must S be until it is more likely to have a collision than not? • This is called the birthday “paradox” 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 26 Birthday paradox • • • • Say there are n students in this room Prob[1st student does not “collide”] = 1 Prob[2nd student does not “collide”] = 1-1/m Prob[3rd student does not “collide” | first two didn’t collide] = 1-2/m • … • Overall probability of no collision is (1-1/m)(1-2/m)…(1-(n-1)/m) < ½ when n=23 and m = 365 • When n=30, Prob[no collision] ≈ 30% 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 27 Rarity of Minimal Perfect Hash Function • Consider |U| = N, m = n, MPHF is a bijection • Number of functions from U to {0,1,…n-1} is nN • For a fixed S (but unknown) of size n – number of MPHF for S is n!nN-n – Hence, the fraction of functions which are MPHF is – When n = 10, the ratio is 0.00036… – When n = 20, the ratio is 2.32*10-8 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 28 Division method • How does this function perform for different m? • The answer depends a lot on the distribution of S in the universe U 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 29 m from 50K to 60K • Total # collisions: 19K • Max bucket size 6-8, typically • Recall n ≈ 47K • Could we have guessed this result without coding? – Something in the spirit of the birthday paradox? – Motto: Think, then code! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 30 Balls into Bins • Throw n balls into m bins randomly • Probability a given bin is empty is (1-1/m)n ≈ e-n/m • Expected number of empty bins is me-n/m • It can be shown mathematically that on average, when m ≈ n, the maximum bin size is about 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 31 These estimates are incredibly good! • n = 47000, m = 50000 • me-n/m ≈ 19000 • And • You can repeat the experiment with m ≈ 100K 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 32 Multiplication method – slightly better! Golden ratio 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 33 Universal Hashing • Adversary can always pick key set S which create O(n) collisions – Denial of Service attack (more later!) • Universal hashing approach – Design a family H of hash functions such that for any k ≠ k’ – Pick a hash function h in H uniformly at random – Note that the key set is chosen by the adversary 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 34 Theoretical Results • α = n/m is the load factor of the hash table • Expected bucket size is at most 1 + α • Fact: when the universal family is nindependent, they behave almost as if we throw balls randomly and independently into bins 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 35 Additional notes • Table size m: should choose a prime for mod compression – If it’s even, even % even = even – Objects in computer memory often start with even address – If it’s a power of 2 then we effectively mod out the high-order bits • Lost the relative order of keys – Can’t answer queries such as: • “what are the keys (& associated values) in between 3 and 432?” • “list the smallest k keys” – BSTs handle those just fine! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 36 Separate chaining Open addressing Cuckoo hashing COLLISION RESOLUTION 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 37 Separate Chaining Turing Cantor Index 0 1 Knuth Turing Knuth Karp Cantor Dijkstra 2 3 Karp Pointer 4 Dijkstra 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 38 Performance • Under simple uniform hashing assumption – i.e. Each object hashed into a bucket with probability 1/m, uniformly and independent from other objects • Expected search time Θ(1+α) • Worst-case search time Ω(n) – though very unlikely • Using universal hashing, expected time for n operations is Ω(n) 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 39 Denial of Service Attacks • http://events.ccc.de/congress/2011/Fahrplan/events/4680.en.html • http://www.ocert.org/advisories/ocert-2011-003.html • http://permalink.gmane.org/gmane.comp.security.full-disclosure/83694 • “Hash tables are a commonly used data structure in most programming languages. Web application servers or platforms commonly parse attacker-controlled POST form data into hash tables automatically, so that they can be accessed by application developers. If the language does not provide a randomized hash function or the application server does not recognize attacks using multi-collisions, an attacker can degenerate the hash table by sending lots of colliding keys. The algorithmic complexity of inserting n elements into the table then goes to O(n**2), making it possible to exhaust hours of CPU time using a single HTTP request.” 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 40 Poor Hash Choices Brought Down msn.com http://blogs.msdn.com/b/ericlippert/archive/200 3/09/19/arrrrr-cap-n-eric-be-learnin-aboutthreadin-the-harrrrd-way.aspx 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 41 BTW • You can do Separate Treeing too! • They don’t teach that in school • What’s the performance of your hash table then? • What’s the drawback? 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 42 Open Addressing • Store all entries in the hash table itself, no pointer to the “outside” • Advantage – Less space waste – Perhaps good cache usage • Disadvantage – More complex collision resolution – Slower operations 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 43 Open Addressing Index Turing Pointer 0 1 Cantor h(“Knuth”, 0) Knuth Karp h(“Karp”, 0) Turing h(“Knuth”, 1)1) h(“Dijkstra”, 2 3 Knuth 4 Cantor h(“Dijkstra”, 2) 5 Dijkstra Karp h(“Karp”, 1) 6 7 Dijkstra 5/28/2016 h(“Dijkstra”, 0) CSE 250, SUNY Buffalo, @Hung Q. Ngo 44 Open Addressing Scheme • Instead of h : U  {0,1,…,m-1} e.g. h(key) = 3 • We use an extended hash function which defines a probe sequence • h : U x {0,1,…,m-1}  {0,1,…m-1} e.g. h(key, 0) = 5 h(key, 1) = 9 h(key, 2) = 7 … 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 45 Insert Algorithm for (i=0; i<m; i++) { j = h(key, i); if (Table[j] == NULL) { insert entry; break; } } if (i == m) report error “hash table overflown” 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 46 Desirable Property of Probe Sequence • For any key, h(key, 0), …, h(key, m-1) is a permutation of the set {0,1, …, m-1} • What happens if the property does not hold? • How do we search, BTW? 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 47 Delete • Find where the key is • Can’t simply remove and set the entry to NULL – Why? • One solution – Set deleted entry to be a special DELETED object – Modify insert so that new object replaces a DELETED entry as well as a NULL entry – When search, pass over DELETED entries – don’t stop! 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 48 Three typical choices for probe sequence • Linear probing -- h(key, i) = h’(key) + ci (mod m) – Good hash function h’, and c relatively prime to m (why?) – Causes primary clustering problem – Widely used due to excellent cache usage  Quadratic probing -- h(key, i) = h’(key) + c1i + c2i2 (mod m) – c2 ≠ 0 is an auxiliary constant • Double hashing -- h(key, i) = h1(key) + i*h2(key) (mod m) – Need h2(key) relatively prime to m – E.g., m = 2k for some k, and h2(key) always odd 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 49 Analysis (α < 1) • Expected # of probes in an unsuccessful search • Insertion on average takes time • Expected # of probes in a successful searh 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 50 Cuckoo Hashing - Rasmus Pagh & Flemming Friche Rodler, 2001 - A variant of open addressing - Does not use perfect hashing - Time: - O(1)-lookup time in the worst-case - O(1)-amortized insertion time - Space: - 3 words per key like BSTs - Very competitive in practice 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 51 Cuckoo Hashing – Basic Idea Karp HQN Levin Knuth Rehash! (pick new & random h1 h1 h2) Cantor Dijkstra h2 Turing 5/28/2016 CSE 250, SUNY Buffalo, @Hung Q. Ngo 52

Hash Tables Hash tables Dealing with raw bytes Some probabilistic analysis

Related documents

Products

Support

Hash Tables Hash tables Dealing with raw bytes Some probabilistic analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib