Hashing CMPSC 465 – Related to CLRS Chapter 11 I. Dictionaries and Dynamic Sets In discrete math, we learned quite a bit about sets. Perhaps you’ve now “bought in” and believe that we can represent many different entities and concepts with sets. When we’re storing and searching with any kind of data (numbers, objects, records), those collections are very related to sets. In the pure mathematical sense, a set doesn’t change. In computer science, we’re rather interested in dynamic sets, sets that can grow and shrink based on our algorithms. Here are some operations we might want to perform on a dynamic set: insert elements into it delete elements from it test membership in it If we have a dynamic set that supports these operations, we call it a dictionary. Many of the elementary data structures you know from 121 and 122 – arrays, stacks, queues, linked lists – fit this generalization. As with sorting, we assume elements being stored are objects that have a key and satellite data, but we don’t worry about the satellite data here. This relates to object-oriented programming in another way: we have two kinds of operations. Using language from CLRS, queries are like accessors in classes; they’re the operations that return information but don’t change the set modifying operations are (so surprisingly…) like modifiers in classes; they change the set Dynamic set operations include searching, inserting, deleting, finding a minimum, finding a maximum, finding a successor, and finding a predecessor. We don’t always care about all of these. In this unit, we first look at hash tables, which implement dictionaries more like arrays, and then focus on tree structures. We’ll support three dictionary operations: insert, delete, and search. II. Why Hashing? Question: What are the two basic search algorithms and what are their running times? Can we do better? making some reasonable assumptions, yes, we can search a hash table in O(1) time but worst case is still (n) A hash table is a generalization of an ordinary array. Before we proceed, note that CLRS decided to index the arrays from 0 for this topic. We will, happily (at least from my perspective ), do the same. Example 1: Let’s do a very simple example of hashing called a direct-address table. Using the array below, store the keys 3, 1, 5, and 0 in the slots whose indices match the keys. 0 1 2 3 4 5 6 Example 2: Now insert 5 in the same table. Page 1 of 11 Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465 Example 3: Now insert 12. Let’s try something different and bring in the notion of a hash function, which tells us which slot of the array to use for values we insert. In the prior example, we could have said that our hash function, h: {keys} {array indices} was h(k) = k for all k {keys}. Example 4: Define a hash function h by h(k) = k mod 7 for all k {keys} and use it to store keys 3, 15, 37, and 21 in the array below. 0 1 2 3 4 5 6 Of course, we haven’t solved all of the problems. If we try inserting 35 in the above table, what happens? Definition: When a hash function maps to keys to the same slot in the hash table, we say a collision has occurred. When we design hash tables, we have two big issues that can affect performance: collisions the choice of hash function We’ll look at these in detail as we go. Let’s formalize a bit: We say the keys come from a universal set U = ________________________________. We use T as the array holding the hash table. Our hash function is defined as h : ________ ____________________ s.t. h(k) is a legal slot number in T Question: If K is the set of keys and m is the number of slots in the hash table, when are we guaranteed to have collisions? Why? Quickly, here are a few other notes from CLRS: We use a hash table when we don’t want to or can’t allocate a whole array with one position per possible key We use a hash table when the number of keys actually store is small relative to the number of possible keys (think about the domain from which keys are drawn) Page 2 of 11 Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465 III. Direct Address Tables, Formalized We make these assumptions when we use direct-address tables: Each key is drawn from a universal set U = {0, 1, …, m-1}, where m isn’t too large No two keys are the same We can formalize the hash table as an array T[0..m-1]. Then, each slot T[k] in the array… contains a pointer to some element whose key is k, or, is empty, represented by NIL. Here is pseudocode for each of the operations: DIRECT-ADDRESS-SEARCH(T, k) return T[k] DIRECT-ADDRESS-INSERT(T, x) T[key[x]] = x DIRECT-ADDRESS-DELETE(T, x) T[key[x]] = NIL Question: What are their running times? all O(1) Homework: CLRS Exercise 11.1-1 IV. Collision Resolution with Chaining There are two standard strategies for collision resolution. One is open addressing, which we’ll look at later. We’ll start with chaining, which tends to perform better than open addressing. The idea with chaining is… Let’s look at this via an example. Example: Define a hash function h by h(k) = k mod 7 for all k {keys} and use it to store keys 3, 15, 38, 13, 6, and 55 in the hash table below, using chaining for collision resolution 0 1 2 3 4 5 6 Question 1: How would we delete 6? Question 2: How would we search for 15? Question 3: How about searching for 4? Page 3 of 11 14? -8? Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465 Here is pseudocode for each of the operations: CHAINED-HASH-INSERT(T, x) insert x at the head of list T[h(key[x])] CHAINED-HASH-DELETE(T, x) delete x from list T[h(key[x])] CHAINED-HASH-SEARCH(T, k) search for an element with key k in list T[h(k)] Here are a few notes: To make searching work, we assume uniqueness of keys. So, CHAINED-HASH-INSERT must certainly have a precondition that x is not in some list in T. Otherwise, we’d need to do an additional search first. As written, worst-case running time for CHAINED-HASH-INSERT is O(1). How the linked lists are implemented affects CHAINED-HASH-DELETE’s performance: o With singly linked lists, we’d need to find the predecessor to the deleted element, so… o With doubly linked lists, we can immediately find the predecessor, so… Performance of CHAINED-HASH-SEARCH depends on list length. Much more soon… V. Load Factor and Other Influences on Performance Let’s define some quantities we’ll use in analysis of hash tables: n= m= = For j = 0, 1, …, m-1, nj = A way to think of the load factor is… average # of elements per linked list Trichotomy tells us that ____________________, ________________________, or ________________________. The big question in analysis here is How long does it take to find (an element with) key k or that (an element with) key k is not in the table? Question: When does the worst case happen? We’re most interested in average case behavior. Probability comes into play here. We assume simple uniform hashing, i.e that any key is equally likely to hash into any of the m slots in the hash table. Let’s also assume that we can compute the hash function h in O(1) time, so the length of the list being searched dominates. So, using the definitions above, searching for key k involves searching list ___________, which has length _____________. Page 4 of 11 Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465 Problem: What is the average value of nj? VI. Analysis of Hash Tables with Chaining (Average Case) We consider two cases: unsuccessful searches and successful searches. In all of our analysis, we continue to use the definitions from the last section. Unsuccessful Search Let’s derive the running time of an unsuccessful search: Which slot do we use? How many elements do we examine? What is the total running time? So, we get: Theorem: A search for a hash table using simple uniform hashing and collision resolution by chaining is ___________ in the unsuccessful search case. The proof of this theorem is essentially the derivation above. Successful Search Now, let’s consider a successful search. As it turns out, while the circumstances are slightly different and the analysis is much more involved, the expected time is the same: Theorem: A search for a hash table using simple uniform hashing and collision resolution by chaining is ___________ in the successful search case. [nothing of value goes here so maybe stick in something fun?] Page 5 of 11 Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465 Proof: Let… We seek to find the number of elements examined during a successful search for x. We can say… Bringing in probability, So, consider the expected number of elements examined in a successful search: [long sequence manipulation] Page 6 of 11 Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465 We conclude with the total running time: Let’s interpret our results: Assume n = O(m). Given this assumption, the load factor is… Average search time is… Worst case insertion time is… Worst case deletion time (for doubly-linked lists) is… Not bad at all! Homework: CLRS Exercises 11.2-1, 11.2-2, 11.2-3 Page 7 of 11 Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465 VII. Picking Hash Functions What makes a good hash function? To answer this question, recall our prior analysis. When is the performance of hashing good and when is it bad? What was our assumption in the analysis? bad when everything maps to the same place ideally, want to keep the chains small and want as close to one element per slot as possible so, ideally, we’ll satisfy the simple uniform hashing assumption In practice, we won’t know the probability distribution from which the keys are drawn in advance and keys may not be drawn independently, so it’s not possible to achieve the ideal. But, we can use heuristics to create a hash function that performs well and use some “tricks” from number theory. Before we can talk about any hash functions, we must remember that hash functions get keys taken from Znonneg. When this doesn’t happen naturally, we need to find a way to convert them first. Example: Suppose we want to interpret a string numerically. We could use ASCII paired with some sort of radix notation. For the string CLRS, the ASCII values are 67, 76, 82, and 83, respectively, and there are 128 basic ASCII values. So one way to turn this string into a number is… CLRS proposes two strategies commonly used: the division method and the multiplication method. Let’s compare: Method Division General Description For k Znonneg, m Z+, … Advantages and Disadvantages Advantage: Suggestions Disadvantage: Multiplication For k Znonneg, A R s.t. 0 < A < 1, Advantage: Disadvantage: Example: For k = 91 and m = 20, the division method hashes k to… Page 8 of 11 Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465 VIII. A Taste of Universal Hashing Consider this scenario: A malicious adversary gets to choose the keys to be hashed and knows your hash function. Could force worst-case performance. How can we thwart this attack? use a different hash function each time pick randomly… from a set of good candidates We define a finite collection H of hash functions that map a universe U of keys into {0, 1, …, m-1} as universal if for each pair of keys k1, k2 U (where k1 k2), the number of hash functions h H for which h (k1) = h (k2) is at most |H | / m. Why is this good? probability of a collision is no more than the 1/m chance we’d randomly and independently choose 2 slots Using universal hashing and chaining, we can maintain searches that run in constant time. IX. Collision Resolution via Open Addressing Another common collision resolution technique is open addressing. Here we store the keys in the hash table itself instead of in linked lists. Slots that don’t contain keys contain NIL instead. When we try to insert into a slot that is occupied, we must then find a new slot. This process of checking slots is called probing. The most basic kind of probing is linear probing, which essentially means… Let’s look at this via some (familiar) examples before formalizing it. Example: Define a hash function h by h(k) = k mod 7 for all k {keys} and use it to store keys 3, 15, 38, 13, 6, and 55 in the hash table below, using linear probing for collision resolution 0 1 2 3 4 5 6 Question 1: How would we search for 38? Question 2: How about searching for 4? 14? Question 3: What happens when we try to insert 70 into the hash table above? …and then 77? Page 9 of 11 Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465 Let’s note a few summary issues: We begin either operation by computing our hash function, going to the desired slot, and then going to the next available slot(s) as is necessary. When we go past the end of the array, we wrap around to the beginning (easily implemented with modular arithmetic). In searching, we “give up” when we find a NIL slot or we have checked all m slots. In insertion, we stop probing when no slots remain (and report failure). Okay, now that we have the idea for insertion and searching, there’s one more issue: deletion. Chaining made this easy, but it’s trickier with open addressing. Empty slots contain NIL, but if we were to replace a value to be deleted with NIL, we’d run into a problem. Say that some slot, j, had had an element in it that has been deleted by storing NIL in slot j. To be concrete, suppose in the example, we deleted 6 this way. What happens when we search for 55? A solution to the deletion problem is to use DELETED instead of NIL to tag deleted elements. Then… Searching treats DELETED as… Insertion treats DELETED as… There’s one other very important detail about all open addressing: the hash function must take in a second argument to tell which probe attempt we’re on. Formally, now, the hash function is h : ______________________________________________________ ___________________________________ It’s helpful to maintain the usual hash function we had before with one argument; we’ll call that an auxiliary hash function h'. Then, we briefly note three kinds of probing: Linear Probing defines the hash function, given key k and probe number i (0 ≤ i < m) as h (k, i) = (h' (k) + i) mod m Quadratic Probing defines the hash function, given key k and probe number i (0 ≤ i < m) as h (k, i) = (h' (k) + c1i + c2i2) mod m (where c1 and c2 are nonzero constants) Double Hashing uses two different auxiliary hash functions h1 and h2 and defines the hash function, given key k and probe number i (0 ≤ i < m) as h (k, i) = (h1(k) + i h2(k)) mod m The book provides more formality and pseudocode for open addressing. Page 10 of 11 Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465 XI. Analysis of Open Addressing, Briefly (We will not analyze open addressing rigorously like we did with chaining. See CLRS Section 11.4 if you are interested in more rigor and proof for this topic. The analysis is a nice application of conditional probability.) We make the following assumptions in analysis of hashing with open addressing: We again use a load factor We assume the table never completely fills, so ____________________________ ⇒ _________________________ We assume uniform hashing. No deletion. In a successful search, each key is ___________________________________ to be searched for. So then, we have the following theorems: The expected number of probes in an unsuccessful search is The expected number of probes in a successful search is Problem: How many probes are expected in an unsuccessful search when a hash table is half full? Problem: How many probes are expected in an unsuccessful search when a hash table is 90% full? Homework: CLRS Exercises 11.3-3, 11.3-4, 11.4-1 Page 11 of 11 Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465