notes-3a-hashing

advertisement
Hashing
CMPSC 465 – Related to CLRS Chapter 11
I. Dictionaries and Dynamic Sets
In discrete math, we learned quite a bit about sets. Perhaps you’ve now “bought in” and believe that we can represent many
different entities and concepts with sets. When we’re storing and searching with any kind of data (numbers, objects, records),
those collections are very related to sets.
In the pure mathematical sense, a set doesn’t change. In computer science, we’re rather interested in dynamic sets, sets that
can grow and shrink based on our algorithms.
Here are some operations we might want to perform on a dynamic set:
 insert elements into it
 delete elements from it
 test membership in it
If we have a dynamic set that supports these operations, we call it a dictionary. Many of the elementary data structures you
know from 121 and 122 – arrays, stacks, queues, linked lists – fit this generalization.
As with sorting, we assume elements being stored are objects that have a key and satellite data, but we don’t worry about the
satellite data here.
This relates to object-oriented programming in another way: we have two kinds of operations. Using language from CLRS,
 queries are like accessors in classes; they’re the operations that return information but don’t change the set
 modifying operations are (so surprisingly…) like modifiers in classes; they change the set
Dynamic set operations include searching, inserting, deleting, finding a minimum, finding a maximum, finding a successor,
and finding a predecessor. We don’t always care about all of these.
In this unit, we first look at hash tables, which implement dictionaries more like arrays, and then focus on tree structures.
We’ll support three dictionary operations: insert, delete, and search.
II. Why Hashing?
Question: What are the two basic search algorithms and what are their running times?
Can we do better?
making some reasonable assumptions, yes, we can search a hash table in O(1) time
but worst case is still (n)
A hash table is a generalization of an ordinary array. Before we proceed, note that CLRS decided to index the arrays from 0
for this topic. We will, happily (at least from my perspective ), do the same.
Example 1: Let’s do a very simple example of hashing called a direct-address table. Using the array below, store the keys
3, 1, 5, and 0 in the slots whose indices match the keys.
0
1
2
3
4
5
6
Example 2: Now insert 5 in the same table.
Page 1 of 11
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
Example 3: Now insert 12.
Let’s try something different and bring in the notion of a hash function, which tells us which slot of the array to use for
values we insert. In the prior example, we could have said that our hash function, h: {keys}  {array indices} was h(k) = k
for all k {keys}.
Example 4: Define a hash function h by h(k) = k mod 7 for all k  {keys} and use it to store keys 3, 15, 37, and 21 in the
array below.
0
1
2
3
4
5
6
Of course, we haven’t solved all of the problems. If we try inserting 35 in the above table, what happens?
Definition: When a hash function maps to keys to the same slot in the hash table, we say a collision has occurred.
When we design hash tables, we have two big issues that can affect performance:
 collisions
 the choice of hash function
We’ll look at these in detail as we go.
Let’s formalize a bit:
 We say the keys come from a universal set U = ________________________________.

We use T as the array holding the hash table.

Our hash function is defined as h : ________  ____________________ s.t. h(k) is a legal slot number in T
Question: If K is the set of keys and m is the number of slots in the hash table, when are we guaranteed to have collisions?
Why?
Quickly, here are a few other notes from CLRS:
 We use a hash table when we don’t want to or can’t allocate a whole array with one position per possible key
 We use a hash table when the number of keys actually store is small relative to the number of possible keys (think
about the domain from which keys are drawn)
Page 2 of 11
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
III. Direct Address Tables, Formalized
We make these assumptions when we use direct-address tables:
 Each key is drawn from a universal set U = {0, 1, …, m-1}, where m isn’t too large
 No two keys are the same
We can formalize the hash table as an array T[0..m-1]. Then, each slot T[k] in the array…
 contains a pointer to some element whose key is k, or,
 is empty, represented by NIL.
Here is pseudocode for each of the operations:
DIRECT-ADDRESS-SEARCH(T, k)
return T[k]
DIRECT-ADDRESS-INSERT(T, x)
T[key[x]] = x
DIRECT-ADDRESS-DELETE(T, x)
T[key[x]] = NIL
Question: What are their running times?
all O(1)
Homework: CLRS Exercise 11.1-1
IV. Collision Resolution with Chaining
There are two standard strategies for collision resolution. One is open addressing, which we’ll look at later. We’ll start with
chaining, which tends to perform better than open addressing. The idea with chaining is…
Let’s look at this via an example.
Example: Define a hash function h by h(k) = k mod 7 for all k  {keys} and use it to store keys 3, 15, 38, 13, 6, and 55 in the
hash table below, using chaining for collision resolution
0
1
2
3
4
5
6
Question 1: How would we delete 6?
Question 2: How would we search for 15?
Question 3: How about searching for 4?
Page 3 of 11
14?
-8?
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
Here is pseudocode for each of the operations:
CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(key[x])]
CHAINED-HASH-DELETE(T, x)
delete x from list T[h(key[x])]
CHAINED-HASH-SEARCH(T, k)
search for an element with key k in list T[h(k)]
Here are a few notes:
 To make searching work, we assume uniqueness of keys. So, CHAINED-HASH-INSERT must certainly have a
precondition that x is not in some list in T. Otherwise, we’d need to do an additional search first.
 As written, worst-case running time for CHAINED-HASH-INSERT is O(1).
 How the linked lists are implemented affects CHAINED-HASH-DELETE’s performance:
o With singly linked lists, we’d need to find the predecessor to the deleted element, so…
o

With doubly linked lists, we can immediately find the predecessor, so…
Performance of CHAINED-HASH-SEARCH depends on list length. Much more soon…
V. Load Factor and Other Influences on Performance
Let’s define some quantities we’ll use in analysis of hash tables:

n=

m=

=

For j = 0, 1, …, m-1, nj =
A way to think of the load factor is…
average # of elements per linked list
Trichotomy tells us that ____________________, ________________________, or ________________________.
The big question in analysis here is
How long does it take to find (an element with) key k or that (an element with) key k is not in the table?
Question: When does the worst case happen?
We’re most interested in average case behavior. Probability comes into play here. We assume simple uniform hashing, i.e
that any key is equally likely to hash into any of the m slots in the hash table.
Let’s also assume that we can compute the hash function h in O(1) time, so the length of the list being searched dominates.
So, using the definitions above, searching for key k involves searching list ___________, which has length _____________.
Page 4 of 11
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
Problem: What is the average value of nj?
VI. Analysis of Hash Tables with Chaining (Average Case)
We consider two cases: unsuccessful searches and successful searches. In all of our analysis, we continue to use the
definitions from the last section.
Unsuccessful Search
Let’s derive the running time of an unsuccessful search:

Which slot do we use?

How many elements do we examine?

What is the total running time?
So, we get:
Theorem: A search for a hash table using simple uniform hashing and collision resolution by chaining is ___________ in the
unsuccessful search case.
The proof of this theorem is essentially the derivation above.
Successful Search
Now, let’s consider a successful search. As it turns out, while the circumstances are slightly different and the analysis is
much more involved, the expected time is the same:
Theorem: A search for a hash table using simple uniform hashing and collision resolution by chaining is ___________ in the
successful search case.
[nothing of value goes here so maybe stick in something fun?]
Page 5 of 11
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
Proof:
Let…
We seek to find the number of elements examined during a successful search for x. We can say…
Bringing in probability,
So, consider the expected number of elements examined in a successful search:
[long sequence manipulation]
Page 6 of 11
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
We conclude with the total running time:

Let’s interpret our results:

Assume n = O(m).

Given this assumption, the load factor is…

Average search time is…

Worst case insertion time is…

Worst case deletion time (for doubly-linked lists) is…
Not bad at all!
Homework: CLRS Exercises 11.2-1, 11.2-2, 11.2-3
Page 7 of 11
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
VII. Picking Hash Functions
What makes a good hash function?
To answer this question, recall our prior analysis. When is the performance of hashing good and when is it bad? What was
our assumption in the analysis?
bad when everything maps to the same place
ideally, want to keep the chains small and want as close to one element per slot as possible
so, ideally, we’ll satisfy the simple uniform hashing assumption
In practice, we won’t know the probability distribution from which the keys are drawn in advance and keys may not be drawn
independently, so it’s not possible to achieve the ideal. But, we can use heuristics to create a hash function that performs well
and use some “tricks” from number theory.
Before we can talk about any hash functions, we must remember that hash functions get keys taken from Znonneg. When this
doesn’t happen naturally, we need to find a way to convert them first.
Example: Suppose we want to interpret a string numerically. We could use ASCII paired with some sort of radix notation.
For the string CLRS, the ASCII values are 67, 76, 82, and 83, respectively, and there are 128 basic ASCII values. So one way
to turn this string into a number is…
CLRS proposes two strategies commonly used: the division method and the multiplication method. Let’s compare:
Method
Division
General Description
For k  Znonneg, m  Z+, …
Advantages and Disadvantages
Advantage:
Suggestions
Disadvantage:
Multiplication
For k  Znonneg, A  R s.t. 0 < A < 1,
Advantage:
Disadvantage:
Example: For k = 91 and m = 20, the division method hashes k to…
Page 8 of 11
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
VIII. A Taste of Universal Hashing
Consider this scenario:
 A malicious adversary gets to choose the keys to be hashed and knows your hash function.
 Could force worst-case performance.
How can we thwart this attack?
use a different hash function each time
pick randomly… from a set of good candidates
We define a finite collection H of hash functions that map a universe U of keys into {0, 1, …, m-1} as universal if for each
pair of keys k1, k2  U (where k1  k2), the number of hash functions h  H for which h (k1) = h (k2) is at most |H | / m.
Why is this good?
probability of a collision is no more than the 1/m chance we’d randomly and independently choose 2 slots
Using universal hashing and chaining, we can maintain searches that run in constant time.
IX. Collision Resolution via Open Addressing
Another common collision resolution technique is open addressing. Here we store the keys in the hash table itself instead of
in linked lists. Slots that don’t contain keys contain NIL instead. When we try to insert into a slot that is occupied, we must
then find a new slot. This process of checking slots is called probing. The most basic kind of probing is linear probing,
which essentially means…
Let’s look at this via some (familiar) examples before formalizing it.
Example: Define a hash function h by h(k) = k mod 7 for all k  {keys} and use it to store keys 3, 15, 38, 13, 6, and 55 in the
hash table below, using linear probing for collision resolution
0
1
2
3
4
5
6
Question 1: How would we search for 38?
Question 2: How about searching for 4?
14?
Question 3: What happens when we try to insert 70 into the hash table above? …and then 77?
Page 9 of 11
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
Let’s note a few summary issues:
 We begin either operation by computing our hash function, going to the desired slot, and then going to the next
available slot(s) as is necessary.
 When we go past the end of the array, we wrap around to the beginning (easily implemented with modular
arithmetic).
 In searching, we “give up” when we find a NIL slot or we have checked all m slots.
 In insertion, we stop probing when no slots remain (and report failure).
Okay, now that we have the idea for insertion and searching, there’s one more issue: deletion. Chaining made this easy, but
it’s trickier with open addressing. Empty slots contain NIL, but if we were to replace a value to be deleted with NIL, we’d run
into a problem. Say that some slot, j, had had an element in it that has been deleted by storing NIL in slot j. To be concrete,
suppose in the example, we deleted 6 this way. What happens when we search for 55?
A solution to the deletion problem is to use DELETED instead of NIL to tag deleted elements. Then…

Searching treats DELETED as…

Insertion treats DELETED as…
There’s one other very important detail about all open addressing: the hash function must take in a second argument to tell
which probe attempt we’re on. Formally, now, the hash function is
h : ______________________________________________________  ___________________________________
It’s helpful to maintain the usual hash function we had before with one argument; we’ll call that an auxiliary hash function
h'. Then, we briefly note three kinds of probing:

Linear Probing defines the hash function, given key k and probe number i (0 ≤ i < m) as
h (k, i) = (h' (k) + i) mod m

Quadratic Probing defines the hash function, given key k and probe number i (0 ≤ i < m) as
h (k, i) = (h' (k) + c1i + c2i2) mod m
(where c1 and c2 are nonzero constants)

Double Hashing uses two different auxiliary hash functions h1 and h2 and defines the hash function, given key k and
probe number i (0 ≤ i < m) as
h (k, i) = (h1(k) + i h2(k)) mod m
The book provides more formality and pseudocode for open addressing.
Page 10 of 11
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
XI. Analysis of Open Addressing, Briefly
(We will not analyze open addressing rigorously like we did with chaining. See CLRS Section 11.4 if you are interested in
more rigor and proof for this topic. The analysis is a nice application of conditional probability.)
We make the following assumptions in analysis of hashing with open addressing:

We again use a load factor  We assume the table never completely fills, so
____________________________ ⇒ _________________________

We assume uniform hashing.

No deletion.

In a successful search, each key is ___________________________________ to be searched for.
So then, we have the following theorems:

The expected number of probes in an unsuccessful search is

The expected number of probes in a successful search is
Problem: How many probes are expected in an unsuccessful search when a hash table is half full?
Problem: How many probes are expected in an unsuccessful search when a hash table is 90% full?
Homework: CLRS Exercises 11.3-3, 11.3-4, 11.4-1
Page 11 of 11
Prepared by D. Hogan referencing materials from CLRS - Introduction to Algorithms (3rd ed.) for PSU CMPSC 465
Download